Nano Banana for me. After the initial wow phase it's meh now. Randomly refuses to adhere to the prompt. Randomly makes unexpected changes. Randomly triggers censorship filter. Randomly returns the image as is without making any changes.
The result announcement blog post sounded too hypey and gave the impression that he changed his tone but for the last one year he has been consistently saying that LLM+"discreet program search" is the way to go. All the top scoring submissions before o3 had followed the same bruteforce strategy. Even o3 is more or less doing the same bruteforce under the hood, maybe not discreet.
We can extend this phenomenon to all of knowledge. You ask ChatGPT something and it gives you some very generic 10-point blogspam-esque answer. You immediately think ChatGPT is intelligent and that it has understood your particular question and that it has given you a tailor-made answer.
His book "The Algebraic Mind" goes into great detail about connectionism (neural networks), symbolic systems, limits of connectionism and proposals to integrate neural networks with symbolic systems (hybrid systems).
The "deep learning alone is enough" camp (especially LeCun) has abused him for years but now slowly coming to the realization that we need to feed neural networks with explicit inductive biases to attain AGI which is exactly what Marcus has been saying since the 90s. LeCun, for some reason, refuses to call these explicit biases as symbols and that's the only disagreement between Marcus and LeCun these days.
This. It is a little bit like talking about politics. People will agree with sufficiently generic statements along the lines of 'there should be some accountability'; they will start arguing over what that means though. If you listen to her songs, it is an equivalent of that generic statements most people can relate to.
I am aware of her and I am very detached from US pop culture. It is catchy.
I think if we replaced "AI" with "taking averages over subsets of historical examples", then there'd be no mystery for when "AI" will be good or bad at anything.
Would we expect a discrete melodic structure to be expressible as averages of prior music? No.
Pretty sure the first continuation is a famous piece with a few notes messed up. Can't remember the name. Honestly it only sounds marginally better than the old markov chain continuations.
Isn’t that as good as it gets? The whole point of the continuations is that given a short leading prompt from a real piece that it should continue it realistically.
It didn’t get to train on the test set, if that’s what you’re implying, and I find it hard to believe the assertion that continuations are copies of the train set (if that’s your claim).
Wow, good find! They definitely sound similar but it’s not a facsimile. I wonder if this holds for the other samples.
I guess in retrospect we asked it to continue the music in a likely way, not be novel. And it definitely convinced me enough to be impressive. An NN that composes completely fresh music, whatever that means (I’m sure most modern human music has a hefty dose of cross song sampling), would certainly be a good next goal post.
Indeed, there is lots of denial or ignorance in this thread (ignorance in the technical sense). AudioLM already produced impressive results and it's a tiny fraction of what is already possible because performance simply improves with scale. One can probably solve music generation today with a ~$1B budget for most purposes like film or game music, or personalized soundtracks. This is not science fiction.
What's more interesting and concerning - listen carefully to the first piano continuation example from AudioLM, notice the similarity of the last 7 seconds to Moonlight sonata: https://youtu.be/4Tr0otuiQuU?t=516
I'm afraid we will see a lot of this with music generation models in the near future.
There are quite simple tricks to avoid repetition/copying in NNs, e.g. by (1) training a model to predict the "popularity" of the main model's outputs and penalizing popular/copied productions by backpropping through that model so as to decrease the predicted popularity, or (2) by conditioning on random inputs (LLMs can be prompted with imaginary "ID XXX" prefixes before each example to mitigate repetitions), or (3) by increasing temperature or optimizing for higher entropy. LLM outputs are already extremely diverse and verbatim copying is not a huge issue at all. The point being, all evidence points to this not being a show stopper if you massage these evolutionary methods for long enough in one or more of the various right ways.
I'm not sure what you mean by "backpropping through that model so as to decrease the predicted popularity". During training, we train a model to literally reproduce famous chunks of music exactly as they are in the training set. We can also learn to predict popularity at the same time, but we can't backpropagate anything that will reduce popularity, because this would directly contradict the main loss objective of exact reproduction.
Having said that, I think the idea of predicting popularity is good - we can use it for filtering already generated chunks during post-training evaluation phase.
I don't think the other two methods you suggest would help here, we want to generate while conditioning on famous pieces, and we don't want to increase temperature if we want to generate conservative, but still high quality pieces.
It's true that we (humans) are less sensitive to plagiarism in the text output, but even for LLMs it is a problem when it tries to generate something highly creative, such as poetry. I personally noticed multiple times a particular beautiful poetry phrases generated by GPT-2 only to google it and find out they were copied verbatim from a human poem.
What I had in mind was kind of like a reward model that is trained by on longer outputs that have a very high similarity to training examples. Something similar has been done to prevent LLMs from using toxic language. You'd simply backprop through that model like in GANs. And no it does not contradict the overall training objective completely because the criterion would be long verbatim copies and it would not affect shorter copies of sound fragments and the like which you would want a music model to produce in order for it to sound realistic and natural.
Oh OK, so you mean training the model after it has already been trained on the main task, right? Like finetuning. Yes, I think the GAN-like finetuning is a good idea. Though it's less clear where the labels would come from, it seems like some sort of fingerprint would need to be computed for each generated sequence, and this fingerprint would need to be compared against a database of fingerprints for every sequence in the training set. This could be a huge database.
It doesn't surprise me that an AI model for language can't grok maths or music. I can't see how a language model can map to maths. Hell, I don't even know how to describe music in words. It's possible to articulate some maths in words, but that often involves using words with unexpected definitions.
MIDI is extraordinarily expressive and is likely used to sequence a large majority of music produced within the last three decades. A lot of the instruments you hear are synthesizers or samplers running directly from MIDI. There is a lot more to what MIDI can do, and is used for, than the conception most people have from "canyon.mid" or old website background music. If an AI can do MIDI just fine then it's an extremely small leap to doing audio just fine.
If an AI can do MIDI just fine then it's an extremely small leap to doing audio just fine.
Unfortunately this is not true. It takes a huge amount of human effort to make MIDI encoded music sound good. The difference between MIDI and raw audio music generation is the same as the difference between drawing a cartoon and producing a photograph.
To clarify, yes MIDI can be expressive, but what's being generated when people say "AI generates MIDI music" is basically a piano roll.
I'm not familiar enough with existing implementations of such systems to dispute it, but there's no fundamental reason algorithmic composition systems could not include modulation parameters of all kinds (pitch/breath/effects/synthesizer controls/etc) in their output. I am envisioning a DAW set up with several VST's and samplers with routing and effects in place, then using some combination of genetic algorithms and other methods to "tweak the knobs" in the search for something pleasing.
The search space is absolutely enormous, though, so I don't dispute that it's very difficult, but I wouldn't go so far as to say that it can't be done. In such a space there are "no wrong answers" so to speak. I have a python script which creates randomized sequences of notes/rhythm and gives each one a different combination of LP/HP filters and random envelopes - it's not music but it takes on a much less mechanical quality by emulating different attacks and timbres over time, even though it's completely random.
I would go so far as to say I'd be genuinely surprised if algorithmic composition and production hasn't been used to some extent significantly greater than "basically a piano roll" in at least some of the past decade's top 40 music on the radio.
there's no fundamental reason algorithmic composition systems could not include modulation parameters of all kinds (pitch/breath/effects/synthesizer controls/etc) in their output
There is such a reason - lack of training data. Very few high quality detailed MIDI samples exist to train machine learning models like AudioLM.
For state of the art in MIDI generation, take a look at what https://aiva.ai/ produces (it's free for personal use). There you can compare raw MIDI output to an automatically generated mp3 output (using "VST's and samplers with routing and effects in place, then using some combination of genetic algorithms and other methods to "tweak the knobs" in the search for something pleasing.")
mp3 version will sound much better than raw MIDI, but (usually) significantly worse than music recorded in a studio and arranged/processed by a human.
As a clasically-trained pianist who then got into electronica and synthesis, it was mind blowing to me that people could wrangle expression and phrasing from a MIDI sequencer.
That particular niche has had some pretty amazing successes already. It's coming.
We can't produce arbitrary media streams with many "stack layers" of meaning and detail yet, but we can do a lot of specific instrumental transformations...
That’s what a musician does. They make short loops and loop them.
This reads like someone who knows sheet music and theory but does not listen to music. It’s repetition of short phrases over and over.
I’m not really sure what people expect of general AI trained on human generated outputs. It can’t make up anything anything “net new” only compose based upon what we feed it.
I like to think AI is just showing us how simple minded we really are and how our habit of sharing vain fairy tales about history makes us believe we’re masters of the universe.
Those models are not trained on short loops. They are trained on whole songs just like image generation models are trained on whole images. And yet they struggle to repeat sections, modulate to a different key, create bridges, intros and outros. After a few seconds of hallucinating a melodic line they simply abandon the idea and migrate to another one. There is no global structure whatsoever.
We're trying to train a full composer AI without allowing to learn about different instrument sections independently at first. The human composer will have a good idea of the different parts and know how to merge them in harmony.
I think we might get better results training separate AI systems on percussions, strings, vocals etc. then somehow create connections between them so they learn together. A band AI if you will.
We could try a BERT for each, with the generator learning to output logical sequences of sounds instead of words.
Musicians don’t spit out an album in one sitting and they’re highly trained in theory. They get bored and tired of a process and take breaks. They come up with an album of loops composed together over time.
AIs state will forever be constrained to the limits of human cognition and behavior as that’s what it’s trained on.
I read published research all year. Circular reasoning. Tautology. It’s all over PhD thesis.
There’s no “global structure” to humanity. Relativity is a bitch.
Seeing the world through the vacuum of embedded inner monologue ignores the constraints of the physical one. It’s exhausting dealing with the mentality some clean room idea we imagine in a hammock can actually exist in a universe being ripped asunder by entropy.
It’s living in memory of what we were sold; some ideal state. Very akin to religious and nation state idealism.
I think it's deeply depressing that AI has been sold as something even capable of modelling anything humans do; and quite depressing that this comment exists.
"AI" is just taking `mean()` over our choice of encodings of our choice of measurements of our selection of things we've created.
There is as much "alike humans" in patterns in tree bark.
AI is an embarrassingly dumb procedure, incapable of the most basic homology with anything any animal has ever done; us especially.
We are embedded in our environments, on which we act, and which act on us. In doing so we physically grow, mould our structure and that of our environment, and develop sensory-motor conceptualisations of the world. Everything we do, every act of the imagination or of movement of our limbs, is preconditioned-on and symptomatic-of our profound understanding of the world and how we are in it.
The idea that `mean(424,34324,223123,3424,....)` even has any revelance to us at all is quite absurd. The idea that such a thing might sound pleasant thru' a speaker, irrelevant.
This is a product of i dont know what. On the optimist side, a cultish desire to see Science produce a new utopia. On the pessimisst side, a likewise delusional desire to see Humans as dumb machines.
I lack your confidence, and find it a bit religious.
> The idea that `mean(424,34324,223123,3424,....)` even has any revelance to us at all is quite absurd.
Most of what I say to anyone is exactly this.
When I'm about to give anyone any information, I look back at all of the relevant past information that I can recall (through word and sensory association, not by logic, unless I have a recollection of an associated internal or external dialog that also used logical rules.) I multiply those by strength of recollection and similarity of situation (e.g. can I create a metaphor for the current situation from the recalled one?). I take the mean, then I share it, along with caveats about the aforementioned strength of recollection and similarity of situation.
This is what it feels like I actually do. Any of these steps can be either taken consciously or by reflex. It's not hidden.
> I think it's deeply depressing that AI has been sold as something even capable of modelling anything humans do
This is a bizarre position. All computers ever do is model things that humans do. All a computer consists of is a receptacle for placing human will that will continue to apply that will after the human is removed. They are a way of crystallizing will in a way that you can sustain it with things (like electricity) other than the particular combination of air, water, food, space, pressure, temperature, etc. that is a person. An overflow drain is a computer that models the human will. An automatic switch/regulator is the basic electrical model of human will, and a computer is just a bunch of those stitched together in a complementary way.
You're an animal. You've no idea what you do, and you're using machines as a model. Likewise, in the 16th C. it was brass cogs; and in anchient greece, air/fire/etc.
You're no more made of clay & god's breath, as you are sand and electricy.
You're an oozing, growing, malluable organic organism being physiologically dynamically shaped by your sensory-motor oozing. You're a mystery to yourself, and these self-reports, heavily coloured by the in-vogue tech are not science, they're pseudoscience.
If you want to study how animals work, you'd need to study that. Not these impoverished metaphors that mystify both machines and men. No machine has ever acquired a concept through sensory-motor action, nor used one to imagine, nor thereby planned its actions. No machine is ever at play, nor has grown its muscles to be better at-play. No machine has, therefore, learned to play the piano. No machine has thought about food, because no machine has been hungry; no machine has cared, nor been motivated to care by a harsh environment.
An inorganic mechanism is nothing at all like an animal, and an algorithm over a discrete sequence of numbers with electronic semantics, is nothing like tissue development.
What you are doing is not something you can introspect. And you arent really doing that. Rather, you've learned a "way of speaking" about machine action and are back-projecting that onto yourself. In this way, you're obliterating 95% of the things you are.
This isn't really responsive. Not only am I not using machines as any sort of model for human behavior, I'm trying to think about weird things you could do to a machine to make it ape a human.
> these self-reports, heavily coloured by the in-vogue tech are not science, they're pseudoscience.
I simply don't know what you're referring to. If you're referring to retrieving memories through associations, there's mountains of empirical evidence for that. If you're referring to wondering if I remember things, and being unsure of the information I'm recalling when I have less recall of that, or wondering if past situations compare well to current situations, well you got me. It's my personal belief that conscious thought is an epiphenomenon that is a rationalization of decisions already made.
But the rest of this is nonsense. Vivid imagery is not an argument for exceptionalism, no matter how much I say things drip or ooze. This is just association in action. You're trying to create a distinction for life (or rather what you recognize as life) life oozes and has viscera, so using a bunch of words that feel wet and organy can substitute for reason contra the robots.
This one suffers from the same problem that previous audio generation methods had. It correctly mimics the piano timbre but there is no global structure in the generated melodic lines. There is some style imitation but no melodic coherence.