Sounding the Secrets of AudioLM

knaik94 · on Feb 26, 2023

I think audio model will be much more sensitive to input issues relative to text or art. Humans are very good at picking up the nuances in audio and also process it very quickly. I wonder how far we are from being able to manipulate the emotions of how something sounds. In my opinion, that's the turing test for any audio generative AI. Native speakers will immediately know when something is AI generated or adjusted for the same reason they immediately detect accents.

I am curious what kind of audio repair AI models are being worked to help make outputs sound more natural. This research feels like progress towards that goal as well.

chaps · on Feb 26, 2023

Possibly weird question, but have there been any attempts at modeling this sort audio model specifically where tokens aren't defined by its audio, but instead by the movement of the tongue/mouth/lips/vocal chords, etc?

danuker · on Feb 26, 2023

I would assume the Transformer model would do that implicitly. The point of machine learning is to avoid expensive hand-crafted features.

That said, you can find research of almost anything. See page 36 here (46th in the PDF): https://publications.rwth-aachen.de/record/59935/files/Kob_M...

chaps · on Feb 26, 2023

Good point. Wondering now if there'd be a way to walk backwards from the modeled audio to a full physics simulation of the spoken words.

Thanks for the link!

KRAKRISMOTT · on Feb 26, 2023

The keywords you are looking for: articulatory synthesis

chaps · on Feb 26, 2023

Thanks!

saint_yossarian · on Feb 26, 2023

Not what you meant, but there's https://dood.al/pinktrombone/

stanleydrew · on Feb 26, 2023

It's off-topic (or maybe not?) but I get a very strong "ChatGPT wrote the first draft of this" vibe from a lot of the introductory prose in this post.

knaik94 · on Feb 26, 2023

I think this is off-topic since it's not discussing the content of post. But I agree it's worth discussing.

I don't agree, I think ChatGPT may have been part of the editing process but not in the primary draft. The intro is complex in a way that I haven't ever noticed in ChatGPT output. I have a suspicion that participating in academic-ish/research discussions pushes you to use very specific language. This matches the goal for ChatGPT, to produce neutral and clinical sounding answers.

I had some existential concerns about not knowing whether I was just AI, and read similar sentiment from others in some other AI posts. ChatGPT describes its personality using the same descriptions people used to describe my work in the past. There's no meaningful insight you can gain from that parallel, other than that it's good at mirroring the sentiment of the user. [1]

I put my own past HN comments in the AI detectors from OpenAI and GPTZero, and I got a few false positive where it indicated "possibly" to "completely" written by AI. The comments are from years ago. In case you're curious, I didn't use AI in any part of the process of writing and edit this comment. I'm not using grammerly or even spellcheck, so excuse any mistakes you find. I've been thinking a lot about the 1995 Ghost in the Shell: "What if a cyber brain could possibly generate its own ghost, create a soul all by itself? And if it did, just what would be the importance of being human then?”

1. https://twitter.com/Knaikk/status/1624728297090306050

Those pics are out of order, I asked the house question first, then sharing traits, then about itself, then fictional characters. I didn't intentionally try to guide its answers.

mickelsen · on Feb 26, 2023

Dude, even that comment could sound a bit ChatGPT-ish, but as you said, it's because of the tone and style expected when constructing an argument or exposing a topic in a certain group. It's mostly how you flow from one idea to the next, logically and in a timeline, without going back and forth, using correct punctuation.

That and just like culture; if you interact a lot with a group (and now add similar sounding LLMs to the mix) you end up absorbing characteristics of them, then imitating some without a second thought. That's humans, sorry if you wanted to be an AI.

It's interesting when I asked for style corrections on passages from past college essays or formal letters, ChatGPT sometimes makes a nice improvement and I could learn a thing or two from it, but it always ends up trying to add formalities and flow modifications that get in the way instead of being precise.

rnosov · on Feb 26, 2023

> I had some existential concerns about not knowing whether I was just AI

You might want to dive into simulation argument by Nick Bostrom

https://www.simulation-argument.com/simulation.pdf

https://en.wikipedia.org/wiki/Simulation_hypothesis

knaik94 · on Feb 26, 2023

The difference between those thought experiments and the LLMs I'm working with, is the fact that I can test LLMs and myself empirically. I don't feel that I can tell apart human and AI writing with more than 70% confidence. As these models get better, I feel it's going to approach 0%. And especially for writing on topics I am not familiar with. Since the simulation hypothesis arguments aren't directly testable, they don't feel scientific. I personally also have problems with the implication of the existence of conscious greater power. And so it doesn't give me the same existential panic thinking about the future of AI generated content does. I never took the time to read the original paper by Bostrom, thank you for the link.

Edit: I should have used the word disprovable instead of testable.

rnosov · on Feb 26, 2023

In a same vein, you can argue then you can't empirically test that every proton in your hand has, say, three quarks or whatever, as you loose your hand. You then might say that all knowledge about protons in your particular hand isn't directly testable and therefore doesn't feel scientific. A lot of things aren't directly testable but can still be scientific (as with simulation argument). BTW, I haven't read the full paper either.

rnosov · on Feb 26, 2023

Good catch! The author's name "Heorhii Skovorodnikov" sounds Ukrainian and he describes himself as a graduate of NYU Abu Dhabi on his Github page. Looks like he is not a native english speaker and that might explain the need to use ChatGPT in the first place. The verbose sounding explanations are becoming kind of dead giveaway of ChatGPT use in the wild.

rnosov · on Feb 26, 2023

More examples on the AudioLM page. Some are pretty impressive (assuming they are cherry picked).

https://google-research.github.io/seanet/audiolm/examples/

peatfreak · on Feb 26, 2023

The good examples are _always_ cherry picked.

visarga · on Feb 26, 2023

AudioLM advantage is that we have orders of magnitude more audio than text.

ginko · on Feb 26, 2023

Do we? Maybe in terms of megabytes of data but text takes a lot less space than audio.

knaik94 · on Feb 26, 2023

Cleaning audio data is impossibly harder than text data.