What would be really cool if somebody figured out how to do embeddings -> text.

kaibee · on Nov 1, 2024

Hmm as a very stupid first pass...

0. Generate an embedding of some text, so that you have a known good embedding, this will be your target.

1. Generate an array of random tokens the length of the response you want.

2. Compute the embedding of this response.

3. Pick a random sub-section of the response and randomize the tokens in it again.

4. Compute the embedding of your new response.

5. If the embeddings are closer together, keep your random changes, otherwise discard them, go back to step 2.

6. Repeat this process until going back to step 2 stops improving your score. Also you'll probably want to shrink the size of the sub-section you're randomizing the closer your computed embedding is to your target embedding. Also you might be able to be cleverer by doing some kind of masking strategy? Like let's say the first half of your response text already was actually the true text of the target embedding. An ideal randomizer would see that randomizing the first half almost always makes the result worse, and so would target the 2nd half more often (I'm hoping that embeddings work like this?).

7. Do this N times and use an LLM to score and discard the worst N-1 results. I expect that 99.9% of the time you're basically producing adversarial examples w/ this strategy.

8. Feed this last result into an LLM and ask it to clean it up.

freediver · on Nov 7, 2024

We'd be happy to sponsor research on this topic. If interseted, email me.

kabla · on Nov 1, 2024

Is it not possible? I'm not that familiar with the topic. Doing some sort of averaging over a large corpus of separate texts could be interesting and probably would also have a lot of applications. Let's say that you are gathering feedback from a large group of people and want to summarize it in an anonymized way. I imagine you'd need embeddings with a somewhat large dimensionality though?

OutOfHere · on Nov 1, 2024

Reconstruct text from SONAR embeddings: https://github.com/facebookresearch/SONAR?tab=readme-ov-file...

cubefox · on Nov 1, 2024

I wonder if someone has already tried to do that. Though this might go in a similar direction: https://arxiv.org/abs/1711.00043

0x1ceb00da · on Nov 1, 2024

That's chatgpt