> As docs site owners, I wonder if we should start freely providing embeddings for our content to anyone who wants them, via REST APIs or well-known URIs. Who knows what kinds of cool stuff our communities can build with this extra type of data about our docs?
Interesting idea. You'd have to specify the exact embedding model used to generate an embedding, right? Is there a well understood convention for such identification like say model_name:model_version:model_hash or something? For technical docs, obviously very broad field, is there an embedding model (or small number) widely used or obviously highly suitable that a site ownwer could choose one and have some reasonable expectation that publishing embeddings for their docs generated using that model would be useful to others? (Naive questions, I am not embedded in the field.)
It seems like sharing the text itself would be a better API, since it lets API users calculate their own embeddings easily. This is what the crawlers for search engines do. If they use embeddings internally, that’s up to them, and it doesn’t need to be baked into the protocol.
Could we work toward standardization at some point? Obviously, there will always be a newer model. I just hate that all the embedding work I did was with now depreciated openai model. At least single providers should see interest in ensuring that for their own model releases. Some trick like matryoshka embedding could secure that embedding from newer models nest or work within the space of older model preserving some form of comparability or alignment
Yeah, this is the main issue with the suggestion. Embeddings can only be compared to each other if they are in the same space (e.g., generated by the same model). Providing embeddings of a specific kind would require users to use the same model, which can quickly become problematic if you're using a closed-source embedding model (like OpenAI's or Cohere's).
> As docs site owners, I wonder if we should start freely providing embeddings for our content to anyone who wants them, via REST APIs or well-known URIs. Who knows what kinds of cool stuff our communities can build with this extra type of data about our docs?
Interesting idea. You'd have to specify the exact embedding model used to generate an embedding, right? Is there a well understood convention for such identification like say model_name:model_version:model_hash or something? For technical docs, obviously very broad field, is there an embedding model (or small number) widely used or obviously highly suitable that a site ownwer could choose one and have some reasonable expectation that publishing embeddings for their docs generated using that model would be useful to others? (Naive questions, I am not embedded in the field.)