Embedding

admin · August 24, 2024, 4:27am

Embedding

The word "Embeddings" use in the AI context are numerical representations of real-world objects that machine learning uses to represent complex data and relationships.

Unlike traditional "records" in a database that represent just the "value" of the data, embeddings can provide a lot of information about "relationships" between data.

Value

There are different types of embeddings e.g. for images, for text, for audio, for relationships etc.
Embeddings are used by machine learning to represent different "things".

Embeddings are just a series of numbers, so if the "things" they are representing are different then there will be a different series of numbers.

For example, there can be TWO embeddings for the word "play" :

one embedding for "play" as in "play with a toy"
one embedding for "play" as in "going to a play"

The more numbers there are in the series the more information we have about that "thing".

Relationship

By comparing the embeddings of 2 things (their 2 series of numbers) we can tell how "similar" these 2 thing s are to each other - that is we can discover the relationship between them.

Machine learning can operate on multiple embeddings by putting many of them (many series of numbers) from different "things" into into a matrix.

Model

We don't normally use the large models (e.g. Llama) for generating embeddings since it is slow and the generated embeddings are very large. Instead a smaller more specialised model (e.g. Nomic for text) is used.

More details:

state-of-open-source-ai/model-formats.md at main · premAI-io/state-of-open-source-ai · GitHub