You can dive deeper into this end-to-end process by reading [A Deep Dive into NLP Tokenization and Encoding with Word and Sentence Embeddings – Data Jenius](https://datajenius.com/2022/03/13/a-deep-dive-into-nlp-tokenization-encoding-word-embeddings-sentence-embeddings-word2vec-bert/), or get a quick overview through this [Reddit comment](https://www.reddit.com/r/learnmachinelearning/comments/vkrzoo/does_anyone_have_resources_about_how_words_are/idr8a7o/).
We're in a fortunate position where we can do this whole process without diving into the complexity (but, you should!). There are many ready-made embedding models that turn text into numerical embeddings, just like how `gpt-xx-xx` and `claude-sonnet-3.5` are pre-trained LLMs.
Once again, the one you select will be based on your overall search and RAG strategy, but here are a few you can consider:
- **Dense Vectors** #flashcard
- Description: continuous, fixed-dimensional embeddings where each dimension contains information about the input data. Dense vectors are compact and encode semantic relationships between inputs.
- **Strengths**:
- Great for semantic search and understanding.
- Effective at handling synonyms and context-aware queries.
- **Limitations**:
- Can be computationally expensive to train and query.
- May not work well with highly sparse, keyword-heavy, or domain-specific data without fine-tuning.
- **Examples**: OpenAI’s Ada-002. Sentence Transformers (e.g., SBERT)
<!--ID: 1751507777530-->
- **Sparse Vectors** #flashcard
- Description: high-dimensional vectors where most values are zeros. They represent data in a keyword or feature-based manner, making them explicitly interpretable and directly tied to input tokens or features.
- **Strengths**:
- Excellent for **exact matches** and keyword-heavy domains (e.g., legal, scientific texts).
- Easy to interpret and debug.
- Low computational cost for indexing.
- **Limitations**:
- Lack semantic understanding; struggles with synonyms and paraphrased queries.
- High-dimensional nature may require careful storage optimizations.
- **Examples:** [[BM25]], [[Term Frequency-Inverse Document Frequency (TF-IDF)]], [[Apache Lucene]]
<!--ID: 1751507777532-->
# Examples:
- [[Cohere Embed]] v3
- [[bge-large-en]]
- [[ada-2]]
- [[Sentence Transformers]]
- Google's [[PaLM2 Gecko-001]]
[MTEB Leaderboard - a Hugging Face Space by mteb](https://huggingface.co/spaces/mteb/leaderboard)