You can dive deeper into this end-to-end process by reading [A Deep Dive into NLP Tokenization and Encoding with Word and Sentence Embeddings – Data Jenius](https://datajenius.com/2022/03/13/a-deep-dive-into-nlp-tokenization-encoding-word-embeddings-sentence-embeddings-word2vec-bert/), or get a quick overview through this [Reddit comment](https://www.reddit.com/r/learnmachinelearning/comments/vkrzoo/does_anyone_have_resources_about_how_words_are/idr8a7o/). We're in a fortunate position where we can do this whole process without diving into the complexity (but, you should!). There are many ready-made embedding models that turn text into numerical embeddings, just like how `gpt-xx-xx` and `claude-sonnet-3.5` are pre-trained LLMs. Once again, the one you select will be based on your overall search and RAG strategy, but here are a few you can consider: - **Dense Vectors** #flashcard - Description: continuous, fixed-dimensional embeddings where each dimension contains information about the input data. Dense vectors are compact and encode semantic relationships between inputs. - **Strengths**: - Great for semantic search and understanding. - Effective at handling synonyms and context-aware queries. - **Limitations**: - Can be computationally expensive to train and query. - May not work well with highly sparse, keyword-heavy, or domain-specific data without fine-tuning. - **Examples**: OpenAI’s Ada-002. Sentence Transformers (e.g., SBERT) <!--ID: 1751507777530--> - **Sparse Vectors** #flashcard - Description: high-dimensional vectors where most values are zeros. They represent data in a keyword or feature-based manner, making them explicitly interpretable and directly tied to input tokens or features. - **Strengths**: - Excellent for **exact matches** and keyword-heavy domains (e.g., legal, scientific texts). - Easy to interpret and debug. - Low computational cost for indexing. - **Limitations**: - Lack semantic understanding; struggles with synonyms and paraphrased queries. - High-dimensional nature may require careful storage optimizations. - **Examples:** [[BM25]], [[Term Frequency-Inverse Document Frequency (TF-IDF)]], [[Apache Lucene]] <!--ID: 1751507777532--> # Examples: - [[Cohere Embed]] v3 - [[bge-large-en]] - [[ada-2]] - [[Sentence Transformers]] - Google's [[PaLM2 Gecko-001]] [MTEB Leaderboard - a Hugging Face Space by mteb](https://huggingface.co/spaces/mteb/leaderboard)