# Overview
Retrieval Augmented Generation (RAG) is an approach for improving the accuracy and relevance of [[Large Language Models (LLMs)]] by using an additional, specific set of documents / text / data (e.g., a company's policy manual, a company's website). After the user generates a prompt, it is *augmented* with the most relevant portions of the additional data before being sent to the LLM. The full package of the prompt and additional data is sent to the LLM, so the LLM has the additional context when answering the prompt.
This page will focus purely on text-based RAG. For approaches on RAG for other types of data visit: [[Multi-modal RAG]]. Additionally, we focus on RAG using embeddings and vector storage here. [[Retrieval Augmented Generation (RAG) with Knowledge Graphs]] is a separate topic.
## Diagram
### RAG Reference Architecture
![[Retrieval Augmented Generation (RAG) 2024-12-05 07.08.28.excalidraw.svg]]
%%[[Retrieval Augmented Generation (RAG) 2024-12-05 07.08.28.excalidraw|🖋 Edit in Excalidraw]]%%
### RAG Components
![[Retrieval Augmented Generation (RAG) 2025-03-25 19.59.07.excalidraw.svg]]
%%[[Retrieval Augmented Generation (RAG) 2025-03-25 19.59.07.excalidraw|🖋 Edit in Excalidraw]]%%
# Key Considerations
## Data Transformation for RAG
### Context and Content Enrichment #flashcard
#### Contextual Chunk Headers (CCH)
A method for creating headers at the start of each chunk that contain higher-level context. Types of context you could include in the header are:
- Document summary
- Section / sub-section titles
#### Document Augmentation through Question Generation
Augment each document (or each document chunk) to add questions directly into them. The questions are generated by an LLM. They are added to help improve retrieval results.
<!--ID: 1751507777396-->
### Indexing
#### Hierarchical Indexing
In addition to the traditional storage of chunks, also store an LLM-generated summary. The summary can be at any level larger than the chunk (document, section, page, etc.). Then, first search the summary vector store. Use the summary results to retrieve relevant chunks by filtering on document / page / etc.
#### Multi-Representation Indexing
### Transform into LLM-Friendly Formats (e.g., HTML -> Markdown)
Convert HTML to Markdown. The [Textractor - txtai](https://neuml.github.io/txtai/pipeline/data/textractor/) package may be helpful.
### Miscellaneous
| Transformation | Purpose |
| --------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| **Preserve headings** (`#`, `##`, etc.) | Help semantic chunking and create better retrieval anchors. |
| **Normalize whitespace and line breaks** | Remove excess newlines, collapse multiple `\n` to one. If newlines convey structure (e.g., for markdown, bullet lists, paragraphs, code), keep them. If they’re just arbitrary (e.g., due to PDF line wrapping), remove or normalize them. |
| **Convert lists to plain bullets** | e.g., `- item` or `* item` to `• item`, or retain if consistent. |
| **Convert markdown tables to text tables** | Markdown tables aren’t always parsed well by embedding models. |
| **Remove image references** (``) | Unless the alt-text is meaningful, this adds noise. |
| **Flatten links** (`[text](url)` → `text`) | URLs usually don’t add semantic value in embeddings. |
| **Strip HTML/JSX blocks** (unless needed) | Raw HTML often adds noise. |
| **Decode or remove inline code** (`\`code``) depending on your domain | Retain only if the LLM needs to see actual code. |
### Chunking
- Consider context you'd like captured
- Consider token limits
- Can pair the chunks with summaries
A list of detailed approaches can be found in: [[Chunking for RAG]].
## Feature Generation
### Synthetic Label Generation
- **Context Setting**: This initial step establishes a domain-specific context to guide the data generation. For example, using a prompt such as “Imagine you are a movie reviewer” helps set the scene for generating data relevant to movie reviews.
- **Data Generation Prompt**: After setting the context, the LLM is given specific instructions about the desired output. This includes details on the style of text (e.g., movie review), the sentiment (positive or negative), and any constraints like word count or specific terminology. This ensures the synthetic data aligns closely with the requirements of the recommendation system, making them useful for enhancing its accuracy and effectiveness.
### Classification of Documents and Text
Using [[Sentence Transformers]], such as BERT.
## Embedding
[[Embedding Models]]
## Prompt / Query Refinement
### Prompt Engineering
[[Prompt Engineering]] - [[Zero-shot Classification]] vs. [[Few-shot Classification]]
### Query Translation #flashcard
![[2025-01-30_Retrieval Augmented Generation (RAG)-1.png]]
#### Query Rewriting
Rewrite the query to make it more specific and detailed,
#### Multi-Query Decomposition
Breakdown a question into multiple different questions that may work better when being compared in the vector space.
##### Sub-query Decomposition
Decomposing a question in sub problems. Answer sub problems in order and feed answer from previous question into context
#### Step-back Prompting
Generate broader queries to help retrieve relevant background information
#### Hypothetical Document Embedding (HyDE)
Use the question to create a "hypothetical document". Use the hypothetical document for embedding search.
<!--ID: 1751507777398-->
### Query Construction #flashcard
Creating a query from the input question that fits the need of the data source used for RAG. If a relational DB is being used, it would be text to SQL.
<!--ID: 1751507777401-->
## Routing #flashcard
- Logical Routing - routing a question to different set of documents based on the questions
- Semantic Routing - route a question based on semantic similarity (i.e., use a specific prompt out of a group of prompt options based on similarity to question)
## Retrieval Approaches
### Keyword Search
[[BM25]]
### Semantic Search (Embedding-Based Retrieval)
In semantic search, you find results based on a similar *semantic* meaning. This means converting the query and text into numerical embeddings. Then, you can find similarity using your preferred method for [[vector search]].
<!--ID: 1751507777404-->
#### Contextual Compression
Uses semantic search to retrieve and initial set of result. Then, uses an LLM-based "contextual retriever" to extract the most relevant information from each chunk. Overly complicated term here in my opinion... it just uses an LLM to only get relevant context from the initial search - "compression" doesn't really apply.
### Fusion / Hybrid Retrieval
Create a weighted approach to combine Keyword and Semantic Search. Use a single, normalized, weighted score for the results to select results.
## Ranking and Reranking
Tools include: - [https://jina.ai/reranker/](https://www.youtube.com/redirect?event=video_description&redir_token=QUFFLUhqa2JGWmNmR2lNVzhxTFF4Y25EOUROaXh0bWdYQXxBQ3Jtc0tuM3lUb0tFdDkzbm9YaXdfaEVQX0FiclFVQXVkS0FJT21FRG5NZ0prQmNxckk0OEdHdFdDS0I2ZTVpVEZJNG93SlpJeVpMSFdLNFB4cGlTYlRUNTBKbl9RZlFzc0ZKczF5Vk1vWTQ1TG5sTVF4T3Zpaw&q=https%3A%2F%2Fjina.ai%2Freranker%2F&v=d_WwEdxyuGs), Cohere.
### Score Generation
Use a method, such as cross-encoders, for scoring document relevance.
#### Dartboard Retrieval
A retrieval approach that prioritizes "information gain" in retrieval through scoring to prevent multiple top-k documents offering the same information.
#### Score Prediction from LLM
Use prompts to ask an LLM to rate document relevance.
### Knowledge Distillation
[Knowledge Distillation Paper](https://arxiv.org/pdf/2405.00338)
## Managing Retrieval Results
### Filtering #flashcard
#### Content Filtering
Remove results that don't match specific content criteria or essential keywords
#### Document Relevancy Checks
Use an LLM to score the relevancy of a document. Filter out irrelevant documents.
#### Metadata Filtering
Filter out documents based on attributes like date, source, author, or document type.
#### Diversity Filtering
Compare retrieved documents to filter out highly similar documents.
<!--ID: 1751507777406-->
### Adding Additional Context Post-Retrieval
#### Relevant Segment Extraction (RSE)
After a relevant chunk is identified and returned, add the surrounding chunks for additional context, before sending to the LLM. This approach involves considering the relevance of surrounding chunk to find the right ones to combine.
This general approach can also be applied as the sentence, as opposed to the chunk-level.
## Response Tuning
### Hallucination Detection
## Iterative and Adaptive Techniques
### User Feedback Loops
Gather user feedback based on responses. Ensure there is traceability between documents retrieved and responses. Incorporate a user feedback component into the scoring done during retrieval and ranking.
### Adaptive RAG #flashcard
Adjust the RAG strategy based on the "type" of query submitted by the user. Use an LLM to classify the "type" of query. Your types could be "factual", "analytical", "opinion" and "contextual".
<!--ID: 1751507777409-->
![[2025-03-29_Retrieval Augmented Generation (RAG)-1.png]]
## Evaluation
[[ML Experimentation and Evaluation#Evaluating LLMs]]
Using [[deepeval]] - [RAG\_Techniques/evaluation/evaluation\_deep\_eval.ipynb at main · NirDiamant/RAG\_Techniques · GitHub](https://github.com/NirDiamant/RAG_Techniques/blob/main/evaluation/evaluation_deep_eval.ipynb)
Using [[GroUSE]] - [RAG\_Techniques/evaluation/evaluation\_deep\_eval.ipynb at main · NirDiamant/RAG\_Techniques · GitHub](https://github.com/NirDiamant/RAG_Techniques/blob/main/evaluation/evaluation_deep_eval.ipynb)
## Advanced Techniques
Corrective RAG
# Pros
# Cons
# Use Cases
- [[Chatbots]]
- [https://blog.bytebytego.com/p/how-perplexity-built-an-ai-google?utm\_campaign=post&utm\_medium=email&triedRedirect=true](https://open.substack.com/pub/bytebytego/p/how-perplexity-built-an-ai-google?utm_campaign=post&utm_medium=email)
# References
## Tools for RAG
- Chunking:
- [GitHub - chonkie-ai/chonkie: 🦛 CHONK your texts with Chonkie ✨ - The no-nonsense RAG chunking library](https://github.com/chonkie-ai/chonkie)
- [GitHub - neuml/txtai: 💡 All-in-one open-source embeddings database for semantic search, LLM orchestration and language model workflows](https://github.com/neuml/txtai)
- [[LLM Orchestration Frameworks]]
## Other Resources
- [Local Retrieval Augmented Generation (RAG) from Scratch (step by step tutorial) - YouTube](https://www.youtube.com/watch?v=qN_2fnOPY-M)
- [The best resource for latest RAG Techniques](https://github.com/NirDiamant/RAG_Techniques)
- [Learn RAG From Scratch – Python AI Tutorial from a LangChain Engineer - YouTube](https://www.youtube.com/watch?v=sVcwVQRHIc8)
- [From zero to a RAG system: successes and failures \| Andros Fenollosa](https://en.andros.dev/blog/aa31d744/from-zero-to-a-rag-system-successes-and-failures/)
- [Building Enterprise AI: Hard-Won Lessons from 1200+ Hours of RAG Development \| ByteVagabond – Digital Tinkering & Real-World Adventures](https://bytevagabond.com/post/how-to-build-enterprise-ai-rag/)
# Related Topics