Skip to content

RAG Data Preparation: The Foundation That Makes or Breaks Your AI System

Comprehensive guide to preparing data for RAG systems covering document parsing, chunking strategies, contextual enrichment, and embedding optimization

Most RAG implementation failures trace back to data preparation, not retrieval architecture. Teams spend weeks tuning retrieval parameters when the real problem is poorly parsed documents or inappropriate chunking. This guide covers the critical foundation that determines the quality ceiling of your RAG system.

Why Data Preparation Is the Most Critical RAG Step

There is a common pattern in RAG implementations: sophisticated retrieval architectures (hybrid search, reranking, CRAG) that still produce poor results. The root cause is almost always upstream in the data preparation layer.

The key insight: if data preparation fails at 60% quality, no amount of architectural sophistication can push retrieval quality above that ceiling. Teams report 40-60% quality improvements from fixing data preparation alone, often without touching retrieval logic.

Document Parsing: Extracting Clean Text from Messy Sources

Real-world documents are messy. PDFs store text as positioned glyphs, not logical sequences. Tables get mangled. Multi-column layouts require layout analysis. Scanned documents need OCR with 5-15% error rates.

Parsing Tool Selection

ToolTable AccuracyText FidelitySpeedBest For
Docling97.9%Excellent~10s/pageComplex documents
LlamaParse75-90%Good~6s/docSpeed-critical
Unstructured75-100%*GoodVariableOCR-heavy
PyMuPDF/PyPDF60-70%FairFastSimple PDFs

*Unstructured achieves 100% on simple tables, 75% on complex structures

Practical PDF Parsing

python
from docling.document_converter import DocumentConverterfrom docling.datamodel.base_models import InputFormat
converter = DocumentConverter()
# Parse PDF with layout analysisresult = converter.convert("technical-manual.pdf")
# Access structured contentfor element in result.document.body:    if element.type == "table":        # Table extracted with structure preserved        markdown_table = element.export_to_markdown()    elif element.type == "text":        # Text with section context        text = element.text        section = element.section_header
# Export as markdown for RAG ingestionmarkdown_output = result.document.export_to_markdown()

HTML Content Extraction

python
from bs4 import BeautifulSoupfrom readability import Document
def extract_html_content(html: str) -> dict:    """Extract meaningful content from HTML, handling diverse structures."""
    # Use readability for main content extraction    doc = Document(html)    main_content = doc.summary()    title = doc.title()
    # Parse with BeautifulSoup for structure    soup = BeautifulSoup(main_content, 'html.parser')
    # Remove navigation, ads, footers    for element in soup.find_all(['nav', 'footer', 'aside', 'script', 'style']):        element.decompose()
    # Extract text with structure preservation    text_blocks = []    for element in soup.find_all(['h1', 'h2', 'h3', 'p', 'li', 'td']):        text = element.get_text(strip=True)        if text:            text_blocks.append({                'type': element.name,                'text': text,                'level': int(element.name[1]) if element.name.startswith('h') else 0            })
    return {        'title': title,        'blocks': text_blocks,        'full_text': soup.get_text(separator='\n', strip=True)    }

Tip: Start with rule-based parsing before resorting to LLM-based parsing. Use hybrid approaches: heuristics for structure combined with Vision-Language Models only for the most challenging elements.

Text Preprocessing: Cleaning for Embedding Quality

Embeddings encode noise along with signal. Inconsistent formatting creates spurious similarity. PII in embeddings creates security risks. Preprocessing removes these issues before they propagate through the pipeline.

python
import refrom typing import Listimport unicodedata
class TextPreprocessor:    def __init__(self):        self.pii_patterns = {            'email': r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',            'phone': r'\b\d{3}[-.\s]?\d{3}[-.\s]?\d{4}\b',            'ssn': r'\b\d{3}-\d{2}-\d{4}\b',            'credit_card': r'\b\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b',        }
    def normalize_whitespace(self, text: str) -> str:        """Normalize all whitespace to single spaces."""        text = re.sub(r'\s+', ' ', text)        return text.strip()
    def normalize_unicode(self, text: str) -> str:        """Normalize unicode characters to consistent form."""        return unicodedata.normalize('NFKC', text)
    def redact_pii(self, text: str) -> str:        """Detect and redact PII patterns."""        for pii_type, pattern in self.pii_patterns.items():            text = re.sub(pattern, f'[REDACTED_{pii_type.upper()}]', text)        return text
    def remove_boilerplate(self, text: str, patterns: List[str] = None) -> str:        """Remove known boilerplate text patterns."""        default_patterns = [            r'Page \d+ of \d+',            r'Copyright \d{4}.*?(?=\n|$)',            r'All rights reserved\.?',        ]        patterns = patterns or default_patterns        for pattern in patterns:            text = re.sub(pattern, '', text, flags=re.IGNORECASE)        return text
    def process(self, text: str, redact_pii: bool = True) -> str:        """Run full preprocessing pipeline."""        text = self.normalize_unicode(text)        text = self.remove_boilerplate(text)        text = self.normalize_whitespace(text)        if redact_pii:            text = self.redact_pii(text)        return text

Deduplication

Near-duplicate content wastes storage and skews retrieval results. MinHash LSH provides efficient near-duplicate detection:

python
from datasketch import MinHash, MinHashLSHimport hashlibfrom typing import List, Set
class Deduplicator:    def __init__(self, threshold: float = 0.8, num_perm: int = 128):        self.threshold = threshold        self.num_perm = num_perm        self.lsh = MinHashLSH(threshold=threshold, num_perm=num_perm)        self.exact_hashes: Set[str] = set()        self.doc_id = 0
    def _compute_minhash(self, text: str) -> MinHash:        """Compute MinHash signature for text."""        minhash = MinHash(num_perm=self.num_perm)        words = text.lower().split()        for i in range(len(words) - 2):            shingle = ' '.join(words[i:i+3])            minhash.update(shingle.encode('utf-8'))        return minhash
    def is_duplicate(self, text: str) -> bool:        """Check for exact or near duplicates."""        # Exact duplicate check        text_hash = hashlib.md5(text.encode()).hexdigest()        if text_hash in self.exact_hashes:            return True        self.exact_hashes.add(text_hash)
        # Near-duplicate check        minhash = self._compute_minhash(text)        if self.lsh.query(minhash):            return True
        self.lsh.insert(f"doc_{self.doc_id}", minhash)        self.doc_id += 1        return False
    def deduplicate(self, documents: List[str]) -> List[str]:        """Remove duplicates from document list."""        return [doc for doc in documents if not self.is_duplicate(doc)]

Chunking Strategies: The Art of Splitting Documents

Chunking determines how information is organized for retrieval. The core tension: too small loses context, too large dilutes relevance signal.

Strategy Comparison

python
from langchain_text_splitters import RecursiveCharacterTextSplitter
def recursive_chunking(text: str, chunk_size: int = 512, overlap: int = 50) -> list:    """    Split text recursively using hierarchy of separators.    Tries to keep paragraphs together, then sentences, then words.    """    splitter = RecursiveCharacterTextSplitter(        separators=[            "\n\n",  # Paragraphs first            "\n",  # Then line breaks            ". ",  # Then sentences            ", ",  # Then clauses            " ",  # Finally words        ],        chunk_size=chunk_size,        chunk_overlap=overlap,        length_function=len,    )    return splitter.split_text(text)

Semantic Chunking

python
from langchain_experimental.text_splitter import SemanticChunkerfrom langchain_openai.embeddings import OpenAIEmbeddings
def semantic_chunking(text: str) -> list:    """    Split based on semantic similarity between sentences.    Groups semantically related content together.    """    embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
    splitter = SemanticChunker(        embeddings=embeddings,        breakpoint_threshold_type="percentile",        breakpoint_threshold_amount=95,  # Split at top 5% dissimilarity        min_chunk_size=100    )
    return splitter.split_text(text)
# Performance: Up to 70% accuracy improvement in retrieval (varies by content type)# Trade-off: Requires embedding calls during chunking

Hierarchical Parent-Child Chunking

python
from langchain_text_splitters import RecursiveCharacterTextSplitterfrom langchain.retrievers import ParentDocumentRetrieverfrom langchain.storage import InMemoryStorefrom langchain_chroma import Chroma
def setup_hierarchical_chunking(documents: list, embeddings):    """    Create parent-child hierarchy for precision + context.    Search on small child chunks, return large parent chunks.    """    # Parent splitter: large chunks for context    parent_splitter = RecursiveCharacterTextSplitter(        chunk_size=2000,        chunk_overlap=200    )
    # Child splitter: small chunks for precise retrieval    child_splitter = RecursiveCharacterTextSplitter(        chunk_size=400,        chunk_overlap=50    )
    # Storage for parent documents    docstore = InMemoryStore()
    # Vector store indexes child chunks    vectorstore = Chroma(        collection_name="child_chunks",        embedding_function=embeddings    )
    # Retriever searches children, returns parents    retriever = ParentDocumentRetriever(        vectorstore=vectorstore,        docstore=docstore,        child_splitter=child_splitter,        parent_splitter=parent_splitter    )
    retriever.add_documents(documents)    return retriever
# Performance: Improved relevance on structured documents by preserving context# Trade-off: 2-3x storage overhead

Chunk Size Guidelines

Content TypeRecommended SizeOverlapRationale
Technical docs512 tokens50-100Balance detail with context
Conversational256 tokens25-50Shorter exchanges
Legal/contracts1024 tokens100-150Preserve clause context
Code1000 chars100Keep functions intact
Q&A pairs128-256 tokens0Each Q&A is self-contained

Contextual Chunking: Solving the Lost Context Problem

Traditional chunking destroys document context. A chunk saying "This approach reduces latency by 40%" is useless without knowing which approach. Contextual chunking addresses this.

Anthropic's Contextual Retrieval Technique

python
from anthropic import Anthropicfrom typing import List
def add_contextual_headers(    document: str,    chunks: List[str],    model: str = "claude-3-5-haiku-latest") -> List[str]:    """    Prepend chunk-specific context using Claude.    Reduces retrieval failures by 35% with contextual embeddings alone,    49% when combined with BM25 hybrid search, and 67% with reranking added.    """    client = Anthropic()    contextualized_chunks = []
    context_prompt = """Here is the full document:<document>{document}</document>
Here is a chunk from that document:<chunk>{chunk}</chunk>
Provide a short context (2-3 sentences) to situate this chunk within the document. Focus on:1. What section/topic this chunk belongs to2. Key entities or concepts being discussed3. How it relates to the document's main subject
Context:"""
    for chunk in chunks:        response = client.messages.create(            model=model,            max_tokens=200,            messages=[{                "role": "user",                "content": context_prompt.format(document=document, chunk=chunk)            }]        )
        context = response.content[0].text        contextualized_chunks.append(f"{context}\n\n{chunk}")
    return contextualized_chunks
# Cost with prompt caching: ~$1.02 per million document tokens

Rule-Based Context (Zero-Cost Alternative)

python
def add_structural_context(chunks: List[dict]) -> List[dict]:    """    Add context based on document structure without LLM calls.    Uses metadata from structure-aware chunking.    """    contextualized = []
    for chunk in chunks:        metadata = chunk.get('metadata', {})        content = chunk['content']
        context_parts = []        if 'document_title' in metadata:            context_parts.append(f"From: {metadata['document_title']}")        if 'header_1' in metadata:            context_parts.append(f"Section: {metadata['header_1']}")        if 'header_2' in metadata:            context_parts.append(f"Subsection: {metadata['header_2']}")
        context = " | ".join(context_parts)        contextualized.append({            'content': f"{context}\n\n{content}" if context else content,            'metadata': metadata        })
    return contextualized

Embedding Model Selection

Choosing the right embedding model depends on content type, chunk size, and deployment constraints. Note that MTEB scores change frequently as models are updated and new benchmarks are added, so always verify current scores before making decisions.

ModelMTEB ScoreDimensionsCost/1M tokensBest For
Cohere embed-v465.21024$0.10Multilingual, production
text-embedding-3-large64.63072$0.13General purpose
text-embedding-3-small62.31536$0.02Cost-sensitive
Voyage voyage-3-large63.81536$0.12RAG-optimized
BGE-M363.01024Self-hostedPrivacy-critical

Embedding Optimization

python
from typing import Listimport numpy as npfrom openai import OpenAI
client = OpenAI()
def get_embeddings(    texts: List[str],    dimensions: int = 1024) -> List[List[float]]:    """    Get OpenAI embeddings with dimension reduction.    256-dim text-embedding-3-large outperforms full ada-002.    """    response = client.embeddings.create(        model="text-embedding-3-large",        input=texts,        dimensions=dimensions  # Matryoshka truncation    )    return [item.embedding for item in response.data]

def batch_embed_with_normalization(    texts: List[str],    batch_size: int = 100,    dimensions: int = 1024) -> np.ndarray:    """    Embed texts in batches with L2 normalization.    Normalization enables cosine similarity via dot product.    """    all_embeddings = []
    for i in range(0, len(texts), batch_size):        batch = texts[i:i + batch_size]        embeddings = get_embeddings(batch, dimensions)        all_embeddings.extend(embeddings)
    embeddings_array = np.array(all_embeddings)
    # L2 normalize for cosine similarity    norms = np.linalg.norm(embeddings_array, axis=1, keepdims=True)    return embeddings_array / norms

Metadata Extraction and Enrichment

Metadata enables filtering before semantic search, provides ranking signals, and supports source attribution.

python
from dataclasses import dataclassfrom typing import List, Optionalfrom datetime import datetime
@dataclassclass ChunkMetadata:    # Content-based    keywords: List[str]    entities: List[str]    content_type: str
    # Structural    document_title: str    section_header: Optional[str]    chunk_index: int
    # Contextual    source_url: Optional[str]    ingestion_date: datetime    language: str
    # Technical    word_count: int    has_code: bool    has_table: bool

Automated Extraction

python
import spacyfrom collections import Counterfrom typing import Dict, List
class MetadataExtractor:    def __init__(self):        self.nlp = spacy.load("en_core_web_sm")
    def extract_entities(self, text: str) -> Dict[str, List[str]]:        """Extract named entities using spaCy."""        doc = self.nlp(text)        entities = {}        for ent in doc.ents:            if ent.label_ not in entities:                entities[ent.label_] = []            entities[ent.label_].append(ent.text)        return entities
    def extract_keywords(self, text: str, top_n: int = 10) -> List[str]:        """Extract keywords using noun chunks."""        doc = self.nlp(text)        noun_chunks = [chunk.text.lower() for chunk in doc.noun_chunks]        chunk_counts = Counter(noun_chunks)        return [word for word, _ in chunk_counts.most_common(top_n)]
    def detect_content_type(self, text: str) -> str:        """Heuristic content type detection."""        code_patterns = ['def ', 'function ', 'class ', 'import ', '```']        if any(pattern in text for pattern in code_patterns):            return 'code'
        tech_indicators = ['API', 'database', 'server', 'deployment']        if sum(1 for ind in tech_indicators if ind.lower() in text.lower()) >= 2:            return 'technical'
        return 'general'

Storing with Vector Database

python
from qdrant_client import QdrantClientfrom qdrant_client.models import PointStruct, VectorParams, Distance, Filter, FieldCondition, MatchValue
def store_chunks_with_metadata(    chunks: List[str],    embeddings: List[List[float]],    metadata_list: List[dict],    collection_name: str = "documents"):    """Store chunks with rich metadata in Qdrant."""    client = QdrantClient(host="localhost", port=6333)
    client.recreate_collection(        collection_name=collection_name,        vectors_config=VectorParams(            size=len(embeddings[0]),            distance=Distance.COSINE        )    )
    points = [        PointStruct(            id=idx,            vector=embedding,            payload={"text": chunk, **metadata}        )        for idx, (chunk, embedding, metadata)        in enumerate(zip(chunks, embeddings, metadata_list))    ]
    client.upsert(collection_name=collection_name, points=points)

def search_with_filter(    query_embedding: List[float],    collection_name: str,    content_type: str = None,    top_k: int = 10) -> List[dict]:    """Search with optional metadata filter."""    client = QdrantClient(host="localhost", port=6333)
    query_filter = None    if content_type:        query_filter = Filter(            must=[FieldCondition(                key="content_type",                match=MatchValue(value=content_type)            )]        )
    results = client.search(        collection_name=collection_name,        query_vector=query_embedding,        query_filter=query_filter,        limit=top_k    )
    return [{"text": hit.payload["text"], "score": hit.score} for hit in results]

Quality Metrics for Data Preparation

Measuring data quality enables data-driven optimization and early issue detection.

python
import numpy as npfrom sklearn.metrics.pairwise import cosine_similarityfrom typing import List
def evaluate_chunk_coherence(chunks: List[str], embedding_model) -> dict:    """    Measure semantic coherence within chunks.    High coherence = chunk discusses single topic.    """    coherence_scores = []
    for chunk in chunks:        sentences = chunk.split('. ')        if len(sentences) < 2:            coherence_scores.append(1.0)            continue
        embeddings = np.array(embedding_model.embed(sentences))        similarities = cosine_similarity(embeddings)
        n = len(sentences)        coherence = (similarities.sum() - n) / (n * (n - 1)) if n > 1 else 1.0        coherence_scores.append(coherence)
    return {        'mean_coherence': np.mean(coherence_scores),        'min_coherence': np.min(coherence_scores),        'low_coherence_count': sum(1 for s in coherence_scores if s < 0.5)    }

def evaluate_boundary_quality(chunks: List[str]) -> dict:    """Check if chunks have clean boundaries."""    bad_starts = 0    bad_ends = 0
    lowercase_starters = ['and', 'but', 'or', 'so', 'because', 'however']
    for chunk in chunks:        first_word = chunk.split()[0].lower() if chunk.split() else ''        if first_word in lowercase_starters:            bad_starts += 1
        if chunk and chunk.rstrip()[-1] not in '.!?:':            bad_ends += 1
    return {        'bad_start_ratio': bad_starts / len(chunks),        'bad_end_ratio': bad_ends / len(chunks),        'clean_boundary_ratio': 1 - (bad_starts + bad_ends) / (2 * len(chunks))    }

def evaluate_retrieval_quality(    embeddings: np.ndarray,    test_queries: List[str],    relevant_chunk_ids: List[List[int]],    embedding_model) -> dict:    """Evaluate embedding quality using retrieval tests."""    query_embeddings = np.array(embedding_model.embed(test_queries))    similarities = cosine_similarity(query_embeddings, embeddings)
    hits_at_k = {1: 0, 5: 0, 10: 0}    mrr_sum = 0
    for i, relevant_ids in enumerate(relevant_chunk_ids):        ranked_indices = np.argsort(similarities[i])[::-1]
        for rank, idx in enumerate(ranked_indices):            if idx in relevant_ids:                mrr_sum += 1 / (rank + 1)                for k in hits_at_k:                    if rank < k:                        hits_at_k[k] += 1                break
    n_queries = len(test_queries)    return {        'mrr': mrr_sum / n_queries,        'hit_rate@1': hits_at_k[1] / n_queries,        'hit_rate@5': hits_at_k[5] / n_queries,        'hit_rate@10': hits_at_k[10] / n_queries    }

Common Pitfalls and Solutions

Pitfall 1: Skipping Parsing Validation

Assuming parsing tools work perfectly on all documents leads to missing content and mangled tables in retrieval results. Always validate parsing output on representative samples before full ingestion.

Pitfall 2: One-Size-Fits-All Chunking

Using the same chunk size for all content types results in code split mid-function and tables losing context. Match chunking strategy to content structure.

Pitfall 3: Ignoring Lost Context

Chunks that reference "it", "this method", "as mentioned" become meaningless in isolation. Implement contextual chunking (LLM or rule-based) to make chunks self-contained.

Pitfall 4: Choosing Models by Benchmark Alone

MTEB scores do not reflect performance on specific content. A high-benchmark model can perform poorly on domain-specific queries. Evaluate embedding models on your own test queries.

Pitfall 5: Processing in Wrong Order

Chunking before cleaning or embedding before deduplication creates noisy results. Follow the pipeline: parse -> clean -> dedupe -> chunk -> enrich -> embed.

Building Your Pipeline: A Practical Approach

The order in which you tackle data preparation matters. Here's how to think about building your pipeline.

Start with parsing validation. Before writing any pipeline code, manually inspect parsing output for 10-20 representative documents. Look for mangled tables, missing sections, and garbled text. If your parser fails on 30% of samples, no downstream optimization will save you.

Next, establish your preprocessing baseline. Run your text through normalization, PII detection, and boilerplate removal. Compare before/after samples. The goal is clean, consistent text without losing meaningful content.

Then choose your chunking strategy based on what you learned from parsing. If your documents have clear hierarchical structure (headers, sections), leverage it with structure-aware chunking. If they're dense technical prose, recursive splitting is your friend. If you're dealing with mixed content types, consider routing different document types to different strategies.

Add context only after chunking works well. Contextual enrichment is powerful but adds cost and complexity. Get your basic pipeline producing reasonable results first, then measure whether contextual chunking improves your specific retrieval scenarios.

Finally, close the loop with metrics. Implement coherence and boundary quality checks. Create a small test set of queries with known relevant chunks. Run retrieval evaluations weekly as you tune parameters. Without measurement, you're guessing.

The key insight: each step depends on the previous one working correctly. Resist the urge to implement everything at once. A simple pipeline you understand beats a complex one you can't debug.

Key Takeaways

Data preparation sets the quality ceiling: The most sophisticated RAG architecture cannot compensate for poorly prepared data.

Parsing determines everything downstream: Invest in quality parsing tools and validate output before proceeding.

Context matters more than chunk size: Contextual chunking reduces retrieval failures by 35% alone, 49% with BM25, and 67% with reranking.

Quality metrics are non-negotiable: Measure parsing accuracy, chunk coherence, and retrieval quality throughout the pipeline.

Start simple, measure, enhance: Begin with RecursiveCharacterTextSplitter and quality parsing. Add complexity only when metrics justify it.

References

Related Posts