RAG Data Preparation: The Foundation That Makes or Breaks Your AI System

Most RAG implementation failures trace back to data preparation, not retrieval architecture. Teams spend weeks tuning retrieval parameters when the real problem is poorly parsed documents or inappropriate chunking. This guide covers the critical foundation that determines the quality ceiling of your RAG system.

Why Data Preparation Is the Most Critical RAG Step

There is a common pattern in RAG implementations: sophisticated retrieval architectures (hybrid search, reranking, CRAG) that still produce poor results. The root cause is almost always upstream in the data preparation layer.

The key insight: if data preparation fails at 60% quality, no amount of architectural sophistication can push retrieval quality above that ceiling. Teams report 40-60% quality improvements from fixing data preparation alone, often without touching retrieval logic.

Document Parsing: Extracting Clean Text from Messy Sources

Real-world documents are messy. PDFs store text as positioned glyphs, not logical sequences. Tables get mangled. Multi-column layouts require layout analysis. Scanned documents need OCR with 5-15% error rates.

Parsing Tool Selection

Tool	Table Accuracy	Text Fidelity	Speed	Best For
Docling	97.9%	Excellent	~10s/page	Complex documents
LlamaParse	75-90%	Good	~6s/doc	Speed-critical
Unstructured	75-100%*	Good	Variable	OCR-heavy
PyMuPDF/PyPDF	60-70%	Fair	Fast	Simple PDFs

*Unstructured achieves 100% on simple tables, 75% on complex structures

Practical PDF Parsing

python

from docling.document_converter import DocumentConverterfrom docling.datamodel.base_models import InputFormat
converter = DocumentConverter()
# Parse PDF with layout analysisresult = converter.convert("technical-manual.pdf")
# Access structured contentfor element in result.document.body:    if element.type == "table":        # Table extracted with structure preserved        markdown_table = element.export_to_markdown()    elif element.type == "text":        # Text with section context        text = element.text        section = element.section_header
# Export as markdown for RAG ingestionmarkdown_output = result.document.export_to_markdown()

from docling.document_converter import DocumentConverterfrom docling.datamodel.base_models import InputFormat
converter = DocumentConverter()
# Parse PDF with layout analysisresult = converter.convert("technical-manual.pdf")
# Access structured contentfor element in result.document.body:    if element.type == "table":        # Table extracted with structure preserved        markdown_table = element.export_to_markdown()    elif element.type == "text":        # Text with section context        text = element.text        section = element.section_header
# Export as markdown for RAG ingestionmarkdown_output = result.document.export_to_markdown()

HTML Content Extraction

python

from bs4 import BeautifulSoupfrom readability import Document
def extract_html_content(html: str) -> dict:    """Extract meaningful content from HTML, handling diverse structures."""
    # Use readability for main content extraction    doc = Document(html)    main_content = doc.summary()    title = doc.title()
    # Parse with BeautifulSoup for structure    soup = BeautifulSoup(main_content, 'html.parser')
    # Remove navigation, ads, footers    for element in soup.find_all(['nav', 'footer', 'aside', 'script', 'style']):        element.decompose()
    # Extract text with structure preservation    text_blocks = []    for element in soup.find_all(['h1', 'h2', 'h3', 'p', 'li', 'td']):        text = element.get_text(strip=True)        if text:            text_blocks.append({                'type': element.name,                'text': text,                'level': int(element.name[1]) if element.name.startswith('h') else 0            })
    return {        'title': title,        'blocks': text_blocks,        'full_text': soup.get_text(separator='\n', strip=True)    }

from bs4 import BeautifulSoupfrom readability import Document
def extract_html_content(html: str) -> dict:    """Extract meaningful content from HTML, handling diverse structures."""
    # Use readability for main content extraction    doc = Document(html)    main_content = doc.summary()    title = doc.title()
    # Parse with BeautifulSoup for structure    soup = BeautifulSoup(main_content, 'html.parser')
    # Remove navigation, ads, footers    for element in soup.find_all(['nav', 'footer', 'aside', 'script', 'style']):        element.decompose()
    # Extract text with structure preservation    text_blocks = []    for element in soup.find_all(['h1', 'h2', 'h3', 'p', 'li', 'td']):        text = element.get_text(strip=True)        if text:            text_blocks.append({                'type': element.name,                'text': text,                'level': int(element.name[1]) if element.name.startswith('h') else 0            })
    return {        'title': title,        'blocks': text_blocks,        'full_text': soup.get_text(separator='\n', strip=True)    }

Tip: Start with rule-based parsing before resorting to LLM-based parsing. Use hybrid approaches: heuristics for structure combined with Vision-Language Models only for the most challenging elements.

Text Preprocessing: Cleaning for Embedding Quality

Embeddings encode noise along with signal. Inconsistent formatting creates spurious similarity. PII in embeddings creates security risks. Preprocessing removes these issues before they propagate through the pipeline.

python

import refrom typing import Listimport unicodedata
class TextPreprocessor:    def __init__(self):        self.pii_patterns = {            'email': r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',            'phone': r'\b\d{3}[-.\s]?\d{3}[-.\s]?\d{4}\b',            'ssn': r'\b\d{3}-\d{2}-\d{4}\b',            'credit_card': r'\b\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b',        }
    def normalize_whitespace(self, text: str) -> str:        """Normalize all whitespace to single spaces."""        text = re.sub(r'\s+', ' ', text)        return text.strip()
    def normalize_unicode(self, text: str) -> str:        """Normalize unicode characters to consistent form."""        return unicodedata.normalize('NFKC', text)
    def redact_pii(self, text: str) -> str:        """Detect and redact PII patterns."""        for pii_type, pattern in self.pii_patterns.items():            text = re.sub(pattern, f'[REDACTED_{pii_type.upper()}]', text)        return text
    def remove_boilerplate(self, text: str, patterns: List[str] = None) -> str:        """Remove known boilerplate text patterns."""        default_patterns = [            r'Page \d+ of \d+',            r'Copyright \d{4}.*?(?=\n|$)',            r'All rights reserved\.?',        ]        patterns = patterns or default_patterns        for pattern in patterns:            text = re.sub(pattern, '', text, flags=re.IGNORECASE)        return text
    def process(self, text: str, redact_pii: bool = True) -> str:        """Run full preprocessing pipeline."""        text = self.normalize_unicode(text)        text = self.remove_boilerplate(text)        text = self.normalize_whitespace(text)        if redact_pii:            text = self.redact_pii(text)        return text

import refrom typing import Listimport unicodedata
class TextPreprocessor:    def __init__(self):        self.pii_patterns = {            'email': r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',            'phone': r'\b\d{3}[-.\s]?\d{3}[-.\s]?\d{4}\b',            'ssn': r'\b\d{3}-\d{2}-\d{4}\b',            'credit_card': r'\b\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b',        }
    def normalize_whitespace(self, text: str) -> str:        """Normalize all whitespace to single spaces."""        text = re.sub(r'\s+', ' ', text)        return text.strip()
    def normalize_unicode(self, text: str) -> str:        """Normalize unicode characters to consistent form."""        return unicodedata.normalize('NFKC', text)
    def redact_pii(self, text: str) -> str:        """Detect and redact PII patterns."""        for pii_type, pattern in self.pii_patterns.items():            text = re.sub(pattern, f'[REDACTED_{pii_type.upper()}]', text)        return text
    def remove_boilerplate(self, text: str, patterns: List[str] = None) -> str:        """Remove known boilerplate text patterns."""        default_patterns = [            r'Page \d+ of \d+',            r'Copyright \d{4}.*?(?=\n|$)',            r'All rights reserved\.?',        ]        patterns = patterns or default_patterns        for pattern in patterns:            text = re.sub(pattern, '', text, flags=re.IGNORECASE)        return text
    def process(self, text: str, redact_pii: bool = True) -> str:        """Run full preprocessing pipeline."""        text = self.normalize_unicode(text)        text = self.remove_boilerplate(text)        text = self.normalize_whitespace(text)        if redact_pii:            text = self.redact_pii(text)        return text

Deduplication

Near-duplicate content wastes storage and skews retrieval results. MinHash LSH provides efficient near-duplicate detection:

python

from datasketch import MinHash, MinHashLSHimport hashlibfrom typing import List, Set
class Deduplicator:    def __init__(self, threshold: float = 0.8, num_perm: int = 128):        self.threshold = threshold        self.num_perm = num_perm        self.lsh = MinHashLSH(threshold=threshold, num_perm=num_perm)        self.exact_hashes: Set[str] = set()        self.doc_id = 0
    def _compute_minhash(self, text: str) -> MinHash:        """Compute MinHash signature for text."""        minhash = MinHash(num_perm=self.num_perm)        words = text.lower().split()        for i in range(len(words) - 2):            shingle = ' '.join(words[i:i+3])            minhash.update(shingle.encode('utf-8'))        return minhash
    def is_duplicate(self, text: str) -> bool:        """Check for exact or near duplicates."""        # Exact duplicate check        text_hash = hashlib.md5(text.encode()).hexdigest()        if text_hash in self.exact_hashes:            return True        self.exact_hashes.add(text_hash)
        # Near-duplicate check        minhash = self._compute_minhash(text)        if self.lsh.query(minhash):            return True
        self.lsh.insert(f"doc_{self.doc_id}", minhash)        self.doc_id += 1        return False
    def deduplicate(self, documents: List[str]) -> List[str]:        """Remove duplicates from document list."""        return [doc for doc in documents if not self.is_duplicate(doc)]

from datasketch import MinHash, MinHashLSHimport hashlibfrom typing import List, Set
class Deduplicator:    def __init__(self, threshold: float = 0.8, num_perm: int = 128):        self.threshold = threshold        self.num_perm = num_perm        self.lsh = MinHashLSH(threshold=threshold, num_perm=num_perm)        self.exact_hashes: Set[str] = set()        self.doc_id = 0
    def _compute_minhash(self, text: str) -> MinHash:        """Compute MinHash signature for text."""        minhash = MinHash(num_perm=self.num_perm)        words = text.lower().split()        for i in range(len(words) - 2):            shingle = ' '.join(words[i:i+3])            minhash.update(shingle.encode('utf-8'))        return minhash
    def is_duplicate(self, text: str) -> bool:        """Check for exact or near duplicates."""        # Exact duplicate check        text_hash = hashlib.md5(text.encode()).hexdigest()        if text_hash in self.exact_hashes:            return True        self.exact_hashes.add(text_hash)
        # Near-duplicate check        minhash = self._compute_minhash(text)        if self.lsh.query(minhash):            return True
        self.lsh.insert(f"doc_{self.doc_id}", minhash)        self.doc_id += 1        return False
    def deduplicate(self, documents: List[str]) -> List[str]:        """Remove duplicates from document list."""        return [doc for doc in documents if not self.is_duplicate(doc)]

Chunking Strategies: The Art of Splitting Documents

Chunking determines how information is organized for retrieval. The core tension: too small loses context, too large dilutes relevance signal.

Strategy Comparison

Recursive Character Splitting (Recommended Default)

python

from langchain_text_splitters import RecursiveCharacterTextSplitter
def recursive_chunking(text: str, chunk_size: int = 512, overlap: int = 50) -> list:    """    Split text recursively using hierarchy of separators.    Tries to keep paragraphs together, then sentences, then words.    """    splitter = RecursiveCharacterTextSplitter(        separators=[            "\n\n",  # Paragraphs first            "\n",  # Then line breaks            ". ",  # Then sentences            ", ",  # Then clauses            " ",  # Finally words        ],        chunk_size=chunk_size,        chunk_overlap=overlap,        length_function=len,    )    return splitter.split_text(text)

from langchain_text_splitters import RecursiveCharacterTextSplitter
def recursive_chunking(text: str, chunk_size: int = 512, overlap: int = 50) -> list:    """    Split text recursively using hierarchy of separators.    Tries to keep paragraphs together, then sentences, then words.    """    splitter = RecursiveCharacterTextSplitter(        separators=[            "\n\n",  # Paragraphs first            "\n",  # Then line breaks            ". ",  # Then sentences            ", ",  # Then clauses            " ",  # Finally words        ],        chunk_size=chunk_size,        chunk_overlap=overlap,        length_function=len,    )    return splitter.split_text(text)

Semantic Chunking

python

from langchain_experimental.text_splitter import SemanticChunkerfrom langchain_openai.embeddings import OpenAIEmbeddings
def semantic_chunking(text: str) -> list:    """    Split based on semantic similarity between sentences.    Groups semantically related content together.    """    embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
    splitter = SemanticChunker(        embeddings=embeddings,        breakpoint_threshold_type="percentile",        breakpoint_threshold_amount=95,  # Split at top 5% dissimilarity        min_chunk_size=100    )
    return splitter.split_text(text)
# Performance: Up to 70% accuracy improvement in retrieval (varies by content type)# Trade-off: Requires embedding calls during chunking

from langchain_experimental.text_splitter import SemanticChunkerfrom langchain_openai.embeddings import OpenAIEmbeddings
def semantic_chunking(text: str) -> list:    """    Split based on semantic similarity between sentences.    Groups semantically related content together.    """    embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
    splitter = SemanticChunker(        embeddings=embeddings,        breakpoint_threshold_type="percentile",        breakpoint_threshold_amount=95,  # Split at top 5% dissimilarity        min_chunk_size=100    )
    return splitter.split_text(text)
# Performance: Up to 70% accuracy improvement in retrieval (varies by content type)# Trade-off: Requires embedding calls during chunking

Hierarchical Parent-Child Chunking

python

from langchain_text_splitters import RecursiveCharacterTextSplitterfrom langchain.retrievers import ParentDocumentRetrieverfrom langchain.storage import InMemoryStorefrom langchain_chroma import Chroma
def setup_hierarchical_chunking(documents: list, embeddings):    """    Create parent-child hierarchy for precision + context.    Search on small child chunks, return large parent chunks.    """    # Parent splitter: large chunks for context    parent_splitter = RecursiveCharacterTextSplitter(        chunk_size=2000,        chunk_overlap=200    )
    # Child splitter: small chunks for precise retrieval    child_splitter = RecursiveCharacterTextSplitter(        chunk_size=400,        chunk_overlap=50    )
    # Storage for parent documents    docstore = InMemoryStore()
    # Vector store indexes child chunks    vectorstore = Chroma(        collection_name="child_chunks",        embedding_function=embeddings    )
    # Retriever searches children, returns parents    retriever = ParentDocumentRetriever(        vectorstore=vectorstore,        docstore=docstore,        child_splitter=child_splitter,        parent_splitter=parent_splitter    )
    retriever.add_documents(documents)    return retriever
# Performance: Improved relevance on structured documents by preserving context# Trade-off: 2-3x storage overhead

from langchain_text_splitters import RecursiveCharacterTextSplitterfrom langchain.retrievers import ParentDocumentRetrieverfrom langchain.storage import InMemoryStorefrom langchain_chroma import Chroma
def setup_hierarchical_chunking(documents: list, embeddings):    """    Create parent-child hierarchy for precision + context.    Search on small child chunks, return large parent chunks.    """    # Parent splitter: large chunks for context    parent_splitter = RecursiveCharacterTextSplitter(        chunk_size=2000,        chunk_overlap=200    )
    # Child splitter: small chunks for precise retrieval    child_splitter = RecursiveCharacterTextSplitter(        chunk_size=400,        chunk_overlap=50    )
    # Storage for parent documents    docstore = InMemoryStore()
    # Vector store indexes child chunks    vectorstore = Chroma(        collection_name="child_chunks",        embedding_function=embeddings    )
    # Retriever searches children, returns parents    retriever = ParentDocumentRetriever(        vectorstore=vectorstore,        docstore=docstore,        child_splitter=child_splitter,        parent_splitter=parent_splitter    )
    retriever.add_documents(documents)    return retriever
# Performance: Improved relevance on structured documents by preserving context# Trade-off: 2-3x storage overhead

Chunk Size Guidelines

Content Type	Recommended Size	Overlap	Rationale
Technical docs	512 tokens	50-100	Balance detail with context
Conversational	256 tokens	25-50	Shorter exchanges
Legal/contracts	1024 tokens	100-150	Preserve clause context
Code	1000 chars	100	Keep functions intact
Q&A pairs	128-256 tokens	0	Each Q&A is self-contained

Contextual Chunking: Solving the Lost Context Problem

Traditional chunking destroys document context. A chunk saying "This approach reduces latency by 40%" is useless without knowing which approach. Contextual chunking addresses this.

Anthropic's Contextual Retrieval Technique

python

from anthropic import Anthropicfrom typing import List
def add_contextual_headers(    document: str,    chunks: List[str],    model: str = "claude-3-5-haiku-latest") -> List[str]:    """    Prepend chunk-specific context using Claude.    Reduces retrieval failures by 35% with contextual embeddings alone,    49% when combined with BM25 hybrid search, and 67% with reranking added.    """    client = Anthropic()    contextualized_chunks = []
    context_prompt = """Here is the full document:<document>{document}</document>
Here is a chunk from that document:<chunk>{chunk}</chunk>
Provide a short context (2-3 sentences) to situate this chunk within the document. Focus on:1. What section/topic this chunk belongs to2. Key entities or concepts being discussed3. How it relates to the document's main subject
Context:"""
    for chunk in chunks:        response = client.messages.create(            model=model,            max_tokens=200,            messages=[{                "role": "user",                "content": context_prompt.format(document=document, chunk=chunk)            }]        )
        context = response.content[0].text        contextualized_chunks.append(f"{context}\n\n{chunk}")
    return contextualized_chunks
# Cost with prompt caching: ~$1.02 per million document tokens

from anthropic import Anthropicfrom typing import List
def add_contextual_headers(    document: str,    chunks: List[str],    model: str = "claude-3-5-haiku-latest") -> List[str]:    """    Prepend chunk-specific context using Claude.    Reduces retrieval failures by 35% with contextual embeddings alone,    49% when combined with BM25 hybrid search, and 67% with reranking added.    """    client = Anthropic()    contextualized_chunks = []
    context_prompt = """Here is the full document:<document>{document}</document>
Here is a chunk from that document:<chunk>{chunk}</chunk>
Provide a short context (2-3 sentences) to situate this chunk within the document. Focus on:1. What section/topic this chunk belongs to2. Key entities or concepts being discussed3. How it relates to the document's main subject
Context:"""
    for chunk in chunks:        response = client.messages.create(            model=model,            max_tokens=200,            messages=[{                "role": "user",                "content": context_prompt.format(document=document, chunk=chunk)            }]        )
        context = response.content[0].text        contextualized_chunks.append(f"{context}\n\n{chunk}")
    return contextualized_chunks
# Cost with prompt caching: ~$1.02 per million document tokens

Rule-Based Context (Zero-Cost Alternative)

python

def add_structural_context(chunks: List[dict]) -> List[dict]:    """    Add context based on document structure without LLM calls.    Uses metadata from structure-aware chunking.    """    contextualized = []
    for chunk in chunks:        metadata = chunk.get('metadata', {})        content = chunk['content']
        context_parts = []        if 'document_title' in metadata:            context_parts.append(f"From: {metadata['document_title']}")        if 'header_1' in metadata:            context_parts.append(f"Section: {metadata['header_1']}")        if 'header_2' in metadata:            context_parts.append(f"Subsection: {metadata['header_2']}")
        context = " | ".join(context_parts)        contextualized.append({            'content': f"{context}\n\n{content}" if context else content,            'metadata': metadata        })
    return contextualized

def add_structural_context(chunks: List[dict]) -> List[dict]:    """    Add context based on document structure without LLM calls.    Uses metadata from structure-aware chunking.    """    contextualized = []
    for chunk in chunks:        metadata = chunk.get('metadata', {})        content = chunk['content']
        context_parts = []        if 'document_title' in metadata:            context_parts.append(f"From: {metadata['document_title']}")        if 'header_1' in metadata:            context_parts.append(f"Section: {metadata['header_1']}")        if 'header_2' in metadata:            context_parts.append(f"Subsection: {metadata['header_2']}")
        context = " | ".join(context_parts)        contextualized.append({            'content': f"{context}\n\n{content}" if context else content,            'metadata': metadata        })
    return contextualized

Embedding Model Selection

Choosing the right embedding model depends on content type, chunk size, and deployment constraints. Note that MTEB scores change frequently as models are updated and new benchmarks are added, so always verify current scores before making decisions.

Model	MTEB Score	Dimensions	Cost/1M tokens	Best For
Cohere embed-v4	65.2	1024	$0.10	Multilingual, production
text-embedding-3-large	64.6	3072	$0.13	General purpose
text-embedding-3-small	62.3	1536	$0.02	Cost-sensitive
Voyage voyage-3-large	63.8	1536	$0.12	RAG-optimized
BGE-M3	63.0	1024	Self-hosted	Privacy-critical

Embedding Optimization

python

from typing import Listimport numpy as npfrom openai import OpenAI
client = OpenAI()
def get_embeddings(    texts: List[str],    dimensions: int = 1024) -> List[List[float]]:    """    Get OpenAI embeddings with dimension reduction.    256-dim text-embedding-3-large outperforms full ada-002.    """    response = client.embeddings.create(        model="text-embedding-3-large",        input=texts,        dimensions=dimensions  # Matryoshka truncation    )    return [item.embedding for item in response.data]

def batch_embed_with_normalization(    texts: List[str],    batch_size: int = 100,    dimensions: int = 1024) -> np.ndarray:    """    Embed texts in batches with L2 normalization.    Normalization enables cosine similarity via dot product.    """    all_embeddings = []
    for i in range(0, len(texts), batch_size):        batch = texts[i:i + batch_size]        embeddings = get_embeddings(batch, dimensions)        all_embeddings.extend(embeddings)
    embeddings_array = np.array(all_embeddings)
    # L2 normalize for cosine similarity    norms = np.linalg.norm(embeddings_array, axis=1, keepdims=True)    return embeddings_array / norms

from typing import Listimport numpy as npfrom openai import OpenAI
client = OpenAI()
def get_embeddings(    texts: List[str],    dimensions: int = 1024) -> List[List[float]]:    """    Get OpenAI embeddings with dimension reduction.    256-dim text-embedding-3-large outperforms full ada-002.    """    response = client.embeddings.create(        model="text-embedding-3-large",        input=texts,        dimensions=dimensions  # Matryoshka truncation    )    return [item.embedding for item in response.data]

def batch_embed_with_normalization(    texts: List[str],    batch_size: int = 100,    dimensions: int = 1024) -> np.ndarray:    """    Embed texts in batches with L2 normalization.    Normalization enables cosine similarity via dot product.    """    all_embeddings = []
    for i in range(0, len(texts), batch_size):        batch = texts[i:i + batch_size]        embeddings = get_embeddings(batch, dimensions)        all_embeddings.extend(embeddings)
    embeddings_array = np.array(all_embeddings)
    # L2 normalize for cosine similarity    norms = np.linalg.norm(embeddings_array, axis=1, keepdims=True)    return embeddings_array / norms

Metadata Extraction and Enrichment

Metadata enables filtering before semantic search, provides ranking signals, and supports source attribution.

python

from dataclasses import dataclassfrom typing import List, Optionalfrom datetime import datetime
@dataclassclass ChunkMetadata:    # Content-based    keywords: List[str]    entities: List[str]    content_type: str
    # Structural    document_title: str    section_header: Optional[str]    chunk_index: int
    # Contextual    source_url: Optional[str]    ingestion_date: datetime    language: str
    # Technical    word_count: int    has_code: bool    has_table: bool

from dataclasses import dataclassfrom typing import List, Optionalfrom datetime import datetime
@dataclassclass ChunkMetadata:    # Content-based    keywords: List[str]    entities: List[str]    content_type: str
    # Structural    document_title: str    section_header: Optional[str]    chunk_index: int
    # Contextual    source_url: Optional[str]    ingestion_date: datetime    language: str
    # Technical    word_count: int    has_code: bool    has_table: bool

Automated Extraction

python

import spacyfrom collections import Counterfrom typing import Dict, List
class MetadataExtractor:    def __init__(self):        self.nlp = spacy.load("en_core_web_sm")
    def extract_entities(self, text: str) -> Dict[str, List[str]]:        """Extract named entities using spaCy."""        doc = self.nlp(text)        entities = {}        for ent in doc.ents:            if ent.label_ not in entities:                entities[ent.label_] = []            entities[ent.label_].append(ent.text)        return entities
    def extract_keywords(self, text: str, top_n: int = 10) -> List[str]:        """Extract keywords using noun chunks."""        doc = self.nlp(text)        noun_chunks = [chunk.text.lower() for chunk in doc.noun_chunks]        chunk_counts = Counter(noun_chunks)        return [word for word, _ in chunk_counts.most_common(top_n)]
    def detect_content_type(self, text: str) -> str:        """Heuristic content type detection."""        code_patterns = ['def ', 'function ', 'class ', 'import ', '```']        if any(pattern in text for pattern in code_patterns):            return 'code'
        tech_indicators = ['API', 'database', 'server', 'deployment']        if sum(1 for ind in tech_indicators if ind.lower() in text.lower()) >= 2:            return 'technical'
        return 'general'

import spacyfrom collections import Counterfrom typing import Dict, List
class MetadataExtractor:    def __init__(self):        self.nlp = spacy.load("en_core_web_sm")
    def extract_entities(self, text: str) -> Dict[str, List[str]]:        """Extract named entities using spaCy."""        doc = self.nlp(text)        entities = {}        for ent in doc.ents:            if ent.label_ not in entities:                entities[ent.label_] = []            entities[ent.label_].append(ent.text)        return entities
    def extract_keywords(self, text: str, top_n: int = 10) -> List[str]:        """Extract keywords using noun chunks."""        doc = self.nlp(text)        noun_chunks = [chunk.text.lower() for chunk in doc.noun_chunks]        chunk_counts = Counter(noun_chunks)        return [word for word, _ in chunk_counts.most_common(top_n)]
    def detect_content_type(self, text: str) -> str:        """Heuristic content type detection."""        code_patterns = ['def ', 'function ', 'class ', 'import ', '```']        if any(pattern in text for pattern in code_patterns):            return 'code'
        tech_indicators = ['API', 'database', 'server', 'deployment']        if sum(1 for ind in tech_indicators if ind.lower() in text.lower()) >= 2:            return 'technical'
        return 'general'

Storing with Vector Database

python

from qdrant_client import QdrantClientfrom qdrant_client.models import PointStruct, VectorParams, Distance, Filter, FieldCondition, MatchValue
def store_chunks_with_metadata(    chunks: List[str],    embeddings: List[List[float]],    metadata_list: List[dict],    collection_name: str = "documents"):    """Store chunks with rich metadata in Qdrant."""    client = QdrantClient(host="localhost", port=6333)
    client.recreate_collection(        collection_name=collection_name,        vectors_config=VectorParams(            size=len(embeddings[0]),            distance=Distance.COSINE        )    )
    points = [        PointStruct(            id=idx,            vector=embedding,            payload={"text": chunk, **metadata}        )        for idx, (chunk, embedding, metadata)        in enumerate(zip(chunks, embeddings, metadata_list))    ]
    client.upsert(collection_name=collection_name, points=points)

def search_with_filter(    query_embedding: List[float],    collection_name: str,    content_type: str = None,    top_k: int = 10) -> List[dict]:    """Search with optional metadata filter."""    client = QdrantClient(host="localhost", port=6333)
    query_filter = None    if content_type:        query_filter = Filter(            must=[FieldCondition(                key="content_type",                match=MatchValue(value=content_type)            )]        )
    results = client.search(        collection_name=collection_name,        query_vector=query_embedding,        query_filter=query_filter,        limit=top_k    )
    return [{"text": hit.payload["text"], "score": hit.score} for hit in results]

from qdrant_client import QdrantClientfrom qdrant_client.models import PointStruct, VectorParams, Distance, Filter, FieldCondition, MatchValue
def store_chunks_with_metadata(    chunks: List[str],    embeddings: List[List[float]],    metadata_list: List[dict],    collection_name: str = "documents"):    """Store chunks with rich metadata in Qdrant."""    client = QdrantClient(host="localhost", port=6333)
    client.recreate_collection(        collection_name=collection_name,        vectors_config=VectorParams(            size=len(embeddings[0]),            distance=Distance.COSINE        )    )
    points = [        PointStruct(            id=idx,            vector=embedding,            payload={"text": chunk, **metadata}        )        for idx, (chunk, embedding, metadata)        in enumerate(zip(chunks, embeddings, metadata_list))    ]
    client.upsert(collection_name=collection_name, points=points)

def search_with_filter(    query_embedding: List[float],    collection_name: str,    content_type: str = None,    top_k: int = 10) -> List[dict]:    """Search with optional metadata filter."""    client = QdrantClient(host="localhost", port=6333)
    query_filter = None    if content_type:        query_filter = Filter(            must=[FieldCondition(                key="content_type",                match=MatchValue(value=content_type)            )]        )
    results = client.search(        collection_name=collection_name,        query_vector=query_embedding,        query_filter=query_filter,        limit=top_k    )
    return [{"text": hit.payload["text"], "score": hit.score} for hit in results]

Quality Metrics for Data Preparation

Measuring data quality enables data-driven optimization and early issue detection.

python

import numpy as npfrom sklearn.metrics.pairwise import cosine_similarityfrom typing import List
def evaluate_chunk_coherence(chunks: List[str], embedding_model) -> dict:    """    Measure semantic coherence within chunks.    High coherence = chunk discusses single topic.    """    coherence_scores = []
    for chunk in chunks:        sentences = chunk.split('. ')        if len(sentences) < 2:            coherence_scores.append(1.0)            continue
        embeddings = np.array(embedding_model.embed(sentences))        similarities = cosine_similarity(embeddings)
        n = len(sentences)        coherence = (similarities.sum() - n) / (n * (n - 1)) if n > 1 else 1.0        coherence_scores.append(coherence)
    return {        'mean_coherence': np.mean(coherence_scores),        'min_coherence': np.min(coherence_scores),        'low_coherence_count': sum(1 for s in coherence_scores if s < 0.5)    }

def evaluate_boundary_quality(chunks: List[str]) -> dict:    """Check if chunks have clean boundaries."""    bad_starts = 0    bad_ends = 0
    lowercase_starters = ['and', 'but', 'or', 'so', 'because', 'however']
    for chunk in chunks:        first_word = chunk.split()[0].lower() if chunk.split() else ''        if first_word in lowercase_starters:            bad_starts += 1
        if chunk and chunk.rstrip()[-1] not in '.!?:':            bad_ends += 1
    return {        'bad_start_ratio': bad_starts / len(chunks),        'bad_end_ratio': bad_ends / len(chunks),        'clean_boundary_ratio': 1 - (bad_starts + bad_ends) / (2 * len(chunks))    }

def evaluate_retrieval_quality(    embeddings: np.ndarray,    test_queries: List[str],    relevant_chunk_ids: List[List[int]],    embedding_model) -> dict:    """Evaluate embedding quality using retrieval tests."""    query_embeddings = np.array(embedding_model.embed(test_queries))    similarities = cosine_similarity(query_embeddings, embeddings)
    hits_at_k = {1: 0, 5: 0, 10: 0}    mrr_sum = 0
    for i, relevant_ids in enumerate(relevant_chunk_ids):        ranked_indices = np.argsort(similarities[i])[::-1]
        for rank, idx in enumerate(ranked_indices):            if idx in relevant_ids:                mrr_sum += 1 / (rank + 1)                for k in hits_at_k:                    if rank < k:                        hits_at_k[k] += 1                break
    n_queries = len(test_queries)    return {        'mrr': mrr_sum / n_queries,        'hit_rate@1': hits_at_k[1] / n_queries,        'hit_rate@5': hits_at_k[5] / n_queries,        'hit_rate@10': hits_at_k[10] / n_queries    }

import numpy as npfrom sklearn.metrics.pairwise import cosine_similarityfrom typing import List
def evaluate_chunk_coherence(chunks: List[str], embedding_model) -> dict:    """    Measure semantic coherence within chunks.    High coherence = chunk discusses single topic.    """    coherence_scores = []
    for chunk in chunks:        sentences = chunk.split('. ')        if len(sentences) < 2:            coherence_scores.append(1.0)            continue
        embeddings = np.array(embedding_model.embed(sentences))        similarities = cosine_similarity(embeddings)
        n = len(sentences)        coherence = (similarities.sum() - n) / (n * (n - 1)) if n > 1 else 1.0        coherence_scores.append(coherence)
    return {        'mean_coherence': np.mean(coherence_scores),        'min_coherence': np.min(coherence_scores),        'low_coherence_count': sum(1 for s in coherence_scores if s < 0.5)    }

def evaluate_boundary_quality(chunks: List[str]) -> dict:    """Check if chunks have clean boundaries."""    bad_starts = 0    bad_ends = 0
    lowercase_starters = ['and', 'but', 'or', 'so', 'because', 'however']
    for chunk in chunks:        first_word = chunk.split()[0].lower() if chunk.split() else ''        if first_word in lowercase_starters:            bad_starts += 1
        if chunk and chunk.rstrip()[-1] not in '.!?:':            bad_ends += 1
    return {        'bad_start_ratio': bad_starts / len(chunks),        'bad_end_ratio': bad_ends / len(chunks),        'clean_boundary_ratio': 1 - (bad_starts + bad_ends) / (2 * len(chunks))    }

def evaluate_retrieval_quality(    embeddings: np.ndarray,    test_queries: List[str],    relevant_chunk_ids: List[List[int]],    embedding_model) -> dict:    """Evaluate embedding quality using retrieval tests."""    query_embeddings = np.array(embedding_model.embed(test_queries))    similarities = cosine_similarity(query_embeddings, embeddings)
    hits_at_k = {1: 0, 5: 0, 10: 0}    mrr_sum = 0
    for i, relevant_ids in enumerate(relevant_chunk_ids):        ranked_indices = np.argsort(similarities[i])[::-1]
        for rank, idx in enumerate(ranked_indices):            if idx in relevant_ids:                mrr_sum += 1 / (rank + 1)                for k in hits_at_k:                    if rank < k:                        hits_at_k[k] += 1                break
    n_queries = len(test_queries)    return {        'mrr': mrr_sum / n_queries,        'hit_rate@1': hits_at_k[1] / n_queries,        'hit_rate@5': hits_at_k[5] / n_queries,        'hit_rate@10': hits_at_k[10] / n_queries    }

Common Pitfalls and Solutions

Pitfall 1: Skipping Parsing Validation

Assuming parsing tools work perfectly on all documents leads to missing content and mangled tables in retrieval results. Always validate parsing output on representative samples before full ingestion.

Pitfall 2: One-Size-Fits-All Chunking

Using the same chunk size for all content types results in code split mid-function and tables losing context. Match chunking strategy to content structure.

Pitfall 3: Ignoring Lost Context

Chunks that reference "it", "this method", "as mentioned" become meaningless in isolation. Implement contextual chunking (LLM or rule-based) to make chunks self-contained.

Pitfall 4: Choosing Models by Benchmark Alone

MTEB scores do not reflect performance on specific content. A high-benchmark model can perform poorly on domain-specific queries. Evaluate embedding models on your own test queries.

Pitfall 5: Processing in Wrong Order

Chunking before cleaning or embedding before deduplication creates noisy results. Follow the pipeline: parse -> clean -> dedupe -> chunk -> enrich -> embed.

Building Your Pipeline: A Practical Approach

The order in which you tackle data preparation matters. Here's how to think about building your pipeline.

Start with parsing validation. Before writing any pipeline code, manually inspect parsing output for 10-20 representative documents. Look for mangled tables, missing sections, and garbled text. If your parser fails on 30% of samples, no downstream optimization will save you.

Next, establish your preprocessing baseline. Run your text through normalization, PII detection, and boilerplate removal. Compare before/after samples. The goal is clean, consistent text without losing meaningful content.

Then choose your chunking strategy based on what you learned from parsing. If your documents have clear hierarchical structure (headers, sections), leverage it with structure-aware chunking. If they're dense technical prose, recursive splitting is your friend. If you're dealing with mixed content types, consider routing different document types to different strategies.

Add context only after chunking works well. Contextual enrichment is powerful but adds cost and complexity. Get your basic pipeline producing reasonable results first, then measure whether contextual chunking improves your specific retrieval scenarios.

Finally, close the loop with metrics. Implement coherence and boundary quality checks. Create a small test set of queries with known relevant chunks. Run retrieval evaluations weekly as you tune parameters. Without measurement, you're guessing.

The key insight: each step depends on the previous one working correctly. Resist the urge to implement everything at once. A simple pipeline you understand beats a complex one you can't debug.

Key Takeaways

Data preparation sets the quality ceiling: The most sophisticated RAG architecture cannot compensate for poorly prepared data.

Parsing determines everything downstream: Invest in quality parsing tools and validate output before proceeding.

Context matters more than chunk size: Contextual chunking reduces retrieval failures by 35% alone, 49% with BM25, and 67% with reranking.

Quality metrics are non-negotiable: Measure parsing accuracy, chunk coherence, and retrieval quality throughout the pipeline.

Start simple, measure, enhance: Begin with RecursiveCharacterTextSplitter and quality parsing. Add complexity only when metrics justify it.

References

Anthropic: Introducing Contextual Retrieval - Research on contextual chunking with performance benchmarks
LangChain Text Splitters Documentation - Official documentation for chunking strategies
Docling: Document Parsing Library - High-accuracy document parsing tool
MTEB Leaderboard - Massive Text Embedding Benchmark for model comparison
Qdrant Documentation - Vector database with metadata filtering
datasketch: MinHash LSH - Near-duplicate detection library
spaCy: Industrial NLP - Named entity recognition and text processing
OpenAI Embeddings Guide - Embedding model documentation with Matryoshka truncation

Why Data Preparation Is the Most Critical RAG Step#

Document Parsing: Extracting Clean Text from Messy Sources#

Parsing Tool Selection#

Practical PDF Parsing#

HTML Content Extraction#

Text Preprocessing: Cleaning for Embedding Quality#

Deduplication#

Chunking Strategies: The Art of Splitting Documents#

Strategy Comparison#

Recursive Character Splitting (Recommended Default)#

Semantic Chunking#

Hierarchical Parent-Child Chunking#

Chunk Size Guidelines#

Contextual Chunking: Solving the Lost Context Problem#

Anthropic's Contextual Retrieval Technique#

Rule-Based Context (Zero-Cost Alternative)#

Embedding Model Selection#

Embedding Optimization#

Metadata Extraction and Enrichment#

Automated Extraction#

Storing with Vector Database#

Quality Metrics for Data Preparation#

Common Pitfalls and Solutions#

Pitfall 1: Skipping Parsing Validation#

Pitfall 2: One-Size-Fits-All Chunking#

Pitfall 3: Ignoring Lost Context#

Pitfall 4: Choosing Models by Benchmark Alone#

Pitfall 5: Processing in Wrong Order#

Building Your Pipeline: A Practical Approach#

Key Takeaways#

References#

Related Posts

Why Data Preparation Is the Most Critical RAG Step

Document Parsing: Extracting Clean Text from Messy Sources

Parsing Tool Selection

Practical PDF Parsing

HTML Content Extraction

Text Preprocessing: Cleaning for Embedding Quality

Deduplication

Chunking Strategies: The Art of Splitting Documents

Strategy Comparison

Recursive Character Splitting (Recommended Default)

Semantic Chunking

Hierarchical Parent-Child Chunking

Chunk Size Guidelines

Contextual Chunking: Solving the Lost Context Problem

Anthropic's Contextual Retrieval Technique

Rule-Based Context (Zero-Cost Alternative)

Embedding Model Selection

Embedding Optimization

Metadata Extraction and Enrichment

Automated Extraction

Storing with Vector Database

Quality Metrics for Data Preparation

Common Pitfalls and Solutions

Pitfall 1: Skipping Parsing Validation

Pitfall 2: One-Size-Fits-All Chunking

Pitfall 3: Ignoring Lost Context

Pitfall 4: Choosing Models by Benchmark Alone

Pitfall 5: Processing in Wrong Order

Building Your Pipeline: A Practical Approach

Key Takeaways

References