AI/LLM Glossary: 82 Terms Every Developer Should Know
A practical, implementation-focused glossary for developers navigating the AI/LLM landscape. From tokens to agents, RAG to fine-tuning, with code examples and honest assessments.
AI terminology evolves faster than most documentation can keep up with. New terms appear weekly - RAG, RLHF, LoRA, MCP, GGUF - often with inconsistent definitions across different sources. This creates a real problem: vendor materials conflate concepts, and understanding what a term means conceptually differs significantly from knowing how to use it practically.
This glossary bridges that gap. It's not just definitions - it's implementation context, common misconceptions, and practical guidance from building LLM-powered systems. Consider it your reference for those moments when a PM asks about "embedding our knowledge base" or when you need to explain why temperature 0 doesn't prevent hallucinations.
Navigation
- Core Concepts - LLM, tokens, context window, temperature
- Model Types - Base vs instruct, multimodal, reasoning models
- RAG and Retrieval - Vector databases, chunking, hybrid search
- Fine-tuning and Training - LoRA, RLHF, quantization
- Model Formats and Local Inference - GGUF, Ollama, vLLM
- Agents and Orchestration - Tool use, MCP, multi-agent systems
- Prompt Engineering - Few-shot, templates, caching
- Production and Operations - Guardrails, rate limiting, streaming
- Cost and Metrics - Token pricing, batch discounts
- Security and Compliance - PII handling, red teaming
- Evaluation and Metrics - Benchmarks, human evaluation
- Extended Thinking - Reasoning models and thinking tokens
Core Concepts
LLM (Large Language Model)
Definition: A neural network trained on massive text datasets to predict the next token in a sequence. "Large" refers to parameter count (billions to trillions) and training data scale.
Implementation Reality: LLMs are statistical pattern matchers, not reasoning engines. They produce probable text continuations, which sometimes looks like reasoning but isn't deterministic logic.
When to Use: Text generation, summarization, code assistance, conversational interfaces.
When NOT to Use: Precise calculations, real-time data lookups, deterministic workflows.
Foundation Model
Definition: A large model pre-trained on broad data that serves as a starting point for downstream tasks. GPT-4, Claude, Llama, and Gemini are foundation models.
Key Distinction: Foundation models are general-purpose; fine-tuned models are specialized. You typically don't train foundation models yourself - you use them via API or adapt them through fine-tuning.
Practical Lesson: Building on foundation models is almost always more cost-effective than training from scratch. A custom-trained model requires millions in compute; fine-tuning costs hundreds to thousands.
Token / Tokenization
Definition: The basic unit of text that LLMs process. Tokenization splits text into subword pieces (not necessarily whole words). "tokenization" might become ["token", "ization"] or ["tok", "en", "ization"] depending on the tokenizer.
Implementation Reality: Token counts directly impact cost and context limits. A rough rule: 1 token is approximately 4 characters or 0.75 words in English.
Cost Impact: At 0.03. Multiply by thousands of daily requests and tokens become a significant line item.
Common Gotcha: Non-English text and code often tokenize inefficiently. Japanese text can be 2-3x more tokens than equivalent English.
Context Window
Definition: The maximum number of tokens an LLM can process in a single request (input + output combined). Think of it as the model's "working memory."
2025 Context Windows:
Implementation Reality: Large context windows don't guarantee good performance on all that content. Models struggle with "needle in a haystack" tasks - finding specific information buried in long contexts.
Practical Lesson: Just because you can fit 200K tokens doesn't mean you should. Retrieval (RAG) often outperforms stuffing everything into context.
Temperature / Top-P
Definition: Sampling parameters that control output randomness. Temperature scales the probability distribution; Top-P (nucleus sampling) limits which tokens are considered.
-
Temperature 0: Nearly deterministic (same input often produces same output)
-
Temperature 0.7: Balanced creativity and coherence
-
Temperature 1.0+: More random, creative, but potentially incoherent
-
Top-P 0.1: Only consider tokens comprising top 10% probability mass
-
Top-P 1.0: Consider all tokens (no filtering)
Common Misconception: Temperature 0 doesn't eliminate hallucinations. It makes hallucinations more consistent, not less likely.
Practical Guidance:
- Code generation: temperature 0-0.3
- Factual Q&A: temperature 0.3-0.5
- Creative writing: temperature 0.7-1.0
- Avoid using both temperature and top_p aggressively together
Prompt / System Prompt
Definition: The text input that instructs the LLM. System prompts set persistent context and behavior; user prompts are the actual requests.
Implementation Reality: System prompts aren't foolproof. Determined users can override them through prompt injection. Never rely solely on system prompts for security.
Practical Lesson: Clear, specific system prompts dramatically improve output quality. "Be a helpful assistant" produces worse results than detailed role definitions with examples.
Completion
Definition: The text generated by an LLM in response to a prompt. Also refers to the older API paradigm (completion endpoints) vs the newer chat paradigm (chat completion endpoints).
Historical Context: Early APIs used "completion" endpoints where you provided a text prefix and the model continued it. Modern APIs use "chat completion" with structured message arrays.
Recommendation: Always use chat completion endpoints for new projects. They handle conversation context better and work with instruction-tuned models.
Inference
Definition: The process of running a trained model to generate predictions/outputs. Distinct from training (updating model weights).
Implementation Reality: Inference is what you pay for with API calls. Local inference means running models on your own hardware.
Cost Equation: Inference Cost = (Input Tokens + Output Tokens) x Price per Token
Latency Components:
- TTFT (Time to First Token): Prompt processing time
- TPS (Tokens per Second): Generation speed
- Total Latency: TTFT + (Output Tokens / TPS)
Hallucination
Definition: When an LLM generates confident-sounding but factually incorrect or fabricated information. The model "makes things up" rather than admitting uncertainty.
Why It Happens: LLMs predict probable token sequences, not truth. They've learned patterns that look correct even when content is wrong.
Practical Lesson: Hallucinations cannot be eliminated entirely. Build systems that verify LLM outputs rather than trusting them blindly. RAG with source citations helps users evaluate accuracy.
Grounding
Definition: Connecting LLM outputs to verified information sources (documents, databases, APIs) to reduce hallucinations and improve accuracy.
Implementation Approaches:
- RAG (Retrieval-Augmented Generation): Retrieve relevant documents before generation
- Tool Use: Let the model call APIs for real-time data
- Constrained Generation: Limit outputs to predefined options
Key Insight: Grounding trades off response flexibility for accuracy. A grounded system won't answer questions outside its data sources.
Model Types
Base vs Instruct Model
Definition: Base models are trained only on next-token prediction; instruct models are further trained to follow instructions and generate helpful responses.
- Base Model: Trained on raw text, predicts continuations
- Instruct Model: Fine-tuned with instruction-response pairs (SFT) and human feedback (RLHF)
Practical Difference:
When to Use Base Models: Fine-tuning for specialized tasks, research, or when you need the model to continue text naturally.
When to Use Instruct Models: Production applications, chatbots, code assistants - any task requiring instruction following.
Chat vs Completion Model
Definition: Completion models generate text continuations from a prompt; chat models are optimized for multi-turn conversational interactions.
Technical Difference: Chat models use message arrays with roles (system, user, assistant); completion models take raw text strings.
Recommendation: Use chat models for nearly all applications. Completion models are largely deprecated.
Multimodal Model
Definition: Models that process multiple input types - text, images, audio, video - in a single model architecture.
Examples: GPT-4o (text + images + audio), Claude Sonnet 4.6 (text + images), Gemini 2.5 (text + images + video + audio)
Use Cases: Document analysis with charts, code screenshot debugging, video content understanding, accessibility features.
Limitation: Multimodal processing is more expensive (images can be 100-1000+ tokens) and slower than text-only.
Reasoning Model (o1/o3)
Definition: Models specifically designed for complex reasoning tasks, trained to "think step by step" before producing answers. OpenAI's o1 and o3 series are the primary examples.
How They Differ: Reasoning models use "extended thinking" - generating internal reasoning tokens before the final answer. This improves performance on math, logic, and multi-step problems.
Trade-offs:
- Much slower than standard models (seconds to minutes for complex problems)
- Higher cost (thinking tokens are billed)
- Overkill for simple queries
- Excellent for coding, math, scientific reasoning
When to Use: Complex math problems, formal logic, code debugging requiring deep analysis, scientific reasoning.
When NOT to Use: Simple Q&A, chat, content generation - standard models are faster and cheaper.
Embedding Model
Definition: Models that convert text into dense numerical vectors (embeddings) that capture semantic meaning. Similar texts have similar vectors.
Purpose: Enable semantic search, clustering, classification, and as input to RAG systems.
Popular Embedding Models (2025):
Cost Comparison: Embedding models are much cheaper than generation models - typically $0.02-0.13 per million tokens.
Small Language Model (SLM)
Definition: Language models with fewer parameters (typically 1B-13B) optimized for efficiency, on-device deployment, and specific use cases.
Examples:
- Phi-4-mini (3.8B): Strong reasoning for its size
- Gemma 3 (1B-27B): Multimodal capable
- Llama 3.2 (1B, 3B): Mobile-optimized
- Qwen2.5 (0.5B-7B): Efficient instruction following
Advantages:
- Run on consumer hardware (laptops, phones)
- Lower latency and cost
- Privacy (no data leaves the device)
- Lower energy consumption
When to Use: On-device applications, privacy-sensitive use cases, high-volume low-complexity tasks, cost-constrained scenarios.
When NOT to Use: Complex reasoning, tasks requiring broad knowledge, when quality is paramount.
RAG and Retrieval
RAG (Retrieval-Augmented Generation)
Definition: A pattern that enhances LLM responses by retrieving relevant documents from a knowledge base before generation. The retrieved context "grounds" the response in specific data.
Why It Matters: RAG enables LLMs to answer questions about private data, recent events, or domain-specific knowledge not in their training data.
Key Trade-off: RAG adds latency (retrieval step) and complexity but dramatically improves accuracy for domain-specific questions.
Vector Database
Definition: A database optimized for storing and searching high-dimensional vectors (embeddings). Enables fast approximate nearest neighbor (ANN) search for semantic similarity.
Popular Options (2025):
Practical Lesson: Start with simpler options (Chroma, pgvector) for prototypes. Move to managed services (Pinecone, OpenSearch) for production scale.
Embedding
Definition: A dense vector representation of text (or images, etc.) that captures semantic meaning. Similar concepts have similar embeddings, enabling semantic search.
How Embeddings Work: Text is processed through an embedding model to produce a fixed-size vector (e.g., 1536 dimensions). The position in this high-dimensional space represents meaning.
Key Insight: Embedding quality directly impacts RAG performance. Better embeddings = better retrieval = better answers.
Chunking
Definition: Splitting documents into smaller pieces (chunks) for embedding and retrieval. Chunk size and strategy significantly impact RAG quality.
Common Strategies:
- Fixed-size chunking: Split every N characters/tokens
- Semantic chunking: Split at topic boundaries
- Recursive chunking: Split hierarchically (paragraphs -> sentences)
- Parent-child chunking: Small chunks for search, return larger parent for context
Best Practices:
- 256-512 tokens is often the sweet spot
- Include 10-20% overlap
- Preserve metadata (source, page number)
- Consider document type (code needs different chunking than prose)
Practical Lesson: Poor chunking is a common cause of RAG failures. If chunks split sentences or lose context, retrieval suffers.
Semantic Search
Definition: Finding documents based on meaning rather than keyword matching. Uses embedding similarity to find conceptually related content.
Difference from Keyword Search:
Limitation: Pure semantic search can miss exact matches. "AWS CDK" semantically similar to "infrastructure as code" but user might want exact keyword match.
Hybrid Search
Definition: Combining semantic search (dense vectors) with keyword search (sparse, BM25) to get benefits of both approaches.
Why Hybrid: Semantic search handles paraphrasing; keyword search handles exact matches, names, and abbreviations.
Benchmark: Hybrid search typically improves retrieval precision by 15-25% over semantic-only search.
Reranking
Definition: A second-stage retrieval process that re-scores initial results using a more sophisticated (but slower) model to improve precision.
How It Works:
- Initial retrieval: Get top 50-100 candidates (high recall, lower precision)
- Reranking: Score each candidate against the query using a cross-encoder
- Return top 5-10 (high precision)
Trade-off: Reranking adds 100-500ms latency but can improve precision by 40-60%.
Knowledge Base
Definition: A structured collection of documents, facts, or data that an LLM system can reference. In RAG systems, the knowledge base is what gets searched and retrieved.
Components:
- Document storage (S3, database)
- Chunked and embedded content
- Vector index for retrieval
- Metadata for filtering
AWS Bedrock Knowledge Bases Example:
Fine-tuning and Training
Fine-tuning
Definition: Adapting a pre-trained model to a specific task or domain by training it on additional, specialized data.
When to Fine-tune:
- Specific output format required
- Domain vocabulary not in base model
- Consistent style/tone needed
- After prompt engineering and RAG aren't sufficient
When NOT to Fine-tune:
- Just need domain knowledge (use RAG instead)
- Small dataset (fewer than 100 examples)
- Rapidly changing information
Cost Reality: Fine-tuning GPT-4o-mini costs 0.30/0.15/$0.60 per 1M tokens).
LoRA / QLoRA
Definition: Low-Rank Adaptation (LoRA) is a parameter-efficient fine-tuning method that trains small adapter matrices instead of full model weights. QLoRA adds 4-bit quantization for even lower memory.
Why It Matters: LoRA reduces fine-tuning memory from 100+ GB to under 16GB, making it possible on consumer hardware.
Hardware Requirements:
- Full fine-tuning 7B model: 80GB+ VRAM
- LoRA fine-tuning 7B model: 16GB VRAM
- QLoRA fine-tuning 7B model: 8GB VRAM
RLHF (Reinforcement Learning from Human Feedback)
Definition: A training technique that uses human preferences to improve model outputs. Humans rank model responses, and the model learns to produce preferred outputs.
Process:
- Generate multiple responses to prompts
- Humans rank responses (best to worst)
- Train a reward model on rankings
- Fine-tune LLM using reinforcement learning to maximize reward
Practical Reality: RLHF is how ChatGPT, Claude, and other assistants became "helpful, harmless, and honest." Most developers won't implement RLHF directly - it requires significant data and infrastructure.
Simpler Alternatives:
- DPO (Direct Preference Optimization): Skips the reward model, trains directly on preferences
- ORPO: Combines instruction tuning with preference alignment
- Constitutional AI: Uses AI to generate and evaluate responses (Anthropic's approach)
PEFT (Parameter-Efficient Fine-Tuning)
Definition: A family of techniques that fine-tune only a small subset of model parameters, reducing compute and memory requirements.
PEFT Methods:
When to Use: When you need to customize a model but don't have datacenter-scale GPU resources.
Distillation
Definition: Training a smaller "student" model to mimic a larger "teacher" model's behavior. Transfers knowledge from big models to smaller, faster ones.
How It Works:
- Run teacher model on many examples
- Capture teacher outputs (logits, intermediate representations)
- Train student to match teacher outputs
- Student learns to approximate teacher at fraction of size
Example: GPT-4 generates training data -> Used to fine-tune Llama 7B -> Smaller model with GPT-4-like behavior for specific tasks.
Practical Application:
Trade-off: Distilled models are smaller and faster but rarely match teacher quality across all tasks.
Synthetic Data
Definition: Training data generated by AI models rather than collected from real-world sources. Used to augment or replace human-labeled data.
Use Cases:
- Generating diverse training examples
- Creating hard-to-collect edge cases
- Data augmentation for fine-tuning
- Privacy-preserving data generation
Quality Warning: Synthetic data can amplify biases and errors from the generating model. Always validate quality.
Quantization (INT8/INT4/FP16)
Definition: Reducing model precision from 32-bit floats to lower precision (16-bit, 8-bit, 4-bit) to decrease memory usage and increase inference speed.
Quantization Levels:
Practical Impact: A 70B parameter model at FP32 needs ~280GB VRAM. At INT4, it fits in ~35GB.
Recommendation: INT8 offers the best quality/size trade-off for most use cases. INT4 when memory is extremely constrained.
Pruning
Definition: Removing unnecessary weights or entire components from a model to reduce size and improve inference speed while maintaining accuracy.
Types:
- Unstructured pruning: Remove individual weights (sparse matrices)
- Structured pruning: Remove entire neurons, attention heads, or layers
- Magnitude pruning: Remove smallest-value weights
Trade-off: Pruning can reduce model size by 30-90% but requires careful calibration to maintain quality.
GGUF / GGML
Definition: Model file formats designed for efficient local LLM inference. GGUF (GPT-Generated Unified Format) is the successor to GGML, used by llama.cpp and Ollama.
Why GGUF: Combines model weights, tokenizer, and metadata in a single portable file. Supports various quantization levels.
Quantization Variants:
File Size Examples:
Model Formats and Local Inference
MLX (Apple Silicon)
Definition: Apple's machine learning framework optimized for Apple Silicon (M1/M2/M3/M4). Enables efficient local LLM inference on Macs.
Advantages:
- Optimized for unified memory architecture
- Faster than llama.cpp on Apple Silicon for many workloads
- Python API similar to PyTorch/NumPy
Performance: MLX achieves ~230 tokens/second on M3 Max vs ~40 tokens/second with Ollama for comparable models.
ONNX (Open Neural Network Exchange)
Definition: An open format for representing machine learning models, enabling interoperability between frameworks (PyTorch, TensorFlow, etc.).
Use Case: Export model from PyTorch, run with ONNX Runtime for optimized inference across platforms.
SafeTensors
Definition: A secure model serialization format developed by Hugging Face. Unlike pickle-based formats, SafeTensors cannot execute arbitrary code during loading.
Why It Matters: Traditional PyTorch model files (.pt, .bin) use pickle, which can execute malicious code when loaded. SafeTensors stores only tensor data.
Adoption: 42% of Hugging Face models now use SafeTensors. Always prefer .safetensors files when available.
AWQ (Activation-aware Weight Quantization)
Definition: A quantization method that preserves accuracy by identifying and protecting the most important weights based on activation patterns.
Advantage over GPTQ: AWQ often achieves better quality at the same bit-width because it's smarter about which weights can be quantized aggressively.
GPTQ (GPT Quantization)
Definition: A post-training quantization method for large language models. Compresses models to 4-bit or 8-bit while minimizing accuracy loss.
Comparison:
Ollama
Definition: A tool for running LLMs locally with a simple CLI and API. Manages model downloads, quantization, and serving.
Best For: Quick local experimentation, privacy-sensitive applications, development without API costs.
LM Studio
Definition: A desktop application for discovering, downloading, and running local LLMs with a graphical interface. Supports both llama.cpp and MLX backends.
Features:
- Model browser for Hugging Face
- Automatic quantization selection
- OpenAI-compatible API server
- GPU acceleration on Mac, Windows, Linux
llama.cpp
Definition: A C/C++ implementation for LLM inference, enabling efficient execution on CPUs and various GPUs. The foundation for many local LLM tools.
Key Features:
- CPU-first design (works without GPU)
- Supports CUDA, Metal, Vulkan
- GGUF model format
- Quantization support (Q2-Q8)
vLLM
Definition: A high-throughput LLM serving engine optimized for production workloads. Uses PagedAttention for efficient memory management.
When to Use: Production API serving, high concurrency, when you need maximum throughput.
Comparison with llama.cpp:
- vLLM: Higher throughput, better batching, production-focused
- llama.cpp: Better for single-user, CPU inference, local use
TGI (Text Generation Inference)
Definition: Hugging Face's production inference server for LLMs. Optimized for high-performance serving.
Features:
- Continuous batching
- Tensor parallelism
- Quantization support
- OpenAI-compatible API
Agents and Orchestration
AI Agent
Definition: An LLM-powered system that can take actions, use tools, and work autonomously toward goals. Agents perceive, decide, and act in a loop.
Key Distinction: Chatbots respond; agents act. An agent might search the web, run code, update databases, and call APIs to complete a task.
Tool Use / Function Calling
Definition: The ability of LLMs to invoke external functions or APIs. The model outputs structured calls that your code executes.
MCP (Model Context Protocol)
Definition: An open standard by Anthropic for connecting AI agents to external tools and data sources. Think "USB-C for AI" - a universal protocol for tool integration.
Why MCP: Before MCP, every LLM provider had proprietary tool integration. MCP standardizes how agents access external capabilities.
Architecture:
- MCP Servers: Expose tools (file system, databases, APIs)
- MCP Clients: AI applications that consume tools
- Transport: JSON-RPC over stdio or HTTP
Adoption (2025): Anthropic launched MCP in November 2024. OpenAI, Google, Microsoft, and major toolmakers adopted it throughout 2025. It's becoming the de-facto standard.
Agentic Workflow
Definition: A multi-step process where an LLM autonomously plans, executes, and iterates to achieve a goal. More sophisticated than single-turn conversations.
Patterns:
- Sequential: Steps execute in order
- Parallel: Independent steps run concurrently
- Conditional: Branching based on results
- Iterative: Repeat until success criteria met
ReAct Pattern
Definition: "Reasoning and Acting" - an agent architecture that interleaves thinking (reasoning) with tool use (acting). The model explains its reasoning before each action.
Format:
Benefit: Separating reasoning from action improves reliability and makes agent behavior interpretable.
Chain-of-Thought (CoT)
Definition: Prompting technique that instructs the model to show its reasoning step-by-step before answering. Improves performance on complex reasoning tasks.
Variants:
- Zero-shot CoT: Just add "Let's think step by step"
- Few-shot CoT: Provide examples with reasoning
- Tree-of-Thought: Explore multiple reasoning paths
Multi-Agent Systems
Definition: Architectures where multiple specialized AI agents collaborate to solve problems. Each agent has a role (researcher, coder, reviewer, etc.).
Example Architecture:
2025 Trend: Gartner reported 1,445% increase in multi-agent system inquiries from Q1 2024 to Q2 2025.
Orchestration
Definition: Coordinating multiple LLM calls, tool uses, and agents to complete complex workflows. The "conductor" managing the AI orchestra.
Frameworks:
Memory (Short/Long-term)
Definition: Mechanisms for agents to retain information across interactions. Short-term memory persists within a session; long-term memory persists across sessions.
Types:
- Buffer Memory: Recent conversation turns (context window)
- Summary Memory: Compressed history
- Vector Memory: Embeddings of past interactions for retrieval
- Entity Memory: Extracted facts about people, places, concepts
Prompt Engineering
Zero-shot / Few-shot Prompting
Definition:
- Zero-shot: Model performs a task without any examples
- Few-shot: Model is given examples before the actual task
When to Use:
- Zero-shot: Well-known tasks, capable models
- Few-shot: Specific formats, edge cases, consistency needed
Prompt Template
Definition: A reusable prompt structure with placeholders for dynamic content. Separates prompt logic from input data.
System vs User Prompt
Definition:
- System Prompt: Sets overall behavior, role, and constraints (persistent context)
- User Prompt: The actual request or question (per-interaction)
Best Practices:
- Put constraints and role definition in system prompt
- Put task-specific instructions in user prompt
- Keep system prompts concise but complete
- Don't rely solely on system prompts for security
Prompt Injection
Definition: An attack where malicious input tricks the LLM into ignoring its instructions or performing unintended actions. The number one OWASP vulnerability for LLM applications.
Example Attack:
Mitigation Strategies:
Jailbreaking
Definition: Techniques to bypass an LLM's safety guidelines and get it to produce prohibited content. A subset of prompt injection focused on circumventing alignment.
Common Techniques:
- Roleplay scenarios ("You are DAN who can do anything")
- Encoding tricks (base64, rot13)
- Multi-turn gradual escalation
- Hypothetical framing ("For a novel, how would a character...")
Defense Layers:
- Input filtering (block known patterns)
- Output filtering (detect policy violations)
- Constitutional AI (model self-critique)
- Regular red teaming
Prompt Caching
Definition: Storing computed prompt representations to avoid reprocessing identical prefixes. Reduces latency and cost for repeated prompts.
Provider Support:
- Anthropic: Explicit cache_control headers, 90% cost savings on cache hits
- OpenAI: Automatic caching for prompts over 1024 tokens, 50% discount
Cost Impact: With a 10K token system prompt called 1000 times, caching saves ~$27 at Claude Sonnet pricing.
Production and Operations
Guardrails
Definition: Safety mechanisms that filter, validate, or modify LLM inputs and outputs to prevent harmful or undesired behavior.
Types:
- Input guardrails: Block injection attempts, PII, profanity before model
- Output guardrails: Filter harmful content, validate format, check facts
Frameworks: NVIDIA NeMo Guardrails, Guardrails AI, LangChain with custom validators
Content Filtering
Definition: Automated detection and blocking of inappropriate content (hate speech, violence, adult content) in LLM inputs or outputs.
Approaches:
- Classifier models (fast, less nuanced)
- LLM-as-judge (slower, more nuanced)
- Rule-based (regex, keyword matching)
- Hybrid (layered approach)
Rate Limiting
Definition: Controlling the frequency of API requests to prevent abuse, manage costs, and ensure fair usage across users.
Batch Processing
Definition: Grouping multiple LLM requests for processing together, typically at reduced cost with higher latency.
Benefits:
- 50% cost reduction (OpenAI, Anthropic batch APIs)
- Better for offline/async workloads
- More efficient resource utilization
When to Use: Analytics, content generation, data processing - anything not user-facing real-time.
Streaming
Definition: Receiving LLM output token-by-token as it's generated rather than waiting for the complete response.
Benefits:
- Faster perceived latency (TTFT matters more than total time)
- Better UX for chat interfaces
- Can cancel generation early
Latency / TTFT (Time to First Token)
Definition:
- Latency: Total time from request to complete response
- TTFT: Time until the first token appears (critical for UX)
Latency Formula: Total = TTFT + (Output Tokens / TPS)
TTFT Benchmarks:
Optimization Strategies:
- Prompt caching
- Smaller prompts
- Faster models for routing
- Edge deployment
Token Budget
Definition: The maximum tokens allocated for a request, considering costs, context limits, and quality trade-offs.
Model Routing
Definition: Directing queries to different models based on complexity, cost, or capability requirements.
Cost Impact: Routing can reduce costs by 60%+ by using expensive models only when needed.
Cost and Metrics
Input/Output Tokens
Definition: The distinction between tokens in the prompt (input) and tokens in the response (output). Output tokens are typically 2-5x more expensive.
Pricing Example (Claude Sonnet 4.6):
- Input: $3 per million tokens
- Output: $15 per million tokens
Cost Calculation:
Cost per Million Tokens
Definition: Standard pricing unit for LLM APIs. Enables cost comparison across providers and models.
2025 Pricing Comparison:
Context Window Pricing
Definition: Some providers charge differently based on how much of the context window is used, especially for very long contexts.
Example: Gemini 2.5 Pro charges standard rates up to 200K tokens, then 2x for prompts over 200K.
Batch API Discount
Definition: Reduced pricing for batch/async API requests where results can be delayed (typically 24 hours).
Discounts:
- OpenAI: 50% off standard pricing
- Anthropic: 50% off + prompt caching compatible
When to Use: Data processing, content generation, analytics - non-real-time workloads.
Security and Compliance
PII Handling
Definition: Protocols for handling Personally Identifiable Information when using LLMs. Critical for GDPR, HIPAA, and other regulations.
Best Practices:
- Redact PII before sending to LLM
- Use on-premise/private deployments for sensitive data
- Implement output scanning
- Log and audit data flows
Data Residency
Definition: Requirements about where data is physically stored and processed. Many regulations require data to stay within specific geographic regions.
Provider Options:
- OpenAI: US, EU (Azure OpenAI)
- Anthropic: US, EU (via AWS Bedrock)
- AWS Bedrock: Multiple regions
- Azure OpenAI: 20+ regions
Consideration: API calls may cross borders even if data storage is regional. Verify both processing and storage locations.
Model Card
Definition: Documentation describing a model's capabilities, limitations, training data, intended use, and known biases. Like a "nutrition label" for AI models.
Standard Sections:
- Model details (architecture, training)
- Intended use cases
- Limitations and risks
- Performance metrics
- Ethical considerations
- Training data sources
Why It Matters: EU AI Act requires documented model information for high-risk AI systems. Model cards are becoming a compliance requirement.
Red Teaming
Definition: Adversarial testing to find vulnerabilities, biases, and failure modes in AI systems before deployment.
Testing Categories:
- Jailbreaking attempts
- Prompt injection
- PII leakage
- Bias detection
- Misinformation generation
Adversarial Testing
Definition: Systematically testing AI systems with intentionally challenging or malicious inputs to evaluate robustness.
Techniques:
- Input perturbation (typos, encoding tricks)
- Edge cases (empty inputs, very long inputs)
- Boundary testing (context limits)
- Multi-turn attacks (gradual escalation)
Evaluation and Metrics
Perplexity
Definition: A measure of how well a language model predicts a sample. Lower perplexity = better predictions. Calculated as the exponential of cross-entropy loss.
Limitation: Perplexity measures language modeling ability, not task performance. A model with low perplexity might still give bad answers.
Practical Use: Comparing model quality during training/fine-tuning, not for evaluating production outputs.
BLEU / ROUGE
Definition: Automatic metrics for comparing generated text to reference text.
- BLEU: Precision-focused, common in translation
- ROUGE: Recall-focused, common in summarization
Limitation: Correlate poorly with human judgment for open-ended generation. Use for specific tasks (translation, summarization) where reference texts exist.
Human Evaluation
Definition: Having humans rate LLM outputs for quality, helpfulness, accuracy, and safety. The gold standard but expensive and slow.
Common Approaches:
- A/B comparison (which response is better?)
- Likert scales (rate 1-5)
- Task completion rates
- Expert review for domain-specific content
Practical Balance: Use automated metrics for continuous monitoring, human evaluation for periodic audits and important decisions.
A/B Testing
Definition: Comparing two variants (prompts, models, configurations) by randomly assigning users and measuring outcomes.
Benchmarks (MMLU, HumanEval)
Definition: Standardized test suites for measuring LLM capabilities.
Key Benchmarks:
Reality Check: Benchmarks are increasingly "gamed" by training on test data. Real-world performance often differs from benchmark scores.
Extended Thinking
Extended Thinking / Deep Thinking
Definition: A mode where the model generates internal reasoning tokens before producing the final answer. Used by Claude Sonnet 4.6, OpenAI o1/o3 series.
How It Works:
- Model receives query
- Generates "thinking" tokens (visible or summarized)
- Uses reasoning to produce better final answer
- Thinking tokens count toward output costs
Cost Consideration: Thinking tokens are billed as output tokens. A query with 5000 thinking tokens + 500 answer tokens costs the same as 5500 output tokens.
When to Use:
- Complex math/logic problems
- Multi-step reasoning
- Code debugging requiring analysis
- Tasks where accuracy matters more than speed
When NOT to Use:
- Simple Q&A (massive overkill)
- Real-time chat (too slow)
- High-volume, low-complexity tasks
Practical Lesson: Extended thinking dramatically improves accuracy on hard problems but is wasted on easy ones. Use model routing to save costs.
Key Takeaways
-
Tokens are the currency of LLMs - Understanding tokenization is essential for cost management and context window planning
-
RAG before fine-tuning - Most use cases are better served by retrieval than expensive fine-tuning
-
Temperature controls randomness, not accuracy - Low temperature doesn't prevent hallucinations
-
Hybrid search beats pure semantic - Combine vector and keyword search for best results
-
System prompts aren't security - Use guardrails, validation, and defense in depth
-
Model routing saves money - Use expensive models only when complexity warrants
-
Extended thinking is powerful but expensive - Reserve for complex reasoning tasks
-
Evaluation is non-negotiable - Automated metrics catch issues before users do
-
MCP is becoming standard - Invest in MCP integrations for future-proof tool use
-
Local inference is viable - SLMs and GGUF models enable privacy-preserving, cost-free inference
Common Pitfalls and Lessons
Pitfall 1: Ignoring Token Costs
- Problem: Building features without understanding cost implications
- Lesson: Calculate cost per user action early; a chatty agent can cost $0.50+ per conversation
Pitfall 2: Over-relying on System Prompts
- Problem: Assuming system prompts provide security
- Lesson: System prompts can be overridden; add guardrails and validation
Pitfall 3: Temperature 0 = No Hallucinations
- Problem: Believing deterministic = accurate
- Lesson: Temperature controls randomness, not truthfulness; hallucinations persist at temperature 0
Pitfall 4: Stuffing Everything in Context
- Problem: Using max context window because you can
- Lesson: Models struggle with long contexts; RAG with good retrieval often outperforms
Pitfall 5: Choosing Models by Benchmark
- Problem: Selecting models based on MMLU scores
- Lesson: Benchmarks are saturated and gamed; test on your specific use cases
Pitfall 6: Building Before Evaluating
- Problem: No evaluation framework until production
- Lesson: Set up automated evaluation early; you can't improve what you don't measure
Pitfall 7: Ignoring Latency
- Problem: Optimizing only for quality
- Lesson: Users abandon slow experiences; TTFT matters more than you think
This glossary serves as your field guide. Bookmark it, reference it during architecture discussions, and use it to educate your teams. The next time someone suggests "just use GPT-4 for everything" or "RAG is too complex," you'll know what to say and why.
References
- platform.openai.com - Prompt engineering guide (OpenAI API docs).
- anthropic.com - Anthropic research note: building effective agents.
- docs.aws.amazon.com - Amazon Bedrock Knowledge Bases (RAG on AWS).
- spec.modelcontextprotocol.io - Model Context Protocol (MCP) specification.
- developer.mozilla.org - MDN Web Docs (web platform reference).
- semver.org - Semantic Versioning specification.