FinOps for AI Workloads: Managing LLM Costs in Production

Abstract

Running Large Language Models in production introduces a fundamentally different cost model than traditional cloud infrastructure. Token-based pricing means costs can vary 100x based on usage patterns, prompt design, and model selection. Unlike predictable compute-hour billing, LLM expenses can spike unexpectedly from poorly optimized prompts or unbounded tool usage.

This guide explores systematic approaches to LLM cost optimization, including prompt caching (90% savings), intelligent model routing (30-50% reduction), token budget enforcement, and semantic caching. Teams implementing these patterns typically achieve 60-80% cost reduction while maintaining quality.

The Token-Based Billing Challenge

Traditional cloud FinOps principles don't translate directly to LLM workloads. A single poorly designed prompt can consume more tokens than thousands of optimized requests.

Cost Variability Example

python

# Simple query: "What's the weather?"# Input: 50 tokens (user message + system prompt)# Output: 30 tokens# Cost (GPT-4): $0.0008
# Complex RAG query: "Analyze Q4 sales trends and recommend strategies"# Input: 8,050 tokens (50 user + 10,000 system + 2,000 RAG context)# Output: 500 tokens# Cost (GPT-4): $0.095 (119x more expensive)
# Tool-call storm: Agent invokes 20 tools in one turn# Input per tool call: 8,050 tokens × 20 = 161,000 tokens# Output per tool call: 500 tokens × 20 = 10,000 tokens# Cost (GPT-4): $1.91 (2,387x more expensive)

# Simple query: "What's the weather?"# Input: 50 tokens (user message + system prompt)# Output: 30 tokens# Cost (GPT-4): $0.0008
# Complex RAG query: "Analyze Q4 sales trends and recommend strategies"# Input: 8,050 tokens (50 user + 10,000 system + 2,000 RAG context)# Output: 500 tokens# Cost (GPT-4): $0.095 (119x more expensive)
# Tool-call storm: Agent invokes 20 tools in one turn# Input per tool call: 8,050 tokens × 20 = 161,000 tokens# Output per tool call: 500 tokens × 20 = 10,000 tokens# Cost (GPT-4): $1.91 (2,387x more expensive)

The problem compounds when applications scale from proof-of-concept to production. I've worked with teams where monthly bills jumped from $500 to$ 15,000 within weeks of launch, catching everyone by surprise.

Understanding Provider Pricing Models

Different providers offer distinct pricing structures that significantly impact total cost of ownership.

AWS Bedrock Pricing Tiers

Standard (On-Demand): Token-based billing with no commitments

Claude Sonnet 4.6: $3 input,$ 15 output per 1M tokens
Most flexible option, highest per-token cost

Batch Inference: 50% discount for asynchronous workloads

Ideal for overnight reports, bulk document analysis
Non-real-time processing acceptable

Provisioned Throughput: Time-based pricing for high-volume scenarios

Reserved capacity with predictable costs
Example: Claude Haiku 4.5 with Provisioned Throughput (6-month commitment)

OpenAI Pricing Structure

python

PRICING = {    'gpt-4-turbo': {        'input': 10.00 / 1_000_000,        'output': 30.00 / 1_000_000    },    'gpt-4o': {        'input': 2.50 / 1_000_000,        'output': 10.00 / 1_000_000    },    'gpt-4o-mini': {        'input': 0.15 / 1_000_000,        'output': 0.60 / 1_000_000    }}
# Key insight: Output tokens cost 2-5x more than input tokens# GPT-4: Output is 3x more expensive ($30 vs $10 per 1M)# GPT-4o: Output is 4x more expensive ($10 vs $2.50 per 1M)

PRICING = {    'gpt-4-turbo': {        'input': 10.00 / 1_000_000,        'output': 30.00 / 1_000_000    },    'gpt-4o': {        'input': 2.50 / 1_000_000,        'output': 10.00 / 1_000_000    },    'gpt-4o-mini': {        'input': 0.15 / 1_000_000,        'output': 0.60 / 1_000_000    }}
# Key insight: Output tokens cost 2-5x more than input tokens# GPT-4: Output is 3x more expensive ($30 vs $10 per 1M)# GPT-4o: Output is 4x more expensive ($10 vs $2.50 per 1M)

Anthropic Direct Pricing

Claude Opus 4.1: $15 input,$ 75 output per 1M tokens
Claude Opus 4.5: $5 input,$ 25 output per 1M tokens (newer, more cost-effective)
Claude Sonnet 3.5: $3 input,$ 15 output per 1M tokens
Claude Haiku 3: $0.25 input,$ 1.25 output per 1M tokens
Claude Haiku 4.5: $1 input,$ 5 output per 1M tokens (newer generation)
Prompt Caching: 90% discount on cached tokens with 85% latency reduction
Cache Write Premium: 25% premium on cache writes (one-time cost for caching content)

Optimization Strategy 1: Prompt Caching

Prompt caching provides the highest cost reduction with minimal implementation effort. By marking static prompt components as cacheable, subsequent requests within the cache TTL period receive a 90% discount on those tokens.

Implementation with AWS Bedrock

python

import boto3import json
bedrock_runtime = boto3.client('bedrock-runtime')
# Large system prompt with company policies (10,000 tokens)SYSTEM_PROMPT = """You are a customer support agent for Acme Corp.
Company Policies:[... 8,000 tokens of policies, procedures, FAQs ...]
Communication Style:- Professional but friendly- Concise responses (max 200 words)- Always include relevant policy references
Tool Usage Guidelines:[... 2,000 tokens of tool documentation ...]"""
def invoke_with_caching(user_message: str):    response = bedrock_runtime.converse(        modelId="anthropic.claude-sonnet-4-20250929-v1:0",  # Include -v1:0 suffix        messages=[            {                "role": "user",                "content": [{"text": user_message}]            }        ],        system=[            {                "text": SYSTEM_PROMPT,                "cachePoint": {"type": "default"}  # Cache for 5 minutes            }        ]    )
    # Cost analysis    usage = response['usage']
    # First call: Full input token cost + cache write premium (25%)    # input_tokens: 10,000 (system) + 50 (user message) = 10,050    # cacheReadInputTokensCount: 0    # cacheCreationInputTokensCount: 10,000 (with 25% premium on first write)
    # Subsequent calls within 5 minutes:    # input_tokens: 50 (user message only)    # cacheReadInputTokensCount: 10,000 (90% discount)
    print(f"Input tokens: {usage.get('inputTokens', 0)}")    print(f"Cached input tokens: {usage.get('cacheReadInputTokensCount', 0)}")    print(f"Cache creation tokens: {usage.get('cacheCreationInputTokensCount', 0)}")    print(f"Output tokens: {usage.get('outputTokens', 0)}")
    return response

import boto3import json
bedrock_runtime = boto3.client('bedrock-runtime')
# Large system prompt with company policies (10,000 tokens)SYSTEM_PROMPT = """You are a customer support agent for Acme Corp.
Company Policies:[... 8,000 tokens of policies, procedures, FAQs ...]
Communication Style:- Professional but friendly- Concise responses (max 200 words)- Always include relevant policy references
Tool Usage Guidelines:[... 2,000 tokens of tool documentation ...]"""
def invoke_with_caching(user_message: str):    response = bedrock_runtime.converse(        modelId="anthropic.claude-sonnet-4-20250929-v1:0",  # Include -v1:0 suffix        messages=[            {                "role": "user",                "content": [{"text": user_message}]            }        ],        system=[            {                "text": SYSTEM_PROMPT,                "cachePoint": {"type": "default"}  # Cache for 5 minutes            }        ]    )
    # Cost analysis    usage = response['usage']
    # First call: Full input token cost + cache write premium (25%)    # input_tokens: 10,000 (system) + 50 (user message) = 10,050    # cacheReadInputTokensCount: 0    # cacheCreationInputTokensCount: 10,000 (with 25% premium on first write)
    # Subsequent calls within 5 minutes:    # input_tokens: 50 (user message only)    # cacheReadInputTokensCount: 10,000 (90% discount)
    print(f"Input tokens: {usage.get('inputTokens', 0)}")    print(f"Cached input tokens: {usage.get('cacheReadInputTokensCount', 0)}")    print(f"Cache creation tokens: {usage.get('cacheCreationInputTokensCount', 0)}")    print(f"Output tokens: {usage.get('outputTokens', 0)}")
    return response

Cost Impact Analysis

python

# Without caching (100 requests/day):# Daily cost: (10,050 * 100 * $3/1M) + (200 * 100 * $15/1M) = $3.32# Monthly cost: $3.32 * 30 = $99.60
# With caching (assuming 80% cache hit rate):# First request: (10,050 * $3/1M) + (200 * $15/1M) = $0.033# Cached requests: (50 * $3/1M) + (10,000 * $0.30/1M) + (200 * $15/1M) = $0.0066# Daily cost: $0.033 + (99 * $0.0066) = $0.686# Monthly cost: $0.686 * 30 = $20.58# Savings: 79% reduction

# Without caching (100 requests/day):# Daily cost: (10,050 * 100 * $3/1M) + (200 * 100 * $15/1M) = $3.32# Monthly cost: $3.32 * 30 = $99.60
# With caching (assuming 80% cache hit rate):# First request: (10,050 * $3/1M) + (200 * $15/1M) = $0.033# Cached requests: (50 * $3/1M) + (10,000 * $0.30/1M) + (200 * $15/1M) = $0.0066# Daily cost: $0.033 + (99 * $0.0066) = $0.686# Monthly cost: $0.686 * 30 = $20.58# Savings: 79% reduction

Implementation Best Practices

Structure Prompts for Caching:

Place static content (policies, instructions) first
Dynamic context (user data, timestamps) goes in user messages
Avoid changing cached sections unnecessarily

Common Pitfalls:

Dynamic Timestamps: Adding current_time to system prompt invalidates cache every request
Intermittent Traffic: 5-minute TTL means gaps invalidate cache
Prompt Versioning: Deploy prompt changes during low-traffic periods

python

# BAD: Dynamic content invalidates cachesystem_prompt = f"""You are a support agent.Current time: {datetime.now().isoformat()}  # Changes every request![... rest of prompt ...]"""
# GOOD: Static prompt, dynamic context in user messagesystem_prompt = """You are a support agent.[... static policies and instructions ...]"""
user_message = f"""Current time: {datetime.now().isoformat()}User question: {question}"""

# BAD: Dynamic content invalidates cachesystem_prompt = f"""You are a support agent.Current time: {datetime.now().isoformat()}  # Changes every request![... rest of prompt ...]"""
# GOOD: Static prompt, dynamic context in user messagesystem_prompt = """You are a support agent.[... static policies and instructions ...]"""
user_message = f"""Current time: {datetime.now().isoformat()}User question: {question}"""

Optimization Strategy 2: Intelligent Model Routing

Not all queries require the most powerful (and expensive) model. Routing queries based on complexity can reduce costs by 30-50% with minimal quality impact.

Custom Routing Implementation

typescript

import OpenAI from 'openai';
interface ModelRoutingConfig {  simpleThreshold: number;  // < 0.3 = simple query  complexThreshold: number;  // > 0.7 = complex query  models: {    simple: string;    medium: string;    complex: string;  };}
interface QueryComplexity {  score: number;  factors: {    wordCount: number;    questionType: string;    contextRequired: boolean;    multiStepReasoning: boolean;  };}
class IntelligentRouter {  private openai: OpenAI;  private config: ModelRoutingConfig;
  constructor() {    this.openai = new OpenAI();    this.config = {      simpleThreshold: 0.3,      complexThreshold: 0.7,      models: {        simple: 'gpt-4o-mini',  // $0.15 input, $0.60 output per 1M        medium: 'gpt-4o',  // $2.50 input, $10.00 output per 1M        complex: 'gpt-4-turbo'  // $10.00 input, $30.00 output per 1M      }    };  }
  /**   * Analyze query complexity using heuristics   * Production systems might use a lightweight classifier model   */  analyzeComplexity(query: string): QueryComplexity {    const words = query.split(/\s+/);    const wordCount = words.length;
    // Detect question type    const questionType = this.detectQuestionType(query);
    // Check for multi-step reasoning indicators    const multiStepKeywords = ['compare', 'analyze', 'design', 'implement',                                'evaluate', 'recommend', 'strategize'];    const multiStepReasoning = multiStepKeywords.some(kw =>      query.toLowerCase().includes(kw)    );
    // Context required (references to previous conversation, documents)    const contextRequired = query.toLowerCase().includes('previous') ||                           query.toLowerCase().includes('earlier') ||                           query.toLowerCase().includes('mentioned');
    // Calculate complexity score (0.0 - 1.0)    let score = 0.0;
    // Word count factor (longer = potentially more complex)    if (wordCount < 10) score += 0.1;    else if (wordCount < 30) score += 0.3;    else score += 0.5;
    // Question type factor    if (questionType === 'factual') score += 0.1;    else if (questionType === 'analytical') score += 0.5;    else score += 0.3;
    // Multi-step reasoning adds significant complexity    if (multiStepReasoning) score += 0.3;
    // Context requirement adds complexity    if (contextRequired) score += 0.2;
    // Normalize to 0.0 - 1.0 range    score = Math.min(1.0, score);
    return {      score,      factors: {        wordCount,        questionType,        contextRequired,        multiStepReasoning      }    };  }
  private detectQuestionType(query: string): string {    const lower = query.toLowerCase();
    // Factual questions    if (lower.match(/^(what|when|where|who) is/)) return 'factual';
    // Analytical questions    if (lower.match(/(how|why|explain|compare|analyze)/)) return 'analytical';
    // Procedural questions    if (lower.match(/(how to|steps|process|implement)/)) return 'procedural';
    return 'general';  }
  selectModel(complexity: QueryComplexity): string {    if (complexity.score < this.config.simpleThreshold) {      return this.config.models.simple;    } else if (complexity.score < this.config.complexThreshold) {      return this.config.models.medium;    } else {      return this.config.models.complex;    }  }
  async invoke(query: string, systemPrompt: string) {    const complexity = this.analyzeComplexity(query);    const model = this.selectModel(complexity);
    console.log(`Query complexity: ${complexity.score.toFixed(2)} -> ${model}`);
    const response = await this.openai.chat.completions.create({      model,      messages: [        { role: 'system', content: systemPrompt },        { role: 'user', content: query }      ],      temperature: 0.7    });
    return {      response: response.choices[0].message.content,      model,      complexity: complexity.score,      usage: response.usage    };  }}
// Usage exampleconst router = new IntelligentRouter();
// Simple query -> gpt-4o-miniawait router.invoke(  "What's your return policy?",  "You are a customer support agent");
// Complex query -> gpt-4-turboawait router.invoke(  "Compare our enterprise and business plans, analyze which would be better for a mid-sized company with 500 employees, considering scalability and cost over 3 years",  "You are a customer support agent");

import OpenAI from 'openai';
interface ModelRoutingConfig {  simpleThreshold: number;  // < 0.3 = simple query  complexThreshold: number;  // > 0.7 = complex query  models: {    simple: string;    medium: string;    complex: string;  };}
interface QueryComplexity {  score: number;  factors: {    wordCount: number;    questionType: string;    contextRequired: boolean;    multiStepReasoning: boolean;  };}
class IntelligentRouter {  private openai: OpenAI;  private config: ModelRoutingConfig;
  constructor() {    this.openai = new OpenAI();    this.config = {      simpleThreshold: 0.3,      complexThreshold: 0.7,      models: {        simple: 'gpt-4o-mini',  // $0.15 input, $0.60 output per 1M        medium: 'gpt-4o',  // $2.50 input, $10.00 output per 1M        complex: 'gpt-4-turbo'  // $10.00 input, $30.00 output per 1M      }    };  }
  /**   * Analyze query complexity using heuristics   * Production systems might use a lightweight classifier model   */  analyzeComplexity(query: string): QueryComplexity {    const words = query.split(/\s+/);    const wordCount = words.length;
    // Detect question type    const questionType = this.detectQuestionType(query);
    // Check for multi-step reasoning indicators    const multiStepKeywords = ['compare', 'analyze', 'design', 'implement',                                'evaluate', 'recommend', 'strategize'];    const multiStepReasoning = multiStepKeywords.some(kw =>      query.toLowerCase().includes(kw)    );
    // Context required (references to previous conversation, documents)    const contextRequired = query.toLowerCase().includes('previous') ||                           query.toLowerCase().includes('earlier') ||                           query.toLowerCase().includes('mentioned');
    // Calculate complexity score (0.0 - 1.0)    let score = 0.0;
    // Word count factor (longer = potentially more complex)    if (wordCount < 10) score += 0.1;    else if (wordCount < 30) score += 0.3;    else score += 0.5;
    // Question type factor    if (questionType === 'factual') score += 0.1;    else if (questionType === 'analytical') score += 0.5;    else score += 0.3;
    // Multi-step reasoning adds significant complexity    if (multiStepReasoning) score += 0.3;
    // Context requirement adds complexity    if (contextRequired) score += 0.2;
    // Normalize to 0.0 - 1.0 range    score = Math.min(1.0, score);
    return {      score,      factors: {        wordCount,        questionType,        contextRequired,        multiStepReasoning      }    };  }
  private detectQuestionType(query: string): string {    const lower = query.toLowerCase();
    // Factual questions    if (lower.match(/^(what|when|where|who) is/)) return 'factual';
    // Analytical questions    if (lower.match(/(how|why|explain|compare|analyze)/)) return 'analytical';
    // Procedural questions    if (lower.match(/(how to|steps|process|implement)/)) return 'procedural';
    return 'general';  }
  selectModel(complexity: QueryComplexity): string {    if (complexity.score < this.config.simpleThreshold) {      return this.config.models.simple;    } else if (complexity.score < this.config.complexThreshold) {      return this.config.models.medium;    } else {      return this.config.models.complex;    }  }
  async invoke(query: string, systemPrompt: string) {    const complexity = this.analyzeComplexity(query);    const model = this.selectModel(complexity);
    console.log(`Query complexity: ${complexity.score.toFixed(2)} -> ${model}`);
    const response = await this.openai.chat.completions.create({      model,      messages: [        { role: 'system', content: systemPrompt },        { role: 'user', content: query }      ],      temperature: 0.7    });
    return {      response: response.choices[0].message.content,      model,      complexity: complexity.score,      usage: response.usage    };  }}
// Usage exampleconst router = new IntelligentRouter();
// Simple query -> gpt-4o-miniawait router.invoke(  "What's your return policy?",  "You are a customer support agent");
// Complex query -> gpt-4-turboawait router.invoke(  "Compare our enterprise and business plans, analyze which would be better for a mid-sized company with 500 employees, considering scalability and cost over 3 years",  "You are a customer support agent");

AWS Bedrock Intelligent Prompt Routing

For teams using AWS Bedrock, intelligent routing is available through the prompt router:

python

import boto3
bedrock = boto3.client('bedrock-runtime')
# Use prompt router ARN (not a "family ARN")# Prompt router analyzes query complexity and routes to appropriate modelPROMPT_ROUTER_ARN = "arn:aws:bedrock:us-east-1::prompt-router/anthropic.claude"
response = bedrock.converse(    modelId=PROMPT_ROUTER_ARN,  # Bedrock prompt router auto-routes to Haiku or Sonnet    messages=[        {            "role": "user",            "content": [{"text": "What's the weather in Seattle?"}]        }    ])
# Bedrock automatically routes:# - Simple queries -> Claude Haiku 4.5 ($1.00/1M input)# - Complex queries -> Claude 3.5 Sonnet ($3/1M input)# Average cost reduction: 30% without quality loss# Observed routing: 87% Haiku, 13% Sonnet on RAG datasets

import boto3
bedrock = boto3.client('bedrock-runtime')
# Use prompt router ARN (not a "family ARN")# Prompt router analyzes query complexity and routes to appropriate modelPROMPT_ROUTER_ARN = "arn:aws:bedrock:us-east-1::prompt-router/anthropic.claude"
response = bedrock.converse(    modelId=PROMPT_ROUTER_ARN,  # Bedrock prompt router auto-routes to Haiku or Sonnet    messages=[        {            "role": "user",            "content": [{"text": "What's the weather in Seattle?"}]        }    ])
# Bedrock automatically routes:# - Simple queries -> Claude Haiku 4.5 ($1.00/1M input)# - Complex queries -> Claude 3.5 Sonnet ($3/1M input)# Average cost reduction: 30% without quality loss# Observed routing: 87% Haiku, 13% Sonnet on RAG datasets

Expected Outcomes

Optimization Strategy 3: Token Budget Enforcement

Unbounded token consumption leads to cost storms. Implementing hard limits prevents runaway expenses while maintaining system functionality.

Budget Tracking Implementation

python

from dataclasses import dataclassfrom datetime import datetime, timedeltafrom typing import Dict, Optionalimport redis
@dataclassclass TokenBudget:    max_input_tokens_per_request: int    max_output_tokens_per_request: int    max_tokens_per_user_daily: int    max_tokens_per_team_monthly: int
@dataclassclass BudgetUsage:    user_id: str    team_id: str    tokens_used_today: int    tokens_used_this_month: int    last_reset: datetime
class TokenBudgetEnforcer:    def __init__(self, budget: TokenBudget):        self.budget = budget        self.redis_client = redis.Redis(host='localhost', decode_responses=True)
    def check_and_reserve(        self,        user_id: str,        team_id: str,        estimated_input_tokens: int,        estimated_output_tokens: int    ) -> tuple[bool, Optional[str]]:        """        Check if request is within budget and reserve tokens.        Returns (allowed, error_message)        """
        # Check per-request limits        if estimated_input_tokens > self.budget.max_input_tokens_per_request:            return False, f"Input tokens ({estimated_input_tokens}) exceed per-request limit ({self.budget.max_input_tokens_per_request})"
        if estimated_output_tokens > self.budget.max_output_tokens_per_request:            return False, f"Output tokens ({estimated_output_tokens}) exceed per-request limit ({self.budget.max_output_tokens_per_request})"
        # Check daily user limit        user_daily_key = f"budget:user:{user_id}:daily"        user_tokens_today = int(self.redis_client.get(user_daily_key) or 0)
        total_estimated = estimated_input_tokens + estimated_output_tokens
        if user_tokens_today + total_estimated > self.budget.max_tokens_per_user_daily:            return False, f"User daily limit exceeded ({user_tokens_today}/{self.budget.max_tokens_per_user_daily})"
        # Check monthly team limit        team_monthly_key = f"budget:team:{team_id}:monthly"        team_tokens_this_month = int(self.redis_client.get(team_monthly_key) or 0)
        if team_tokens_this_month + total_estimated > self.budget.max_tokens_per_team_monthly:            return False, f"Team monthly limit exceeded ({team_tokens_this_month}/{self.budget.max_tokens_per_team_monthly})"
        # Reserve tokens (optimistic locking)        pipe = self.redis_client.pipeline()
        # Increment user daily counter (expires at midnight)        tomorrow = datetime.now().replace(hour=0, minute=0, second=0, microsecond=0) + timedelta(days=1)        seconds_until_midnight = int((tomorrow - datetime.now()).total_seconds())        pipe.incrby(user_daily_key, total_estimated)        pipe.expire(user_daily_key, seconds_until_midnight)
        # Increment team monthly counter (expires at month end)        next_month = (datetime.now().replace(day=1) + timedelta(days=32)).replace(day=1)        seconds_until_month_end = int((next_month - datetime.now()).total_seconds())        pipe.incrby(team_monthly_key, total_estimated)        pipe.expire(team_monthly_key, seconds_until_month_end)
        pipe.execute()
        return True, None
    def record_actual_usage(        self,        user_id: str,        team_id: str,        actual_input_tokens: int,        actual_output_tokens: int,        estimated_input_tokens: int,        estimated_output_tokens: int    ):        """        Adjust budget based on actual vs estimated usage.        """        actual_total = actual_input_tokens + actual_output_tokens        estimated_total = estimated_input_tokens + estimated_output_tokens        difference = actual_total - estimated_total
        if difference != 0:            pipe = self.redis_client.pipeline()            pipe.incrby(f"budget:user:{user_id}:daily", difference)            pipe.incrby(f"budget:team:{team_id}:monthly", difference)            pipe.execute()
# Usage in LLM applicationbudget_enforcer = TokenBudgetEnforcer(    budget=TokenBudget(        max_input_tokens_per_request=8000,  # Prevent huge contexts        max_output_tokens_per_request=2000,  # Limit response length        max_tokens_per_user_daily=100_000,  # ~$0.50/day per user (GPT-4)        max_tokens_per_team_monthly=10_000_000  # ~$100/month per team    ))
def invoke_llm_with_budget(user_id: str, team_id: str, prompt: str):    # Estimate tokens (rough approximation)    estimated_input = len(prompt.split()) * 1.3  # Account for tokenization    estimated_output = 500  # Conservative estimate
    # Check budget    allowed, error = budget_enforcer.check_and_reserve(        user_id, team_id, int(estimated_input), estimated_output    )
    if not allowed:        raise BudgetExceededError(error)
    # Invoke LLM    response = openai.chat.completions.create(        model="gpt-4",        messages=[{"role": "user", "content": prompt}],        max_tokens=budget_enforcer.budget.max_output_tokens_per_request    )
    # Record actual usage    budget_enforcer.record_actual_usage(        user_id,        team_id,        response.usage.prompt_tokens,        response.usage.completion_tokens,        int(estimated_input),        estimated_output    )
    return response.choices[0].message.content

from dataclasses import dataclassfrom datetime import datetime, timedeltafrom typing import Dict, Optionalimport redis
@dataclassclass TokenBudget:    max_input_tokens_per_request: int    max_output_tokens_per_request: int    max_tokens_per_user_daily: int    max_tokens_per_team_monthly: int
@dataclassclass BudgetUsage:    user_id: str    team_id: str    tokens_used_today: int    tokens_used_this_month: int    last_reset: datetime
class TokenBudgetEnforcer:    def __init__(self, budget: TokenBudget):        self.budget = budget        self.redis_client = redis.Redis(host='localhost', decode_responses=True)
    def check_and_reserve(        self,        user_id: str,        team_id: str,        estimated_input_tokens: int,        estimated_output_tokens: int    ) -> tuple[bool, Optional[str]]:        """        Check if request is within budget and reserve tokens.        Returns (allowed, error_message)        """
        # Check per-request limits        if estimated_input_tokens > self.budget.max_input_tokens_per_request:            return False, f"Input tokens ({estimated_input_tokens}) exceed per-request limit ({self.budget.max_input_tokens_per_request})"
        if estimated_output_tokens > self.budget.max_output_tokens_per_request:            return False, f"Output tokens ({estimated_output_tokens}) exceed per-request limit ({self.budget.max_output_tokens_per_request})"
        # Check daily user limit        user_daily_key = f"budget:user:{user_id}:daily"        user_tokens_today = int(self.redis_client.get(user_daily_key) or 0)
        total_estimated = estimated_input_tokens + estimated_output_tokens
        if user_tokens_today + total_estimated > self.budget.max_tokens_per_user_daily:            return False, f"User daily limit exceeded ({user_tokens_today}/{self.budget.max_tokens_per_user_daily})"
        # Check monthly team limit        team_monthly_key = f"budget:team:{team_id}:monthly"        team_tokens_this_month = int(self.redis_client.get(team_monthly_key) or 0)
        if team_tokens_this_month + total_estimated > self.budget.max_tokens_per_team_monthly:            return False, f"Team monthly limit exceeded ({team_tokens_this_month}/{self.budget.max_tokens_per_team_monthly})"
        # Reserve tokens (optimistic locking)        pipe = self.redis_client.pipeline()
        # Increment user daily counter (expires at midnight)        tomorrow = datetime.now().replace(hour=0, minute=0, second=0, microsecond=0) + timedelta(days=1)        seconds_until_midnight = int((tomorrow - datetime.now()).total_seconds())        pipe.incrby(user_daily_key, total_estimated)        pipe.expire(user_daily_key, seconds_until_midnight)
        # Increment team monthly counter (expires at month end)        next_month = (datetime.now().replace(day=1) + timedelta(days=32)).replace(day=1)        seconds_until_month_end = int((next_month - datetime.now()).total_seconds())        pipe.incrby(team_monthly_key, total_estimated)        pipe.expire(team_monthly_key, seconds_until_month_end)
        pipe.execute()
        return True, None
    def record_actual_usage(        self,        user_id: str,        team_id: str,        actual_input_tokens: int,        actual_output_tokens: int,        estimated_input_tokens: int,        estimated_output_tokens: int    ):        """        Adjust budget based on actual vs estimated usage.        """        actual_total = actual_input_tokens + actual_output_tokens        estimated_total = estimated_input_tokens + estimated_output_tokens        difference = actual_total - estimated_total
        if difference != 0:            pipe = self.redis_client.pipeline()            pipe.incrby(f"budget:user:{user_id}:daily", difference)            pipe.incrby(f"budget:team:{team_id}:monthly", difference)            pipe.execute()
# Usage in LLM applicationbudget_enforcer = TokenBudgetEnforcer(    budget=TokenBudget(        max_input_tokens_per_request=8000,  # Prevent huge contexts        max_output_tokens_per_request=2000,  # Limit response length        max_tokens_per_user_daily=100_000,  # ~$0.50/day per user (GPT-4)        max_tokens_per_team_monthly=10_000_000  # ~$100/month per team    ))
def invoke_llm_with_budget(user_id: str, team_id: str, prompt: str):    # Estimate tokens (rough approximation)    estimated_input = len(prompt.split()) * 1.3  # Account for tokenization    estimated_output = 500  # Conservative estimate
    # Check budget    allowed, error = budget_enforcer.check_and_reserve(        user_id, team_id, int(estimated_input), estimated_output    )
    if not allowed:        raise BudgetExceededError(error)
    # Invoke LLM    response = openai.chat.completions.create(        model="gpt-4",        messages=[{"role": "user", "content": prompt}],        max_tokens=budget_enforcer.budget.max_output_tokens_per_request    )
    # Record actual usage    budget_enforcer.record_actual_usage(        user_id,        team_id,        response.usage.prompt_tokens,        response.usage.completion_tokens,        int(estimated_input),        estimated_output    )
    return response.choices[0].message.content

Alert Configuration

python

def check_budget_alerts(user_id: str, team_id: str, redis_client, budget):    """    Trigger alerts at 70%, 90%, 100% budget thresholds    """    user_daily_key = f"budget:user:{user_id}:daily"    user_tokens_today = int(redis_client.get(user_daily_key) or 0)
    daily_limit = budget.max_tokens_per_user_daily    usage_percentage = (user_tokens_today / daily_limit) * 100
    if usage_percentage >= 100:        send_alert(            level="CRITICAL",            message=f"User {user_id} exceeded daily budget",            action="BLOCK"        )    elif usage_percentage >= 90:        send_alert(            level="WARNING",            message=f"User {user_id} at 90% of daily budget",            action="NOTIFY"        )    elif usage_percentage >= 70:        send_alert(            level="INFO",            message=f"User {user_id} at 70% of daily budget",            action="MONITOR"        )

def check_budget_alerts(user_id: str, team_id: str, redis_client, budget):    """    Trigger alerts at 70%, 90%, 100% budget thresholds    """    user_daily_key = f"budget:user:{user_id}:daily"    user_tokens_today = int(redis_client.get(user_daily_key) or 0)
    daily_limit = budget.max_tokens_per_user_daily    usage_percentage = (user_tokens_today / daily_limit) * 100
    if usage_percentage >= 100:        send_alert(            level="CRITICAL",            message=f"User {user_id} exceeded daily budget",            action="BLOCK"        )    elif usage_percentage >= 90:        send_alert(            level="WARNING",            message=f"User {user_id} at 90% of daily budget",            action="NOTIFY"        )    elif usage_percentage >= 70:        send_alert(            level="INFO",            message=f"User {user_id} at 70% of daily budget",            action="MONITOR"        )

Common Budget Pitfalls

Tool-Call Storms: Agents invoke 50+ tools without limits, consuming millions of tokens

python

# Solution: Set max_tool_calls_per_turnagent = Agent(    tools=[get_product_details, get_reviews, get_pricing],    max_tool_calls_per_turn=5,  # Hard limit    instructions="Use batch queries when possible.")

# Solution: Set max_tool_calls_per_turnagent = Agent(    tools=[get_product_details, get_reviews, get_pricing],    max_tool_calls_per_turn=5,  # Hard limit    instructions="Use batch queries when possible.")

RAG Over-Retrieval: Retrieving 50 chunks when 5 would suffice

python

# BAD: Too many chunksretriever = VectorStoreRetriever(    vector_store=vector_db,    search_kwargs={"k": 50}  # 25,000 tokens of context)
# GOOD: Focused retrievalretriever = VectorStoreRetriever(    vector_store=vector_db,    search_kwargs={"k": 5}  # 2,500 tokens (90% reduction))

# BAD: Too many chunksretriever = VectorStoreRetriever(    vector_store=vector_db,    search_kwargs={"k": 50}  # 25,000 tokens of context)
# GOOD: Focused retrievalretriever = VectorStoreRetriever(    vector_store=vector_db,    search_kwargs={"k": 5}  # 2,500 tokens (90% reduction))

Optimization Strategy 4: Semantic Caching

Traditional caching only matches exact queries. Semantic caching uses vector similarity to cache responses for semantically similar questions, dramatically increasing cache hit rates.

Implementation with Vector Similarity

python

import hashlibimport jsonfrom typing import Optionalimport redisfrom sentence_transformers import SentenceTransformerimport numpy as np
class SemanticCache:    def __init__(        self,        redis_client: redis.Redis,        similarity_threshold: float = 0.95,        ttl_seconds: int = 3600    ):        self.redis = redis_client        self.similarity_threshold = similarity_threshold        self.ttl_seconds = ttl_seconds
        # Lightweight embedding model for semantic matching        self.embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
    def _get_embedding(self, text: str) -> np.ndarray:        """Generate embedding vector for query"""        return self.embedding_model.encode(text, normalize_embeddings=True)
    def _cosine_similarity(self, vec1: np.ndarray, vec2: np.ndarray) -> float:        """Calculate cosine similarity between two vectors"""        return np.dot(vec1, vec2)  # Vectors are already normalized
    def get(self, query: str, system_prompt: str = "") -> Optional[dict]:        """        Retrieve cached response if semantically similar query exists        """        cache_key_prefix = f"semantic_cache:{hashlib.md5(system_prompt.encode()).hexdigest()}"
        # Get all cached queries for this system prompt        cached_keys = self.redis.keys(f"{cache_key_prefix}:*")
        if not cached_keys:            return None
        query_embedding = self._get_embedding(query)
        best_match = None        best_similarity = 0.0
        # Find most similar cached query        for key in cached_keys:            cached_data = self.redis.get(key)            if not cached_data:                continue
            cached = json.loads(cached_data)            cached_embedding = np.array(cached['embedding'])
            similarity = self._cosine_similarity(query_embedding, cached_embedding)
            if similarity > best_similarity:                best_similarity = similarity                best_match = cached
        # Return cached response if similarity exceeds threshold        if best_similarity >= self.similarity_threshold:            return {                'response': best_match['response'],                'similarity': best_similarity,                'cached': True,                'original_query': best_match['query']            }
        return None
    def set(self, query: str, response: str, system_prompt: str = ""):        """        Cache query-response pair with semantic embedding        """        cache_key_prefix = f"semantic_cache:{hashlib.md5(system_prompt.encode()).hexdigest()}"        query_hash = hashlib.md5(query.encode()).hexdigest()        cache_key = f"{cache_key_prefix}:{query_hash}"
        embedding = self._get_embedding(query)
        cache_data = {            'query': query,            'response': response,            'embedding': embedding.tolist(),            'timestamp': datetime.utcnow().isoformat()        }
        self.redis.setex(            cache_key,            self.ttl_seconds,            json.dumps(cache_data)        )
# Usage in productionsemantic_cache = SemanticCache(    redis_client=redis.Redis(host='localhost', decode_responses=False),    similarity_threshold=0.95,  # 95% similarity required    ttl_seconds=3600  # Cache for 1 hour)
def invoke_with_semantic_cache(query: str, system_prompt: str):    # Check semantic cache first    cached = semantic_cache.get(query, system_prompt)
    if cached:        print(f"Cache hit! Similarity: {cached['similarity']:.2%}")        print(f"Original query: {cached['original_query']}")        return cached['response']
    # Cache miss - invoke LLM    response = openai.chat.completions.create(        model="gpt-4",        messages=[            {"role": "system", "content": system_prompt},            {"role": "user", "content": query}        ]    )
    result = response.choices[0].message.content
    # Cache for future semantically similar queries    semantic_cache.set(query, result, system_prompt)
    return result
# Example: Semantically similar queries# Query 1: "What's your refund policy?"# Query 2: "How do I get my money back?"# Query 3: "Can I return items for a refund?"# All three would match with > 95% similarity and return cached response

import hashlibimport jsonfrom typing import Optionalimport redisfrom sentence_transformers import SentenceTransformerimport numpy as np
class SemanticCache:    def __init__(        self,        redis_client: redis.Redis,        similarity_threshold: float = 0.95,        ttl_seconds: int = 3600    ):        self.redis = redis_client        self.similarity_threshold = similarity_threshold        self.ttl_seconds = ttl_seconds
        # Lightweight embedding model for semantic matching        self.embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
    def _get_embedding(self, text: str) -> np.ndarray:        """Generate embedding vector for query"""        return self.embedding_model.encode(text, normalize_embeddings=True)
    def _cosine_similarity(self, vec1: np.ndarray, vec2: np.ndarray) -> float:        """Calculate cosine similarity between two vectors"""        return np.dot(vec1, vec2)  # Vectors are already normalized
    def get(self, query: str, system_prompt: str = "") -> Optional[dict]:        """        Retrieve cached response if semantically similar query exists        """        cache_key_prefix = f"semantic_cache:{hashlib.md5(system_prompt.encode()).hexdigest()}"
        # Get all cached queries for this system prompt        cached_keys = self.redis.keys(f"{cache_key_prefix}:*")
        if not cached_keys:            return None
        query_embedding = self._get_embedding(query)
        best_match = None        best_similarity = 0.0
        # Find most similar cached query        for key in cached_keys:            cached_data = self.redis.get(key)            if not cached_data:                continue
            cached = json.loads(cached_data)            cached_embedding = np.array(cached['embedding'])
            similarity = self._cosine_similarity(query_embedding, cached_embedding)
            if similarity > best_similarity:                best_similarity = similarity                best_match = cached
        # Return cached response if similarity exceeds threshold        if best_similarity >= self.similarity_threshold:            return {                'response': best_match['response'],                'similarity': best_similarity,                'cached': True,                'original_query': best_match['query']            }
        return None
    def set(self, query: str, response: str, system_prompt: str = ""):        """        Cache query-response pair with semantic embedding        """        cache_key_prefix = f"semantic_cache:{hashlib.md5(system_prompt.encode()).hexdigest()}"        query_hash = hashlib.md5(query.encode()).hexdigest()        cache_key = f"{cache_key_prefix}:{query_hash}"
        embedding = self._get_embedding(query)
        cache_data = {            'query': query,            'response': response,            'embedding': embedding.tolist(),            'timestamp': datetime.utcnow().isoformat()        }
        self.redis.setex(            cache_key,            self.ttl_seconds,            json.dumps(cache_data)        )
# Usage in productionsemantic_cache = SemanticCache(    redis_client=redis.Redis(host='localhost', decode_responses=False),    similarity_threshold=0.95,  # 95% similarity required    ttl_seconds=3600  # Cache for 1 hour)
def invoke_with_semantic_cache(query: str, system_prompt: str):    # Check semantic cache first    cached = semantic_cache.get(query, system_prompt)
    if cached:        print(f"Cache hit! Similarity: {cached['similarity']:.2%}")        print(f"Original query: {cached['original_query']}")        return cached['response']
    # Cache miss - invoke LLM    response = openai.chat.completions.create(        model="gpt-4",        messages=[            {"role": "system", "content": system_prompt},            {"role": "user", "content": query}        ]    )
    result = response.choices[0].message.content
    # Cache for future semantically similar queries    semantic_cache.set(query, result, system_prompt)
    return result
# Example: Semantically similar queries# Query 1: "What's your refund policy?"# Query 2: "How do I get my money back?"# Query 3: "Can I return items for a refund?"# All three would match with > 95% similarity and return cached response

Performance Impact

python

# Exact match caching: 10-20% hit rate (only identical queries)# Semantic caching: 40-60% hit rate (semantically similar queries)# Cost reduction: 40-60% for customer support/FAQ use cases
# Trade-offs:# - Embedding computation: ~1ms for MiniLM (minimal overhead)# - Redis memory: ~384 bytes per cached query# - Similarity tuning: Too low = wrong answers, too high = fewer hits

# Exact match caching: 10-20% hit rate (only identical queries)# Semantic caching: 40-60% hit rate (semantically similar queries)# Cost reduction: 40-60% for customer support/FAQ use cases
# Trade-offs:# - Embedding computation: ~1ms for MiniLM (minimal overhead)# - Redis memory: ~384 bytes per cached query# - Similarity tuning: Too low = wrong answers, too high = fewer hits

Cost Monitoring and Observability

Without real-time visibility into token consumption, cost problems remain hidden until the bill arrives.

CloudWatch Metrics Implementation

python

import boto3from datetime import datetimefrom dataclasses import dataclass
@dataclassclass CostMetrics:    timestamp: datetime    model: str    input_tokens: int    output_tokens: int    cached_tokens: int    total_cost: float    user_id: str    team_id: str    request_type: str  # 'simple', 'medium', 'complex'
class LLMCostTracker:    def __init__(self):        self.cloudwatch = boto3.client('cloudwatch')
        # Provider pricing (updated 2025)        self.pricing = {            'gpt-4-turbo': {                'input': 10.00 / 1_000_000,                'output': 30.00 / 1_000_000            },            'gpt-4o': {                'input': 2.50 / 1_000_000,                'output': 10.00 / 1_000_000            },            'gpt-4o-mini': {                'input': 0.15 / 1_000_000,                'output': 0.60 / 1_000_000            },            'claude-sonnet-3.5': {                'input': 3.00 / 1_000_000,                'output': 15.00 / 1_000_000,                'cached_input': 0.30 / 1_000_000  # 90% discount            }        }
    def calculate_cost(self, metrics: CostMetrics) -> float:        """Calculate cost based on token usage and model pricing"""        pricing = self.pricing.get(metrics.model)        if not pricing:            raise ValueError(f"Unknown model: {metrics.model}")
        input_cost = metrics.input_tokens * pricing['input']        output_cost = metrics.output_tokens * pricing['output']
        # Apply caching discount if applicable        if metrics.cached_tokens > 0 and 'cached_input' in pricing:            cached_cost = metrics.cached_tokens * pricing['cached_input']            # Cached tokens already counted in input_tokens, so adjust            uncached_tokens = metrics.input_tokens - metrics.cached_tokens            input_cost = (uncached_tokens * pricing['input']) + cached_cost
        return input_cost + output_cost
    def publish_metrics(self, metrics: CostMetrics):        """Publish metrics to CloudWatch for dashboard visualization"""
        cost = self.calculate_cost(metrics)
        metric_data = [            {                'MetricName': 'TokenUsage',                'Dimensions': [                    {'Name': 'Model', 'Value': metrics.model},                    {'Name': 'TokenType', 'Value': 'Input'}                ],                'Value': metrics.input_tokens,                'Unit': 'Count',                'Timestamp': metrics.timestamp            },            {                'MetricName': 'TokenUsage',                'Dimensions': [                    {'Name': 'Model', 'Value': metrics.model},                    {'Name': 'TokenType', 'Value': 'Output'}                ],                'Value': metrics.output_tokens,                'Unit': 'Count',                'Timestamp': metrics.timestamp            },            {                'MetricName': 'LLMCost',                'Dimensions': [                    {'Name': 'Model', 'Value': metrics.model},                    {'Name': 'Team', 'Value': metrics.team_id},                    {'Name': 'RequestType', 'Value': metrics.request_type}                ],                'Value': cost,                'Unit': 'None',  # Dollars                'Timestamp': metrics.timestamp            }        ]
        # Add cache hit rate metric if caching is used        if metrics.cached_tokens > 0:            cache_hit_rate = (metrics.cached_tokens / metrics.input_tokens) * 100            metric_data.append({                'MetricName': 'CacheHitRate',                'Dimensions': [{'Name': 'Model', 'Value': metrics.model}],                'Value': cache_hit_rate,                'Unit': 'Percent',                'Timestamp': metrics.timestamp            })
        self.cloudwatch.put_metric_data(            Namespace='LLM/Costs',            MetricData=metric_data        )
    def create_cost_anomaly_alarm(self, threshold_dollars: float):        """Create CloudWatch alarm for cost anomalies"""        self.cloudwatch.put_metric_alarm(            AlarmName='LLM-Daily-Cost-Anomaly',            ComparisonOperator='GreaterThanThreshold',            EvaluationPeriods=1,            MetricName='LLMCost',            Namespace='LLM/Costs',            Period=86400,  # 24 hours            Statistic='Sum',            Threshold=threshold_dollars,            ActionsEnabled=True,            AlarmActions=[                'arn:aws:sns:us-east-1:123456789:llm-cost-alerts'            ],            AlarmDescription=f'Alert when daily LLM costs exceed ${threshold_dollars}'        )

import boto3from datetime import datetimefrom dataclasses import dataclass
@dataclassclass CostMetrics:    timestamp: datetime    model: str    input_tokens: int    output_tokens: int    cached_tokens: int    total_cost: float    user_id: str    team_id: str    request_type: str  # 'simple', 'medium', 'complex'
class LLMCostTracker:    def __init__(self):        self.cloudwatch = boto3.client('cloudwatch')
        # Provider pricing (updated 2025)        self.pricing = {            'gpt-4-turbo': {                'input': 10.00 / 1_000_000,                'output': 30.00 / 1_000_000            },            'gpt-4o': {                'input': 2.50 / 1_000_000,                'output': 10.00 / 1_000_000            },            'gpt-4o-mini': {                'input': 0.15 / 1_000_000,                'output': 0.60 / 1_000_000            },            'claude-sonnet-3.5': {                'input': 3.00 / 1_000_000,                'output': 15.00 / 1_000_000,                'cached_input': 0.30 / 1_000_000  # 90% discount            }        }
    def calculate_cost(self, metrics: CostMetrics) -> float:        """Calculate cost based on token usage and model pricing"""        pricing = self.pricing.get(metrics.model)        if not pricing:            raise ValueError(f"Unknown model: {metrics.model}")
        input_cost = metrics.input_tokens * pricing['input']        output_cost = metrics.output_tokens * pricing['output']
        # Apply caching discount if applicable        if metrics.cached_tokens > 0 and 'cached_input' in pricing:            cached_cost = metrics.cached_tokens * pricing['cached_input']            # Cached tokens already counted in input_tokens, so adjust            uncached_tokens = metrics.input_tokens - metrics.cached_tokens            input_cost = (uncached_tokens * pricing['input']) + cached_cost
        return input_cost + output_cost
    def publish_metrics(self, metrics: CostMetrics):        """Publish metrics to CloudWatch for dashboard visualization"""
        cost = self.calculate_cost(metrics)
        metric_data = [            {                'MetricName': 'TokenUsage',                'Dimensions': [                    {'Name': 'Model', 'Value': metrics.model},                    {'Name': 'TokenType', 'Value': 'Input'}                ],                'Value': metrics.input_tokens,                'Unit': 'Count',                'Timestamp': metrics.timestamp            },            {                'MetricName': 'TokenUsage',                'Dimensions': [                    {'Name': 'Model', 'Value': metrics.model},                    {'Name': 'TokenType', 'Value': 'Output'}                ],                'Value': metrics.output_tokens,                'Unit': 'Count',                'Timestamp': metrics.timestamp            },            {                'MetricName': 'LLMCost',                'Dimensions': [                    {'Name': 'Model', 'Value': metrics.model},                    {'Name': 'Team', 'Value': metrics.team_id},                    {'Name': 'RequestType', 'Value': metrics.request_type}                ],                'Value': cost,                'Unit': 'None',  # Dollars                'Timestamp': metrics.timestamp            }        ]
        # Add cache hit rate metric if caching is used        if metrics.cached_tokens > 0:            cache_hit_rate = (metrics.cached_tokens / metrics.input_tokens) * 100            metric_data.append({                'MetricName': 'CacheHitRate',                'Dimensions': [{'Name': 'Model', 'Value': metrics.model}],                'Value': cache_hit_rate,                'Unit': 'Percent',                'Timestamp': metrics.timestamp            })
        self.cloudwatch.put_metric_data(            Namespace='LLM/Costs',            MetricData=metric_data        )
    def create_cost_anomaly_alarm(self, threshold_dollars: float):        """Create CloudWatch alarm for cost anomalies"""        self.cloudwatch.put_metric_alarm(            AlarmName='LLM-Daily-Cost-Anomaly',            ComparisonOperator='GreaterThanThreshold',            EvaluationPeriods=1,            MetricName='LLMCost',            Namespace='LLM/Costs',            Period=86400,  # 24 hours            Statistic='Sum',            Threshold=threshold_dollars,            ActionsEnabled=True,            AlarmActions=[                'arn:aws:sns:us-east-1:123456789:llm-cost-alerts'            ],            AlarmDescription=f'Alert when daily LLM costs exceed ${threshold_dollars}'        )

Key Metrics Dashboard

CloudWatch Insights Queries:

sql

-- Cost per model per dayfields @timestamp, model, sum(cost) as daily_cost| filter namespace = "LLM/Costs"| stats sum(daily_cost) by model, bin(@timestamp, 1d)
-- Top 10 most expensive usersfields user_id, sum(cost) as user_cost| filter namespace = "LLM/Costs"| stats sum(user_cost) by user_id| sort user_cost desc| limit 10
-- Cache effectiveness (cost savings)fields @timestamp,       sum(cached_tokens) / sum(input_tokens) * 100 as cache_hit_rate,       sum(cached_tokens) * (standard_price - cached_price) as savings| filter namespace = "LLM/Costs" and model = "claude-sonnet-3.5"| stats avg(cache_hit_rate), sum(savings) by bin(@timestamp, 1h)

-- Cost per model per dayfields @timestamp, model, sum(cost) as daily_cost| filter namespace = "LLM/Costs"| stats sum(daily_cost) by model, bin(@timestamp, 1d)
-- Top 10 most expensive usersfields user_id, sum(cost) as user_cost| filter namespace = "LLM/Costs"| stats sum(user_cost) by user_id| sort user_cost desc| limit 10
-- Cache effectiveness (cost savings)fields @timestamp,       sum(cached_tokens) / sum(input_tokens) * 100 as cache_hit_rate,       sum(cached_tokens) * (standard_price - cached_price) as savings| filter namespace = "LLM/Costs" and model = "claude-sonnet-3.5"| stats avg(cache_hit_rate), sum(savings) by bin(@timestamp, 1h)

Common Pitfalls and Lessons Learned

Ignoring Output Token Costs

Output tokens cost 2-5x more than input tokens, yet optimization often focuses only on input.

python

# Example: RAG application# Input: 8,000 tokens (2,000 user query + 6,000 retrieved context)# Output: 2,000 tokens (detailed answer)
# GPT-4 cost:# - Input: 8,000 × $10/1M = $0.08# - Output: 2,000 × $30/1M = $0.06# Output is 43% of total cost despite being 20% of tokens
# Solution: Aggressive max_tokens limits + conciseness promptssystem_prompt = """Answer concisely in under 150 words.Prioritize clarity over exhaustive detail."""
response = openai.chat.completions.create(    model="gpt-4",    messages=[...],    max_tokens=200  # Hard limit)

# Example: RAG application# Input: 8,000 tokens (2,000 user query + 6,000 retrieved context)# Output: 2,000 tokens (detailed answer)
# GPT-4 cost:# - Input: 8,000 × $10/1M = $0.08# - Output: 2,000 × $30/1M = $0.06# Output is 43% of total cost despite being 20% of tokens
# Solution: Aggressive max_tokens limits + conciseness promptssystem_prompt = """Answer concisely in under 150 words.Prioritize clarity over exhaustive detail."""
response = openai.chat.completions.create(    model="gpt-4",    messages=[...],    max_tokens=200  # Hard limit)

Cache Invalidation from Minor Changes

Small prompt variations invalidate entire cache, destroying effectiveness.

python

# BAD: Dynamic timestamp invalidates cache every requestsystem_prompt = f"""You are a support agent.Current time: {datetime.now().isoformat()}  # Different every time![... rest of prompt ...]"""
# GOOD: Static content only, dynamic context in user messagesystem_prompt = """You are a support agent.[... static policies ...]"""
user_message = f"""Current time: {datetime.now().isoformat()}User question: {question}"""

# BAD: Dynamic timestamp invalidates cache every requestsystem_prompt = f"""You are a support agent.Current time: {datetime.now().isoformat()}  # Different every time![... rest of prompt ...]"""
# GOOD: Static content only, dynamic context in user messagesystem_prompt = """You are a support agent.[... static policies ...]"""
user_message = f"""Current time: {datetime.now().isoformat()}User question: {question}"""

No Monitoring Until Bill Shock

Deploying to production without observability means discovering problems after damage is done.

Example timeline from a project I worked on:

Week 1: POC with 100 requests/day = $50/monthWeek 2: Beta with 1,000 requests/day = $500/monthWeek 3: Production with 10,000 requests/day = $5,000/monthWeek 4: Feature went viral, 50,000 requests/day = $25,000/month

Week 1: POC with 100 requests/day = $50/monthWeek 2: Beta with 1,000 requests/day = $500/monthWeek 3: Production with 10,000 requests/day = $5,000/monthWeek 4: Feature went viral, 50,000 requests/day = $25,000/month

Solution: Instrument from day one, publish metrics to CloudWatch immediately, set budget alerts before launching.

Optimization Impact Matrix

Optimization	Cost Savings	Quality Impact	Implementation Effort
Prompt caching	50-90%	None	Low (provider feature)
Model routing	30-50%	Low (5-10% accuracy)	Medium (routing logic)
Output limits	20-40%	Low (conciseness)	Low (parameter setting)
Semantic caching	40-60%	Medium (staleness)	Medium (vector DB setup)
Token budgets	10-30%	None (prevents waste)	Medium (budget system)
Batch inference	50%	None (async only)	Low (provider feature)

Key Takeaways

Token-Based Billing Requires New Mindset: Traditional cloud costs are predictable and linear. LLM costs vary 100x based on usage patterns. Optimization is mandatory.

Output Tokens Cost More: Focus on concise responses with max_tokens limits and prompt engineering for brevity.

Prompt Caching is Low-Hanging Fruit: 90% discount on cached tokens (Anthropic), 50-70% cost reduction for typical applications, zero code changes required.

Model Routing Balances Cost and Quality: 70% of queries can use cheaper models. Intelligent routing saves 30-50%. AWS Bedrock provides zero-config routing.

Observability Prevents Surprises: Instrument all LLM calls from day one. Set budget alerts at 70%, 90%, 100%. Review metrics weekly.

Optimization Compounds: Combining multiple techniques can achieve 60-80% total cost reduction while maintaining quality.

LLM cost management is fundamentally different from traditional cloud FinOps, but systematic application of these patterns makes costs predictable and controllable.

References

docs.aws.amazon.com - AWS documentation home (service guides and API references).
docs.aws.amazon.com - AWS Well-Architected Framework overview.
platform.openai.com - Prompt engineering guide (OpenAI API docs).
oreilly.com - O'Reilly: Distributed Systems Observability (ebook landing).
web.dev - web.dev performance guidance (Core Web Vitals).
docs.aws.amazon.com - AWS Overview (official whitepaper).
cloud.google.com - Google Cloud documentation.

Abstract#

The Token-Based Billing Challenge#

Cost Variability Example#

Understanding Provider Pricing Models#

AWS Bedrock Pricing Tiers#

OpenAI Pricing Structure#

Anthropic Direct Pricing#

Optimization Strategy 1: Prompt Caching#

Implementation with AWS Bedrock#

Cost Impact Analysis#

Implementation Best Practices#

Optimization Strategy 2: Intelligent Model Routing#

Custom Routing Implementation#

AWS Bedrock Intelligent Prompt Routing#

Expected Outcomes#

Optimization Strategy 3: Token Budget Enforcement#

Budget Tracking Implementation#

Alert Configuration#

Common Budget Pitfalls#

Optimization Strategy 4: Semantic Caching#

Implementation with Vector Similarity#

Performance Impact#

Cost Monitoring and Observability#

CloudWatch Metrics Implementation#

Key Metrics Dashboard#

Common Pitfalls and Lessons Learned#

Ignoring Output Token Costs#

Cache Invalidation from Minor Changes#

No Monitoring Until Bill Shock#

Optimization Impact Matrix#

Key Takeaways#

References#

Related Posts

Abstract

The Token-Based Billing Challenge

Cost Variability Example

Understanding Provider Pricing Models

AWS Bedrock Pricing Tiers

OpenAI Pricing Structure

Anthropic Direct Pricing

Optimization Strategy 1: Prompt Caching

Implementation with AWS Bedrock

Cost Impact Analysis

Implementation Best Practices

Optimization Strategy 2: Intelligent Model Routing

Custom Routing Implementation

AWS Bedrock Intelligent Prompt Routing

Expected Outcomes

Optimization Strategy 3: Token Budget Enforcement

Budget Tracking Implementation

Alert Configuration

Common Budget Pitfalls

Optimization Strategy 4: Semantic Caching

Implementation with Vector Similarity

Performance Impact

Cost Monitoring and Observability

CloudWatch Metrics Implementation

Key Metrics Dashboard

Common Pitfalls and Lessons Learned

Ignoring Output Token Costs

Cache Invalidation from Minor Changes

No Monitoring Until Bill Shock

Optimization Impact Matrix

Key Takeaways

References