AI Performance Optimization Guide

Learn how to optimize your AI workflows for speed, cost, and quality.

Performance Dimensions

When optimizing AI performance, consider three key dimensions:

Speed - How fast you get results
Quality - How good the outputs are
Cost - How much you spend per output

These often trade off against each other, so optimization is about finding the right balance for your use case.

Model Selection Optimization

Matching Models to Tasks

Choose the right model for each task:

Task Type	Recommended	Why
Simple generation	GPT-4o Mini, Gemini Flash	Fast, cheap, sufficient quality
Complex reasoning	Claude Sonnet, GPT-4o	Better understanding, nuanced output
Code generation	Claude Sonnet, DeepSeek	Code-optimized training
Creative writing	Claude, GPT-4o	Richer, more creative output
Classification	Any fast model	Simple task, minimize cost
Summarization	Gemini Flash	Good at compression

Model Cascading

Start with cheaper models, escalate if needed:

async function cascadeGeneration(prompt, qualityThreshold = 0.7) {
  // Start with fast, cheap model
  const fastResult = await generate(prompt, 'gpt-4o-mini');

  // Check quality (your own scoring function)
  const quality = await scoreQuality(fastResult);

  if (quality >= qualityThreshold) {
    return { result: fastResult, model: 'gpt-4o-mini', cost: 'low' };
  }

  // Escalate to premium model
  const premiumResult = await generate(prompt, 'claude-sonnet-4-20250514');
  return { result: premiumResult, model: 'claude-sonnet', cost: 'high' };
}

A/B Testing Models

Test different models to find the best fit:

async function abTestModels(prompt, models, iterations = 100) {
  const results = {};

  for (const model of models) {
    const samples = [];

    for (let i = 0; i < iterations; i++) {
      const start = Date.now();
      const result = await generate(prompt, model);
      const latency = Date.now() - start;
      const quality = await scoreQuality(result);

      samples.push({ latency, quality, result });
    }

    results[model] = {
      avgLatency: average(samples.map(s => s.latency)),
      avgQuality: average(samples.map(s => s.quality)),
      samples
    };
  }

  return results;
}

Prompt Optimization

Efficient Prompting

Write prompts that minimize tokens while maximizing clarity:

Before (verbose):

I would like you to please help me write a product description for a new product we are launching. The product is a wireless headphone. It has noise cancellation. It has 30 hours of battery life. Please make it compelling and include all the features I mentioned.

After (optimized):

Write a compelling product description:
Product: Wireless headphones
Features: Noise cancellation, 30h battery
Tone: Engaging, benefit-focused
Length: 100 words

Structured Outputs

Request structured output to reduce post-processing:

Generate product copy in JSON format:
{
  "headline": "...",
  "description": "...",
  "bullets": ["...", "...", "..."],
  "cta": "..."
}

Few-Shot vs Zero-Shot

Use few-shot examples only when necessary:

Zero-shot (cheaper, faster):

Classify sentiment: "Great product, loved it!"
Output: positive/negative/neutral

Few-shot (better accuracy for complex tasks):

Classify sentiment:
"Terrible experience" -> negative
"It was okay" -> neutral
"Absolutely amazing!" -> positive

"Great product, loved it!" ->

Caching Strategies

Response Caching

Cache identical or similar requests:

const cache = new Map();

async function cachedGenerate(prompt, model, ttl = 3600000) {
  const cacheKey = createHash('md5').update(`${model}:${prompt}`).digest('hex');

  // Check cache
  const cached = cache.get(cacheKey);
  if (cached && Date.now() - cached.timestamp < ttl) {
    return { ...cached.result, fromCache: true };
  }

  // Generate fresh
  const result = await generate(prompt, model);

  // Store in cache
  cache.set(cacheKey, {
    result,
    timestamp: Date.now()
  });

  return { ...result, fromCache: false };
}

Semantic Caching

Cache based on semantic similarity:

async function semanticCache(prompt, threshold = 0.95) {
  const embedding = await getEmbedding(prompt);

  // Find similar cached prompts
  const similar = await findSimilarCached(embedding, threshold);

  if (similar) {
    return { result: similar.result, similarity: similar.score };
  }

  // Generate and cache
  const result = await generate(prompt);
  await storeWithEmbedding(prompt, embedding, result);

  return { result, fromCache: false };
}

Batching and Parallelization

Batch Processing

Group similar requests for efficiency:

async function batchGenerate(prompts, model, batchSize = 10) {
  const results = [];

  for (let i = 0; i < prompts.length; i += batchSize) {
    const batch = prompts.slice(i, i + batchSize);

    // Process batch in parallel
    const batchResults = await Promise.all(
      batch.map(prompt => generate(prompt, model))
    );

    results.push(...batchResults);
  }

  return results;
}

Smart Parallelization

Parallelize independent operations:

async function generateProductContent(product) {
  // Run independent generations in parallel
  const [description, seoMeta, socialPosts] = await Promise.all([
    generate(descriptionPrompt(product), 'claude-sonnet-4-20250514'),
    generate(seoPrompt(product), 'gpt-4o-mini'),
    generate(socialPrompt(product), 'gpt-4o-mini')
  ]);

  return { description, seoMeta, socialPosts };
}

Token Optimization

Minimizing Input Tokens

Remove redundancy: Eliminate repeated information
Use shorthand: "desc" vs "description"
Compress context: Summarize long documents before including
Use references: "Generate 3 more like above" instead of repeating

Controlling Output Tokens

Generate a product description.
Constraints:
- Max 150 words
- 3-4 sentences
- No emojis

Token Counting

Monitor token usage:

const { encode } = require('gpt-tokenizer');

function countTokens(text) {
  return encode(text).length;
}

async function monitoredGenerate(prompt, model, maxTokens) {
  const inputTokens = countTokens(prompt);

  if (inputTokens > maxTokens * 0.5) {
    console.warn('Large input prompt', { inputTokens });
  }

  const result = await generate(prompt, model, { maxTokens });
  const outputTokens = countTokens(result);

  logUsage({ model, inputTokens, outputTokens });

  return result;
}

Latency Optimization

Streaming Responses

Use streaming for better perceived performance:

async function* streamGenerate(prompt, model) {
  const response = await fetch('/api/generate', {
    method: 'POST',
    body: JSON.stringify({ prompt, model, stream: true })
  });

  const reader = response.body.getReader();
  const decoder = new TextDecoder();

  while (true) {
    const { done, value } = await reader.read();
    if (done) break;

    yield decoder.decode(value);
  }
}

// Usage
for await (const chunk of streamGenerate(prompt, model)) {
  process.stdout.write(chunk);
}

Edge Deployment

Reduce latency with edge functions:

// Vercel Edge Function
export const config = { runtime: 'edge' };

export default async function handler(req) {
  const { prompt } = await req.json();

  // Call Promptha from edge location
  const result = await generate(prompt);

  return new Response(JSON.stringify(result), {
    headers: { 'Content-Type': 'application/json' }
  });
}

Cost Optimization

Usage Monitoring

Track and analyze usage:

class UsageTracker {
  constructor() {
    this.usage = [];
  }

  track(model, inputTokens, outputTokens, cost) {
    this.usage.push({
      timestamp: Date.now(),
      model,
      inputTokens,
      outputTokens,
      cost
    });
  }

  getStats(period = 'day') {
    const cutoff = this.getCutoff(period);
    const relevant = this.usage.filter(u => u.timestamp > cutoff);

    return {
      totalCost: relevant.reduce((sum, u) => sum + u.cost, 0),
      totalTokens: relevant.reduce((sum, u) => sum + u.inputTokens + u.outputTokens, 0),
      byModel: this.groupBy(relevant, 'model')
    };
  }
}

Budget Alerts

Set up cost alerts:

async function checkBudget(currentSpend, budget, alertThreshold = 0.8) {
  const usage = currentSpend / budget;

  if (usage >= alertThreshold) {
    await sendAlert({
      type: 'budget_warning',
      message: `AI spend at ${Math.round(usage * 100)}% of budget`,
      currentSpend,
      budget
    });
  }

  if (usage >= 1) {
    // Implement fallback or throttling
    return { throttle: true, fallbackModel: 'gpt-4o-mini' };
  }

  return { throttle: false };
}

Measuring Performance

Key Metrics

Track these metrics:

Metric	How to Measure	Target
Latency P50/P99	Request timing	<2s / <5s
Throughput	Requests per minute	Based on load
Error rate	Failed requests	<1%
Cost per request	Total cost / requests	Varies
Quality score	User ratings or automated	>4/5

Performance Dashboard

async function getPerformanceMetrics(timeRange) {
  const requests = await getRequests(timeRange);

  return {
    latency: {
      p50: percentile(requests.map(r => r.latency), 50),
      p99: percentile(requests.map(r => r.latency), 99)
    },
    throughput: requests.length / timeRange.hours,
    errorRate: requests.filter(r => r.error).length / requests.length,
    costPerRequest: totalCost(requests) / requests.length,
    byModel: groupByModel(requests)
  };
}

Best Practices Summary

Match model to task - Don't overpay for simple tasks
Optimize prompts - Shorter prompts = faster, cheaper
Cache aggressively - Same inputs = same outputs
Batch when possible - Reduce overhead
Stream for UX - Better perceived performance
Monitor continuously - Track metrics and costs
Set budgets - Prevent runaway costs
Test and iterate - A/B test optimizations

Start optimizing: Performance Tools