Intermediate 12 min read

AI Performance Optimization Guide

Optimize AI model selection, prompt engineering, and infrastructure for better performance.

AI Performance Optimization Guide

AI Performance Optimization Guide

Learn how to optimize your AI workflows for speed, cost, and quality.

Performance Dimensions

When optimizing AI performance, consider three key dimensions:

  1. Speed - How fast you get results
  2. Quality - How good the outputs are
  3. Cost - How much you spend per output

These often trade off against each other, so optimization is about finding the right balance for your use case.

Model Selection Optimization

Matching Models to Tasks

Choose the right model for each task:

Task Type Recommended Why
Simple generation GPT-4o Mini, Gemini Flash Fast, cheap, sufficient quality
Complex reasoning Claude Sonnet, GPT-4o Better understanding, nuanced output
Code generation Claude Sonnet, DeepSeek Code-optimized training
Creative writing Claude, GPT-4o Richer, more creative output
Classification Any fast model Simple task, minimize cost
Summarization Gemini Flash Good at compression

Model Cascading

Start with cheaper models, escalate if needed:

async function cascadeGeneration(prompt, qualityThreshold = 0.7) {
  // Start with fast, cheap model
  const fastResult = await generate(prompt, 'gpt-4o-mini');

  // Check quality (your own scoring function)
  const quality = await scoreQuality(fastResult);

  if (quality >= qualityThreshold) {
    return { result: fastResult, model: 'gpt-4o-mini', cost: 'low' };
  }

  // Escalate to premium model
  const premiumResult = await generate(prompt, 'claude-sonnet-4-20250514');
  return { result: premiumResult, model: 'claude-sonnet', cost: 'high' };
}

A/B Testing Models

Test different models to find the best fit:

async function abTestModels(prompt, models, iterations = 100) {
  const results = {};

  for (const model of models) {
    const samples = [];

    for (let i = 0; i < iterations; i++) {
      const start = Date.now();
      const result = await generate(prompt, model);
      const latency = Date.now() - start;
      const quality = await scoreQuality(result);

      samples.push({ latency, quality, result });
    }

    results[model] = {
      avgLatency: average(samples.map(s => s.latency)),
      avgQuality: average(samples.map(s => s.quality)),
      samples
    };
  }

  return results;
}

Prompt Optimization

Efficient Prompting

Write prompts that minimize tokens while maximizing clarity:

Before (verbose):

I would like you to please help me write a product description for a new product we are launching. The product is a wireless headphone. It has noise cancellation. It has 30 hours of battery life. Please make it compelling and include all the features I mentioned.

After (optimized):

Write a compelling product description:
Product: Wireless headphones
Features: Noise cancellation, 30h battery
Tone: Engaging, benefit-focused
Length: 100 words

Structured Outputs

Request structured output to reduce post-processing:

Generate product copy in JSON format:
{
  "headline": "...",
  "description": "...",
  "bullets": ["...", "...", "..."],
  "cta": "..."
}

Few-Shot vs Zero-Shot

Use few-shot examples only when necessary:

Zero-shot (cheaper, faster):

Classify sentiment: "Great product, loved it!"
Output: positive/negative/neutral

Few-shot (better accuracy for complex tasks):

Classify sentiment:
"Terrible experience" -> negative
"It was okay" -> neutral
"Absolutely amazing!" -> positive

"Great product, loved it!" ->

Caching Strategies

Response Caching

Cache identical or similar requests:

const cache = new Map();

async function cachedGenerate(prompt, model, ttl = 3600000) {
  const cacheKey = createHash('md5').update(`${model}:${prompt}`).digest('hex');

  // Check cache
  const cached = cache.get(cacheKey);
  if (cached && Date.now() - cached.timestamp < ttl) {
    return { ...cached.result, fromCache: true };
  }

  // Generate fresh
  const result = await generate(prompt, model);

  // Store in cache
  cache.set(cacheKey, {
    result,
    timestamp: Date.now()
  });

  return { ...result, fromCache: false };
}

Semantic Caching

Cache based on semantic similarity:

async function semanticCache(prompt, threshold = 0.95) {
  const embedding = await getEmbedding(prompt);

  // Find similar cached prompts
  const similar = await findSimilarCached(embedding, threshold);

  if (similar) {
    return { result: similar.result, similarity: similar.score };
  }

  // Generate and cache
  const result = await generate(prompt);
  await storeWithEmbedding(prompt, embedding, result);

  return { result, fromCache: false };
}

Batching and Parallelization

Batch Processing

Group similar requests for efficiency:

async function batchGenerate(prompts, model, batchSize = 10) {
  const results = [];

  for (let i = 0; i < prompts.length; i += batchSize) {
    const batch = prompts.slice(i, i + batchSize);

    // Process batch in parallel
    const batchResults = await Promise.all(
      batch.map(prompt => generate(prompt, model))
    );

    results.push(...batchResults);
  }

  return results;
}

Smart Parallelization

Parallelize independent operations:

async function generateProductContent(product) {
  // Run independent generations in parallel
  const [description, seoMeta, socialPosts] = await Promise.all([
    generate(descriptionPrompt(product), 'claude-sonnet-4-20250514'),
    generate(seoPrompt(product), 'gpt-4o-mini'),
    generate(socialPrompt(product), 'gpt-4o-mini')
  ]);

  return { description, seoMeta, socialPosts };
}

Token Optimization

Minimizing Input Tokens

  1. Remove redundancy: Eliminate repeated information
  2. Use shorthand: "desc" vs "description"
  3. Compress context: Summarize long documents before including
  4. Use references: "Generate 3 more like above" instead of repeating

Controlling Output Tokens

Generate a product description.
Constraints:
- Max 150 words
- 3-4 sentences
- No emojis

Token Counting

Monitor token usage:

const { encode } = require('gpt-tokenizer');

function countTokens(text) {
  return encode(text).length;
}

async function monitoredGenerate(prompt, model, maxTokens) {
  const inputTokens = countTokens(prompt);

  if (inputTokens > maxTokens * 0.5) {
    console.warn('Large input prompt', { inputTokens });
  }

  const result = await generate(prompt, model, { maxTokens });
  const outputTokens = countTokens(result);

  logUsage({ model, inputTokens, outputTokens });

  return result;
}

Latency Optimization

Streaming Responses

Use streaming for better perceived performance:

async function* streamGenerate(prompt, model) {
  const response = await fetch('/api/generate', {
    method: 'POST',
    body: JSON.stringify({ prompt, model, stream: true })
  });

  const reader = response.body.getReader();
  const decoder = new TextDecoder();

  while (true) {
    const { done, value } = await reader.read();
    if (done) break;

    yield decoder.decode(value);
  }
}

// Usage
for await (const chunk of streamGenerate(prompt, model)) {
  process.stdout.write(chunk);
}

Edge Deployment

Reduce latency with edge functions:

// Vercel Edge Function
export const config = { runtime: 'edge' };

export default async function handler(req) {
  const { prompt } = await req.json();

  // Call Promptha from edge location
  const result = await generate(prompt);

  return new Response(JSON.stringify(result), {
    headers: { 'Content-Type': 'application/json' }
  });
}

Cost Optimization

Usage Monitoring

Track and analyze usage:

class UsageTracker {
  constructor() {
    this.usage = [];
  }

  track(model, inputTokens, outputTokens, cost) {
    this.usage.push({
      timestamp: Date.now(),
      model,
      inputTokens,
      outputTokens,
      cost
    });
  }

  getStats(period = 'day') {
    const cutoff = this.getCutoff(period);
    const relevant = this.usage.filter(u => u.timestamp > cutoff);

    return {
      totalCost: relevant.reduce((sum, u) => sum + u.cost, 0),
      totalTokens: relevant.reduce((sum, u) => sum + u.inputTokens + u.outputTokens, 0),
      byModel: this.groupBy(relevant, 'model')
    };
  }
}

Budget Alerts

Set up cost alerts:

async function checkBudget(currentSpend, budget, alertThreshold = 0.8) {
  const usage = currentSpend / budget;

  if (usage >= alertThreshold) {
    await sendAlert({
      type: 'budget_warning',
      message: `AI spend at ${Math.round(usage * 100)}% of budget`,
      currentSpend,
      budget
    });
  }

  if (usage >= 1) {
    // Implement fallback or throttling
    return { throttle: true, fallbackModel: 'gpt-4o-mini' };
  }

  return { throttle: false };
}

Measuring Performance

Key Metrics

Track these metrics:

Metric How to Measure Target
Latency P50/P99 Request timing <2s / <5s
Throughput Requests per minute Based on load
Error rate Failed requests <1%
Cost per request Total cost / requests Varies
Quality score User ratings or automated >4/5

Performance Dashboard

async function getPerformanceMetrics(timeRange) {
  const requests = await getRequests(timeRange);

  return {
    latency: {
      p50: percentile(requests.map(r => r.latency), 50),
      p99: percentile(requests.map(r => r.latency), 99)
    },
    throughput: requests.length / timeRange.hours,
    errorRate: requests.filter(r => r.error).length / requests.length,
    costPerRequest: totalCost(requests) / requests.length,
    byModel: groupByModel(requests)
  };
}

Best Practices Summary

  1. Match model to task - Don't overpay for simple tasks
  2. Optimize prompts - Shorter prompts = faster, cheaper
  3. Cache aggressively - Same inputs = same outputs
  4. Batch when possible - Reduce overhead
  5. Stream for UX - Better perceived performance
  6. Monitor continuously - Track metrics and costs
  7. Set budgets - Prevent runaway costs
  8. Test and iterate - A/B test optimizations

Start optimizing: Performance Tools

Ready to create?

Put what you've learned into practice with Promptha's AI-powered tools.

Get Started Free