AI Performance Optimization Guide
Optimize AI model selection, prompt engineering, and infrastructure for better performance.
AI Performance Optimization Guide
Learn how to optimize your AI workflows for speed, cost, and quality.
Performance Dimensions
When optimizing AI performance, consider three key dimensions:
- Speed - How fast you get results
- Quality - How good the outputs are
- Cost - How much you spend per output
These often trade off against each other, so optimization is about finding the right balance for your use case.
Model Selection Optimization
Matching Models to Tasks
Choose the right model for each task:
| Task Type | Recommended | Why |
|---|---|---|
| Simple generation | GPT-4o Mini, Gemini Flash | Fast, cheap, sufficient quality |
| Complex reasoning | Claude Sonnet, GPT-4o | Better understanding, nuanced output |
| Code generation | Claude Sonnet, DeepSeek | Code-optimized training |
| Creative writing | Claude, GPT-4o | Richer, more creative output |
| Classification | Any fast model | Simple task, minimize cost |
| Summarization | Gemini Flash | Good at compression |
Model Cascading
Start with cheaper models, escalate if needed:
async function cascadeGeneration(prompt, qualityThreshold = 0.7) {
// Start with fast, cheap model
const fastResult = await generate(prompt, 'gpt-4o-mini');
// Check quality (your own scoring function)
const quality = await scoreQuality(fastResult);
if (quality >= qualityThreshold) {
return { result: fastResult, model: 'gpt-4o-mini', cost: 'low' };
}
// Escalate to premium model
const premiumResult = await generate(prompt, 'claude-sonnet-4-20250514');
return { result: premiumResult, model: 'claude-sonnet', cost: 'high' };
}
A/B Testing Models
Test different models to find the best fit:
async function abTestModels(prompt, models, iterations = 100) {
const results = {};
for (const model of models) {
const samples = [];
for (let i = 0; i < iterations; i++) {
const start = Date.now();
const result = await generate(prompt, model);
const latency = Date.now() - start;
const quality = await scoreQuality(result);
samples.push({ latency, quality, result });
}
results[model] = {
avgLatency: average(samples.map(s => s.latency)),
avgQuality: average(samples.map(s => s.quality)),
samples
};
}
return results;
}
Prompt Optimization
Efficient Prompting
Write prompts that minimize tokens while maximizing clarity:
Before (verbose):
I would like you to please help me write a product description for a new product we are launching. The product is a wireless headphone. It has noise cancellation. It has 30 hours of battery life. Please make it compelling and include all the features I mentioned.
After (optimized):
Write a compelling product description:
Product: Wireless headphones
Features: Noise cancellation, 30h battery
Tone: Engaging, benefit-focused
Length: 100 words
Structured Outputs
Request structured output to reduce post-processing:
Generate product copy in JSON format:
{
"headline": "...",
"description": "...",
"bullets": ["...", "...", "..."],
"cta": "..."
}
Few-Shot vs Zero-Shot
Use few-shot examples only when necessary:
Zero-shot (cheaper, faster):
Classify sentiment: "Great product, loved it!"
Output: positive/negative/neutral
Few-shot (better accuracy for complex tasks):
Classify sentiment:
"Terrible experience" -> negative
"It was okay" -> neutral
"Absolutely amazing!" -> positive
"Great product, loved it!" ->
Caching Strategies
Response Caching
Cache identical or similar requests:
const cache = new Map();
async function cachedGenerate(prompt, model, ttl = 3600000) {
const cacheKey = createHash('md5').update(`${model}:${prompt}`).digest('hex');
// Check cache
const cached = cache.get(cacheKey);
if (cached && Date.now() - cached.timestamp < ttl) {
return { ...cached.result, fromCache: true };
}
// Generate fresh
const result = await generate(prompt, model);
// Store in cache
cache.set(cacheKey, {
result,
timestamp: Date.now()
});
return { ...result, fromCache: false };
}
Semantic Caching
Cache based on semantic similarity:
async function semanticCache(prompt, threshold = 0.95) {
const embedding = await getEmbedding(prompt);
// Find similar cached prompts
const similar = await findSimilarCached(embedding, threshold);
if (similar) {
return { result: similar.result, similarity: similar.score };
}
// Generate and cache
const result = await generate(prompt);
await storeWithEmbedding(prompt, embedding, result);
return { result, fromCache: false };
}
Batching and Parallelization
Batch Processing
Group similar requests for efficiency:
async function batchGenerate(prompts, model, batchSize = 10) {
const results = [];
for (let i = 0; i < prompts.length; i += batchSize) {
const batch = prompts.slice(i, i + batchSize);
// Process batch in parallel
const batchResults = await Promise.all(
batch.map(prompt => generate(prompt, model))
);
results.push(...batchResults);
}
return results;
}
Smart Parallelization
Parallelize independent operations:
async function generateProductContent(product) {
// Run independent generations in parallel
const [description, seoMeta, socialPosts] = await Promise.all([
generate(descriptionPrompt(product), 'claude-sonnet-4-20250514'),
generate(seoPrompt(product), 'gpt-4o-mini'),
generate(socialPrompt(product), 'gpt-4o-mini')
]);
return { description, seoMeta, socialPosts };
}
Token Optimization
Minimizing Input Tokens
- Remove redundancy: Eliminate repeated information
- Use shorthand: "desc" vs "description"
- Compress context: Summarize long documents before including
- Use references: "Generate 3 more like above" instead of repeating
Controlling Output Tokens
Generate a product description.
Constraints:
- Max 150 words
- 3-4 sentences
- No emojis
Token Counting
Monitor token usage:
const { encode } = require('gpt-tokenizer');
function countTokens(text) {
return encode(text).length;
}
async function monitoredGenerate(prompt, model, maxTokens) {
const inputTokens = countTokens(prompt);
if (inputTokens > maxTokens * 0.5) {
console.warn('Large input prompt', { inputTokens });
}
const result = await generate(prompt, model, { maxTokens });
const outputTokens = countTokens(result);
logUsage({ model, inputTokens, outputTokens });
return result;
}
Latency Optimization
Streaming Responses
Use streaming for better perceived performance:
async function* streamGenerate(prompt, model) {
const response = await fetch('/api/generate', {
method: 'POST',
body: JSON.stringify({ prompt, model, stream: true })
});
const reader = response.body.getReader();
const decoder = new TextDecoder();
while (true) {
const { done, value } = await reader.read();
if (done) break;
yield decoder.decode(value);
}
}
// Usage
for await (const chunk of streamGenerate(prompt, model)) {
process.stdout.write(chunk);
}
Edge Deployment
Reduce latency with edge functions:
// Vercel Edge Function
export const config = { runtime: 'edge' };
export default async function handler(req) {
const { prompt } = await req.json();
// Call Promptha from edge location
const result = await generate(prompt);
return new Response(JSON.stringify(result), {
headers: { 'Content-Type': 'application/json' }
});
}
Cost Optimization
Usage Monitoring
Track and analyze usage:
class UsageTracker {
constructor() {
this.usage = [];
}
track(model, inputTokens, outputTokens, cost) {
this.usage.push({
timestamp: Date.now(),
model,
inputTokens,
outputTokens,
cost
});
}
getStats(period = 'day') {
const cutoff = this.getCutoff(period);
const relevant = this.usage.filter(u => u.timestamp > cutoff);
return {
totalCost: relevant.reduce((sum, u) => sum + u.cost, 0),
totalTokens: relevant.reduce((sum, u) => sum + u.inputTokens + u.outputTokens, 0),
byModel: this.groupBy(relevant, 'model')
};
}
}
Budget Alerts
Set up cost alerts:
async function checkBudget(currentSpend, budget, alertThreshold = 0.8) {
const usage = currentSpend / budget;
if (usage >= alertThreshold) {
await sendAlert({
type: 'budget_warning',
message: `AI spend at ${Math.round(usage * 100)}% of budget`,
currentSpend,
budget
});
}
if (usage >= 1) {
// Implement fallback or throttling
return { throttle: true, fallbackModel: 'gpt-4o-mini' };
}
return { throttle: false };
}
Measuring Performance
Key Metrics
Track these metrics:
| Metric | How to Measure | Target |
|---|---|---|
| Latency P50/P99 | Request timing | <2s / <5s |
| Throughput | Requests per minute | Based on load |
| Error rate | Failed requests | <1% |
| Cost per request | Total cost / requests | Varies |
| Quality score | User ratings or automated | >4/5 |
Performance Dashboard
async function getPerformanceMetrics(timeRange) {
const requests = await getRequests(timeRange);
return {
latency: {
p50: percentile(requests.map(r => r.latency), 50),
p99: percentile(requests.map(r => r.latency), 99)
},
throughput: requests.length / timeRange.hours,
errorRate: requests.filter(r => r.error).length / requests.length,
costPerRequest: totalCost(requests) / requests.length,
byModel: groupByModel(requests)
};
}
Best Practices Summary
- Match model to task - Don't overpay for simple tasks
- Optimize prompts - Shorter prompts = faster, cheaper
- Cache aggressively - Same inputs = same outputs
- Batch when possible - Reduce overhead
- Stream for UX - Better perceived performance
- Monitor continuously - Track metrics and costs
- Set budgets - Prevent runaway costs
- Test and iterate - A/B test optimizations
Start optimizing: Performance Tools
Ready to create?
Put what you've learned into practice with Promptha's AI-powered tools.
Get Started Free