The real engineering behind integrating large language models into web applications. Streaming responses, managing costs, handling failures, prompt management, caching strategies, and building AI features users actually want.
Every tutorial on integrating LLMs into web apps follows the same script: install the SDK, call the API, display the response. Twenty lines of code, maybe thirty. "Look how easy it is!" And it is easy — until you ship it to real users and discover that your AI feature costs $400/day, times out on 12% of requests, occasionally tells users to eat glass, and streams responses so slowly that people close the tab before they see the answer.
I have spent the better part of a year building AI-powered features into production web applications. Not chatbot wrappers. Not demo projects. Features that handle real traffic, run against real cost constraints, and have to work reliably for users who do not care that there is an LLM behind the curtain — they just want the thing to work.
What follows is everything I have learned about doing this properly. The architecture decisions, the streaming implementation details, the cost traps, the failure modes, and the UX patterns that make the difference between an AI feature that delights users and one that embarrasses you.
Before you write a single line of code, you need to decide where the LLM sits in your architecture. This sounds obvious, but the wrong decision here will haunt you for months.
Option 1: Direct API calls from your backend. Your Next.js API route calls the LLM provider directly. Simple, fast to implement, and the right choice for most teams starting out.
Option 2: A dedicated AI service layer. A separate service (or at least a separate module) that owns all LLM interactions. Prompt templates, model selection, response parsing, caching — all in one place. More work upfront, dramatically easier to maintain.
Option 3: Edge functions. Running LLM calls at the edge sounds appealing for latency. In practice, edge runtimes have strict execution time limits and memory constraints that make them a poor fit for most LLM workloads. The LLM API call itself dominates latency anyway — shaving 50ms off the network hop to your server doesn't matter when the model takes 3 seconds to respond.
I started with Option 1 and migrated to Option 2 within two months. The trigger was discovering that I had the same prompt scattered across seven different API routes, each with slightly different system messages, and a bug fix in one place meant hunting down six other places. Learn from my mistake: centralize from the start.
Here is the pattern I settled on:
// src/lib/ai/provider.ts
import { createOpenAI } from "@ai-sdk/openai";
import { createAnthropic } from "@ai-sdk/anthropic";
const providers = {
openai: createOpenAI({
apiKey: process.env.OPENAI_API_KEY,
compatibility: "strict",
}),
anthropic: createAnthropic({
apiKey: process.env.ANTHROPIC_API_KEY,
}),
};
export type ModelConfig = {
provider: keyof typeof providers;
model: string;
maxTokens: number;
temperature: number;
fallback?: ModelConfig;
};
export const models = {
fast: {
provider: "openai" as const,
model: "gpt-4o-mini",
maxTokens: 1024,
temperature: 0.3,
fallback: {
provider: "anthropic" as const,
model: "claude-3-5-haiku-20241022",
maxTokens: 1024,
temperature: 0.3,
},
},
quality: {
provider: "anthropic" as const,
model: "claude-sonnet-4-20250514",
maxTokens: 4096,
temperature: 0.4,
fallback: {
provider: "openai" as const,
model: "gpt-4o",
maxTokens: 4096,
temperature: 0.4,
},
},
reasoning: {
provider: "anthropic" as const,
model: "claude-opus-4-20250514",
maxTokens: 8192,
temperature: 0.2,
},
} satisfies Record<string, ModelConfig>;The key insight: you need at least two tiers of model. A fast, cheap model for simple tasks (summaries, classifications, short generations) and a slower, expensive model for tasks that require actual reasoning. Trying to use one model for everything either bankrupts you or disappoints users.
Non-streaming LLM responses are unusable in production. A 3-second wait with no feedback feels like an eternity to users. Streaming changes the perceived latency from "how long until I see anything" to "how long until the response is complete" — and the first metric is the one that matters for user experience.
I have seen people try to stream LLM responses over WebSockets. Do not do this. SSE (Server-Sent Events) exists specifically for this pattern: server pushes data to the client over a long-lived HTTP connection. It is simpler, works through more proxies and CDNs, automatically reconnects, and does not require a persistent bidirectional channel you don't need.
Here is a production-grade streaming endpoint:
// src/app/api/ai/generate/route.ts
import { streamText } from "ai";
import { getModel } from "@/lib/ai/provider";
import { getPrompt } from "@/lib/ai/prompts";
import { rateLimit } from "@/lib/rate-limit";
import { estimateTokens } from "@/lib/ai/tokens";
export const runtime = "nodejs";
export const maxDuration = 60;
export async function POST(req: Request) {
const ip = req.headers.get("x-forwarded-for") ?? "unknown";
const limiter = await rateLimit(ip, { max: 20, window: 60 });
if (!limiter.success) {
return new Response("Rate limit exceeded", {
status: 429,
headers: { "Retry-After": String(limiter.retryAfter) },
});
}
const { input, feature, context } = await req.json();
// Token budget enforcement BEFORE calling the model
const estimatedInputTokens = estimateTokens(input + (context ?? ""));
if (estimatedInputTokens > 8000) {
return new Response(
JSON.stringify({ error: "Input too long", maxTokens: 8000 }),
{ status: 400 }
);
}
const prompt = getPrompt(feature, { input, context });
const modelConfig = selectModel(feature, estimatedInputTokens);
try {
const result = streamText({
model: getModel(modelConfig),
messages: prompt.messages,
maxTokens: modelConfig.maxTokens,
temperature: modelConfig.temperature,
abortSignal: req.signal,
});
return result.toDataStreamResponse();
} catch (error) {
return handleAIError(error);
}
}A few things to notice here that tutorials skip:
maxDuration: 60 — Next.js API routes have a default timeout. LLM responses can take 30+ seconds for long generations. If you don't increase this, your responses will be truncated silently.
abortSignal: req.signal — When a user navigates away, the request is aborted. Without this, you keep paying for a response nobody will see. This alone saved me measurable money.
Token estimation before the call — You do not want to send a 50,000 token prompt to the API and discover it fails (or costs $2) after the fact. Estimate first, reject early.
On the client side, the Vercel AI SDK handles most of the complexity, but understanding what happens underneath matters for debugging:
// src/hooks/useAIGeneration.ts
"use client";
import { useChat } from "ai/react";
import { useCallback, useRef, useState } from "react";
export function useAIGeneration(feature: string) {
const [isGenerating, setIsGenerating] = useState(false);
const abortRef = useRef<(() => void) | null>(null);
const { messages, append, isLoading, error, stop } = useChat({
api: "/api/ai/generate",
body: { feature },
onResponse(response) {
if (!response.ok) {
setIsGenerating(false);
}
},
onFinish() {
setIsGenerating(false);
},
onError(err) {
setIsGenerating(false);
console.error(`AI generation failed for ${feature}:`, err);
},
});
const generate = useCallback(
async (input: string, context?: string) => {
setIsGenerating(true);
await append({
role: "user",
content: input,
});
},
[append]
);
const cancel = useCallback(() => {
stop();
setIsGenerating(false);
}, [stop]);
return {
messages,
generate,
cancel,
isGenerating,
isStreaming: isLoading,
error,
};
}Critical detail: always give users a cancel button. LLM generations can take 30 seconds. If a user realizes they asked the wrong question at second 3, they should not have to wait 27 more seconds. The cancel also saves you money — an aborted stream stops billing for output tokens.
Sometimes you cannot or do not want to use the Vercel AI SDK. Here is what streaming looks like from scratch, because understanding this will save you hours of debugging when something goes wrong:
// Raw SSE streaming without the AI SDK
export async function POST(req: Request) {
const encoder = new TextEncoder();
const stream = new ReadableStream({
async start(controller) {
try {
const response = await fetch(
"https://api.anthropic.com/v1/messages",
{
method: "POST",
headers: {
"Content-Type": "application/json",
"x-api-key": process.env.ANTHROPIC_API_KEY!,
"anthropic-version": "2023-06-01",
},
body: JSON.stringify({
model: "claude-sonnet-4-20250514",
max_tokens: 2048,
stream: true,
messages: [{ role: "user", content: "..." }],
}),
}
);
const reader = response.body!.getReader();
const decoder = new TextDecoder();
let buffer = "";
while (true) {
const { done, value } = await reader.read();
if (done) break;
buffer += decoder.decode(value, { stream: true });
const lines = buffer.split("\n");
buffer = lines.pop() ?? "";
for (const line of lines) {
if (line.startsWith("data: ")) {
const data = line.slice(6);
if (data === "[DONE]") {
controller.close();
return;
}
try {
const parsed = JSON.parse(data);
if (parsed.type === "content_block_delta") {
const text = parsed.delta?.text ?? "";
controller.enqueue(
encoder.encode(`data: ${JSON.stringify({ text })}\n\n`)
);
}
} catch {
// Incomplete JSON in buffer, will be completed
// on next chunk
}
}
}
}
controller.close();
} catch (error) {
controller.enqueue(
encoder.encode(
`data: ${JSON.stringify({ error: "Generation failed" })}\n\n`
)
);
controller.close();
}
},
});
return new Response(stream, {
headers: {
"Content-Type": "text/event-stream",
"Cache-Control": "no-cache",
Connection: "keep-alive",
},
});
}The buffer handling is the part that bites people. SSE data arrives in chunks that do not necessarily align with JSON boundaries. You need to buffer incomplete lines and only parse complete ones. I have seen production code that crashes because it tries to JSON.parse a partial chunk.
Prompt engineering in a production application is nothing like the prompt engineering you see on Twitter. You are not writing clever one-liners. You are building a system of composable, versioned, testable prompt templates that must produce consistent results across thousands of diverse inputs.
// src/lib/ai/prompts.ts
type PromptTemplate = {
system: string;
userTemplate: (vars: Record<string, string>) => string;
version: string;
model: "fast" | "quality" | "reasoning";
};
const prompts: Record<string, PromptTemplate> = {
"tool-description": {
system: `You are a technical writer for a developer tools website.
Write concise, accurate descriptions of web-based tools.
Rules:
- Maximum 2 sentences
- Focus on what the tool DOES, not how it works
- No marketing language ("powerful", "revolutionary", "amazing")
- No first person
- Include the primary use case`,
userTemplate: ({ toolName, toolCategory, existingDescription }) =>
`Write a description for the tool "${toolName}" in the ${toolCategory} category.
${existingDescription ? `Current description (improve this): ${existingDescription}` : ""}`,
version: "2.1",
model: "fast",
},
"code-review": {
system: `You are a senior software engineer reviewing code.
Provide specific, actionable feedback. No generic advice.
Focus on:
- Bugs and logic errors (critical)
- Security issues (critical)
- Performance problems (important)
- Readability improvements (nice to have)
Format: Use markdown. List issues by severity.
Do NOT suggest stylistic changes unless they impact readability.
Do NOT rewrite the entire code — point out specific lines.`,
userTemplate: ({ code, language, context }) =>
`Review this ${language} code:
\`\`\`${language}
${code}
\`\`\`
${context ? `Context: ${context}` : ""}`,
version: "3.0",
model: "quality",
},
"content-moderation": {
system: `You are a content moderation system. Classify the input text.
Respond with ONLY a JSON object:
{
"safe": boolean,
"category": "safe" | "harassment" | "hate" | "sexual" | "violence" | "self-harm" | "illegal",
"confidence": number (0-1),
"reason": string (brief explanation, max 20 words)
}
No other text. No markdown. No explanation outside the JSON.`,
userTemplate: ({ text }) => `Classify this text:\n\n${text}`,
version: "1.4",
model: "fast",
},
};
export function getPrompt(
feature: string,
vars: Record<string, string>
): { messages: Array<{ role: string; content: string }> } {
const template = prompts[feature];
if (!template) throw new Error(`Unknown prompt: ${feature}`);
return {
messages: [
{ role: "system", content: template.system },
{ role: "user", content: template.userTemplate(vars) },
],
};
}Lessons from managing prompts in production:
Version your prompts. When you change a prompt, you change the behavior of your application. You need to know which version produced which output, especially when debugging user reports.
System messages are your guardrails. The system message is the only thing between your application and the model deciding to go off-script. Be explicit about format, constraints, and forbidden behaviors. "Be concise" is not a constraint. "Maximum 2 sentences, no marketing language, no first person" is a constraint.
Template variables must be sanitized. If your user template includes user input, that input can contain prompt injection attacks. At minimum, truncate to a maximum length and strip control characters. For sensitive features, run the input through a moderation check first.
This is the elephant in the room that most LLM tutorials completely ignore. If your application takes user input and puts it into a prompt, users can manipulate the model's behavior:
// src/lib/ai/safety.ts
const INJECTION_PATTERNS = [
/ignore\s+(all\s+)?previous\s+instructions/i,
/you\s+are\s+now\s+/i,
/system\s*:\s*/i,
/\[INST\]/i,
/<<SYS>>/i,
/forget\s+(everything|all|your)/i,
/new\s+instructions?\s*:/i,
/override\s+(system|instructions|prompt)/i,
];
export function detectPromptInjection(input: string): {
suspicious: boolean;
patterns: string[];
} {
const matches = INJECTION_PATTERNS.filter((p) => p.test(input)).map(
(p) => p.source
);
return {
suspicious: matches.length > 0,
patterns: matches,
};
}
export function sanitizeInput(input: string, maxLength = 4000): string {
return input
.slice(0, maxLength)
.replace(/[\x00-\x08\x0B\x0C\x0E-\x1F\x7F]/g, "") // control chars
.trim();
}This is not bulletproof. Prompt injection is an unsolved problem. But defense in depth helps: input sanitization, output validation, least-privilege system prompts, and treating LLM output as untrusted data (never executing it, never inserting it into SQL, never rendering it as raw HTML).
LLM API calls are expensive and slow. Caching identical or semantically similar requests can cut your costs by 40-60% depending on your use case.
The simplest and most effective cache: hash the prompt, check if you have seen it before.
// src/lib/ai/cache.ts
import { createHash } from "crypto";
function hashPrompt(
messages: Array<{ role: string; content: string }>,
model: string,
temperature: number
): string {
const input = JSON.stringify({ messages, model, temperature });
return createHash("sha256").update(input).digest("hex");
}
export async function getCachedResponse(
messages: Array<{ role: string; content: string }>,
model: string,
temperature: number
): Promise<string | null> {
// Only cache deterministic requests
if (temperature > 0.1) return null;
const key = `ai:cache:${hashPrompt(messages, model, temperature)}`;
try {
const cached = await redis.get(key);
if (cached) {
await redis.hincrby("ai:stats", "cache_hits", 1);
return cached;
}
await redis.hincrby("ai:stats", "cache_misses", 1);
return null;
} catch {
// Cache failures should never block AI responses
return null;
}
}
export async function cacheResponse(
messages: Array<{ role: string; content: string }>,
model: string,
temperature: number,
response: string,
ttl = 86400 // 24 hours default
): Promise<void> {
if (temperature > 0.1) return;
const key = `ai:cache:${hashPrompt(messages, model, temperature)}`;
try {
await redis.setex(key, ttl, response);
} catch {
// Silent failure — caching is optimization, not requirement
}
}Important: only cache when temperature is near zero. With higher temperatures, the same prompt should produce different responses. Serving cached responses for a creative writing feature would make it feel broken.
TTL strategy matters. I use different TTLs based on the feature:
For some features, exact-match caching is not enough. "How do I center a div?" and "centering a div in CSS" should hit the same cache. This is where embeddings come in:
// src/lib/ai/semantic-cache.ts
import { embed } from "ai";
import { openai } from "@ai-sdk/openai";
const SIMILARITY_THRESHOLD = 0.92;
export async function getSemanticallyCachedResponse(
query: string
): Promise<string | null> {
const { embedding } = await embed({
model: openai.embedding("text-embedding-3-small"),
value: query,
});
// Search for similar cached queries using cosine similarity
// In production, use a vector DB (pgvector, Pinecone, etc.)
const candidates = await vectorStore.search(embedding, {
topK: 1,
threshold: SIMILARITY_THRESHOLD,
});
if (candidates.length > 0) {
return candidates[0].metadata.response;
}
return null;
}Semantic caching is powerful but adds latency (the embedding call itself takes 50-100ms) and complexity. I only use it for features with high query repetition and expensive underlying model calls. For most features, exact-match caching gets you 80% of the benefit with 20% of the complexity.
Here is the thing nobody tells you until the bill arrives: LLM costs scale with usage in a way that traditional compute costs do not. A busy API endpoint might cost you $20/month in compute. The same endpoint backed by an LLM can cost $20/day.
Every feature in your application should have a token budget:
// src/lib/ai/budget.ts
type FeatureBudget = {
maxInputTokens: number;
maxOutputTokens: number;
maxRequestsPerUser: number;
maxRequestsPerDay: number;
costPerRequest: number; // estimated, in cents
dailyBudgetCents: number;
};
const budgets: Record<string, FeatureBudget> = {
"tool-description": {
maxInputTokens: 500,
maxOutputTokens: 200,
maxRequestsPerUser: 50,
maxRequestsPerDay: 5000,
costPerRequest: 0.02,
dailyBudgetCents: 100, // $1/day
},
"code-review": {
maxInputTokens: 8000,
maxOutputTokens: 2000,
maxRequestsPerUser: 10,
maxRequestsPerDay: 1000,
costPerRequest: 0.8,
dailyBudgetCents: 800, // $8/day
},
"content-moderation": {
maxInputTokens: 2000,
maxOutputTokens: 100,
maxRequestsPerUser: 200,
maxRequestsPerDay: 50000,
costPerRequest: 0.005,
dailyBudgetCents: 250, // $2.50/day
},
};
export async function checkBudget(
feature: string,
userId: string
): Promise<{ allowed: boolean; reason?: string }> {
const budget = budgets[feature];
if (!budget) return { allowed: false, reason: "Unknown feature" };
const today = new Date().toISOString().slice(0, 10);
// Check per-user limit
const userKey = `ai:usage:${feature}:user:${userId}:${today}`;
const userCount = await redis.incr(userKey);
if (userCount === 1) await redis.expire(userKey, 86400);
if (userCount > budget.maxRequestsPerUser) {
return { allowed: false, reason: "Daily user limit reached" };
}
// Check global daily limit
const globalKey = `ai:usage:${feature}:global:${today}`;
const globalCount = await redis.incr(globalKey);
if (globalCount === 1) await redis.expire(globalKey, 86400);
if (globalCount > budget.maxRequestsPerDay) {
return { allowed: false, reason: "Feature daily limit reached" };
}
// Check cost budget
const costKey = `ai:cost:${feature}:${today}`;
const currentCost = parseFloat((await redis.get(costKey)) ?? "0");
if (currentCost > budget.dailyBudgetCents) {
return { allowed: false, reason: "Daily cost budget exceeded" };
}
return { allowed: true };
}
export async function recordUsage(
feature: string,
inputTokens: number,
outputTokens: number,
model: string
): Promise<void> {
const cost = calculateCost(model, inputTokens, outputTokens);
const today = new Date().toISOString().slice(0, 10);
const costKey = `ai:cost:${feature}:${today}`;
await redis.incrbyfloat(costKey, cost);
// Also track aggregate stats
await redis.hincrby("ai:stats:tokens", `${feature}:input`, inputTokens);
await redis.hincrby("ai:stats:tokens", `${feature}:output`, outputTokens);
}The cost trap with streaming: when you stream, you often do not know the total output tokens until the stream is complete. You have to track this post-hoc, which means your budget checks are always slightly behind reality. Build in a 20% buffer.
Not every request needs your most expensive model. A classification task (is this spam? what category is this?) can use the cheapest model available. A code review needs something smarter. Build this intelligence into your model selection:
function selectModel(
feature: string,
estimatedInputTokens: number
): ModelConfig {
const budget = budgets[feature];
const today = new Date().toISOString().slice(0, 10);
// If we're over 80% of daily budget, downgrade to cheaper model
const currentCost = getCachedCost(feature, today);
if (currentCost > budget.dailyBudgetCents * 0.8) {
return models.fast; // Always fall back to cheapest
}
// Short inputs with simple tasks -> fast model
if (estimatedInputTokens < 500 && isSimpleTask(feature)) {
return models.fast;
}
return models[prompts[feature].model];
}This adaptive model selection saved me roughly 35% on monthly costs. The trick is being honest about which tasks actually need the expensive model. Most of them don't.
LLM APIs fail in ways that traditional APIs do not. You need to handle all of these:
// src/lib/ai/errors.ts
export async function withAIRetry<T>(
fn: () => Promise<T>,
options: {
maxRetries?: number;
feature: string;
fallback?: () => Promise<T>;
}
): Promise<T> {
const maxRetries = options.maxRetries ?? 3;
let lastError: Error | null = null;
for (let attempt = 0; attempt <= maxRetries; attempt++) {
try {
return await fn();
} catch (error: any) {
lastError = error;
// Don't retry on client errors (except rate limits)
if (error.status >= 400 && error.status < 500 && error.status !== 429) {
break;
}
// Rate limit: respect Retry-After header
if (error.status === 429) {
const retryAfter = parseInt(
error.headers?.["retry-after"] ?? "5",
10
);
await sleep(retryAfter * 1000);
continue;
}
// Exponential backoff for other errors
if (attempt < maxRetries) {
await sleep(Math.pow(2, attempt) * 1000);
}
}
}
// All retries failed — try fallback model
if (options.fallback) {
try {
return await options.fallback();
} catch (fallbackError) {
// Fallback also failed — log and throw
console.error(
`AI fallback failed for ${options.feature}:`,
fallbackError
);
}
}
throw lastError;
}
function sleep(ms: number): Promise<void> {
return new Promise((resolve) => setTimeout(resolve, ms));
}This is one of the most valuable patterns I implemented. When your primary model provider has an outage (and they will — I have seen every major provider go down at least once in the past year), your AI features should not disappear entirely:
// src/lib/ai/fallback.ts
import { generateText, streamText } from "ai";
export async function generateWithFallback(
config: ModelConfig,
params: {
messages: Array<{ role: string; content: string }>;
maxTokens: number;
temperature: number;
}
) {
try {
return await generateText({
model: getModel(config),
...params,
});
} catch (error: any) {
if (config.fallback && shouldFallback(error)) {
console.warn(
`Primary model ${config.model} failed, falling back to ${config.fallback.model}`
);
return await generateText({
model: getModel(config.fallback),
...params,
});
}
throw error;
}
}
function shouldFallback(error: any): boolean {
// Fallback on server errors, timeouts, and rate limits
// Don't fallback on client errors (bad prompt, content policy)
return (
error.status >= 500 ||
error.status === 429 ||
error.code === "ETIMEDOUT" ||
error.code === "ECONNRESET"
);
}I run with dual providers (different companies). When one is down, the other usually isn't. This has saved me from multiple outage-related incidents where my AI features would have been completely unavailable.
LLMs return text. Your application needs structured data. This mismatch is the source of endless pain.
You ask the model to return JSON. Sometimes it does. Sometimes it wraps it in a markdown code block. Sometimes it adds a chatty preamble. Sometimes the JSON is almost valid but has a trailing comma. You need robust parsing:
// src/lib/ai/parse.ts
import { z } from "zod";
export function extractJSON<T>(
text: string,
schema: z.ZodSchema<T>
): { success: true; data: T } | { success: false; error: string } {
// Try direct parse first
try {
const parsed = JSON.parse(text);
const result = schema.safeParse(parsed);
if (result.success) return { success: true, data: result.data };
} catch {
// Not direct JSON, try extraction
}
// Extract from markdown code blocks
const codeBlockMatch = text.match(/```(?:json)?\s*\n?([\s\S]*?)```/);
if (codeBlockMatch) {
try {
const parsed = JSON.parse(codeBlockMatch[1].trim());
const result = schema.safeParse(parsed);
if (result.success) return { success: true, data: result.data };
} catch {
// Malformed JSON in code block
}
}
// Try to find JSON object/array in text
const jsonMatch = text.match(/(\{[\s\S]*\}|\[[\s\S]*\])/);
if (jsonMatch) {
try {
// Fix common LLM JSON mistakes
const fixed = jsonMatch[1]
.replace(/,\s*}/g, "}") // trailing commas in objects
.replace(/,\s*]/g, "]") // trailing commas in arrays
.replace(/'/g, '"'); // single quotes to double
const parsed = JSON.parse(fixed);
const result = schema.safeParse(parsed);
if (result.success) return { success: true, data: result.data };
} catch {
// Still not valid
}
}
return {
success: false,
error: `Failed to extract valid JSON from response: ${text.slice(0, 200)}...`,
};
}Better approach: use the model provider's structured output features when available. Both OpenAI and Anthropic now support JSON mode and tool use that constrains the output format. This is far more reliable than parsing free text:
import { generateObject } from "ai";
import { z } from "zod";
const moderationSchema = z.object({
safe: z.boolean(),
category: z.enum([
"safe",
"harassment",
"hate",
"sexual",
"violence",
"self-harm",
"illegal",
]),
confidence: z.number().min(0).max(1),
reason: z.string().max(100),
});
const { object } = await generateObject({
model: getModel(models.fast),
schema: moderationSchema,
prompt: `Classify this text: ${sanitizedInput}`,
});
// object is fully typed and validated
console.log(object.safe, object.category);This structured output approach eliminates an entire class of parsing bugs. Use it whenever your output has a known schema.
RAG is the pattern where you retrieve relevant context from your own data and include it in the prompt. It is how you make LLMs answer questions about your specific content without fine-tuning.
User Query
|
v
[Embed Query] --> [Vector Search] --> [Retrieve Top-K Documents]
| |
v v
[Build Prompt with Retrieved Context]
|
v
[LLM Generates Answer Grounded in Your Data]
// src/lib/ai/rag.ts
import { embed } from "ai";
import { openai } from "@ai-sdk/openai";
type Document = {
id: string;
content: string;
metadata: Record<string, string>;
similarity: number;
};
export async function queryWithRAG(
query: string,
options: {
collection: string;
topK?: number;
minSimilarity?: number;
}
): Promise<{ answer: string; sources: Document[] }> {
const { topK = 5, minSimilarity = 0.7 } = options;
// 1. Embed the query
const { embedding } = await embed({
model: openai.embedding("text-embedding-3-small"),
value: query,
});
// 2. Vector search (using pgvector here, but Pinecone/Weaviate work too)
const documents = await prisma.$queryRaw<Document[]>`
SELECT
id,
content,
metadata,
1 - (embedding <=> ${embedding}::vector) as similarity
FROM documents
WHERE collection = ${options.collection}
AND 1 - (embedding <=> ${embedding}::vector) > ${minSimilarity}
ORDER BY embedding <=> ${embedding}::vector
LIMIT ${topK}
`;
if (documents.length === 0) {
return {
answer:
"I don't have enough information to answer that question accurately.",
sources: [],
};
}
// 3. Build context-augmented prompt
const context = documents
.map((d, i) => `[Source ${i + 1}]: ${d.content}`)
.join("\n\n");
const { text } = await generateText({
model: getModel(models.quality),
messages: [
{
role: "system",
content: `Answer the user's question based ONLY on the provided sources.
If the sources don't contain enough information, say so.
Cite sources using [Source N] notation.
Do not make up information not present in the sources.`,
},
{
role: "user",
content: `Sources:\n${context}\n\nQuestion: ${query}`,
},
],
});
return { answer: text, sources: documents };
}Chunk size matters enormously. Too small (100 tokens) and you lose context. Too large (2000 tokens) and you waste context window space on irrelevant text. I settled on 500-800 tokens with 100-token overlap between chunks. But this varies by content type — code needs larger chunks than prose.
Embedding models have blind spots. Short queries like "auth" or "deploy" embed poorly because there is not enough semantic signal. For short queries, I augment with keyword search (BM25) and merge the results.
"Based only on the provided sources" is not as strong a guardrail as you think. Models will still hallucinate if the sources are tangentially related but don't actually contain the answer. You need to validate the output, not just trust the instruction.
Freshness is a real problem. If you embed your documentation once and never update it, your RAG system will serve outdated answers. I run re-embedding on a schedule — daily for frequently changing content, weekly for stable content.
LLM latency is brutal. A typical request involves:
You cannot make the model think faster. But you can optimize everything around it.
If a feature needs multiple LLM calls, run them in parallel:
// Bad: sequential calls (6 seconds total)
const summary = await generateText({ ... });
const keywords = await generateText({ ... });
const sentiment = await generateText({ ... });
// Good: parallel calls (2 seconds total)
const [summary, keywords, sentiment] = await Promise.all([
generateText({ model: getModel(models.fast), prompt: summaryPrompt }),
generateText({ model: getModel(models.fast), prompt: keywordsPrompt }),
generateText({ model: getModel(models.fast), prompt: sentimentPrompt }),
]);For features where you can predict the next user action, start the LLM call before the user explicitly requests it:
// When user starts typing code, speculatively begin analysis
const debouncedAnalysis = useMemo(
() =>
debounce(async (code: string) => {
if (code.length > 50) {
// Start analysis in background
speculativeResultRef.current = generateText({
model: getModel(models.fast),
prompt: buildAnalysisPrompt(code),
});
}
}, 2000),
[]
);This makes the response feel instant when the user clicks "Analyze" because the work already started 2 seconds ago.
For predictable queries (e.g., tool descriptions that will be needed on page load), generate and cache them ahead of time rather than on-demand:
// scripts/prefetch-descriptions.ts
// Run this as a cron job, not on every request
async function prefetchToolDescriptions() {
const tools = await getAllTools();
for (const tool of tools) {
const cacheKey = `ai:desc:${tool.slug}`;
const cached = await redis.get(cacheKey);
if (cached) continue;
const { text } = await generateText({
model: getModel(models.fast),
prompt: buildDescriptionPrompt(tool),
});
await redis.setex(cacheKey, 604800, text); // 7 days
// Respect rate limits
await sleep(200);
}
}The UX around AI features matters more than the AI itself. A mediocre model with great UX beats a great model with terrible UX every time.
Do not just dump streaming text into a container. Parse it incrementally and render it properly:
// src/components/ai/StreamingMarkdown.tsx
"use client";
import { memo, useMemo } from "react";
import ReactMarkdown from "react-markdown";
import { Prism as SyntaxHighlighter } from "react-syntax-highlighter";
interface StreamingMarkdownProps {
content: string;
isStreaming: boolean;
}
export const StreamingMarkdown = memo(function StreamingMarkdown({
content,
isStreaming,
}: StreamingMarkdownProps) {
// Avoid re-parsing markdown on every token
const stableContent = useMemo(() => {
if (!isStreaming) return content;
// While streaming, ensure we don't render incomplete
// markdown that causes layout shifts
const lines = content.split("\n");
const lastLine = lines[lines.length - 1];
// If the last line looks like an incomplete code block, don't render it yet
if (lastLine?.startsWith("```") && !content.endsWith("```")) {
return lines.slice(0, -1).join("\n");
}
return content;
}, [content, isStreaming]);
return (
<div className="prose prose-neutral dark:prose-invert max-w-none">
<ReactMarkdown
components={{
code({ className, children, ...props }) {
const match = /language-(\w+)/.exec(className ?? "");
if (match) {
return (
<SyntaxHighlighter language={match[1]}>
{String(children).replace(/\n$/, "")}
</SyntaxHighlighter>
);
}
return (
<code className={className} {...props}>
{children}
</code>
);
},
}}
>
{stableContent}
</ReactMarkdown>
{isStreaming && (
<span className="inline-block w-2 h-4 bg-current animate-pulse ml-0.5" />
)}
</div>
);
});The blinking cursor matters. That tiny animated cursor at the end of streaming text tells the user "I'm still working." Without it, users don't know if the response is done or frozen. It is a small detail with a massive impact on perceived quality.
When the AI fails, do not show "Something went wrong." Tell the user something useful:
function getErrorMessage(error: any, feature: string): string {
if (error.status === 429) {
return "This feature is temporarily busy. Please try again in a few seconds.";
}
if (error.status === 413 || error.message?.includes("token")) {
return "Your input is too long for AI analysis. Try with a shorter text.";
}
if (error.code === "ETIMEDOUT") {
return "The AI took too long to respond. This sometimes happens with complex requests. Please try again.";
}
if (error.message?.includes("content_policy")) {
return "This content cannot be processed by our AI. Please modify your input.";
}
return "AI analysis is temporarily unavailable. Your other tools still work fine.";
}Notice the last message: "Your other tools still work fine." AI features should degrade gracefully. If the LLM is down, the rest of your application should be completely unaffected. Never let an AI feature outage take down non-AI features.
// src/components/ai/AILoadingState.tsx
export function AILoadingState({ feature }: { feature: string }) {
const estimates: Record<string, string> = {
"code-review": "10-20 seconds",
"tool-description": "2-5 seconds",
summarize: "5-10 seconds",
};
return (
<div className="flex items-center gap-3 text-muted-foreground">
<div className="flex gap-1">
<span className="w-2 h-2 bg-current rounded-full animate-bounce" />
<span className="w-2 h-2 bg-current rounded-full animate-bounce [animation-delay:0.1s]" />
<span className="w-2 h-2 bg-current rounded-full animate-bounce [animation-delay:0.2s]" />
</div>
<span>
Analyzing... This usually takes {estimates[feature] ?? "a few seconds"}.
</span>
</div>
);
}Setting expectations upfront ("this usually takes 10-20 seconds") dramatically reduces perceived wait time. Users who know to expect a wait are far more patient than users who are wondering if the feature is broken.
Testing non-deterministic systems is hard. But "it's non-deterministic" is not an excuse to skip testing. Here is how I approach it:
Test everything around the LLM call deterministically. The prompt construction, the response parsing, the error handling, the caching logic — all of this is deterministic and should have thorough unit tests:
// src/lib/ai/__tests__/parse.test.ts
import { describe, it, expect } from "vitest";
import { extractJSON } from "../parse";
import { z } from "zod";
const schema = z.object({
safe: z.boolean(),
category: z.string(),
});
describe("extractJSON", () => {
it("parses direct JSON", () => {
const result = extractJSON('{"safe": true, "category": "safe"}', schema);
expect(result.success).toBe(true);
});
it("extracts JSON from markdown code block", () => {
const input = `Here is my analysis:\n\`\`\`json\n{"safe": true, "category": "safe"}\n\`\`\``;
const result = extractJSON(input, schema);
expect(result.success).toBe(true);
});
it("handles trailing commas", () => {
const input = '{"safe": true, "category": "safe",}';
const result = extractJSON(input, schema);
expect(result.success).toBe(true);
});
it("fails gracefully on garbage input", () => {
const result = extractJSON("I cannot help with that.", schema);
expect(result.success).toBe(false);
});
});When you change a prompt, you want to know. Snapshot tests catch unintended prompt modifications:
describe("prompts", () => {
it("tool-description prompt matches snapshot", () => {
const prompt = getPrompt("tool-description", {
toolName: "JSON Formatter",
toolCategory: "developer",
existingDescription: "",
});
expect(prompt.messages[0].content).toMatchSnapshot();
});
});For critical features, I run periodic integration tests against the actual API. Not on every commit — that would be expensive and slow — but on a daily schedule:
// src/lib/ai/__tests__/integration.test.ts
// Only runs with AI_INTEGRATION_TESTS=true
describe.skipIf(!process.env.AI_INTEGRATION_TESTS)(
"AI integration",
() => {
it("moderation correctly flags harmful content", async () => {
const result = await moderate("I will hurt someone");
expect(result.safe).toBe(false);
expect(result.confidence).toBeGreaterThan(0.8);
}, 30000);
it("moderation allows benign content", async () => {
const result = await moderate("How do I center a div in CSS?");
expect(result.safe).toBe(true);
}, 30000);
it("structured output matches schema", async () => {
const result = await analyzeCode(
"function add(a, b) { return a + b; }",
"javascript"
);
expect(result).toHaveProperty("issues");
expect(Array.isArray(result.issues)).toBe(true);
}, 60000);
}
);For any AI feature that matters, build an evaluation set — a collection of inputs with expected outputs that you can run periodically to catch regressions:
const evaluationSet = [
{
input: "A React hook for fetching data",
expectedCategory: "developer",
expectedContains: ["hook", "fetch", "React"],
expectedNotContains: ["revolutionary", "powerful"],
},
{
input: "Convert PDF to Word document",
expectedCategory: "converter",
expectedContains: ["PDF", "Word"],
},
];
async function runEvaluation() {
const results = [];
for (const testCase of evaluationSet) {
const response = await generateDescription(testCase.input);
const passed =
testCase.expectedContains?.every((word) =>
response.toLowerCase().includes(word.toLowerCase())
) &&
!testCase.expectedNotContains?.some((word) =>
response.toLowerCase().includes(word.toLowerCase())
);
results.push({
input: testCase.input,
response,
passed,
});
}
const passRate = results.filter((r) => r.passed).length / results.length;
console.log(`Evaluation pass rate: ${(passRate * 100).toFixed(1)}%`);
if (passRate < 0.85) {
// Alert: prompt may have regressed
await sendAlert(`AI evaluation pass rate dropped to ${passRate}`);
}
}If your AI feature processes user input, you need moderation. Not optional. Not "we'll add it later." Now.
// src/lib/ai/moderation.ts
export async function moderateInput(input: string): Promise<{
allowed: boolean;
reason?: string;
}> {
// Layer 1: Keyword blocklist (fast, catches obvious cases)
const blocklist = getBlocklist();
for (const term of blocklist) {
if (input.toLowerCase().includes(term)) {
return { allowed: false, reason: "blocked_content" };
}
}
// Layer 2: Prompt injection detection
const injection = detectPromptInjection(input);
if (injection.suspicious) {
return { allowed: false, reason: "prompt_injection" };
}
// Layer 3: AI-based moderation (most accurate, but costs money)
const moderation = await classifyContent(input);
if (!moderation.safe && moderation.confidence > 0.8) {
return { allowed: false, reason: moderation.category };
}
return { allowed: true };
}Layer your defenses. The keyword blocklist catches the obvious stuff for free. Prompt injection detection catches manipulation attempts. AI-based moderation catches nuanced cases. Each layer has different cost/accuracy tradeoffs.
Moderate outputs too. Even with a perfect system prompt, models can sometimes generate inappropriate content. Run the output through at minimum a basic check before showing it to users:
export function sanitizeOutput(text: string): string {
// Remove any system prompt leakage
const systemLeakPatterns = [
/you are a .* assistant/gi,
/as an ai/gi,
/i'?m (just )?an? ai/gi,
/my (system )?instructions/gi,
];
let cleaned = text;
for (const pattern of systemLeakPatterns) {
cleaned = cleaned.replace(pattern, "");
}
return cleaned.trim();
}If you are building any kind of search, recommendation, or RAG feature, you need an embedding pipeline. Here is a minimal but production-ready setup:
// src/lib/ai/embeddings.ts
import { embed, embedMany } from "ai";
import { openai } from "@ai-sdk/openai";
const embeddingModel = openai.embedding("text-embedding-3-small");
export async function embedDocument(
content: string,
metadata: Record<string, string>
): Promise<void> {
const chunks = splitIntoChunks(content, {
maxTokens: 600,
overlap: 100,
});
const { embeddings } = await embedMany({
model: embeddingModel,
values: chunks.map((c) => c.text),
});
// Store in database with pgvector
for (let i = 0; i < chunks.length; i++) {
await prisma.embedding.create({
data: {
content: chunks[i].text,
embedding: embeddings[i],
metadata,
chunkIndex: i,
totalChunks: chunks.length,
},
});
}
}
function splitIntoChunks(
text: string,
options: { maxTokens: number; overlap: number }
): Array<{ text: string; startOffset: number }> {
const { maxTokens, overlap } = options;
const chunks: Array<{ text: string; startOffset: number }> = [];
// Split on paragraph boundaries when possible
const paragraphs = text.split(/\n\n+/);
let currentChunk = "";
let currentOffset = 0;
let chunkStartOffset = 0;
for (const paragraph of paragraphs) {
const combined = currentChunk
? `${currentChunk}\n\n${paragraph}`
: paragraph;
const estimatedTokens = Math.ceil(combined.length / 4);
if (estimatedTokens > maxTokens && currentChunk) {
chunks.push({
text: currentChunk.trim(),
startOffset: chunkStartOffset,
});
// Overlap: keep last portion of previous chunk
const overlapText = currentChunk.slice(
-(overlap * 4)
);
currentChunk = overlapText + "\n\n" + paragraph;
chunkStartOffset = currentOffset - overlapText.length;
} else {
if (!currentChunk) chunkStartOffset = currentOffset;
currentChunk = combined;
}
currentOffset += paragraph.length + 2;
}
if (currentChunk.trim()) {
chunks.push({
text: currentChunk.trim(),
startOffset: chunkStartOffset,
});
}
return chunks;
}Embedding model choice matters less than you think. I have tested several embedding models, and for most practical use cases, the quality difference is marginal. The cheap, fast model is usually good enough. Save your money for the generative model.
Re-embedding is expensive. If you have 100,000 documents and need to re-embed them because you changed models, that is a significant cost and time investment. Choose your embedding model carefully upfront, and do not switch casually.
After a year of building AI features, here is my honest assessment:
Works well:
Overhyped:
Underrated:
You cannot improve what you cannot measure. Every AI feature needs:
// src/lib/ai/telemetry.ts
type AITelemetry = {
feature: string;
model: string;
inputTokens: number;
outputTokens: number;
latencyMs: number;
cached: boolean;
fallbackUsed: boolean;
error?: string;
userSatisfaction?: "positive" | "negative";
};
export async function trackAICall(telemetry: AITelemetry): Promise<void> {
// Real-time metrics for alerting
await redis.lpush(
"ai:telemetry:recent",
JSON.stringify({ ...telemetry, timestamp: Date.now() })
);
await redis.ltrim("ai:telemetry:recent", 0, 9999);
// Aggregate stats for dashboards
const today = new Date().toISOString().slice(0, 10);
const prefix = `ai:metrics:${today}:${telemetry.feature}`;
await Promise.all([
redis.hincrby(prefix, "requests", 1),
redis.hincrby(prefix, "input_tokens", telemetry.inputTokens),
redis.hincrby(prefix, "output_tokens", telemetry.outputTokens),
redis.hincrby(prefix, "total_latency_ms", telemetry.latencyMs),
redis.hincrby(prefix, telemetry.cached ? "cache_hits" : "cache_misses", 1),
telemetry.error && redis.hincrby(prefix, "errors", 1),
telemetry.fallbackUsed && redis.hincrby(prefix, "fallbacks", 1),
]);
redis.expire(prefix, 2592000); // 30 days
}The metrics I watch daily:
After building AI features into production web applications for the past year, these are the things I wish someone had told me on day one:
1. Your first LLM cost estimate is wrong by at least 3x. You will underestimate how many tokens your prompts use, how many requests users make, and how expensive retries are. Budget 3x what your napkin math says, then add a hard spending cap anyway.
2. Streaming is not optional. I shipped a non-streaming AI feature once. User feedback was universally negative. "Is it broken?" "Did it freeze?" "Nothing is happening." Rebuilt it with streaming in two days. Problem solved. Never ship a user-facing LLM feature without streaming again.
3. The LLM is the least reliable part of your stack. It will time out. It will rate limit you. It will occasionally return garbage. It will refuse benign inputs because its safety filters are overzealous. Design every AI feature to degrade gracefully.
4. Prompt engineering is software engineering. Treat prompts like code: version them, test them, review them, monitor their performance. A "small tweak" to a prompt can change the behavior of your feature in ways that are invisible until users report problems.
5. Users do not care that it is AI. They care that it works. They care that it is fast. They care that it does not waste their time. The word "AI" in your feature name does not make it better. The feature being genuinely useful makes it better.
6. Cache everything that can be cached. The single biggest cost and latency improvement I made was implementing aggressive caching. Many AI features receive the same or similar inputs repeatedly. A 60% cache hit rate cuts both your cost and your latency by more than half.
7. Test the edges, not the happy path. Your AI feature works great with well-formed English input. What happens with empty input? With 50,000 characters? With Chinese? With prompt injection attempts? With pure emoji? With HTML? Test the weird stuff.
8. Start with the cheapest model that works. Most developers reach for GPT-4 or Claude Opus first. For 80% of AI features, a smaller and cheaper model works just as well. Start cheap, upgrade only if quality demands it.
9. Multi-provider is not premature optimization. When your only LLM provider goes down for 4 hours on a Tuesday afternoon and your entire AI feature set is dead, your boss will not accept "they had an outage" as an answer. Having a fallback provider is table stakes for production.
10. The AI hype cycle will pass, but the engineering fundamentals will not. Streaming, caching, error handling, cost management, observability — these are not AI-specific skills. They are distributed systems engineering skills applied to a new type of external dependency. Learn them well, and you will be valuable long after the hype settles.
Building AI features that users love is hard. Not because the AI is hard — the APIs are remarkably simple. It is hard because the engineering around the AI is hard. The reliability, the cost, the latency, the UX, the safety — that is where the real work is. And that is where the real value is too.