Every tutorial on integrating LLMs into web apps follows the same script: install the SDK, call the API, display the response. Twenty lines of code, maybe thirty. "Look how easy it is!" And it is easy — until you ship it to real users and discover that your AI feature costs $400/day, times out on 12% of requests, occasionally tells users to eat glass, and streams responses so slowly that people close the tab before they see the answer.

I have spent the better part of a year building AI-powered features into production web applications. Not chatbot wrappers. Not demo projects. Features that handle real traffic, run against real cost constraints, and have to work reliably for users who do not care that there is an LLM behind the curtain — they just want the thing to work.

What follows is everything I have learned about doing this properly. The architecture decisions, the streaming implementation details, the cost traps, the failure modes, and the UX patterns that make the difference between an AI feature that delights users and one that embarrasses you.

The Architecture Decision You Make on Day One#

Before you write a single line of code, you need to decide where the LLM sits in your architecture. This sounds obvious, but the wrong decision here will haunt you for months.

Option 1: Direct API calls from your backend. Your Next.js API route calls the LLM provider directly. Simple, fast to implement, and the right choice for most teams starting out.

Option 2: A dedicated AI service layer. A separate service (or at least a separate module) that owns all LLM interactions. Prompt templates, model selection, response parsing, caching — all in one place. More work upfront, dramatically easier to maintain.

Option 3: Edge functions. Running LLM calls at the edge sounds appealing for latency. In practice, edge runtimes have strict execution time limits and memory constraints that make them a poor fit for most LLM workloads. The LLM API call itself dominates latency anyway — shaving 50ms off the network hop to your server doesn't matter when the model takes 3 seconds to respond.

I started with Option 1 and migrated to Option 2 within two months. The trigger was discovering that I had the same prompt scattered across seven different API routes, each with slightly different system messages, and a bug fix in one place meant hunting down six other places. Learn from my mistake: centralize from the start.

Here is the pattern I settled on:

typescript

// src/lib/ai/provider.ts
import { createOpenAI } from "@ai-sdk/openai";
import { createAnthropic } from "@ai-sdk/anthropic";
 
const providers = {
  openai: createOpenAI({
    apiKey: process.env.OPENAI_API_KEY,
    compatibility: "strict",
  }),
  anthropic: createAnthropic({
    apiKey: process.env.ANTHROPIC_API_KEY,
  }),
};
 
export type ModelConfig = {
  provider: keyof typeof providers;
  model: string;
  maxTokens: number;
  temperature: number;
  fallback?: ModelConfig;
};
 
export const models = {
  fast: {
    provider: "openai" as const,
    model: "gpt-4o-mini",
    maxTokens: 1024,
    temperature: 0.3,
    fallback: {
      provider: "anthropic" as const,
      model: "claude-3-5-haiku-20241022",
      maxTokens: 1024,
      temperature: 0.3,
    },
  },
  quality: {
    provider: "anthropic" as const,
    model: "claude-sonnet-4-20250514",
    maxTokens: 4096,
    temperature: 0.4,
    fallback: {
      provider: "openai" as const,
      model: "gpt-4o",
      maxTokens: 4096,
      temperature: 0.4,
    },
  },
  reasoning: {
    provider: "anthropic" as const,
    model: "claude-opus-4-20250514",
    maxTokens: 8192,
    temperature: 0.2,
  },
} satisfies Record<string, ModelConfig>;

The key insight: you need at least two tiers of model. A fast, cheap model for simple tasks (summaries, classifications, short generations) and a slower, expensive model for tasks that require actual reasoning. Trying to use one model for everything either bankrupts you or disappoints users.

Streaming: The Part Everyone Gets Wrong#

Non-streaming LLM responses are unusable in production. A 3-second wait with no feedback feels like an eternity to users. Streaming changes the perceived latency from "how long until I see anything" to "how long until the response is complete" — and the first metric is the one that matters for user experience.

Server-Sent Events: The Right Transport#

I have seen people try to stream LLM responses over WebSockets. Do not do this. SSE (Server-Sent Events) exists specifically for this pattern: server pushes data to the client over a long-lived HTTP connection. It is simpler, works through more proxies and CDNs, automatically reconnects, and does not require a persistent bidirectional channel you don't need.

Here is a production-grade streaming endpoint:

typescript

// src/app/api/ai/generate/route.ts
import { streamText } from "ai";
import { getModel } from "@/lib/ai/provider";
import { getPrompt } from "@/lib/ai/prompts";
import { rateLimit } from "@/lib/rate-limit";
import { estimateTokens } from "@/lib/ai/tokens";
 
export const runtime = "nodejs";
export const maxDuration = 60;
 
export async function POST(req: Request) {
  const ip = req.headers.get("x-forwarded-for") ?? "unknown";
  const limiter = await rateLimit(ip, { max: 20, window: 60 });
  if (!limiter.success) {
    return new Response("Rate limit exceeded", {
      status: 429,
      headers: { "Retry-After": String(limiter.retryAfter) },
    });
  }
 
  const { input, feature, context } = await req.json();
 
  // Token budget enforcement BEFORE calling the model
  const estimatedInputTokens = estimateTokens(input + (context ?? ""));
  if (estimatedInputTokens > 8000) {
    return new Response(JSON.stringify({ error: "Input too long", maxTokens: 8000 }), { status: 400 });
  }
 
  const prompt = getPrompt(feature, { input, context });
  const modelConfig = selectModel(feature, estimatedInputTokens);
 
  try {
    const result = streamText({
      model: getModel(modelConfig),
      messages: prompt.messages,
      maxTokens: modelConfig.maxTokens,
      temperature: modelConfig.temperature,
      abortSignal: req.signal,
    });
 
    return result.toDataStreamResponse();
  } catch (error) {
    return handleAIError(error);
  }
}

A few things to notice here that tutorials skip:

maxDuration: 60 — Next.js API routes have a default timeout. LLM responses can take 30+ seconds for long generations. If you don't increase this, your responses will be truncated silently.

abortSignal: req.signal — When a user navigates away, the request is aborted. Without this, you keep paying for a response nobody will see. This alone saved me measurable money.

Token estimation before the call — You do not want to send a 50,000 token prompt to the API and discover it fails (or costs $2) after the fact. Estimate first, reject early.

Client-Side Streaming Consumption#

On the client side, the Vercel AI SDK handles most of the complexity, but understanding what happens underneath matters for debugging:

typescript

// src/hooks/useAIGeneration.ts
"use client";
 
import { useChat } from "ai/react";
import { useCallback, useRef, useState } from "react";
 
export function useAIGeneration(feature: string) {
  const [isGenerating, setIsGenerating] = useState(false);
  const abortRef = useRef<(() => void) | null>(null);
 
  const { messages, append, isLoading, error, stop } = useChat({
    api: "/api/ai/generate",
    body: { feature },
    onResponse(response) {
      if (!response.ok) {
        setIsGenerating(false);
      }
    },
    onFinish() {
      setIsGenerating(false);
    },
    onError(err) {
      setIsGenerating(false);
      console.error(`AI generation failed for ${feature}:`, err);
    },
  });
 
  const generate = useCallback(
    async (input: string, context?: string) => {
      setIsGenerating(true);
      await append({
        role: "user",
        content: input,
      });
    },
    [append],
  );
 
  const cancel = useCallback(() => {
    stop();
    setIsGenerating(false);
  }, [stop]);
 
  return {
    messages,
    generate,
    cancel,
    isGenerating,
    isStreaming: isLoading,
    error,
  };
}

Critical detail: always give users a cancel button. LLM generations can take 30 seconds. If a user realizes they asked the wrong question at second 3, they should not have to wait 27 more seconds. The cancel also saves you money — an aborted stream stops billing for output tokens.

The Raw SSE Implementation#

Sometimes you cannot or do not want to use the Vercel AI SDK. Here is what streaming looks like from scratch, because understanding this will save you hours of debugging when something goes wrong:

typescript

// Raw SSE streaming without the AI SDK
export async function POST(req: Request) {
  const encoder = new TextEncoder();
 
  const stream = new ReadableStream({
    async start(controller) {
      try {
        const response = await fetch("https://api.anthropic.com/v1/messages", {
          method: "POST",
          headers: {
            "Content-Type": "application/json",
            "x-api-key": process.env.ANTHROPIC_API_KEY!,
            "anthropic-version": "2023-06-01",
          },
          body: JSON.stringify({
            model: "claude-sonnet-4-20250514",
            max_tokens: 2048,
            stream: true,
            messages: [{ role: "user", content: "..." }],
          }),
        });
 
        const reader = response.body!.getReader();
        const decoder = new TextDecoder();
        let buffer = "";
 
        while (true) {
          const { done, value } = await reader.read();
          if (done) break;
 
          buffer += decoder.decode(value, { stream: true });
          const lines = buffer.split("\n");
          buffer = lines.pop() ?? "";
 
          for (const line of lines) {
            if (line.startsWith("data: ")) {
              const data = line.slice(6);
              if (data === "[DONE]") {
                controller.close();
                return;
              }
              try {
                const parsed = JSON.parse(data);
                if (parsed.type === "content_block_delta") {
                  const text = parsed.delta?.text ?? "";
                  controller.enqueue(encoder.encode(`data: ${JSON.stringify({ text })}\n\n`));
                }
              } catch {
                // Incomplete JSON in buffer, will be completed
                // on next chunk
              }
            }
          }
        }
        controller.close();
      } catch (error) {
        controller.enqueue(encoder.encode(`data: ${JSON.stringify({ error: "Generation failed" })}\n\n`));
        controller.close();
      }
    },
  });
 
  return new Response(stream, {
    headers: {
      "Content-Type": "text/event-stream",
      "Cache-Control": "no-cache",
      Connection: "keep-alive",
    },
  });
}

The buffer handling is the part that bites people. SSE data arrives in chunks that do not necessarily align with JSON boundaries. You need to buffer incomplete lines and only parse complete ones. I have seen production code that crashes because it tries to JSON.parse a partial chunk.

Prompt Engineering in Production#

Prompt engineering in a production application is nothing like the prompt engineering you see on Twitter. You are not writing clever one-liners. You are building a system of composable, versioned, testable prompt templates that must produce consistent results across thousands of diverse inputs.

Prompt Management#

typescript

// src/lib/ai/prompts.ts
type PromptTemplate = {
  system: string;
  userTemplate: (vars: Record<string, string>) => string;
  version: string;
  model: "fast" | "quality" | "reasoning";
};
 
const prompts: Record<string, PromptTemplate> = {
  "tool-description": {
    system: `You are a technical writer for a developer tools website.
Write concise, accurate descriptions of web-based tools.
 
Rules:
- Maximum 2 sentences
- Focus on what the tool DOES, not how it works
- No marketing language ("powerful", "revolutionary", "amazing")
- No first person
- Include the primary use case`,
    userTemplate: ({ toolName, toolCategory, existingDescription }) =>
      `Write a description for the tool "${toolName}" in the ${toolCategory} category.
${existingDescription ? `Current description (improve this): ${existingDescription}` : ""}`,
    version: "2.1",
    model: "fast",
  },
 
  "code-review": {
    system: `You are a senior software engineer reviewing code.
Provide specific, actionable feedback. No generic advice.
 
Focus on:
- Bugs and logic errors (critical)
- Security issues (critical)
- Performance problems (important)
- Readability improvements (nice to have)
 
Format: Use markdown. List issues by severity.
Do NOT suggest stylistic changes unless they impact readability.
Do NOT rewrite the entire code — point out specific lines.`,
    userTemplate: ({ code, language, context }) =>
      `Review this ${language} code:
\`\`\`${language}
${code}
\`\`\`
${context ? `Context: ${context}` : ""}`,
    version: "3.0",
    model: "quality",
  },
 
  "content-moderation": {
    system: `You are a content moderation system. Classify the input text.
 
Respond with ONLY a JSON object:
{
  "safe": boolean,
  "category": "safe" | "harassment" | "hate" | "sexual" | "violence" | "self-harm" | "illegal",
  "confidence": number (0-1),
  "reason": string (brief explanation, max 20 words)
}
 
No other text. No markdown. No explanation outside the JSON.`,
    userTemplate: ({ text }) => `Classify this text:\n\n${text}`,
    version: "1.4",
    model: "fast",
  },
};
 
export function getPrompt(
  feature: string,
  vars: Record<string, string>,
): { messages: Array<{ role: string; content: string }> } {
  const template = prompts[feature];
  if (!template) throw new Error(`Unknown prompt: ${feature}`);
 
  return {
    messages: [
      { role: "system", content: template.system },
      { role: "user", content: template.userTemplate(vars) },
    ],
  };
}

Lessons from managing prompts in production:

Version your prompts. When you change a prompt, you change the behavior of your application. You need to know which version produced which output, especially when debugging user reports.

System messages are your guardrails. The system message is the only thing between your application and the model deciding to go off-script. Be explicit about format, constraints, and forbidden behaviors. "Be concise" is not a constraint. "Maximum 2 sentences, no marketing language, no first person" is a constraint.

Template variables must be sanitized. If your user template includes user input, that input can contain prompt injection attacks. At minimum, truncate to a maximum length and strip control characters. For sensitive features, run the input through a moderation check first.

The Prompt Injection Problem#

This is the elephant in the room that most LLM tutorials completely ignore. If your application takes user input and puts it into a prompt, users can manipulate the model's behavior:

typescript

// src/lib/ai/safety.ts
const INJECTION_PATTERNS = [
  /ignore\s+(all\s+)?previous\s+instructions/i,
  /you\s+are\s+now\s+/i,
  /system\s*:\s*/i,
  /\[INST\]/i,
  /<<SYS>>/i,
  /forget\s+(everything|all|your)/i,
  /new\s+instructions?\s*:/i,
  /override\s+(system|instructions|prompt)/i,
];
 
export function detectPromptInjection(input: string): {
  suspicious: boolean;
  patterns: string[];
} {
  const matches = INJECTION_PATTERNS.filter((p) => p.test(input)).map((p) => p.source);
 
  return {
    suspicious: matches.length > 0,
    patterns: matches,
  };
}
 
export function sanitizeInput(input: string, maxLength = 4000): string {
  return input
    .slice(0, maxLength)
    .replace(/[\x00-\x08\x0B\x0C\x0E-\x1F\x7F]/g, "") // control chars
    .trim();
}

This is not bulletproof. Prompt injection is an unsolved problem. But defense in depth helps: input sanitization, output validation, least-privilege system prompts, and treating LLM output as untrusted data (never executing it, never inserting it into SQL, never rendering it as raw HTML).

Caching LLM Responses#

LLM API calls are expensive and slow. Caching identical or semantically similar requests can cut your costs by 40-60% depending on your use case.

Exact-Match Caching#

The simplest and most effective cache: hash the prompt, check if you have seen it before.

typescript

// src/lib/ai/cache.ts
import { createHash } from "crypto";
 
function hashPrompt(messages: Array<{ role: string; content: string }>, model: string, temperature: number): string {
  const input = JSON.stringify({ messages, model, temperature });
  return createHash("sha256").update(input).digest("hex");
}
 
export async function getCachedResponse(
  messages: Array<{ role: string; content: string }>,
  model: string,
  temperature: number,
): Promise<string | null> {
  // Only cache deterministic requests
  if (temperature > 0.1) return null;
 
  const key = `ai:cache:${hashPrompt(messages, model, temperature)}`;
 
  try {
    const cached = await redis.get(key);
    if (cached) {
      await redis.hincrby("ai:stats", "cache_hits", 1);
      return cached;
    }
    await redis.hincrby("ai:stats", "cache_misses", 1);
    return null;
  } catch {
    // Cache failures should never block AI responses
    return null;
  }
}
 
export async function cacheResponse(
  messages: Array<{ role: string; content: string }>,
  model: string,
  temperature: number,
  response: string,
  ttl = 86400, // 24 hours default
): Promise<void> {
  if (temperature > 0.1) return;
 
  const key = `ai:cache:${hashPrompt(messages, model, temperature)}`;
 
  try {
    await redis.setex(key, ttl, response);
  } catch {
    // Silent failure — caching is optimization, not requirement
  }
}

Important: only cache when temperature is near zero. With higher temperatures, the same prompt should produce different responses. Serving cached responses for a creative writing feature would make it feel broken.

TTL strategy matters. I use different TTLs based on the feature:

Tool descriptions: 7 days (content rarely changes)
Code analysis: 24 hours (code patterns don't change fast)
User-facing chat: no cache (users expect fresh responses)
Moderation: 1 hour (want to catch policy updates)

Semantic Caching#

For some features, exact-match caching is not enough. "How do I center a div?" and "centering a div in CSS" should hit the same cache. This is where embeddings come in:

typescript

// src/lib/ai/semantic-cache.ts
import { embed } from "ai";
import { openai } from "@ai-sdk/openai";
 
const SIMILARITY_THRESHOLD = 0.92;
 
export async function getSemanticallyCachedResponse(query: string): Promise<string | null> {
  const { embedding } = await embed({
    model: openai.embedding("text-embedding-3-small"),
    value: query,
  });
 
  // Search for similar cached queries using cosine similarity
  // In production, use a vector DB (pgvector, Pinecone, etc.)
  const candidates = await vectorStore.search(embedding, {
    topK: 1,
    threshold: SIMILARITY_THRESHOLD,
  });
 
  if (candidates.length > 0) {
    return candidates[0].metadata.response;
  }
 
  return null;
}

Semantic caching is powerful but adds latency (the embedding call itself takes 50-100ms) and complexity. I only use it for features with high query repetition and expensive underlying model calls. For most features, exact-match caching gets you 80% of the benefit with 20% of the complexity.

Cost Management: The Part That Keeps You Up at Night#

Here is the thing nobody tells you until the bill arrives: LLM costs scale with usage in a way that traditional compute costs do not. A busy API endpoint might cost you $20/month in compute. The same endpoint backed by an LLM can cost $20/day.

Token Budget System#

Every feature in your application should have a token budget:

typescript

// src/lib/ai/budget.ts
type FeatureBudget = {
  maxInputTokens: number;
  maxOutputTokens: number;
  maxRequestsPerUser: number;
  maxRequestsPerDay: number;
  costPerRequest: number; // estimated, in cents
  dailyBudgetCents: number;
};
 
const budgets: Record<string, FeatureBudget> = {
  "tool-description": {
    maxInputTokens: 500,
    maxOutputTokens: 200,
    maxRequestsPerUser: 50,
    maxRequestsPerDay: 5000,
    costPerRequest: 0.02,
    dailyBudgetCents: 100, // $1/day
  },
  "code-review": {
    maxInputTokens: 8000,
    maxOutputTokens: 2000,
    maxRequestsPerUser: 10,
    maxRequestsPerDay: 1000,
    costPerRequest: 0.8,
    dailyBudgetCents: 800, // $8/day
  },
  "content-moderation": {
    maxInputTokens: 2000,
    maxOutputTokens: 100,
    maxRequestsPerUser: 200,
    maxRequestsPerDay: 50000,
    costPerRequest: 0.005,
    dailyBudgetCents: 250, // $2.50/day
  },
};
 
export async function checkBudget(feature: string, userId: string): Promise<{ allowed: boolean; reason?: string }> {
  const budget = budgets[feature];
  if (!budget) return { allowed: false, reason: "Unknown feature" };
 
  const today = new Date().toISOString().slice(0, 10);
 
  // Check per-user limit
  const userKey = `ai:usage:${feature}:user:${userId}:${today}`;
  const userCount = await redis.incr(userKey);
  if (userCount === 1) await redis.expire(userKey, 86400);
 
  if (userCount > budget.maxRequestsPerUser) {
    return { allowed: false, reason: "Daily user limit reached" };
  }
 
  // Check global daily limit
  const globalKey = `ai:usage:${feature}:global:${today}`;
  const globalCount = await redis.incr(globalKey);
  if (globalCount === 1) await redis.expire(globalKey, 86400);
 
  if (globalCount > budget.maxRequestsPerDay) {
    return { allowed: false, reason: "Feature daily limit reached" };
  }
 
  // Check cost budget
  const costKey = `ai:cost:${feature}:${today}`;
  const currentCost = parseFloat((await redis.get(costKey)) ?? "0");
  if (currentCost > budget.dailyBudgetCents) {
    return { allowed: false, reason: "Daily cost budget exceeded" };
  }
 
  return { allowed: true };
}
 
export async function recordUsage(
  feature: string,
  inputTokens: number,
  outputTokens: number,
  model: string,
): Promise<void> {
  const cost = calculateCost(model, inputTokens, outputTokens);
  const today = new Date().toISOString().slice(0, 10);
  const costKey = `ai:cost:${feature}:${today}`;
 
  await redis.incrbyfloat(costKey, cost);
 
  // Also track aggregate stats
  await redis.hincrby("ai:stats:tokens", `${feature}:input`, inputTokens);
  await redis.hincrby("ai:stats:tokens", `${feature}:output`, outputTokens);
}

The cost trap with streaming: when you stream, you often do not know the total output tokens until the stream is complete. You have to track this post-hoc, which means your budget checks are always slightly behind reality. Build in a 20% buffer.

Model Selection Based on Task Complexity#

Not every request needs your most expensive model. A classification task (is this spam? what category is this?) can use the cheapest model available. A code review needs something smarter. Build this intelligence into your model selection:

typescript

function selectModel(feature: string, estimatedInputTokens: number): ModelConfig {
  const budget = budgets[feature];
  const today = new Date().toISOString().slice(0, 10);
 
  // If we're over 80% of daily budget, downgrade to cheaper model
  const currentCost = getCachedCost(feature, today);
  if (currentCost > budget.dailyBudgetCents * 0.8) {
    return models.fast; // Always fall back to cheapest
  }
 
  // Short inputs with simple tasks -> fast model
  if (estimatedInputTokens < 500 && isSimpleTask(feature)) {
    return models.fast;
  }
 
  return models[prompts[feature].model];
}

This adaptive model selection saved me roughly 35% on monthly costs. The trick is being honest about which tasks actually need the expensive model. Most of them don't.

Error Handling: Everything Will Break#

LLM APIs fail in ways that traditional APIs do not. You need to handle all of these:

Rate limits (429): You will hit them. Back off exponentially.
Timeouts: Model inference can take 30+ seconds. Your HTTP client, your reverse proxy, and your serverless platform all have different timeout values. Make sure they are aligned.
Context length exceeded: The model rejects prompts that are too long. Estimate first, truncate if needed.
Content policy violations: The model refuses to answer. Have a graceful fallback.
Malformed responses: The model was supposed to return JSON but returned markdown with a JSON block inside it. This happens more often than you think.
Partial stream failures: The stream starts, sends 200 tokens, then dies. The user sees half an answer.

typescript

// src/lib/ai/errors.ts
export async function withAIRetry<T>(
  fn: () => Promise<T>,
  options: {
    maxRetries?: number;
    feature: string;
    fallback?: () => Promise<T>;
  },
): Promise<T> {
  const maxRetries = options.maxRetries ?? 3;
  let lastError: Error | null = null;
 
  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    try {
      return await fn();
    } catch (error: any) {
      lastError = error;
 
      // Don't retry on client errors (except rate limits)
      if (error.status >= 400 && error.status < 500 && error.status !== 429) {
        break;
      }
 
      // Rate limit: respect Retry-After header
      if (error.status === 429) {
        const retryAfter = parseInt(error.headers?.["retry-after"] ?? "5", 10);
        await sleep(retryAfter * 1000);
        continue;
      }
 
      // Exponential backoff for other errors
      if (attempt < maxRetries) {
        await sleep(Math.pow(2, attempt) * 1000);
      }
    }
  }
 
  // All retries failed — try fallback model
  if (options.fallback) {
    try {
      return await options.fallback();
    } catch (fallbackError) {
      // Fallback also failed — log and throw
      console.error(`AI fallback failed for ${options.feature}:`, fallbackError);
    }
  }
 
  throw lastError;
}
 
function sleep(ms: number): Promise<void> {
  return new Promise((resolve) => setTimeout(resolve, ms));
}

Model Fallback Strategy#

This is one of the most valuable patterns I implemented. When your primary model provider has an outage (and they will — I have seen every major provider go down at least once in the past year), your AI features should not disappear entirely:

typescript

// src/lib/ai/fallback.ts
import { generateText, streamText } from "ai";
 
export async function generateWithFallback(
  config: ModelConfig,
  params: {
    messages: Array<{ role: string; content: string }>;
    maxTokens: number;
    temperature: number;
  },
) {
  try {
    return await generateText({
      model: getModel(config),
      ...params,
    });
  } catch (error: any) {
    if (config.fallback && shouldFallback(error)) {
      console.warn(`Primary model ${config.model} failed, falling back to ${config.fallback.model}`);
 
      return await generateText({
        model: getModel(config.fallback),
        ...params,
      });
    }
    throw error;
  }
}
 
function shouldFallback(error: any): boolean {
  // Fallback on server errors, timeouts, and rate limits
  // Don't fallback on client errors (bad prompt, content policy)
  return error.status >= 500 || error.status === 429 || error.code === "ETIMEDOUT" || error.code === "ECONNRESET";
}

I run with dual providers (different companies). When one is down, the other usually isn't. This has saved me from multiple outage-related incidents where my AI features would have been completely unavailable.

Structured Output Parsing#

LLMs return text. Your application needs structured data. This mismatch is the source of endless pain.

The JSON Extraction Problem#

You ask the model to return JSON. Sometimes it does. Sometimes it wraps it in a markdown code block. Sometimes it adds a chatty preamble. Sometimes the JSON is almost valid but has a trailing comma. You need robust parsing:

typescript

// src/lib/ai/parse.ts
import { z } from "zod";
 
export function extractJSON<T>(
  text: string,
  schema: z.ZodSchema<T>,
): { success: true; data: T } | { success: false; error: string } {
  // Try direct parse first
  try {
    const parsed = JSON.parse(text);
    const result = schema.safeParse(parsed);
    if (result.success) return { success: true, data: result.data };
  } catch {
    // Not direct JSON, try extraction
  }
 
  // Extract from markdown code blocks
  const codeBlockMatch = text.match(/```(?:json)?\s*\n?([\s\S]*?)```/);
  if (codeBlockMatch) {
    try {
      const parsed = JSON.parse(codeBlockMatch[1].trim());
      const result = schema.safeParse(parsed);
      if (result.success) return { success: true, data: result.data };
    } catch {
      // Malformed JSON in code block
    }
  }
 
  // Try to find JSON object/array in text
  const jsonMatch = text.match(/(\{[\s\S]*\}|\[[\s\S]*\])/);
  if (jsonMatch) {
    try {
      // Fix common LLM JSON mistakes
      const fixed = jsonMatch[1]
        .replace(/,\s*}/g, "}") // trailing commas in objects
        .replace(/,\s*]/g, "]") // trailing commas in arrays
        .replace(/'/g, '"'); // single quotes to double
 
      const parsed = JSON.parse(fixed);
      const result = schema.safeParse(parsed);
      if (result.success) return { success: true, data: result.data };
    } catch {
      // Still not valid
    }
  }
 
  return {
    success: false,
    error: `Failed to extract valid JSON from response: ${text.slice(0, 200)}...`,
  };
}

Better approach: use the model provider's structured output features when available. Both OpenAI and Anthropic now support JSON mode and tool use that constrains the output format. This is far more reliable than parsing free text:

typescript

import { generateObject } from "ai";
import { z } from "zod";
 
const moderationSchema = z.object({
  safe: z.boolean(),
  category: z.enum(["safe", "harassment", "hate", "sexual", "violence", "self-harm", "illegal"]),
  confidence: z.number().min(0).max(1),
  reason: z.string().max(100),
});
 
const { object } = await generateObject({
  model: getModel(models.fast),
  schema: moderationSchema,
  prompt: `Classify this text: ${sanitizedInput}`,
});
 
// object is fully typed and validated
console.log(object.safe, object.category);

This structured output approach eliminates an entire class of parsing bugs. Use it whenever your output has a known schema.

Retrieval-Augmented Generation (RAG)#

RAG is the pattern where you retrieve relevant context from your own data and include it in the prompt. It is how you make LLMs answer questions about your specific content without fine-tuning.

The Architecture#

User Query
    |
    v
[Embed Query] --> [Vector Search] --> [Retrieve Top-K Documents]
    |                                         |
    v                                         v
[Build Prompt with Retrieved Context]
    |
    v
[LLM Generates Answer Grounded in Your Data]

Implementation#

typescript

// src/lib/ai/rag.ts
import { embed } from "ai";
import { openai } from "@ai-sdk/openai";
 
type Document = {
  id: string;
  content: string;
  metadata: Record<string, string>;
  similarity: number;
};
 
export async function queryWithRAG(
  query: string,
  options: {
    collection: string;
    topK?: number;
    minSimilarity?: number;
  },
): Promise<{ answer: string; sources: Document[] }> {
  const { topK = 5, minSimilarity = 0.7 } = options;
 
  // 1. Embed the query
  const { embedding } = await embed({
    model: openai.embedding("text-embedding-3-small"),
    value: query,
  });
 
  // 2. Vector search (using pgvector here, but Pinecone/Weaviate work too)
  const documents = await prisma.$queryRaw<Document[]>`
    SELECT
      id,
      content,
      metadata,
      1 - (embedding <=> ${embedding}::vector) as similarity
    FROM documents
    WHERE collection = ${options.collection}
      AND 1 - (embedding <=> ${embedding}::vector) > ${minSimilarity}
    ORDER BY embedding <=> ${embedding}::vector
    LIMIT ${topK}
  `;
 
  if (documents.length === 0) {
    return {
      answer: "I don't have enough information to answer that question accurately.",
      sources: [],
    };
  }
 
  // 3. Build context-augmented prompt
  const context = documents.map((d, i) => `[Source ${i + 1}]: ${d.content}`).join("\n\n");
 
  const { text } = await generateText({
    model: getModel(models.quality),
    messages: [
      {
        role: "system",
        content: `Answer the user's question based ONLY on the provided sources.
If the sources don't contain enough information, say so.
Cite sources using [Source N] notation.
Do not make up information not present in the sources.`,
      },
      {
        role: "user",
        content: `Sources:\n${context}\n\nQuestion: ${query}`,
      },
    ],
  });
 
  return { answer: text, sources: documents };
}

RAG Pitfalls I Learned the Hard Way#

Chunk size matters enormously. Too small (100 tokens) and you lose context. Too large (2000 tokens) and you waste context window space on irrelevant text. I settled on 500-800 tokens with 100-token overlap between chunks. But this varies by content type — code needs larger chunks than prose.

Embedding models have blind spots. Short queries like "auth" or "deploy" embed poorly because there is not enough semantic signal. For short queries, I augment with keyword search (BM25) and merge the results.

"Based only on the provided sources" is not as strong a guardrail as you think. Models will still hallucinate if the sources are tangentially related but don't actually contain the answer. You need to validate the output, not just trust the instruction.

Freshness is a real problem. If you embed your documentation once and never update it, your RAG system will serve outdated answers. I run re-embedding on a schedule — daily for frequently changing content, weekly for stable content.

Latency Optimization#

LLM latency is brutal. A typical request involves:

Network round-trip to the API: 50-200ms
Time to first token (TTFT): 500-2000ms
Full generation: 2-30 seconds depending on output length

You cannot make the model think faster. But you can optimize everything around it.

Parallel Requests#

If a feature needs multiple LLM calls, run them in parallel:

typescript

// Bad: sequential calls (6 seconds total)
const summary = await generateText({ ... });
const keywords = await generateText({ ... });
const sentiment = await generateText({ ... });
 
// Good: parallel calls (2 seconds total)
const [summary, keywords, sentiment] = await Promise.all([
  generateText({ model: getModel(models.fast), prompt: summaryPrompt }),
  generateText({ model: getModel(models.fast), prompt: keywordsPrompt }),
  generateText({ model: getModel(models.fast), prompt: sentimentPrompt }),
]);

Speculative Execution#

For features where you can predict the next user action, start the LLM call before the user explicitly requests it:

typescript

// When user starts typing code, speculatively begin analysis
const debouncedAnalysis = useMemo(
  () =>
    debounce(async (code: string) => {
      if (code.length > 50) {
        // Start analysis in background
        speculativeResultRef.current = generateText({
          model: getModel(models.fast),
          prompt: buildAnalysisPrompt(code),
        });
      }
    }, 2000),
  [],
);

This makes the response feel instant when the user clicks "Analyze" because the work already started 2 seconds ago.

Response Prefetching#

For predictable queries (e.g., tool descriptions that will be needed on page load), generate and cache them ahead of time rather than on-demand:

typescript

// scripts/prefetch-descriptions.ts
// Run this as a cron job, not on every request
async function prefetchToolDescriptions() {
  const tools = await getAllTools();
 
  for (const tool of tools) {
    const cacheKey = `ai:desc:${tool.slug}`;
    const cached = await redis.get(cacheKey);
    if (cached) continue;
 
    const { text } = await generateText({
      model: getModel(models.fast),
      prompt: buildDescriptionPrompt(tool),
    });
 
    await redis.setex(cacheKey, 604800, text); // 7 days
 
    // Respect rate limits
    await sleep(200);
  }
}

User Experience Patterns#

The UX around AI features matters more than the AI itself. A mediocre model with great UX beats a great model with terrible UX every time.

Progressive Rendering#

Do not just dump streaming text into a container. Parse it incrementally and render it properly:

typescript

// src/components/ai/StreamingMarkdown.tsx
"use client";
 
import { memo, useMemo } from "react";
import ReactMarkdown from "react-markdown";
import { Prism as SyntaxHighlighter } from "react-syntax-highlighter";
 
interface StreamingMarkdownProps {
  content: string;
  isStreaming: boolean;
}
 
export const StreamingMarkdown = memo(function StreamingMarkdown({
  content,
  isStreaming,
}: StreamingMarkdownProps) {
  // Avoid re-parsing markdown on every token
  const stableContent = useMemo(() => {
    if (!isStreaming) return content;
 
    // While streaming, ensure we don't render incomplete
    // markdown that causes layout shifts
    const lines = content.split("\n");
    const lastLine = lines[lines.length - 1];
 
    // If the last line looks like an incomplete code block, don't render it yet
    if (lastLine?.startsWith("```") && !content.endsWith("```")) {
      return lines.slice(0, -1).join("\n");
    }
 
    return content;
  }, [content, isStreaming]);
 
  return (
    <div className="prose prose-neutral dark:prose-invert max-w-none">
      <ReactMarkdown
        components={{
          code({ className, children, ...props }) {
            const match = /language-(\w+)/.exec(className ?? "");
            if (match) {
              return (
                <SyntaxHighlighter language={match[1]}>
                  {String(children).replace(/\n$/, "")}
                </SyntaxHighlighter>
              );
            }
            return (
              <code className={className} {...props}>
                {children}
              </code>
            );
          },
        }}
      >
        {stableContent}
      </ReactMarkdown>
      {isStreaming && (
        <span className="inline-block w-2 h-4 bg-current animate-pulse ml-0.5" />
      )}
    </div>
  );
});

The blinking cursor matters. That tiny animated cursor at the end of streaming text tells the user "I'm still working." Without it, users don't know if the response is done or frozen. It is a small detail with a massive impact on perceived quality.

Error States That Don't Insult Users#

When the AI fails, do not show "Something went wrong." Tell the user something useful:

typescript

function getErrorMessage(error: any, feature: string): string {
  if (error.status === 429) {
    return "This feature is temporarily busy. Please try again in a few seconds.";
  }
  if (error.status === 413 || error.message?.includes("token")) {
    return "Your input is too long for AI analysis. Try with a shorter text.";
  }
  if (error.code === "ETIMEDOUT") {
    return "The AI took too long to respond. This sometimes happens with complex requests. Please try again.";
  }
  if (error.message?.includes("content_policy")) {
    return "This content cannot be processed by our AI. Please modify your input.";
  }
  return "AI analysis is temporarily unavailable. Your other tools still work fine.";
}

Notice the last message: "Your other tools still work fine." AI features should degrade gracefully. If the LLM is down, the rest of your application should be completely unaffected. Never let an AI feature outage take down non-AI features.

Loading States and Expectations#

typescript

// src/components/ai/AILoadingState.tsx
export function AILoadingState({ feature }: { feature: string }) {
  const estimates: Record<string, string> = {
    "code-review": "10-20 seconds",
    "tool-description": "2-5 seconds",
    summarize: "5-10 seconds",
  };
 
  return (
    <div className="flex items-center gap-3 text-muted-foreground">
      <div className="flex gap-1">
        <span className="w-2 h-2 bg-current rounded-full animate-bounce" />
        <span className="w-2 h-2 bg-current rounded-full animate-bounce [animation-delay:0.1s]" />
        <span className="w-2 h-2 bg-current rounded-full animate-bounce [animation-delay:0.2s]" />
      </div>
      <span>
        Analyzing... This usually takes {estimates[feature] ?? "a few seconds"}.
      </span>
    </div>
  );
}

Setting expectations upfront ("this usually takes 10-20 seconds") dramatically reduces perceived wait time. Users who know to expect a wait are far more patient than users who are wondering if the feature is broken.

Testing AI Features#

Testing non-deterministic systems is hard. But "it's non-deterministic" is not an excuse to skip testing. Here is how I approach it:

Deterministic Boundaries#

Test everything around the LLM call deterministically. The prompt construction, the response parsing, the error handling, the caching logic — all of this is deterministic and should have thorough unit tests:

typescript

// src/lib/ai/__tests__/parse.test.ts
import { describe, it, expect } from "vitest";
import { extractJSON } from "../parse";
import { z } from "zod";
 
const schema = z.object({
  safe: z.boolean(),
  category: z.string(),
});
 
describe("extractJSON", () => {
  it("parses direct JSON", () => {
    const result = extractJSON('{"safe": true, "category": "safe"}', schema);
    expect(result.success).toBe(true);
  });
 
  it("extracts JSON from markdown code block", () => {
    const input = `Here is my analysis:\n\`\`\`json\n{"safe": true, "category": "safe"}\n\`\`\``;
    const result = extractJSON(input, schema);
    expect(result.success).toBe(true);
  });
 
  it("handles trailing commas", () => {
    const input = '{"safe": true, "category": "safe",}';
    const result = extractJSON(input, schema);
    expect(result.success).toBe(true);
  });
 
  it("fails gracefully on garbage input", () => {
    const result = extractJSON("I cannot help with that.", schema);
    expect(result.success).toBe(false);
  });
});

Snapshot Testing for Prompts#

When you change a prompt, you want to know. Snapshot tests catch unintended prompt modifications:

typescript

describe("prompts", () => {
  it("tool-description prompt matches snapshot", () => {
    const prompt = getPrompt("tool-description", {
      toolName: "JSON Formatter",
      toolCategory: "developer",
      existingDescription: "",
    });
 
    expect(prompt.messages[0].content).toMatchSnapshot();
  });
});

Integration Tests with Real Models#

For critical features, I run periodic integration tests against the actual API. Not on every commit — that would be expensive and slow — but on a daily schedule:

typescript

// src/lib/ai/__tests__/integration.test.ts
// Only runs with AI_INTEGRATION_TESTS=true
describe.skipIf(!process.env.AI_INTEGRATION_TESTS)("AI integration", () => {
  it("moderation correctly flags harmful content", async () => {
    const result = await moderate("I will hurt someone");
    expect(result.safe).toBe(false);
    expect(result.confidence).toBeGreaterThan(0.8);
  }, 30000);
 
  it("moderation allows benign content", async () => {
    const result = await moderate("How do I center a div in CSS?");
    expect(result.safe).toBe(true);
  }, 30000);
 
  it("structured output matches schema", async () => {
    const result = await analyzeCode("function add(a, b) { return a + b; }", "javascript");
    expect(result).toHaveProperty("issues");
    expect(Array.isArray(result.issues)).toBe(true);
  }, 60000);
});

Evaluation Sets#

For any AI feature that matters, build an evaluation set — a collection of inputs with expected outputs that you can run periodically to catch regressions:

typescript

const evaluationSet = [
  {
    input: "A React hook for fetching data",
    expectedCategory: "developer",
    expectedContains: ["hook", "fetch", "React"],
    expectedNotContains: ["revolutionary", "powerful"],
  },
  {
    input: "Convert PDF to Word document",
    expectedCategory: "converter",
    expectedContains: ["PDF", "Word"],
  },
];
 
async function runEvaluation() {
  const results = [];
 
  for (const testCase of evaluationSet) {
    const response = await generateDescription(testCase.input);
 
    const passed =
      testCase.expectedContains?.every((word) => response.toLowerCase().includes(word.toLowerCase())) &&
      !testCase.expectedNotContains?.some((word) => response.toLowerCase().includes(word.toLowerCase()));
 
    results.push({
      input: testCase.input,
      response,
      passed,
    });
  }
 
  const passRate = results.filter((r) => r.passed).length / results.length;
  console.log(`Evaluation pass rate: ${(passRate * 100).toFixed(1)}%`);
 
  if (passRate < 0.85) {
    // Alert: prompt may have regressed
    await sendAlert(`AI evaluation pass rate dropped to ${passRate}`);
  }
}

Safety and Moderation#

If your AI feature processes user input, you need moderation. Not optional. Not "we'll add it later." Now.

Multi-Layer Defense#

typescript

// src/lib/ai/moderation.ts
export async function moderateInput(input: string): Promise<{
  allowed: boolean;
  reason?: string;
}> {
  // Layer 1: Keyword blocklist (fast, catches obvious cases)
  const blocklist = getBlocklist();
  for (const term of blocklist) {
    if (input.toLowerCase().includes(term)) {
      return { allowed: false, reason: "blocked_content" };
    }
  }
 
  // Layer 2: Prompt injection detection
  const injection = detectPromptInjection(input);
  if (injection.suspicious) {
    return { allowed: false, reason: "prompt_injection" };
  }
 
  // Layer 3: AI-based moderation (most accurate, but costs money)
  const moderation = await classifyContent(input);
  if (!moderation.safe && moderation.confidence > 0.8) {
    return { allowed: false, reason: moderation.category };
  }
 
  return { allowed: true };
}

Layer your defenses. The keyword blocklist catches the obvious stuff for free. Prompt injection detection catches manipulation attempts. AI-based moderation catches nuanced cases. Each layer has different cost/accuracy tradeoffs.

Moderate outputs too. Even with a perfect system prompt, models can sometimes generate inappropriate content. Run the output through at minimum a basic check before showing it to users:

typescript

export function sanitizeOutput(text: string): string {
  // Remove any system prompt leakage
  const systemLeakPatterns = [
    /you are a .* assistant/gi,
    /as an ai/gi,
    /i'?m (just )?an? ai/gi,
    /my (system )?instructions/gi,
  ];
 
  let cleaned = text;
  for (const pattern of systemLeakPatterns) {
    cleaned = cleaned.replace(pattern, "");
  }
 
  return cleaned.trim();
}

The Embedding Pipeline#

If you are building any kind of search, recommendation, or RAG feature, you need an embedding pipeline. Here is a minimal but production-ready setup:

typescript

// src/lib/ai/embeddings.ts
import { embed, embedMany } from "ai";
import { openai } from "@ai-sdk/openai";
 
const embeddingModel = openai.embedding("text-embedding-3-small");
 
export async function embedDocument(content: string, metadata: Record<string, string>): Promise<void> {
  const chunks = splitIntoChunks(content, {
    maxTokens: 600,
    overlap: 100,
  });
 
  const { embeddings } = await embedMany({
    model: embeddingModel,
    values: chunks.map((c) => c.text),
  });
 
  // Store in database with pgvector
  for (let i = 0; i < chunks.length; i++) {
    await prisma.embedding.create({
      data: {
        content: chunks[i].text,
        embedding: embeddings[i],
        metadata,
        chunkIndex: i,
        totalChunks: chunks.length,
      },
    });
  }
}
 
function splitIntoChunks(
  text: string,
  options: { maxTokens: number; overlap: number },
): Array<{ text: string; startOffset: number }> {
  const { maxTokens, overlap } = options;
  const chunks: Array<{ text: string; startOffset: number }> = [];
 
  // Split on paragraph boundaries when possible
  const paragraphs = text.split(/\n\n+/);
  let currentChunk = "";
  let currentOffset = 0;
  let chunkStartOffset = 0;
 
  for (const paragraph of paragraphs) {
    const combined = currentChunk ? `${currentChunk}\n\n${paragraph}` : paragraph;
    const estimatedTokens = Math.ceil(combined.length / 4);
 
    if (estimatedTokens > maxTokens && currentChunk) {
      chunks.push({
        text: currentChunk.trim(),
        startOffset: chunkStartOffset,
      });
 
      // Overlap: keep last portion of previous chunk
      const overlapText = currentChunk.slice(-(overlap * 4));
      currentChunk = overlapText + "\n\n" + paragraph;
      chunkStartOffset = currentOffset - overlapText.length;
    } else {
      if (!currentChunk) chunkStartOffset = currentOffset;
      currentChunk = combined;
    }
 
    currentOffset += paragraph.length + 2;
  }
 
  if (currentChunk.trim()) {
    chunks.push({
      text: currentChunk.trim(),
      startOffset: chunkStartOffset,
    });
  }
 
  return chunks;
}

Embedding model choice matters less than you think. I have tested several embedding models, and for most practical use cases, the quality difference is marginal. The cheap, fast model is usually good enough. Save your money for the generative model.

Re-embedding is expensive. If you have 100,000 documents and need to re-embed them because you changed models, that is a significant cost and time investment. Choose your embedding model carefully upfront, and do not switch casually.

What is Overhyped and What Actually Works#

After a year of building AI features, here is my honest assessment:

Works well:

Summarization. LLMs are genuinely good at condensing long text. Users love it.
Classification and categorization. Fast, accurate, and saves enormous manual effort.
Code analysis. Not perfect, but catches real bugs and provides genuinely useful suggestions.
Content moderation. Far more nuanced than keyword-based approaches.
Search augmentation. RAG over your own data is a legitimate improvement over traditional search.

Overhyped:

"AI-powered" everything. Most features do not benefit from an LLM. Adding AI to a color picker does not make it better.
Autonomous agents in production. They work in demos. In production, they hallucinate, get stuck in loops, and cost a fortune. Use them for internal tooling, not user-facing features.
Fine-tuning. For most use cases, good prompting with RAG gets you 90% there. Fine-tuning is expensive, hard to iterate on, and locks you to a specific model version.
"Just use AI" as a substitute for engineering. LLMs are powerful tools, but they do not replace data modeling, proper error handling, or performance optimization. They add to your complexity; they do not reduce it.

Underrated:

Embeddings for search. Semantic search with embeddings is dramatically better than full-text search for natural language queries, and it is relatively cheap.
AI for internal tools. The ROI on AI features for your own team (log analysis, code review, documentation search) is often higher than user-facing AI features.
Structured output for data extraction. Pulling structured data from unstructured text (emails, PDFs, web pages) is a genuine superpower.

Monitoring and Observability#

You cannot improve what you cannot measure. Every AI feature needs:

typescript

// src/lib/ai/telemetry.ts
type AITelemetry = {
  feature: string;
  model: string;
  inputTokens: number;
  outputTokens: number;
  latencyMs: number;
  cached: boolean;
  fallbackUsed: boolean;
  error?: string;
  userSatisfaction?: "positive" | "negative";
};
 
export async function trackAICall(telemetry: AITelemetry): Promise<void> {
  // Real-time metrics for alerting
  await redis.lpush("ai:telemetry:recent", JSON.stringify({ ...telemetry, timestamp: Date.now() }));
  await redis.ltrim("ai:telemetry:recent", 0, 9999);
 
  // Aggregate stats for dashboards
  const today = new Date().toISOString().slice(0, 10);
  const prefix = `ai:metrics:${today}:${telemetry.feature}`;
 
  await Promise.all([
    redis.hincrby(prefix, "requests", 1),
    redis.hincrby(prefix, "input_tokens", telemetry.inputTokens),
    redis.hincrby(prefix, "output_tokens", telemetry.outputTokens),
    redis.hincrby(prefix, "total_latency_ms", telemetry.latencyMs),
    redis.hincrby(prefix, telemetry.cached ? "cache_hits" : "cache_misses", 1),
    telemetry.error && redis.hincrby(prefix, "errors", 1),
    telemetry.fallbackUsed && redis.hincrby(prefix, "fallbacks", 1),
  ]);
 
  redis.expire(prefix, 2592000); // 30 days
}

The metrics I watch daily:

Cost per feature per day. If this spikes, something is wrong (a prompt got longer, a cache stopped working, a loop is making repeated calls).
Error rate. Above 5% means something is broken. Above 2% means something needs attention.
Cache hit rate. Below 30% means your caching strategy needs work.
P95 latency. This is the latency that slow users experience. It should be under 10 seconds for most features.
Fallback rate. If fallbacks are triggered more than 1% of the time, your primary provider might have chronic reliability issues.

Lessons Learned the Hard Way#

After building AI features into production web applications for the past year, these are the things I wish someone had told me on day one:

1. Your first LLM cost estimate is wrong by at least 3x. You will underestimate how many tokens your prompts use, how many requests users make, and how expensive retries are. Budget 3x what your napkin math says, then add a hard spending cap anyway.

2. Streaming is not optional. I shipped a non-streaming AI feature once. User feedback was universally negative. "Is it broken?" "Did it freeze?" "Nothing is happening." Rebuilt it with streaming in two days. Problem solved. Never ship a user-facing LLM feature without streaming again.

3. The LLM is the least reliable part of your stack. It will time out. It will rate limit you. It will occasionally return garbage. It will refuse benign inputs because its safety filters are overzealous. Design every AI feature to degrade gracefully.

4. Prompt engineering is software engineering. Treat prompts like code: version them, test them, review them, monitor their performance. A "small tweak" to a prompt can change the behavior of your feature in ways that are invisible until users report problems.

5. Users do not care that it is AI. They care that it works. They care that it is fast. They care that it does not waste their time. The word "AI" in your feature name does not make it better. The feature being genuinely useful makes it better.

6. Cache everything that can be cached. The single biggest cost and latency improvement I made was implementing aggressive caching. Many AI features receive the same or similar inputs repeatedly. A 60% cache hit rate cuts both your cost and your latency by more than half.

7. Test the edges, not the happy path. Your AI feature works great with well-formed English input. What happens with empty input? With 50,000 characters? With Chinese? With prompt injection attempts? With pure emoji? With HTML? Test the weird stuff.

8. Start with the cheapest model that works. Most developers reach for GPT-4 or Claude Opus first. For 80% of AI features, a smaller and cheaper model works just as well. Start cheap, upgrade only if quality demands it.

9. Multi-provider is not premature optimization. When your only LLM provider goes down for 4 hours on a Tuesday afternoon and your entire AI feature set is dead, your boss will not accept "they had an outage" as an answer. Having a fallback provider is table stakes for production.

10. The AI hype cycle will pass, but the engineering fundamentals will not. Streaming, caching, error handling, cost management, observability — these are not AI-specific skills. They are distributed systems engineering skills applied to a new type of external dependency. Learn them well, and you will be valuable long after the hype settles.

Building AI features that users love is hard. Not because the AI is hard — the APIs are remarkably simple. It is hard because the engineering around the AI is hard. The reliability, the cost, the latency, the UX, the safety — that is where the real work is. And that is where the real value is too.

The Architecture Decision You Make on Day One#

Before you write a single line of code, you need to decide where the LLM sits in your architecture. This sounds obvious, but the wrong decision here will haunt you for months.

Option 1: Direct API calls from your backend. Your Next.js API route calls the LLM provider directly. Simple, fast to implement, and the right choice for most teams starting out.

Here is the pattern I settled on:

typescript

// src/lib/ai/provider.ts
import { createOpenAI } from "@ai-sdk/openai";
import { createAnthropic } from "@ai-sdk/anthropic";
 
const providers = {
  openai: createOpenAI({
    apiKey: process.env.OPENAI_API_KEY,
    compatibility: "strict",
  }),
  anthropic: createAnthropic({
    apiKey: process.env.ANTHROPIC_API_KEY,
  }),
};
 
export type ModelConfig = {
  provider: keyof typeof providers;
  model: string;
  maxTokens: number;
  temperature: number;
  fallback?: ModelConfig;
};
 
export const models = {
  fast: {
    provider: "openai" as const,
    model: "gpt-4o-mini",
    maxTokens: 1024,
    temperature: 0.3,
    fallback: {
      provider: "anthropic" as const,
      model: "claude-3-5-haiku-20241022",
      maxTokens: 1024,
      temperature: 0.3,
    },
  },
  quality: {
    provider: "anthropic" as const,
    model: "claude-sonnet-4-20250514",
    maxTokens: 4096,
    temperature: 0.4,
    fallback: {
      provider: "openai" as const,
      model: "gpt-4o",
      maxTokens: 4096,
      temperature: 0.4,
    },
  },
  reasoning: {
    provider: "anthropic" as const,
    model: "claude-opus-4-20250514",
    maxTokens: 8192,
    temperature: 0.2,
  },
} satisfies Record<string, ModelConfig>;

Streaming: The Part Everyone Gets Wrong#

Server-Sent Events: The Right Transport#

Here is a production-grade streaming endpoint:

typescript

// src/app/api/ai/generate/route.ts
import { streamText } from "ai";
import { getModel } from "@/lib/ai/provider";
import { getPrompt } from "@/lib/ai/prompts";
import { rateLimit } from "@/lib/rate-limit";
import { estimateTokens } from "@/lib/ai/tokens";
 
export const runtime = "nodejs";
export const maxDuration = 60;
 
export async function POST(req: Request) {
  const ip = req.headers.get("x-forwarded-for") ?? "unknown";
  const limiter = await rateLimit(ip, { max: 20, window: 60 });
  if (!limiter.success) {
    return new Response("Rate limit exceeded", {
      status: 429,
      headers: { "Retry-After": String(limiter.retryAfter) },
    });
  }
 
  const { input, feature, context } = await req.json();
 
  // Token budget enforcement BEFORE calling the model
  const estimatedInputTokens = estimateTokens(input + (context ?? ""));
  if (estimatedInputTokens > 8000) {
    return new Response(JSON.stringify({ error: "Input too long", maxTokens: 8000 }), { status: 400 });
  }
 
  const prompt = getPrompt(feature, { input, context });
  const modelConfig = selectModel(feature, estimatedInputTokens);
 
  try {
    const result = streamText({
      model: getModel(modelConfig),
      messages: prompt.messages,
      maxTokens: modelConfig.maxTokens,
      temperature: modelConfig.temperature,
      abortSignal: req.signal,
    });
 
    return result.toDataStreamResponse();
  } catch (error) {
    return handleAIError(error);
  }
}

A few things to notice here that tutorials skip:

maxDuration: 60 — Next.js API routes have a default timeout. LLM responses can take 30+ seconds for long generations. If you don't increase this, your responses will be truncated silently.

abortSignal: req.signal — When a user navigates away, the request is aborted. Without this, you keep paying for a response nobody will see. This alone saved me measurable money.

Token estimation before the call — You do not want to send a 50,000 token prompt to the API and discover it fails (or costs $2) after the fact. Estimate first, reject early.

Client-Side Streaming Consumption#

On the client side, the Vercel AI SDK handles most of the complexity, but understanding what happens underneath matters for debugging:

typescript

// src/hooks/useAIGeneration.ts
"use client";
 
import { useChat } from "ai/react";
import { useCallback, useRef, useState } from "react";
 
export function useAIGeneration(feature: string) {
  const [isGenerating, setIsGenerating] = useState(false);
  const abortRef = useRef<(() => void) | null>(null);
 
  const { messages, append, isLoading, error, stop } = useChat({
    api: "/api/ai/generate",
    body: { feature },
    onResponse(response) {
      if (!response.ok) {
        setIsGenerating(false);
      }
    },
    onFinish() {
      setIsGenerating(false);
    },
    onError(err) {
      setIsGenerating(false);
      console.error(`AI generation failed for ${feature}:`, err);
    },
  });
 
  const generate = useCallback(
    async (input: string, context?: string) => {
      setIsGenerating(true);
      await append({
        role: "user",
        content: input,
      });
    },
    [append],
  );
 
  const cancel = useCallback(() => {
    stop();
    setIsGenerating(false);
  }, [stop]);
 
  return {
    messages,
    generate,
    cancel,
    isGenerating,
    isStreaming: isLoading,
    error,
  };
}

The Raw SSE Implementation#

Sometimes you cannot or do not want to use the Vercel AI SDK. Here is what streaming looks like from scratch, because understanding this will save you hours of debugging when something goes wrong:

typescript

// Raw SSE streaming without the AI SDK
export async function POST(req: Request) {
  const encoder = new TextEncoder();
 
  const stream = new ReadableStream({
    async start(controller) {
      try {
        const response = await fetch("https://api.anthropic.com/v1/messages", {
          method: "POST",
          headers: {
            "Content-Type": "application/json",
            "x-api-key": process.env.ANTHROPIC_API_KEY!,
            "anthropic-version": "2023-06-01",
          },
          body: JSON.stringify({
            model: "claude-sonnet-4-20250514",
            max_tokens: 2048,
            stream: true,
            messages: [{ role: "user", content: "..." }],
          }),
        });
 
        const reader = response.body!.getReader();
        const decoder = new TextDecoder();
        let buffer = "";
 
        while (true) {
          const { done, value } = await reader.read();
          if (done) break;
 
          buffer += decoder.decode(value, { stream: true });
          const lines = buffer.split("\n");
          buffer = lines.pop() ?? "";
 
          for (const line of lines) {
            if (line.startsWith("data: ")) {
              const data = line.slice(6);
              if (data === "[DONE]") {
                controller.close();
                return;
              }
              try {
                const parsed = JSON.parse(data);
                if (parsed.type === "content_block_delta") {
                  const text = parsed.delta?.text ?? "";
                  controller.enqueue(encoder.encode(`data: ${JSON.stringify({ text })}\n\n`));
                }
              } catch {
                // Incomplete JSON in buffer, will be completed
                // on next chunk
              }
            }
          }
        }
        controller.close();
      } catch (error) {
        controller.enqueue(encoder.encode(`data: ${JSON.stringify({ error: "Generation failed" })}\n\n`));
        controller.close();
      }
    },
  });
 
  return new Response(stream, {
    headers: {
      "Content-Type": "text/event-stream",
      "Cache-Control": "no-cache",
      Connection: "keep-alive",
    },
  });
}

Prompt Engineering in Production#

Prompt Management#

typescript

// src/lib/ai/prompts.ts
type PromptTemplate = {
  system: string;
  userTemplate: (vars: Record<string, string>) => string;
  version: string;
  model: "fast" | "quality" | "reasoning";
};
 
const prompts: Record<string, PromptTemplate> = {
  "tool-description": {
    system: `You are a technical writer for a developer tools website.
Write concise, accurate descriptions of web-based tools.
 
Rules:
- Maximum 2 sentences
- Focus on what the tool DOES, not how it works
- No marketing language ("powerful", "revolutionary", "amazing")
- No first person
- Include the primary use case`,
    userTemplate: ({ toolName, toolCategory, existingDescription }) =>
      `Write a description for the tool "${toolName}" in the ${toolCategory} category.
${existingDescription ? `Current description (improve this): ${existingDescription}` : ""}`,
    version: "2.1",
    model: "fast",
  },
 
  "code-review": {
    system: `You are a senior software engineer reviewing code.
Provide specific, actionable feedback. No generic advice.
 
Focus on:
- Bugs and logic errors (critical)
- Security issues (critical)
- Performance problems (important)
- Readability improvements (nice to have)
 
Format: Use markdown. List issues by severity.
Do NOT suggest stylistic changes unless they impact readability.
Do NOT rewrite the entire code — point out specific lines.`,
    userTemplate: ({ code, language, context }) =>
      `Review this ${language} code:
\`\`\`${language}
${code}
\`\`\`
${context ? `Context: ${context}` : ""}`,
    version: "3.0",
    model: "quality",
  },
 
  "content-moderation": {
    system: `You are a content moderation system. Classify the input text.
 
Respond with ONLY a JSON object:
{
  "safe": boolean,
  "category": "safe" | "harassment" | "hate" | "sexual" | "violence" | "self-harm" | "illegal",
  "confidence": number (0-1),
  "reason": string (brief explanation, max 20 words)
}
 
No other text. No markdown. No explanation outside the JSON.`,
    userTemplate: ({ text }) => `Classify this text:\n\n${text}`,
    version: "1.4",
    model: "fast",
  },
};
 
export function getPrompt(
  feature: string,
  vars: Record<string, string>,
): { messages: Array<{ role: string; content: string }> } {
  const template = prompts[feature];
  if (!template) throw new Error(`Unknown prompt: ${feature}`);
 
  return {
    messages: [
      { role: "system", content: template.system },
      { role: "user", content: template.userTemplate(vars) },
    ],
  };
}

Lessons from managing prompts in production:

Version your prompts. When you change a prompt, you change the behavior of your application. You need to know which version produced which output, especially when debugging user reports.

The Prompt Injection Problem#

This is the elephant in the room that most LLM tutorials completely ignore. If your application takes user input and puts it into a prompt, users can manipulate the model's behavior:

typescript

// src/lib/ai/safety.ts
const INJECTION_PATTERNS = [
  /ignore\s+(all\s+)?previous\s+instructions/i,
  /you\s+are\s+now\s+/i,
  /system\s*:\s*/i,
  /\[INST\]/i,
  /<<SYS>>/i,
  /forget\s+(everything|all|your)/i,
  /new\s+instructions?\s*:/i,
  /override\s+(system|instructions|prompt)/i,
];
 
export function detectPromptInjection(input: string): {
  suspicious: boolean;
  patterns: string[];
} {
  const matches = INJECTION_PATTERNS.filter((p) => p.test(input)).map((p) => p.source);
 
  return {
    suspicious: matches.length > 0,
    patterns: matches,
  };
}
 
export function sanitizeInput(input: string, maxLength = 4000): string {
  return input
    .slice(0, maxLength)
    .replace(/[\x00-\x08\x0B\x0C\x0E-\x1F\x7F]/g, "") // control chars
    .trim();
}

Caching LLM Responses#

LLM API calls are expensive and slow. Caching identical or semantically similar requests can cut your costs by 40-60% depending on your use case.

Exact-Match Caching#

The simplest and most effective cache: hash the prompt, check if you have seen it before.

typescript

// src/lib/ai/cache.ts
import { createHash } from "crypto";
 
function hashPrompt(messages: Array<{ role: string; content: string }>, model: string, temperature: number): string {
  const input = JSON.stringify({ messages, model, temperature });
  return createHash("sha256").update(input).digest("hex");
}
 
export async function getCachedResponse(
  messages: Array<{ role: string; content: string }>,
  model: string,
  temperature: number,
): Promise<string | null> {
  // Only cache deterministic requests
  if (temperature > 0.1) return null;
 
  const key = `ai:cache:${hashPrompt(messages, model, temperature)}`;
 
  try {
    const cached = await redis.get(key);
    if (cached) {
      await redis.hincrby("ai:stats", "cache_hits", 1);
      return cached;
    }
    await redis.hincrby("ai:stats", "cache_misses", 1);
    return null;
  } catch {
    // Cache failures should never block AI responses
    return null;
  }
}
 
export async function cacheResponse(
  messages: Array<{ role: string; content: string }>,
  model: string,
  temperature: number,
  response: string,
  ttl = 86400, // 24 hours default
): Promise<void> {
  if (temperature > 0.1) return;
 
  const key = `ai:cache:${hashPrompt(messages, model, temperature)}`;
 
  try {
    await redis.setex(key, ttl, response);
  } catch {
    // Silent failure — caching is optimization, not requirement
  }
}

TTL strategy matters. I use different TTLs based on the feature:

Tool descriptions: 7 days (content rarely changes)
Code analysis: 24 hours (code patterns don't change fast)
User-facing chat: no cache (users expect fresh responses)
Moderation: 1 hour (want to catch policy updates)

Semantic Caching#

For some features, exact-match caching is not enough. "How do I center a div?" and "centering a div in CSS" should hit the same cache. This is where embeddings come in:

typescript

// src/lib/ai/semantic-cache.ts
import { embed } from "ai";
import { openai } from "@ai-sdk/openai";
 
const SIMILARITY_THRESHOLD = 0.92;
 
export async function getSemanticallyCachedResponse(query: string): Promise<string | null> {
  const { embedding } = await embed({
    model: openai.embedding("text-embedding-3-small"),
    value: query,
  });
 
  // Search for similar cached queries using cosine similarity
  // In production, use a vector DB (pgvector, Pinecone, etc.)
  const candidates = await vectorStore.search(embedding, {
    topK: 1,
    threshold: SIMILARITY_THRESHOLD,
  });
 
  if (candidates.length > 0) {
    return candidates[0].metadata.response;
  }
 
  return null;
}

Cost Management: The Part That Keeps You Up at Night#

Token Budget System#

Every feature in your application should have a token budget:

typescript

// src/lib/ai/budget.ts
type FeatureBudget = {
  maxInputTokens: number;
  maxOutputTokens: number;
  maxRequestsPerUser: number;
  maxRequestsPerDay: number;
  costPerRequest: number; // estimated, in cents
  dailyBudgetCents: number;
};
 
const budgets: Record<string, FeatureBudget> = {
  "tool-description": {
    maxInputTokens: 500,
    maxOutputTokens: 200,
    maxRequestsPerUser: 50,
    maxRequestsPerDay: 5000,
    costPerRequest: 0.02,
    dailyBudgetCents: 100, // $1/day
  },
  "code-review": {
    maxInputTokens: 8000,
    maxOutputTokens: 2000,
    maxRequestsPerUser: 10,
    maxRequestsPerDay: 1000,
    costPerRequest: 0.8,
    dailyBudgetCents: 800, // $8/day
  },
  "content-moderation": {
    maxInputTokens: 2000,
    maxOutputTokens: 100,
    maxRequestsPerUser: 200,
    maxRequestsPerDay: 50000,
    costPerRequest: 0.005,
    dailyBudgetCents: 250, // $2.50/day
  },
};
 
export async function checkBudget(feature: string, userId: string): Promise<{ allowed: boolean; reason?: string }> {
  const budget = budgets[feature];
  if (!budget) return { allowed: false, reason: "Unknown feature" };
 
  const today = new Date().toISOString().slice(0, 10);
 
  // Check per-user limit
  const userKey = `ai:usage:${feature}:user:${userId}:${today}`;
  const userCount = await redis.incr(userKey);
  if (userCount === 1) await redis.expire(userKey, 86400);
 
  if (userCount > budget.maxRequestsPerUser) {
    return { allowed: false, reason: "Daily user limit reached" };
  }
 
  // Check global daily limit
  const globalKey = `ai:usage:${feature}:global:${today}`;
  const globalCount = await redis.incr(globalKey);
  if (globalCount === 1) await redis.expire(globalKey, 86400);
 
  if (globalCount > budget.maxRequestsPerDay) {
    return { allowed: false, reason: "Feature daily limit reached" };
  }
 
  // Check cost budget
  const costKey = `ai:cost:${feature}:${today}`;
  const currentCost = parseFloat((await redis.get(costKey)) ?? "0");
  if (currentCost > budget.dailyBudgetCents) {
    return { allowed: false, reason: "Daily cost budget exceeded" };
  }
 
  return { allowed: true };
}
 
export async function recordUsage(
  feature: string,
  inputTokens: number,
  outputTokens: number,
  model: string,
): Promise<void> {
  const cost = calculateCost(model, inputTokens, outputTokens);
  const today = new Date().toISOString().slice(0, 10);
  const costKey = `ai:cost:${feature}:${today}`;
 
  await redis.incrbyfloat(costKey, cost);
 
  // Also track aggregate stats
  await redis.hincrby("ai:stats:tokens", `${feature}:input`, inputTokens);
  await redis.hincrby("ai:stats:tokens", `${feature}:output`, outputTokens);
}

Model Selection Based on Task Complexity#

typescript

function selectModel(feature: string, estimatedInputTokens: number): ModelConfig {
  const budget = budgets[feature];
  const today = new Date().toISOString().slice(0, 10);
 
  // If we're over 80% of daily budget, downgrade to cheaper model
  const currentCost = getCachedCost(feature, today);
  if (currentCost > budget.dailyBudgetCents * 0.8) {
    return models.fast; // Always fall back to cheapest
  }
 
  // Short inputs with simple tasks -> fast model
  if (estimatedInputTokens < 500 && isSimpleTask(feature)) {
    return models.fast;
  }
 
  return models[prompts[feature].model];
}

This adaptive model selection saved me roughly 35% on monthly costs. The trick is being honest about which tasks actually need the expensive model. Most of them don't.

Error Handling: Everything Will Break#

LLM APIs fail in ways that traditional APIs do not. You need to handle all of these:

Rate limits (429): You will hit them. Back off exponentially.
Timeouts: Model inference can take 30+ seconds. Your HTTP client, your reverse proxy, and your serverless platform all have different timeout values. Make sure they are aligned.
Context length exceeded: The model rejects prompts that are too long. Estimate first, truncate if needed.
Content policy violations: The model refuses to answer. Have a graceful fallback.
Malformed responses: The model was supposed to return JSON but returned markdown with a JSON block inside it. This happens more often than you think.
Partial stream failures: The stream starts, sends 200 tokens, then dies. The user sees half an answer.

typescript

// src/lib/ai/errors.ts
export async function withAIRetry<T>(
  fn: () => Promise<T>,
  options: {
    maxRetries?: number;
    feature: string;
    fallback?: () => Promise<T>;
  },
): Promise<T> {
  const maxRetries = options.maxRetries ?? 3;
  let lastError: Error | null = null;
 
  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    try {
      return await fn();
    } catch (error: any) {
      lastError = error;
 
      // Don't retry on client errors (except rate limits)
      if (error.status >= 400 && error.status < 500 && error.status !== 429) {
        break;
      }
 
      // Rate limit: respect Retry-After header
      if (error.status === 429) {
        const retryAfter = parseInt(error.headers?.["retry-after"] ?? "5", 10);
        await sleep(retryAfter * 1000);
        continue;
      }
 
      // Exponential backoff for other errors
      if (attempt < maxRetries) {
        await sleep(Math.pow(2, attempt) * 1000);
      }
    }
  }
 
  // All retries failed — try fallback model
  if (options.fallback) {
    try {
      return await options.fallback();
    } catch (fallbackError) {
      // Fallback also failed — log and throw
      console.error(`AI fallback failed for ${options.feature}:`, fallbackError);
    }
  }
 
  throw lastError;
}
 
function sleep(ms: number): Promise<void> {
  return new Promise((resolve) => setTimeout(resolve, ms));
}

Model Fallback Strategy#

typescript

// src/lib/ai/fallback.ts
import { generateText, streamText } from "ai";
 
export async function generateWithFallback(
  config: ModelConfig,
  params: {
    messages: Array<{ role: string; content: string }>;
    maxTokens: number;
    temperature: number;
  },
) {
  try {
    return await generateText({
      model: getModel(config),
      ...params,
    });
  } catch (error: any) {
    if (config.fallback && shouldFallback(error)) {
      console.warn(`Primary model ${config.model} failed, falling back to ${config.fallback.model}`);
 
      return await generateText({
        model: getModel(config.fallback),
        ...params,
      });
    }
    throw error;
  }
}
 
function shouldFallback(error: any): boolean {
  // Fallback on server errors, timeouts, and rate limits
  // Don't fallback on client errors (bad prompt, content policy)
  return error.status >= 500 || error.status === 429 || error.code === "ETIMEDOUT" || error.code === "ECONNRESET";
}

Structured Output Parsing#

LLMs return text. Your application needs structured data. This mismatch is the source of endless pain.

The JSON Extraction Problem#

typescript

// src/lib/ai/parse.ts
import { z } from "zod";
 
export function extractJSON<T>(
  text: string,
  schema: z.ZodSchema<T>,
): { success: true; data: T } | { success: false; error: string } {
  // Try direct parse first
  try {
    const parsed = JSON.parse(text);
    const result = schema.safeParse(parsed);
    if (result.success) return { success: true, data: result.data };
  } catch {
    // Not direct JSON, try extraction
  }
 
  // Extract from markdown code blocks
  const codeBlockMatch = text.match(/```(?:json)?\s*\n?([\s\S]*?)```/);
  if (codeBlockMatch) {
    try {
      const parsed = JSON.parse(codeBlockMatch[1].trim());
      const result = schema.safeParse(parsed);
      if (result.success) return { success: true, data: result.data };
    } catch {
      // Malformed JSON in code block
    }
  }
 
  // Try to find JSON object/array in text
  const jsonMatch = text.match(/(\{[\s\S]*\}|\[[\s\S]*\])/);
  if (jsonMatch) {
    try {
      // Fix common LLM JSON mistakes
      const fixed = jsonMatch[1]
        .replace(/,\s*}/g, "}") // trailing commas in objects
        .replace(/,\s*]/g, "]") // trailing commas in arrays
        .replace(/'/g, '"'); // single quotes to double
 
      const parsed = JSON.parse(fixed);
      const result = schema.safeParse(parsed);
      if (result.success) return { success: true, data: result.data };
    } catch {
      // Still not valid
    }
  }
 
  return {
    success: false,
    error: `Failed to extract valid JSON from response: ${text.slice(0, 200)}...`,
  };
}

typescript

import { generateObject } from "ai";
import { z } from "zod";
 
const moderationSchema = z.object({
  safe: z.boolean(),
  category: z.enum(["safe", "harassment", "hate", "sexual", "violence", "self-harm", "illegal"]),
  confidence: z.number().min(0).max(1),
  reason: z.string().max(100),
});
 
const { object } = await generateObject({
  model: getModel(models.fast),
  schema: moderationSchema,
  prompt: `Classify this text: ${sanitizedInput}`,
});
 
// object is fully typed and validated
console.log(object.safe, object.category);

This structured output approach eliminates an entire class of parsing bugs. Use it whenever your output has a known schema.

Retrieval-Augmented Generation (RAG)#

RAG is the pattern where you retrieve relevant context from your own data and include it in the prompt. It is how you make LLMs answer questions about your specific content without fine-tuning.

The Architecture#

User Query
    |
    v
[Embed Query] --> [Vector Search] --> [Retrieve Top-K Documents]
    |                                         |
    v                                         v
[Build Prompt with Retrieved Context]
    |
    v
[LLM Generates Answer Grounded in Your Data]

Implementation#

typescript

// src/lib/ai/rag.ts
import { embed } from "ai";
import { openai } from "@ai-sdk/openai";
 
type Document = {
  id: string;
  content: string;
  metadata: Record<string, string>;
  similarity: number;
};
 
export async function queryWithRAG(
  query: string,
  options: {
    collection: string;
    topK?: number;
    minSimilarity?: number;
  },
): Promise<{ answer: string; sources: Document[] }> {
  const { topK = 5, minSimilarity = 0.7 } = options;
 
  // 1. Embed the query
  const { embedding } = await embed({
    model: openai.embedding("text-embedding-3-small"),
    value: query,
  });
 
  // 2. Vector search (using pgvector here, but Pinecone/Weaviate work too)
  const documents = await prisma.$queryRaw<Document[]>`
    SELECT
      id,
      content,
      metadata,
      1 - (embedding <=> ${embedding}::vector) as similarity
    FROM documents
    WHERE collection = ${options.collection}
      AND 1 - (embedding <=> ${embedding}::vector) > ${minSimilarity}
    ORDER BY embedding <=> ${embedding}::vector
    LIMIT ${topK}
  `;
 
  if (documents.length === 0) {
    return {
      answer: "I don't have enough information to answer that question accurately.",
      sources: [],
    };
  }
 
  // 3. Build context-augmented prompt
  const context = documents.map((d, i) => `[Source ${i + 1}]: ${d.content}`).join("\n\n");
 
  const { text } = await generateText({
    model: getModel(models.quality),
    messages: [
      {
        role: "system",
        content: `Answer the user's question based ONLY on the provided sources.
If the sources don't contain enough information, say so.
Cite sources using [Source N] notation.
Do not make up information not present in the sources.`,
      },
      {
        role: "user",
        content: `Sources:\n${context}\n\nQuestion: ${query}`,
      },
    ],
  });
 
  return { answer: text, sources: documents };
}

RAG Pitfalls I Learned the Hard Way#

Latency Optimization#

LLM latency is brutal. A typical request involves:

Network round-trip to the API: 50-200ms
Time to first token (TTFT): 500-2000ms
Full generation: 2-30 seconds depending on output length

You cannot make the model think faster. But you can optimize everything around it.

Parallel Requests#

If a feature needs multiple LLM calls, run them in parallel:

typescript

// Bad: sequential calls (6 seconds total)
const summary = await generateText({ ... });
const keywords = await generateText({ ... });
const sentiment = await generateText({ ... });
 
// Good: parallel calls (2 seconds total)
const [summary, keywords, sentiment] = await Promise.all([
  generateText({ model: getModel(models.fast), prompt: summaryPrompt }),
  generateText({ model: getModel(models.fast), prompt: keywordsPrompt }),
  generateText({ model: getModel(models.fast), prompt: sentimentPrompt }),
]);

Speculative Execution#

For features where you can predict the next user action, start the LLM call before the user explicitly requests it:

typescript

// When user starts typing code, speculatively begin analysis
const debouncedAnalysis = useMemo(
  () =>
    debounce(async (code: string) => {
      if (code.length > 50) {
        // Start analysis in background
        speculativeResultRef.current = generateText({
          model: getModel(models.fast),
          prompt: buildAnalysisPrompt(code),
        });
      }
    }, 2000),
  [],
);

This makes the response feel instant when the user clicks "Analyze" because the work already started 2 seconds ago.

Response Prefetching#

For predictable queries (e.g., tool descriptions that will be needed on page load), generate and cache them ahead of time rather than on-demand:

typescript

// scripts/prefetch-descriptions.ts
// Run this as a cron job, not on every request
async function prefetchToolDescriptions() {
  const tools = await getAllTools();
 
  for (const tool of tools) {
    const cacheKey = `ai:desc:${tool.slug}`;
    const cached = await redis.get(cacheKey);
    if (cached) continue;
 
    const { text } = await generateText({
      model: getModel(models.fast),
      prompt: buildDescriptionPrompt(tool),
    });
 
    await redis.setex(cacheKey, 604800, text); // 7 days
 
    // Respect rate limits
    await sleep(200);
  }
}

User Experience Patterns#

The UX around AI features matters more than the AI itself. A mediocre model with great UX beats a great model with terrible UX every time.

Progressive Rendering#

Do not just dump streaming text into a container. Parse it incrementally and render it properly:

typescript

// src/components/ai/StreamingMarkdown.tsx
"use client";
 
import { memo, useMemo } from "react";
import ReactMarkdown from "react-markdown";
import { Prism as SyntaxHighlighter } from "react-syntax-highlighter";
 
interface StreamingMarkdownProps {
  content: string;
  isStreaming: boolean;
}
 
export const StreamingMarkdown = memo(function StreamingMarkdown({
  content,
  isStreaming,
}: StreamingMarkdownProps) {
  // Avoid re-parsing markdown on every token
  const stableContent = useMemo(() => {
    if (!isStreaming) return content;
 
    // While streaming, ensure we don't render incomplete
    // markdown that causes layout shifts
    const lines = content.split("\n");
    const lastLine = lines[lines.length - 1];
 
    // If the last line looks like an incomplete code block, don't render it yet
    if (lastLine?.startsWith("```") && !content.endsWith("```")) {
      return lines.slice(0, -1).join("\n");
    }
 
    return content;
  }, [content, isStreaming]);
 
  return (
    <div className="prose prose-neutral dark:prose-invert max-w-none">
      <ReactMarkdown
        components={{
          code({ className, children, ...props }) {
            const match = /language-(\w+)/.exec(className ?? "");
            if (match) {
              return (
                <SyntaxHighlighter language={match[1]}>
                  {String(children).replace(/\n$/, "")}
                </SyntaxHighlighter>
              );
            }
            return (
              <code className={className} {...props}>
                {children}
              </code>
            );
          },
        }}
      >
        {stableContent}
      </ReactMarkdown>
      {isStreaming && (
        <span className="inline-block w-2 h-4 bg-current animate-pulse ml-0.5" />
      )}
    </div>
  );
});

Error States That Don't Insult Users#

When the AI fails, do not show "Something went wrong." Tell the user something useful:

typescript

function getErrorMessage(error: any, feature: string): string {
  if (error.status === 429) {
    return "This feature is temporarily busy. Please try again in a few seconds.";
  }
  if (error.status === 413 || error.message?.includes("token")) {
    return "Your input is too long for AI analysis. Try with a shorter text.";
  }
  if (error.code === "ETIMEDOUT") {
    return "The AI took too long to respond. This sometimes happens with complex requests. Please try again.";
  }
  if (error.message?.includes("content_policy")) {
    return "This content cannot be processed by our AI. Please modify your input.";
  }
  return "AI analysis is temporarily unavailable. Your other tools still work fine.";
}

Loading States and Expectations#

typescript

// src/components/ai/AILoadingState.tsx
export function AILoadingState({ feature }: { feature: string }) {
  const estimates: Record<string, string> = {
    "code-review": "10-20 seconds",
    "tool-description": "2-5 seconds",
    summarize: "5-10 seconds",
  };
 
  return (
    <div className="flex items-center gap-3 text-muted-foreground">
      <div className="flex gap-1">
        <span className="w-2 h-2 bg-current rounded-full animate-bounce" />
        <span className="w-2 h-2 bg-current rounded-full animate-bounce [animation-delay:0.1s]" />
        <span className="w-2 h-2 bg-current rounded-full animate-bounce [animation-delay:0.2s]" />
      </div>
      <span>
        Analyzing... This usually takes {estimates[feature] ?? "a few seconds"}.
      </span>
    </div>
  );
}

Testing AI Features#

Testing non-deterministic systems is hard. But "it's non-deterministic" is not an excuse to skip testing. Here is how I approach it:

Deterministic Boundaries#

typescript

// src/lib/ai/__tests__/parse.test.ts
import { describe, it, expect } from "vitest";
import { extractJSON } from "../parse";
import { z } from "zod";
 
const schema = z.object({
  safe: z.boolean(),
  category: z.string(),
});
 
describe("extractJSON", () => {
  it("parses direct JSON", () => {
    const result = extractJSON('{"safe": true, "category": "safe"}', schema);
    expect(result.success).toBe(true);
  });
 
  it("extracts JSON from markdown code block", () => {
    const input = `Here is my analysis:\n\`\`\`json\n{"safe": true, "category": "safe"}\n\`\`\``;
    const result = extractJSON(input, schema);
    expect(result.success).toBe(true);
  });
 
  it("handles trailing commas", () => {
    const input = '{"safe": true, "category": "safe",}';
    const result = extractJSON(input, schema);
    expect(result.success).toBe(true);
  });
 
  it("fails gracefully on garbage input", () => {
    const result = extractJSON("I cannot help with that.", schema);
    expect(result.success).toBe(false);
  });
});

Snapshot Testing for Prompts#

When you change a prompt, you want to know. Snapshot tests catch unintended prompt modifications:

typescript

describe("prompts", () => {
  it("tool-description prompt matches snapshot", () => {
    const prompt = getPrompt("tool-description", {
      toolName: "JSON Formatter",
      toolCategory: "developer",
      existingDescription: "",
    });
 
    expect(prompt.messages[0].content).toMatchSnapshot();
  });
});

Integration Tests with Real Models#

For critical features, I run periodic integration tests against the actual API. Not on every commit — that would be expensive and slow — but on a daily schedule:

typescript

// src/lib/ai/__tests__/integration.test.ts
// Only runs with AI_INTEGRATION_TESTS=true
describe.skipIf(!process.env.AI_INTEGRATION_TESTS)("AI integration", () => {
  it("moderation correctly flags harmful content", async () => {
    const result = await moderate("I will hurt someone");
    expect(result.safe).toBe(false);
    expect(result.confidence).toBeGreaterThan(0.8);
  }, 30000);
 
  it("moderation allows benign content", async () => {
    const result = await moderate("How do I center a div in CSS?");
    expect(result.safe).toBe(true);
  }, 30000);
 
  it("structured output matches schema", async () => {
    const result = await analyzeCode("function add(a, b) { return a + b; }", "javascript");
    expect(result).toHaveProperty("issues");
    expect(Array.isArray(result.issues)).toBe(true);
  }, 60000);
});

Evaluation Sets#

For any AI feature that matters, build an evaluation set — a collection of inputs with expected outputs that you can run periodically to catch regressions:

typescript

const evaluationSet = [
  {
    input: "A React hook for fetching data",
    expectedCategory: "developer",
    expectedContains: ["hook", "fetch", "React"],
    expectedNotContains: ["revolutionary", "powerful"],
  },
  {
    input: "Convert PDF to Word document",
    expectedCategory: "converter",
    expectedContains: ["PDF", "Word"],
  },
];
 
async function runEvaluation() {
  const results = [];
 
  for (const testCase of evaluationSet) {
    const response = await generateDescription(testCase.input);
 
    const passed =
      testCase.expectedContains?.every((word) => response.toLowerCase().includes(word.toLowerCase())) &&
      !testCase.expectedNotContains?.some((word) => response.toLowerCase().includes(word.toLowerCase()));
 
    results.push({
      input: testCase.input,
      response,
      passed,
    });
  }
 
  const passRate = results.filter((r) => r.passed).length / results.length;
  console.log(`Evaluation pass rate: ${(passRate * 100).toFixed(1)}%`);
 
  if (passRate < 0.85) {
    // Alert: prompt may have regressed
    await sendAlert(`AI evaluation pass rate dropped to ${passRate}`);
  }
}

Safety and Moderation#

If your AI feature processes user input, you need moderation. Not optional. Not "we'll add it later." Now.

Multi-Layer Defense#

typescript

// src/lib/ai/moderation.ts
export async function moderateInput(input: string): Promise<{
  allowed: boolean;
  reason?: string;
}> {
  // Layer 1: Keyword blocklist (fast, catches obvious cases)
  const blocklist = getBlocklist();
  for (const term of blocklist) {
    if (input.toLowerCase().includes(term)) {
      return { allowed: false, reason: "blocked_content" };
    }
  }
 
  // Layer 2: Prompt injection detection
  const injection = detectPromptInjection(input);
  if (injection.suspicious) {
    return { allowed: false, reason: "prompt_injection" };
  }
 
  // Layer 3: AI-based moderation (most accurate, but costs money)
  const moderation = await classifyContent(input);
  if (!moderation.safe && moderation.confidence > 0.8) {
    return { allowed: false, reason: moderation.category };
  }
 
  return { allowed: true };
}

Moderate outputs too. Even with a perfect system prompt, models can sometimes generate inappropriate content. Run the output through at minimum a basic check before showing it to users:

typescript

export function sanitizeOutput(text: string): string {
  // Remove any system prompt leakage
  const systemLeakPatterns = [
    /you are a .* assistant/gi,
    /as an ai/gi,
    /i'?m (just )?an? ai/gi,
    /my (system )?instructions/gi,
  ];
 
  let cleaned = text;
  for (const pattern of systemLeakPatterns) {
    cleaned = cleaned.replace(pattern, "");
  }
 
  return cleaned.trim();
}

The Embedding Pipeline#

If you are building any kind of search, recommendation, or RAG feature, you need an embedding pipeline. Here is a minimal but production-ready setup:

typescript

// src/lib/ai/embeddings.ts
import { embed, embedMany } from "ai";
import { openai } from "@ai-sdk/openai";
 
const embeddingModel = openai.embedding("text-embedding-3-small");
 
export async function embedDocument(content: string, metadata: Record<string, string>): Promise<void> {
  const chunks = splitIntoChunks(content, {
    maxTokens: 600,
    overlap: 100,
  });
 
  const { embeddings } = await embedMany({
    model: embeddingModel,
    values: chunks.map((c) => c.text),
  });
 
  // Store in database with pgvector
  for (let i = 0; i < chunks.length; i++) {
    await prisma.embedding.create({
      data: {
        content: chunks[i].text,
        embedding: embeddings[i],
        metadata,
        chunkIndex: i,
        totalChunks: chunks.length,
      },
    });
  }
}
 
function splitIntoChunks(
  text: string,
  options: { maxTokens: number; overlap: number },
): Array<{ text: string; startOffset: number }> {
  const { maxTokens, overlap } = options;
  const chunks: Array<{ text: string; startOffset: number }> = [];
 
  // Split on paragraph boundaries when possible
  const paragraphs = text.split(/\n\n+/);
  let currentChunk = "";
  let currentOffset = 0;
  let chunkStartOffset = 0;
 
  for (const paragraph of paragraphs) {
    const combined = currentChunk ? `${currentChunk}\n\n${paragraph}` : paragraph;
    const estimatedTokens = Math.ceil(combined.length / 4);
 
    if (estimatedTokens > maxTokens && currentChunk) {
      chunks.push({
        text: currentChunk.trim(),
        startOffset: chunkStartOffset,
      });
 
      // Overlap: keep last portion of previous chunk
      const overlapText = currentChunk.slice(-(overlap * 4));
      currentChunk = overlapText + "\n\n" + paragraph;
      chunkStartOffset = currentOffset - overlapText.length;
    } else {
      if (!currentChunk) chunkStartOffset = currentOffset;
      currentChunk = combined;
    }
 
    currentOffset += paragraph.length + 2;
  }
 
  if (currentChunk.trim()) {
    chunks.push({
      text: currentChunk.trim(),
      startOffset: chunkStartOffset,
    });
  }
 
  return chunks;
}

What is Overhyped and What Actually Works#

After a year of building AI features, here is my honest assessment:

Works well:

Summarization. LLMs are genuinely good at condensing long text. Users love it.
Classification and categorization. Fast, accurate, and saves enormous manual effort.
Code analysis. Not perfect, but catches real bugs and provides genuinely useful suggestions.
Content moderation. Far more nuanced than keyword-based approaches.
Search augmentation. RAG over your own data is a legitimate improvement over traditional search.

Overhyped:

"AI-powered" everything. Most features do not benefit from an LLM. Adding AI to a color picker does not make it better.
Autonomous agents in production. They work in demos. In production, they hallucinate, get stuck in loops, and cost a fortune. Use them for internal tooling, not user-facing features.
Fine-tuning. For most use cases, good prompting with RAG gets you 90% there. Fine-tuning is expensive, hard to iterate on, and locks you to a specific model version.
"Just use AI" as a substitute for engineering. LLMs are powerful tools, but they do not replace data modeling, proper error handling, or performance optimization. They add to your complexity; they do not reduce it.

Underrated:

Embeddings for search. Semantic search with embeddings is dramatically better than full-text search for natural language queries, and it is relatively cheap.
AI for internal tools. The ROI on AI features for your own team (log analysis, code review, documentation search) is often higher than user-facing AI features.
Structured output for data extraction. Pulling structured data from unstructured text (emails, PDFs, web pages) is a genuine superpower.

Monitoring and Observability#

You cannot improve what you cannot measure. Every AI feature needs:

typescript

// src/lib/ai/telemetry.ts
type AITelemetry = {
  feature: string;
  model: string;
  inputTokens: number;
  outputTokens: number;
  latencyMs: number;
  cached: boolean;
  fallbackUsed: boolean;
  error?: string;
  userSatisfaction?: "positive" | "negative";
};
 
export async function trackAICall(telemetry: AITelemetry): Promise<void> {
  // Real-time metrics for alerting
  await redis.lpush("ai:telemetry:recent", JSON.stringify({ ...telemetry, timestamp: Date.now() }));
  await redis.ltrim("ai:telemetry:recent", 0, 9999);
 
  // Aggregate stats for dashboards
  const today = new Date().toISOString().slice(0, 10);
  const prefix = `ai:metrics:${today}:${telemetry.feature}`;
 
  await Promise.all([
    redis.hincrby(prefix, "requests", 1),
    redis.hincrby(prefix, "input_tokens", telemetry.inputTokens),
    redis.hincrby(prefix, "output_tokens", telemetry.outputTokens),
    redis.hincrby(prefix, "total_latency_ms", telemetry.latencyMs),
    redis.hincrby(prefix, telemetry.cached ? "cache_hits" : "cache_misses", 1),
    telemetry.error && redis.hincrby(prefix, "errors", 1),
    telemetry.fallbackUsed && redis.hincrby(prefix, "fallbacks", 1),
  ]);
 
  redis.expire(prefix, 2592000); // 30 days
}

The metrics I watch daily:

Cost per feature per day. If this spikes, something is wrong (a prompt got longer, a cache stopped working, a loop is making repeated calls).
Error rate. Above 5% means something is broken. Above 2% means something needs attention.
Cache hit rate. Below 30% means your caching strategy needs work.
P95 latency. This is the latency that slow users experience. It should be under 10 seconds for most features.
Fallback rate. If fallbacks are triggered more than 1% of the time, your primary provider might have chronic reliability issues.

Lessons Learned the Hard Way#

After building AI features into production web applications for the past year, these are the things I wish someone had told me on day one:

The Architecture Decision You Make on Day One#

Streaming: The Part Everyone Gets Wrong#

Server-Sent Events: The Right Transport#

Client-Side Streaming Consumption#

The Raw SSE Implementation#

Prompt Engineering in Production#

Prompt Management#

The Prompt Injection Problem#

Caching LLM Responses#

Exact-Match Caching#

Semantic Caching#

Cost Management: The Part That Keeps You Up at Night#

Token Budget System#

Model Selection Based on Task Complexity#

Error Handling: Everything Will Break#

Model Fallback Strategy#

Structured Output Parsing#

The JSON Extraction Problem#

Retrieval-Augmented Generation (RAG)#

The Architecture#

Implementation#

RAG Pitfalls I Learned the Hard Way#

Latency Optimization#

Parallel Requests#

Speculative Execution#

Response Prefetching#

User Experience Patterns#

Progressive Rendering#

Error States That Don't Insult Users#

Loading States and Expectations#

Testing AI Features#

Deterministic Boundaries#

Snapshot Testing for Prompts#

Integration Tests with Real Models#

Evaluation Sets#

Safety and Moderation#

Multi-Layer Defense#

The Embedding Pipeline#

What is Overhyped and What Actually Works#

Monitoring and Observability#

Lessons Learned the Hard Way#

Articoli correlati

Meta Tag Generator Guide: Write Titles and Descriptions That Earn Clicks

Open Graph Preview Guide: Make Shared Links Look Trustworthy

The Architecture Decision You Make on Day One#

Streaming: The Part Everyone Gets Wrong#

Server-Sent Events: The Right Transport#

Client-Side Streaming Consumption#

The Raw SSE Implementation#

Prompt Engineering in Production#

Prompt Management#

The Prompt Injection Problem#

Caching LLM Responses#

Exact-Match Caching#

Semantic Caching#

Cost Management: The Part That Keeps You Up at Night#

Token Budget System#

Model Selection Based on Task Complexity#

Error Handling: Everything Will Break#

Model Fallback Strategy#

Structured Output Parsing#

The JSON Extraction Problem#

Retrieval-Augmented Generation (RAG)#

The Architecture#

Implementation#

RAG Pitfalls I Learned the Hard Way#

Latency Optimization#

Parallel Requests#

Speculative Execution#

Response Prefetching#

User Experience Patterns#

Progressive Rendering#

Error States That Don't Insult Users#

Loading States and Expectations#

Testing AI Features#

Deterministic Boundaries#

Snapshot Testing for Prompts#

Integration Tests with Real Models#

Evaluation Sets#

Safety and Moderation#