From Model Building to LLM Integration

November 7, 2025

What building AI features actually teaches you — start with constraints, treat LLMs like distributed systems, and build for the failure modes.


The constraint comes before the model

The most common mistake in shipping AI features is choosing a model before writing down what "good" actually means. It sounds obvious until you watch teams spend three weeks evaluating GPT-4 vs Claude vs Gemini before they've decided how they'll measure success.

Write the constraints first:

  • Latency budget in milliseconds, not "fast"
  • Cost per call and total monthly estimate
  • Allowed failure modes — what does the UI show when the model is wrong?
  • Data handling rules — can this data leave your infrastructure?
  • One measurable metric the feature must move

With these written down, model selection becomes a lookup, not a deliberation.

Training your own model: start boring

If you're fine-tuning or training from scratch, start with something you can explain to a non-ML engineer. Logistic regression on engineered features. Gradient boosting. A small pretrained transformer with a classification head.

The baseline tells you how much headroom exists for complex approaches. More often than not, the baseline gets you 80% of the way there and the remaining 20% isn't worth the operational complexity of a larger model.

When you do need more: track every experiment. Features, data version, split strategy, evaluation metrics. The experiment that looked like a dead end in week two is often the thing you need in week six.

LLMs are non-deterministic third-party services

This mental model matters more than any prompt engineering technique. LLMs fail like APIs fail: rate limits, latency spikes, output format violations, and occasional hallucinations on inputs you didn't anticipate.

Build for that from the start:

type Answer = { answer: string; citations: string[] }

function isAnswer(x: unknown): x is Answer {
    return !!x &&
        typeof x === "object" &&
        typeof (x as any).answer === "string" &&
        Array.isArray((x as any).citations)
}

Validate the response before you use it. Use structured output modes (JSON schema constraints) when the model supports them. Add retries with exponential backoff. Keep a fallback — a smaller model, a cached response, or a graceful degradation path.

The teams that ship reliable LLM features aren't using better prompts. They're building the same defense-in-depth you'd use for any external dependency.

RAG: retrieval is a system, not a query

Vector search is one component of a retrieval system. The quality of retrieved context depends more on your chunking strategy and metadata filtering than on the choice of embedding model or vector database.

Things that actually matter:

  • Chunk size and overlap relative to your expected query pattern
  • Whether you index by semantic meaning or by section structure (sometimes both)
  • Filtering by tenant, permission, document type before semantic search runs
  • Evaluating retrieval separately from generation — bad retrieval makes good generation impossible

If you're debugging a RAG pipeline and the answers are wrong, check what's being retrieved before touching the prompt.

Agents: constrain the surface area

Multi-step agents are compelling and easy to over-scope. The failure mode is an agent with too many tools, too many steps allowed, and no meaningful output validation at each step.

The constraint that matters most: separate "plan" from "execute." Let the agent reason about what to do. Confirm or validate the plan before it takes irreversible actions. Log every tool call with enough context to reconstruct what happened.

Keep the step limit low until you understand the failure modes. An agent that can take 50 steps will eventually find a path you didn't anticipate.

References

Hi, I'm Martin Duchev. You can find more about my projects on my GitHub.