Building Quality in LLM-powered applications

Craig Risi
Oct 3
7 min read

While I’ve written on AI many times, I do often resist writing on the topic because I fear it is often overplayed and somewhat of a potential tech bubble. It won’t be going away, though, and even if its impact wanes, we still need to learn how to test and build quality around it. Especially in the realm of LLMs, which have had the largest impact on software development, both in how we create software and the types of solutions we can solve.

Large language models (LLMs) like GPT have opened the door to a new generation of intelligent applications, from copilots that accelerate software development to chat-based customer service to decision-support tools that assist analysts and researchers. Yet with these opportunities comes a pressing challenge: how do we ensure quality and reliability in systems powered by AI models that are inherently probabilistic, non-deterministic, and constantly evolving?

Unlike traditional software, where logic and behaviour can be exhaustively specified, LLMs generate outputs based on statistical inference. This means they can be powerful problem-solvers, but they can also “hallucinate,” generate biased content, or behave unpredictably under edge conditions. Building quality into LLM-powered applications requires a shift in mindset: we can’t test or govern them in the same way as deterministic software. Instead, we must embed quality at multiple layers, from data and prompts to architecture, to monitoring.

I want to explore these differences a little more before looking at strategies to overcome them:

How LLM-Powered Systems Differ from Traditional Software

Probabilistic, Not Deterministic

Traditional software: Given the same input, you can expect the same output every time. This predictability makes it easier to test, debug, and guarantee correctness.
LLMs: The same input can yield different outputs depending on sampling, temperature settings, or subtle variations in context. This probabilistic nature means we must think in terms of confidence ranges, likelihoods, and acceptable variability, not absolute correctness.

Impact: Testing moves from verifying fixed outputs to evaluating distributions of possible outputs. Success isn’t binary (“pass/fail”), but measured across dimensions like relevance, coherence, safety, or factual accuracy.

Non-Deterministic Behavior Under Context

Traditional software: Edge cases can be documented, reproduced, and patched with predictable fixes.
LLMs: Edge cases often emerge from interactions between prompts, training data, and hidden model behaviors. Identical inputs in slightly different contexts (chat history, user phrasing) can radically change results.

Impact: Debugging becomes about observing patterns of behavior rather than isolating a single defect. It requires more iterative testing, adversarial probing, and human-in-the-loop review.

Constantly Evolving

Traditional software: Versioned releases are stable until a deliberate upgrade. Teams control when and how the system changes.
LLMs: Models evolve continuously—through fine-tuning, retraining, or even changes in third-party APIs. Behavior can shift even without developers making explicit code changes.

Impact: Quality assurance becomes a continuous monitoring discipline. Regression testing isn’t just for new releases—it’s ongoing, because the “same model” today may not behave like it did yesterday.

Emergent and Unpredictable Capabilities

LLMs can exhibit unexpected skills or weaknesses because their knowledge emerges from large-scale statistical training, not explicit programming.
For example, they may suddenly solve problems outside the original design scope—or fail in seemingly trivial scenarios.

Impact: Planning must account for unknown unknowns. Guardrails, fallback logic, and fail-safes become essential architectural components, not optional extras.

So, now that we have a better understanding of the differences that LLMs introduce to our typical software development processes, let’s explore some of the things we can do to better cater to these new methods of software development. We’ll look at some of the data, design, and architectural elements first and then cover the testing and observability side of it next:

Start With the Right Data Foundation

The quality of LLM applications is only as good as the data they’re built on. The phrase “garbage in, garbage out” has never been more relevant. If the foundation is weak, the most advanced models will still deliver flawed or biased results. Building a strong data pipeline and governance model is, therefore, the first step toward reliable AI.

Curated Training and Fine-Tuning Data

Ensure training and fine-tuning sets are representative of the real-world use cases your application will face.
Regularly update datasets to prevent staleness (e.g., outdated knowledge or terminology).
Use annotation and validation pipelines to keep labeling consistent and reduce noise.

Data Provenance and Governance

Track and document the origin of datasets to confirm legality, licensing, and ethical use.
Implement compliance checks for sensitive domains (e.g., GDPR, HIPAA, POPIA, copyright).
Build metadata pipelines that tag datasets by source, timestamp, and validation status to make retraining auditable.

Synthetic Data with Care

Synthetic data can help balance underrepresented classes or test edge cases that don’t appear frequently in natural data.
However, synthetic data must be validated for realism and balance; otherwise, it risks reinforcing the very biases you’re trying to eliminate.
Use it strategically to fill gaps, not as a substitute for real-world data.

Bias and Fairness Checks

Actively monitor datasets for overrepresentation (e.g., certain demographics, geographies, or viewpoints) that could skew outcomes.
Run fairness and bias audits during data curation, not just after model deployment.
Use multiple perspectives in data review to catch blind spots that automated filters may miss.

Continuous Data Refresh

Unlike static software inputs, data relevance degrades quickly.
Establish a continuous refresh cycle (e.g., quarterly or event-driven updates) so the model reflects new knowledge, terminology, and real-world changes.
Couple this with regression testing to ensure new data improves performance without introducing new risks.

Building LLM-powered applications isn’t just about clever prompts or fine-tuning—it starts with data discipline. Curated, governed, fair, and continuously refreshed data creates the foundation for models that are trustworthy, compliant, and capable of adapting to real-world complexity.

Rigorous Prompt and Response Engineering

The way an LLM is instructed dramatically shapes its behavior and output quality. Unlike traditional software, where logic is hard-coded, LLMs rely on carefully crafted prompts to guide reasoning and responses. Designing resilient, reliable prompts isn’t just a convenience; it’s a core quality discipline.

Prompt Standardization

Develop reusable, tested prompt templates rather than relying on one-off phrasing.
Standardization ensures consistency across teams, reduces duplication of effort, and makes it easier to monitor prompt effectiveness over time.
Maintain a prompt library (with version control) that documents best practices, validated templates, and known limitations.

Guardrails via Instructional Prompts

Use clear constraints and rules within prompts to direct the model away from unsafe, biased, or irrelevant output.
Examples: defining response length, requiring citation of sources, or specifying tone (“be concise,” “answer as a technical advisor,” etc.).
Pair instructional prompts with fallback handling (e.g., redirecting unsafe responses to a predefined “safe” answer).

Evaluation Across Variations

Test prompts across multiple phrasings, contexts, and edge cases to ensure stability and reduce sensitivity to wording changes.
Evaluate against adversarial prompts (malicious or confusing inputs) to check for vulnerabilities.
Benchmark results across a range of expected user intents to validate robustness.

Response Validation and Post-Processing

Apply automated validators to check responses for factuality, compliance, or formatting before they reach end-users.
Use structured post-processing (regex, knowledge-base lookups, external APIs) to correct or enrich raw outputs.
Where high stakes are involved, add a human-in-the-loop review as a final quality checkpoint.

Continuous Testing and Refinement

Prompts are not “set and forget”; they need to be monitored and optimized as models evolve or are updated.
Use A/B testing and analytics to measure prompt effectiveness (accuracy, safety, user satisfaction).
Treat prompts as living artifacts within your software lifecycle, just like code modules or test cases.

Prompt engineering isn’t just a creative exercise; it’s about systematic design, testing, and governance. By standardizing prompts, embedding guardrails, and rigorously evaluating responses, you can transform LLM behavior from unpredictable to predictable, laying the foundation for trustworthy AI applications.

Architectural Safeguards and Governance

LLMs should rarely be deployed in a “raw” state. On their own, they are powerful but unpredictable. To achieve consistent quality, they need to be framed within a robust system architecture that enforces safety, security, and accountability. This governance layer ensures that the model’s intelligence is directed productively while minimizing risks.

Moderation Layers

Apply input filtering to prevent harmful, irrelevant, or malformed queries from ever reaching the model.
Add output moderation (via toxicity filters, regex, classifiers, or business rule checks) to block unsafe or non-compliant results before they are returned to users.
Use tiered safeguards: lightweight automated checks for everyday use and escalation to human review for high-risk cases.

Hybrid Approaches

Rarely should the LLM be the single source of truth. Pair probabilistic reasoning with deterministic systems like business rules, knowledge graphs, or rules engines to enforce correctness.
Retrieval-Augmented Generation (RAG): Ground responses in curated, up-to-date data sources to reduce hallucination and improve factual accuracy.
Adopt fallback flows where the LLM handles open-ended reasoning, but deterministic systems validate or supplement final answers (e.g., calculations, compliance checks).

Access Controls and Security

Implement identity-aware access management to control who can query, configure, or modify LLM integrations.
Protect sensitive data by defining strict data handling policies, such as redacting PII before prompts are sent to the model.
Monitor for prompt injection attacks and enforce limits on system prompts to prevent malicious manipulation.

Governance and Observability

Establish governance frameworks with clear ownership: who defines guardrails, who maintains prompts, and who approves new integrations.
Use audit trails and logging for every interaction—tracking prompts, responses, and moderation decisions for compliance and debugging.
Deploy observability dashboards to monitor usage trends, detect anomalies, and surface risks early.

Continuous Policy Evolution

Governance isn’t static. As regulations, business needs, and model behaviors evolve, so too should policies, guardrails, and architectural safeguards.
Create feedback loops where real-world incidents inform stronger controls, ensuring your LLM environment becomes more secure and reliable over time.

Architectural safeguards and governance turn an LLM from a clever but risky tool into a reliable, enterprise-ready system. By wrapping raw models in moderation, hybrid logic, access controls, and governance frameworks, organizations can balance innovation with safety—delivering value without sacrificing trust.

In summary, if we are to truly account for the non-deterministic nature of LLMs, we need to rethink several aspects of how we build, test, and operate these systems. Unlike traditional software, where deterministic inputs and outputs create a predictable testing landscape, LLMs introduce inherent variability—meaning outcomes can shift even without obvious changes. To manage this, organizations must invest more deliberately in the foundations that drive reliability.

First, this means placing greater emphasis on data quality and governance. Biased, incomplete, or outdated data can magnify unpredictability, so curating, labeling, and continuously monitoring data sources becomes a critical discipline. Second, it requires rigorous and standardized prompt engineering, treating prompts not as ad hoc experiments but as reusable, tested components that are validated across variations. Done well, this reduces the model’s sensitivity to phrasing and helps stabilize outputs.

Finally, addressing LLM unpredictability calls for a rethink of system architecture. Instead of exposing raw model outputs, organizations should build guardrails through moderation layers, retrieval-augmented generation, hybrid approaches with rules engines, and identity-aware access controls. This creates an ecosystem where the strengths of LLMs are harnessed but their variability is contained within safe, predictable boundaries.

Taken together, these investments turn the challenge of non-determinism into an opportunity: by designing deliberately for uncertainty, we can build more resilient, adaptable, and trustworthy AI systems that stand the test of real-world complexity.

In my next blog post, I will then look at how we go about testing LLMs in further detail – a critical aspect that is arguably even more important to make a success out of LLM development.

CRAIG RISI