Learn

LLM Evaluation Criteria: How to Measure AI Quality

A guide to choosing the right metrics for evaluating large language model outputs

Fabio Di Leta

Jan 23, 2026

min read

What Are LLM Evaluation Criteria?

LLM evaluation criteria are the specific dimensions used to assess the quality of AI-generated outputs. When you deploy a large language model in production, you need systematic ways to measure whether responses meet your quality standards.

The challenge: not all AI outputs need the same evaluation. A customer support chatbot has different quality requirements than a code generation tool. A medical information system demands different standards than a creative writing assistant.

Choosing the wrong criteria leads to two problems:

False confidence: High scores on irrelevant metrics while real issues go undetected
Alert fatigue: Low scores on metrics that don't matter for your use case

This guide covers the six core evaluation criteria, when to use each, and how to combine them effectively.

The Six Core LLM Evaluation Criteria

These foundational criteria apply across most LLM applications. Understanding what each measures helps you choose the right combination for your use case.

1. Accuracy

What it measures: Factual correctness of the AI response

Accuracy evaluates whether statements in the response are true and verifiable. This is the most critical criterion for knowledge-based systems where users rely on the AI for factual information.

When accuracy matters most:

Medical or health information
Legal guidance
Financial advice
Technical documentation
Educational content

When accuracy is less critical:

Creative writing
Brainstorming sessions
Hypothetical scenarios

A key challenge with accuracy evaluation is that the judge model must itself have reliable knowledge to assess factual claims. For domain-specific accuracy (medical, legal), consider whether general-purpose LLM judges are sufficient or if specialized evaluation is needed.

2. Relevance

What it measures: How directly the response addresses the user's input

Relevance captures whether the AI understood the question and responded appropriately. A response can be factually accurate but completely irrelevant if it answers a different question than what was asked.

Common relevance failures:

Answering a different question than asked
Including excessive background information
Missing the specific context of the question
Providing generic responses to specific queries

Relevance is particularly important for search systems, RAG applications, and chatbots where users expect direct answers.

3. Completeness

What it measures: Whether all parts of a complex question are addressed

Completeness matters when users ask multi-part questions or have compound requirements. A response that thoroughly answers one part while ignoring others scores low on completeness even if what it does cover is accurate.

Example: If a user asks "What are the pros and cons of Python vs JavaScript, and which should I learn first?"—a response covering only Python's advantages would be accurate and relevant but incomplete.

Completeness trades off against conciseness. For simple questions, a complete response might be unnecessarily verbose. Match your completeness expectations to query complexity.

4. Tone

What it measures: Appropriateness of style, formality, and emotional quality

Tone evaluation assesses whether the response matches the expected communication style for the context. This criterion is inherently subjective and context-dependent.

Application Context	Expected Tone
Customer support	Helpful, empathetic, professional
Technical documentation	Clear, precise, neutral
Marketing copy	Enthusiastic, persuasive, brand-aligned
Medical information	Careful, accurate, appropriately cautious

Tone criteria require clear specification of what "appropriate" means for your application. Without explicit guidance, LLM judges default to generic professional tone, which may not match your brand voice.

5. Consistency

What it measures: Alignment with established patterns and baseline examples

Consistency evaluates whether responses follow the same structure, style, and approach as your approved examples. This criterion is unique because it requires baseline data—you can't measure consistency without something to be consistent with.

Consistency matters most for:

Maintaining brand voice across thousands of responses
Ensuring format adherence in structured outputs
Detecting when model behavior drifts from established patterns

The dbt™-llm-evals package automatically creates baselines from high-quality outputs, enabling consistency scoring without manual curation.

6. Safety

What it measures: Absence of harmful, inappropriate, or dangerous content

Safety is non-negotiable for production AI systems. Unlike other criteria where a score of 6/10 might be acceptable, safety failures often require immediate intervention.

Safety concerns include:

Harmful instructions or dangerous advice
Discriminatory or biased content
Privacy violations (PII exposure)
Misinformation that could cause harm

Important: Safety should be a gate, not a gradient. Rather than treating safety as a 1-10 score to average with other criteria, use it as a pass/fail filter where any output below threshold triggers immediate review.

Domain-Specific Evaluation Criteria

Beyond core criteria, specialized applications require domain-specific evaluation dimensions.

Medical AI Evaluation

Clinical accuracy: Medical facts verified against established standards
Appropriate caution: Doesn't overstate diagnostic certainty
Disclaimer inclusion: Recommends professional consultation

Legal AI Evaluation

Jurisdictional awareness: Notes geographic limitations
Disclaimer compliance: Includes "not legal advice" language
Qualified language: Avoids definitive legal claims

Customer Support AI Evaluation

Resolution focus: Actually solves the customer's problem
Empathy: Acknowledges customer frustration appropriately
Policy compliance: Follows company guidelines and procedures

Code Generation AI Evaluation

Syntactic correctness: Generated code compiles and runs
Logic accuracy: Code does what was requested
Security: No obvious vulnerabilities introduced

How to Combine Multiple Evaluation Criteria

Real-world LLM evaluation requires multiple criteria working together. Three main strategies exist for combining scores:

Simple Average

All criteria weighted equally. Best when criteria are genuinely equal in importance—rare in practice but simple to implement and explain.

Weighted Scoring

Different criteria receive different weights based on business importance. For example, customer support AI might weight resolution (30%) and tone (25%) higher than raw accuracy (20%). This approach requires understanding which criteria matter most for your specific application.

Pass/Fail Gates

Certain criteria must meet minimum thresholds before other scoring applies. Safety is the classic gate criterion—no amount of accuracy compensates for unsafe content. This hybrid approach ensures critical requirements are met while allowing nuance in secondary criteria.

The LLM Judge Calibration Challenge

A fundamental question underlies all LLM-as-a-judge evaluation: How do you know your judge model is judging correctly?

Research shows LLM judges can exhibit systematic biases:

Position bias: Favoring content presented first (or last) in prompts
Length bias: Preferring longer responses regardless of quality
Style bias: Favoring outputs similar to the judge model's own style

Calibration requires human-labeled examples. Create a small gold standard dataset (50-100 examples) with human scores, then verify your LLM judge correlates strongly (>0.7 correlation) with human judgment. Without this validation step, you're measuring something—but you don't know if it reflects actual quality.

Choosing the Right Criteria for Your Application

Primary Concern	Prioritize These Criteria
Users getting correct information	Accuracy, Completeness
Users getting questions answered	Relevance, Completeness
Brand consistency at scale	Tone, Consistency
Avoiding PR disasters	Safety, Tone
Complex multi-step queries	Completeness, Accuracy
Creative or generative tasks	Tone, Relevance

Start focused. Begin with 2-3 criteria that directly map to your application's success metrics. Add more only when you have evidence they're needed—over-evaluation creates noise without improving signal.

Key Takeaways

Match criteria to your use case: Customer support needs different evaluation than code generation
Treat safety as a gate, not a score: Use pass/fail filtering, not averaging
Consistency requires baselines: You can't measure drift without knowing what you're drifting from
Calibrate against human judgment: Validate that your LLM judge agrees with human evaluators
Start with fewer criteria: 2-3 well-chosen criteria beat 10 poorly-defined ones

The goal isn't comprehensive evaluation—it's meaningful evaluation that tells you whether your AI is doing its job.

Next Steps

Ready to implement LLM evaluation criteria in your data warehouse? The dbt™-llm-evals package supports all core criteria and runs natively on Snowflake Cortex, BigQuery Vertex AI, and Databricks AI Functions—with zero data egress.

Related reading:

What is LLM-as-a-Judge? - Understanding the evaluation technique
Quick Start: Your First LLM Evaluation - Hands-on tutorial using dbt™

Interested to Learn More?
Try Out the Free 14-Days Trial

Start free trial

Product

Mar 4, 2026

Your Tickets and Specs Are the Missing Context: Jira and Confluence in DinoAI

Product

Mar 4, 2026

From dbt™ Code to Merged PR: GitHub Pull Request Management with DinoAI

Product

Mar 4, 2026

From Slow Query to Root Cause: Snowflake Performance Debugging with DinoAI

Experience Analytics for the AI-Era

Start your 14-day trial today - it's free and no credit card needed

Start for free

Experience Analytics for the AI-Era

Start your 14-day trial today - it's free and no credit card needed

Start for free

Experience Analytics for the AI-Era

Start your 14-day trial today - it's free and no credit card needed

Start for free

Platform

Resources

ADD-ONs

Industries

About

Legal

Made with ❤️ in San Francisco ・ London

*dbt® and dbt Core® are federally registered trademarks of dbt Labs, Inc. in the United States and various jurisdictions around the world. Paradime is not a partner of dbt Labs. All rights therein are reserved to dbt Labs. Paradime is not a product or service of or endorsed by dbt Labs, Inc.

Start for free

Platform