LLM Evaluation Criteria: How to Measure AI Quality
A guide to choosing the right metrics for evaluating large language model outputs

Fabio Di Leta
Jan 23, 2026
·
8
min read
What Are LLM Evaluation Criteria?
LLM evaluation criteria are the specific dimensions used to assess the quality of AI-generated outputs. When you deploy a large language model in production, you need systematic ways to measure whether responses meet your quality standards.
The challenge: not all AI outputs need the same evaluation. A customer support chatbot has different quality requirements than a code generation tool. A medical information system demands different standards than a creative writing assistant.
Choosing the wrong criteria leads to two problems:
False confidence: High scores on irrelevant metrics while real issues go undetected
Alert fatigue: Low scores on metrics that don't matter for your use case
This guide covers the six core evaluation criteria, when to use each, and how to combine them effectively.
The Six Core LLM Evaluation Criteria
These foundational criteria apply across most LLM applications. Understanding what each measures helps you choose the right combination for your use case.
1. Accuracy
What it measures: Factual correctness of the AI response
Accuracy evaluates whether statements in the response are true and verifiable. This is the most critical criterion for knowledge-based systems where users rely on the AI for factual information.
When accuracy matters most:
Medical or health information
Legal guidance
Financial advice
Technical documentation
Educational content
When accuracy is less critical:
Creative writing
Brainstorming sessions
Hypothetical scenarios
A key challenge with accuracy evaluation is that the judge model must itself have reliable knowledge to assess factual claims. For domain-specific accuracy (medical, legal), consider whether general-purpose LLM judges are sufficient or if specialized evaluation is needed.
2. Relevance
What it measures: How directly the response addresses the user's input
Relevance captures whether the AI understood the question and responded appropriately. A response can be factually accurate but completely irrelevant if it answers a different question than what was asked.
Common relevance failures:
Answering a different question than asked
Including excessive background information
Missing the specific context of the question
Providing generic responses to specific queries
Relevance is particularly important for search systems, RAG applications, and chatbots where users expect direct answers.
3. Completeness
What it measures: Whether all parts of a complex question are addressed
Completeness matters when users ask multi-part questions or have compound requirements. A response that thoroughly answers one part while ignoring others scores low on completeness even if what it does cover is accurate.
Example: If a user asks "What are the pros and cons of Python vs JavaScript, and which should I learn first?"—a response covering only Python's advantages would be accurate and relevant but incomplete.
Completeness trades off against conciseness. For simple questions, a complete response might be unnecessarily verbose. Match your completeness expectations to query complexity.
4. Tone
What it measures: Appropriateness of style, formality, and emotional quality
Tone evaluation assesses whether the response matches the expected communication style for the context. This criterion is inherently subjective and context-dependent.
Application Context | Expected Tone |
|---|---|
Customer support | Helpful, empathetic, professional |
Technical documentation | Clear, precise, neutral |
Marketing copy | Enthusiastic, persuasive, brand-aligned |
Medical information | Careful, accurate, appropriately cautious |
Tone criteria require clear specification of what "appropriate" means for your application. Without explicit guidance, LLM judges default to generic professional tone, which may not match your brand voice.
5. Consistency
What it measures: Alignment with established patterns and baseline examples
Consistency evaluates whether responses follow the same structure, style, and approach as your approved examples. This criterion is unique because it requires baseline data—you can't measure consistency without something to be consistent with.
Consistency matters most for:
Maintaining brand voice across thousands of responses
Ensuring format adherence in structured outputs
Detecting when model behavior drifts from established patterns
The dbt™-llm-evals package automatically creates baselines from high-quality outputs, enabling consistency scoring without manual curation.
6. Safety
What it measures: Absence of harmful, inappropriate, or dangerous content
Safety is non-negotiable for production AI systems. Unlike other criteria where a score of 6/10 might be acceptable, safety failures often require immediate intervention.
Safety concerns include:
Harmful instructions or dangerous advice
Discriminatory or biased content
Privacy violations (PII exposure)
Misinformation that could cause harm
Important: Safety should be a gate, not a gradient. Rather than treating safety as a 1-10 score to average with other criteria, use it as a pass/fail filter where any output below threshold triggers immediate review.
Domain-Specific Evaluation Criteria
Beyond core criteria, specialized applications require domain-specific evaluation dimensions.
Medical AI Evaluation
Clinical accuracy: Medical facts verified against established standards
Appropriate caution: Doesn't overstate diagnostic certainty
Disclaimer inclusion: Recommends professional consultation
Legal AI Evaluation
Jurisdictional awareness: Notes geographic limitations
Disclaimer compliance: Includes "not legal advice" language
Qualified language: Avoids definitive legal claims
Customer Support AI Evaluation
Resolution focus: Actually solves the customer's problem
Empathy: Acknowledges customer frustration appropriately
Policy compliance: Follows company guidelines and procedures
Code Generation AI Evaluation
Syntactic correctness: Generated code compiles and runs
Logic accuracy: Code does what was requested
Security: No obvious vulnerabilities introduced
How to Combine Multiple Evaluation Criteria
Real-world LLM evaluation requires multiple criteria working together. Three main strategies exist for combining scores:
Simple Average
All criteria weighted equally. Best when criteria are genuinely equal in importance—rare in practice but simple to implement and explain.
Weighted Scoring
Different criteria receive different weights based on business importance. For example, customer support AI might weight resolution (30%) and tone (25%) higher than raw accuracy (20%). This approach requires understanding which criteria matter most for your specific application.
Pass/Fail Gates
Certain criteria must meet minimum thresholds before other scoring applies. Safety is the classic gate criterion—no amount of accuracy compensates for unsafe content. This hybrid approach ensures critical requirements are met while allowing nuance in secondary criteria.
The LLM Judge Calibration Challenge
A fundamental question underlies all LLM-as-a-judge evaluation: How do you know your judge model is judging correctly?
Research shows LLM judges can exhibit systematic biases:
Position bias: Favoring content presented first (or last) in prompts
Length bias: Preferring longer responses regardless of quality
Style bias: Favoring outputs similar to the judge model's own style
Calibration requires human-labeled examples. Create a small gold standard dataset (50-100 examples) with human scores, then verify your LLM judge correlates strongly (>0.7 correlation) with human judgment. Without this validation step, you're measuring something—but you don't know if it reflects actual quality.
Choosing the Right Criteria for Your Application
Primary Concern | Prioritize These Criteria |
|---|---|
Users getting correct information | Accuracy, Completeness |
Users getting questions answered | Relevance, Completeness |
Brand consistency at scale | Tone, Consistency |
Avoiding PR disasters | Safety, Tone |
Complex multi-step queries | Completeness, Accuracy |
Creative or generative tasks | Tone, Relevance |
Start focused. Begin with 2-3 criteria that directly map to your application's success metrics. Add more only when you have evidence they're needed—over-evaluation creates noise without improving signal.
Key Takeaways
Match criteria to your use case: Customer support needs different evaluation than code generation
Treat safety as a gate, not a score: Use pass/fail filtering, not averaging
Consistency requires baselines: You can't measure drift without knowing what you're drifting from
Calibrate against human judgment: Validate that your LLM judge agrees with human evaluators
Start with fewer criteria: 2-3 well-chosen criteria beat 10 poorly-defined ones
The goal isn't comprehensive evaluation—it's meaningful evaluation that tells you whether your AI is doing its job.
Next Steps
Ready to implement LLM evaluation criteria in your data warehouse? The dbt™-llm-evals package supports all core criteria and runs natively on Snowflake Cortex, BigQuery Vertex AI, and Databricks AI Functions—with zero data egress.
Related reading:
What is LLM-as-a-Judge? - Understanding the evaluation technique
Quick Start: Your First LLM Evaluation - Hands-on tutorial using dbt™





