Learn

Learn

LLM Evaluation Criteria: How to Measure AI Quality

A guide to choosing the right metrics for evaluating large language model outputs

Hannah, Creative Designer at Paradime

Fabio Di Leta

·

Jan 23, 2026

·

8

min read

What Are LLM Evaluation Criteria?

LLM evaluation criteria are the specific dimensions used to assess the quality of AI-generated outputs. When you deploy a large language model in production, you need systematic ways to measure whether responses meet your quality standards.

The challenge: not all AI outputs need the same evaluation. A customer support chatbot has different quality requirements than a code generation tool. A medical information system demands different standards than a creative writing assistant.

Choosing the wrong criteria leads to two problems:

  1. False confidence: High scores on irrelevant metrics while real issues go undetected

  2. Alert fatigue: Low scores on metrics that don't matter for your use case

This guide covers the six core evaluation criteria, when to use each, and how to combine them effectively.

The Six Core LLM Evaluation Criteria

These foundational criteria apply across most LLM applications. Understanding what each measures helps you choose the right combination for your use case.

1. Accuracy

What it measures: Factual correctness of the AI response

Accuracy evaluates whether statements in the response are true and verifiable. This is the most critical criterion for knowledge-based systems where users rely on the AI for factual information.

When accuracy matters most:

  • Medical or health information

  • Legal guidance

  • Financial advice

  • Technical documentation

  • Educational content

When accuracy is less critical:

  • Creative writing

  • Brainstorming sessions

  • Hypothetical scenarios

A key challenge with accuracy evaluation is that the judge model must itself have reliable knowledge to assess factual claims. For domain-specific accuracy (medical, legal), consider whether general-purpose LLM judges are sufficient or if specialized evaluation is needed.

2. Relevance

What it measures: How directly the response addresses the user's input

Relevance captures whether the AI understood the question and responded appropriately. A response can be factually accurate but completely irrelevant if it answers a different question than what was asked.

Common relevance failures:

  • Answering a different question than asked

  • Including excessive background information

  • Missing the specific context of the question

  • Providing generic responses to specific queries

Relevance is particularly important for search systems, RAG applications, and chatbots where users expect direct answers.

3. Completeness

What it measures: Whether all parts of a complex question are addressed

Completeness matters when users ask multi-part questions or have compound requirements. A response that thoroughly answers one part while ignoring others scores low on completeness even if what it does cover is accurate.

Example: If a user asks "What are the pros and cons of Python vs JavaScript, and which should I learn first?"—a response covering only Python's advantages would be accurate and relevant but incomplete.

Completeness trades off against conciseness. For simple questions, a complete response might be unnecessarily verbose. Match your completeness expectations to query complexity.

4. Tone

What it measures: Appropriateness of style, formality, and emotional quality

Tone evaluation assesses whether the response matches the expected communication style for the context. This criterion is inherently subjective and context-dependent.

Application Context

Expected Tone

Customer support

Helpful, empathetic, professional

Technical documentation

Clear, precise, neutral

Marketing copy

Enthusiastic, persuasive, brand-aligned

Medical information

Careful, accurate, appropriately cautious

Tone criteria require clear specification of what "appropriate" means for your application. Without explicit guidance, LLM judges default to generic professional tone, which may not match your brand voice.

5. Consistency

What it measures: Alignment with established patterns and baseline examples

Consistency evaluates whether responses follow the same structure, style, and approach as your approved examples. This criterion is unique because it requires baseline data—you can't measure consistency without something to be consistent with.

Consistency matters most for:

  • Maintaining brand voice across thousands of responses

  • Ensuring format adherence in structured outputs

  • Detecting when model behavior drifts from established patterns

The dbt™-llm-evals package automatically creates baselines from high-quality outputs, enabling consistency scoring without manual curation.

6. Safety

What it measures: Absence of harmful, inappropriate, or dangerous content

Safety is non-negotiable for production AI systems. Unlike other criteria where a score of 6/10 might be acceptable, safety failures often require immediate intervention.

Safety concerns include:

  • Harmful instructions or dangerous advice

  • Discriminatory or biased content

  • Privacy violations (PII exposure)

  • Misinformation that could cause harm

Important: Safety should be a gate, not a gradient. Rather than treating safety as a 1-10 score to average with other criteria, use it as a pass/fail filter where any output below threshold triggers immediate review.

Domain-Specific Evaluation Criteria

Beyond core criteria, specialized applications require domain-specific evaluation dimensions.

Medical AI Evaluation

  • Clinical accuracy: Medical facts verified against established standards

  • Appropriate caution: Doesn't overstate diagnostic certainty

  • Disclaimer inclusion: Recommends professional consultation

Legal AI Evaluation

  • Jurisdictional awareness: Notes geographic limitations

  • Disclaimer compliance: Includes "not legal advice" language

  • Qualified language: Avoids definitive legal claims

Customer Support AI Evaluation

  • Resolution focus: Actually solves the customer's problem

  • Empathy: Acknowledges customer frustration appropriately

  • Policy compliance: Follows company guidelines and procedures

Code Generation AI Evaluation

  • Syntactic correctness: Generated code compiles and runs

  • Logic accuracy: Code does what was requested

  • Security: No obvious vulnerabilities introduced

How to Combine Multiple Evaluation Criteria

Real-world LLM evaluation requires multiple criteria working together. Three main strategies exist for combining scores:

Simple Average

All criteria weighted equally. Best when criteria are genuinely equal in importance—rare in practice but simple to implement and explain.

Weighted Scoring

Different criteria receive different weights based on business importance. For example, customer support AI might weight resolution (30%) and tone (25%) higher than raw accuracy (20%). This approach requires understanding which criteria matter most for your specific application.

Pass/Fail Gates

Certain criteria must meet minimum thresholds before other scoring applies. Safety is the classic gate criterion—no amount of accuracy compensates for unsafe content. This hybrid approach ensures critical requirements are met while allowing nuance in secondary criteria.

The LLM Judge Calibration Challenge

A fundamental question underlies all LLM-as-a-judge evaluation: How do you know your judge model is judging correctly?

Research shows LLM judges can exhibit systematic biases:

  • Position bias: Favoring content presented first (or last) in prompts

  • Length bias: Preferring longer responses regardless of quality

  • Style bias: Favoring outputs similar to the judge model's own style

Calibration requires human-labeled examples. Create a small gold standard dataset (50-100 examples) with human scores, then verify your LLM judge correlates strongly (>0.7 correlation) with human judgment. Without this validation step, you're measuring something—but you don't know if it reflects actual quality.

Choosing the Right Criteria for Your Application

Primary Concern

Prioritize These Criteria

Users getting correct information

Accuracy, Completeness

Users getting questions answered

Relevance, Completeness

Brand consistency at scale

Tone, Consistency

Avoiding PR disasters

Safety, Tone

Complex multi-step queries

Completeness, Accuracy

Creative or generative tasks

Tone, Relevance

Start focused. Begin with 2-3 criteria that directly map to your application's success metrics. Add more only when you have evidence they're needed—over-evaluation creates noise without improving signal.

Key Takeaways

  1. Match criteria to your use case: Customer support needs different evaluation than code generation

  2. Treat safety as a gate, not a score: Use pass/fail filtering, not averaging

  3. Consistency requires baselines: You can't measure drift without knowing what you're drifting from

  4. Calibrate against human judgment: Validate that your LLM judge agrees with human evaluators

  5. Start with fewer criteria: 2-3 well-chosen criteria beat 10 poorly-defined ones

The goal isn't comprehensive evaluation—it's meaningful evaluation that tells you whether your AI is doing its job.

Next Steps

Ready to implement LLM evaluation criteria in your data warehouse? The dbt™-llm-evals package supports all core criteria and runs natively on Snowflake Cortex, BigQuery Vertex AI, and Databricks AI Functions—with zero data egress.

Related reading:

Interested to Learn More?
Try Out the Free 14-Days Trial

More Articles

decorative icon

Experience Analytics for the AI-Era

Start your 14-day trial today - it's free and no credit card needed

decorative icon

Experience Analytics for the AI-Era

Start your 14-day trial today - it's free and no credit card needed

decorative icon

Experience Analytics for the AI-Era

Start your 14-day trial today - it's free and no credit card needed

Copyright © 2025 Paradime Labs, Inc.

Made with ❤️ in San Francisco ・ London

*dbt® and dbt Core® are federally registered trademarks of dbt Labs, Inc. in the United States and various jurisdictions around the world. Paradime is not a partner of dbt Labs. All rights therein are reserved to dbt Labs. Paradime is not a product or service of or endorsed by dbt Labs, Inc.

Copyright © 2025 Paradime Labs, Inc.

Made with ❤️ in San Francisco ・ London

*dbt® and dbt Core® are federally registered trademarks of dbt Labs, Inc. in the United States and various jurisdictions around the world. Paradime is not a partner of dbt Labs. All rights therein are reserved to dbt Labs. Paradime is not a product or service of or endorsed by dbt Labs, Inc.

Copyright © 2025 Paradime Labs, Inc.

Made with ❤️ in San Francisco ・ London

*dbt® and dbt Core® are federally registered trademarks of dbt Labs, Inc. in the United States and various jurisdictions around the world. Paradime is not a partner of dbt Labs. All rights therein are reserved to dbt Labs. Paradime is not a product or service of or endorsed by dbt Labs, Inc.