Get Started with dbt™-llm-evals: Warehouse-Native LLM Evaluation in 15 Minutes
A hands-on tutorial for monitoring AI quality without data egress using dbt™
Fabio Di Leta
·
Jan 30, 2026
·
10
min read
What is dbt™-llm-evals?
dbt™-llm-evals is an open-source package that brings LLM evaluation directly into your data warehouse. Instead of sending data to external APIs, it uses your warehouse's native AI functions—Snowflake Cortex, BigQuery Vertex AI, or Databricks AI Functions—to evaluate AI outputs where your data already lives.
Why warehouse-native evaluation matters:
Zero data egress: Sensitive data never leaves your environment
No external APIs: One less dependency to manage
Native dbt™ integration: Works with your existing workflows
Automatic baselines: No manual curation required
This tutorial walks through installing the package, configuring your first evaluation, and viewing quality scores—all in about 15 minutes.
Prerequisites
Before starting, you'll need:
A dbt™ project connected to Snowflake, BigQuery, or Databricks
Warehouse AI functions enabled (Cortex, Vertex AI, or AI Functions)
Basic familiarity with dbt™ models and configuration
llm_evals_judge_model: The AI model that evaluates your outputs
llm_evals_criteria: Quality dimensions to measure (accuracy, relevance, tone, completeness)
llm_evals_sampling_rate: Percentage of outputs to evaluate (0.1 = 10%)
llm_evals_pass_threshold: Minimum score considered "passing" (1-10 scale)
Step 4: Configure Your AI Model
The package uses a post-hook to capture AI outputs automatically. Add the configuration to any model that generates AI content.
Create the YAML configuration:
# models/ai_examples/_customer_support_responses.ymlversion: 2models:
- name: customer_support_responses
description: "AI-generated customer support responses with automatic quality evaluation" config:
materialized: table
post_hook: "{{ dbt_llm_evals.capture_and_evaluate() }}" meta:
llm_evals:
enabled: true baseline_version: 'v1.0' input_columns:
- customer_question
- customer_context
- ticket_category
output_column: 'ai_response' prompt: |
You are a helpful customer support agent. Respond to this customer question professionally and helpfully. Customer Context: {customer_context} Category: {ticket_category} Question: {customer_question} Response: sampling_rate: 0.2
# models/ai_examples/_customer_support_responses.ymlversion: 2models:
- name: customer_support_responses
description: "AI-generated customer support responses with automatic quality evaluation" config:
materialized: table
post_hook: "{{ dbt_llm_evals.capture_and_evaluate() }}" meta:
llm_evals:
enabled: true baseline_version: 'v1.0' input_columns:
- customer_question
- customer_context
- ticket_category
output_column: 'ai_response' prompt: |
You are a helpful customer support agent. Respond to this customer question professionally and helpfully. Customer Context: {customer_context} Category: {ticket_category} Question: {customer_question} Response: sampling_rate: 0.2
# models/ai_examples/_customer_support_responses.ymlversion: 2models:
- name: customer_support_responses
description: "AI-generated customer support responses with automatic quality evaluation" config:
materialized: table
post_hook: "{{ dbt_llm_evals.capture_and_evaluate() }}" meta:
llm_evals:
enabled: true baseline_version: 'v1.0' input_columns:
- customer_question
- customer_context
- ticket_category
output_column: 'ai_response' prompt: |
You are a helpful customer support agent. Respond to this customer question professionally and helpfully. Customer Context: {customer_context} Category: {ticket_category} Question: {customer_question} Response: sampling_rate: 0.2
Key configuration options:
enabled: Turn evaluation on/off for this model
baseline_version: Version tag for baseline comparison
input_columns: Which columns contain the AI's input
output_column: Which column contains the AI's output
prompt: The prompt template used (helps the judge understand context)
Step 5: Create Your AI Model
Here's the complete example model that generates AI responses. Choose the version matching your warehouse.
Snowflake (using Cortex)
-- models/ai_examples/customer_support_responses.sqlWITH support_tickets AS(select
ticket_id,
customer_id,
customer_name,
customer_question,
ticket_category,
ticket_priority,
customer_tier,
concat('Customer: ', customer_name,', Tier: ', customer_tier,', Previous interactions: ', previous_interaction_count
)as customer_context,-- Pre-calculate prompt
concat('You are a helpful customer support agent. ','Respond to this customer question professionally and helpfully.\n\n','Customer Context: ',
concat('Customer: ', customer_name,', Tier: ', customer_tier,', Previous interactions: ', previous_interaction_count
),'\n','Category: ', ticket_category,'\n','Question: ', customer_question,'\n\n','Response:')as ai_prompt
from{{ref('support_tickets_seed')}}where status = 'pending_ai_response')select
ticket_id,
customer_id,
customer_name,
customer_question,
ticket_category,
customer_context,-- Call Snowflake Cortex AI function
AI_COMPLETE('{{ var("llm_evals_judge_model") }}',
ai_prompt
)as ai_response,current_timestamp()as generated_at
from
-- models/ai_examples/customer_support_responses.sqlWITH support_tickets AS(select
ticket_id,
customer_id,
customer_name,
customer_question,
ticket_category,
ticket_priority,
customer_tier,
concat('Customer: ', customer_name,', Tier: ', customer_tier,', Previous interactions: ', previous_interaction_count
)as customer_context,-- Pre-calculate prompt
concat('You are a helpful customer support agent. ','Respond to this customer question professionally and helpfully.\n\n','Customer Context: ',
concat('Customer: ', customer_name,', Tier: ', customer_tier,', Previous interactions: ', previous_interaction_count
),'\n','Category: ', ticket_category,'\n','Question: ', customer_question,'\n\n','Response:')as ai_prompt
from{{ref('support_tickets_seed')}}where status = 'pending_ai_response')select
ticket_id,
customer_id,
customer_name,
customer_question,
ticket_category,
customer_context,-- Call Snowflake Cortex AI function
AI_COMPLETE('{{ var("llm_evals_judge_model") }}',
ai_prompt
)as ai_response,current_timestamp()as generated_at
from
-- models/ai_examples/customer_support_responses.sqlWITH support_tickets AS(select
ticket_id,
customer_id,
customer_name,
customer_question,
ticket_category,
ticket_priority,
customer_tier,
concat('Customer: ', customer_name,', Tier: ', customer_tier,', Previous interactions: ', previous_interaction_count
)as customer_context,-- Pre-calculate prompt
concat('You are a helpful customer support agent. ','Respond to this customer question professionally and helpfully.\n\n','Customer Context: ',
concat('Customer: ', customer_name,', Tier: ', customer_tier,', Previous interactions: ', previous_interaction_count
),'\n','Category: ', ticket_category,'\n','Question: ', customer_question,'\n\n','Response:')as ai_prompt
from{{ref('support_tickets_seed')}}where status = 'pending_ai_response')select
ticket_id,
customer_id,
customer_name,
customer_question,
ticket_category,
customer_context,-- Call Snowflake Cortex AI function
AI_COMPLETE('{{ var("llm_evals_judge_model") }}',
ai_prompt
)as ai_response,current_timestamp()as generated_at
from
BigQuery (using Vertex AI)
-- models/ai_examples/customer_support_responses.sqlWITH support_tickets AS(select
ticket_id,
customer_id,
customer_name,
customer_question,
ticket_category,
ticket_priority,
customer_tier,
concat('Customer: ', customer_name,', Tier: ', customer_tier,', Previous interactions: ', previous_interaction_count
)as customer_context,-- Pre-calculate prompt
concat('You are a helpful customer support agent. ','Respond to this customer question professionally and helpfully.\n\n','Customer Context: ',
concat('Customer: ', customer_name,', Tier: ', customer_tier,', Previous interactions: ', previous_interaction_count
),'\n','Category: ', ticket_category,'\n','Question: ', customer_question,'\n\n','Response:')as ai_prompt
from{{ref('support_tickets_seed')}}where status = 'pending_ai_response')select
ticket_id,
customer_id,
customer_name,
customer_question,
ticket_category,
customer_context,-- Call BigQuery Vertex AI function
AI.GENERATE(
ai_prompt,
connection_id => '{{ var("ai_connection_id") }}',
endpoint => '{{ var("ai_endpoint") }}').result as ai_response,current_timestamp()as generated_at
from
-- models/ai_examples/customer_support_responses.sqlWITH support_tickets AS(select
ticket_id,
customer_id,
customer_name,
customer_question,
ticket_category,
ticket_priority,
customer_tier,
concat('Customer: ', customer_name,', Tier: ', customer_tier,', Previous interactions: ', previous_interaction_count
)as customer_context,-- Pre-calculate prompt
concat('You are a helpful customer support agent. ','Respond to this customer question professionally and helpfully.\n\n','Customer Context: ',
concat('Customer: ', customer_name,', Tier: ', customer_tier,', Previous interactions: ', previous_interaction_count
),'\n','Category: ', ticket_category,'\n','Question: ', customer_question,'\n\n','Response:')as ai_prompt
from{{ref('support_tickets_seed')}}where status = 'pending_ai_response')select
ticket_id,
customer_id,
customer_name,
customer_question,
ticket_category,
customer_context,-- Call BigQuery Vertex AI function
AI.GENERATE(
ai_prompt,
connection_id => '{{ var("ai_connection_id") }}',
endpoint => '{{ var("ai_endpoint") }}').result as ai_response,current_timestamp()as generated_at
from
-- models/ai_examples/customer_support_responses.sqlWITH support_tickets AS(select
ticket_id,
customer_id,
customer_name,
customer_question,
ticket_category,
ticket_priority,
customer_tier,
concat('Customer: ', customer_name,', Tier: ', customer_tier,', Previous interactions: ', previous_interaction_count
)as customer_context,-- Pre-calculate prompt
concat('You are a helpful customer support agent. ','Respond to this customer question professionally and helpfully.\n\n','Customer Context: ',
concat('Customer: ', customer_name,', Tier: ', customer_tier,', Previous interactions: ', previous_interaction_count
),'\n','Category: ', ticket_category,'\n','Question: ', customer_question,'\n\n','Response:')as ai_prompt
from{{ref('support_tickets_seed')}}where status = 'pending_ai_response')select
ticket_id,
customer_id,
customer_name,
customer_question,
ticket_category,
customer_context,-- Call BigQuery Vertex AI function
AI.GENERATE(
ai_prompt,
connection_id => '{{ var("ai_connection_id") }}',
endpoint => '{{ var("ai_endpoint") }}').result as ai_response,current_timestamp()as generated_at
from
Databricks (using AI Functions)
-- models/ai_examples/customer_support_responses.sqlWITH support_tickets AS(select
ticket_id,
customer_id,
customer_name,
customer_question,
ticket_category,
ticket_priority,
customer_tier,
concat('Customer: ', customer_name,', Tier: ', customer_tier,', Previous interactions: ', previous_interaction_count
)as customer_context,-- Pre-calculate prompt
concat('You are a helpful customer support agent. ','Respond to this customer question professionally and helpfully.\n\n','Customer Context: ',
concat('Customer: ', customer_name,', Tier: ', customer_tier,', Previous interactions: ', previous_interaction_count
),'\n','Category: ', ticket_category,'\n','Question: ', customer_question,'\n\n','Response:')as ai_prompt
from{{ref('support_tickets_seed')}}where status = 'pending_ai_response')select
ticket_id,
customer_id,
customer_name,
customer_question,
ticket_category,
customer_context,-- Call Databricks AI function
ai_query('{{ var("llm_evals_judge_model") }}',
ai_prompt
)as ai_response,current_timestamp()as generated_at
from
-- models/ai_examples/customer_support_responses.sqlWITH support_tickets AS(select
ticket_id,
customer_id,
customer_name,
customer_question,
ticket_category,
ticket_priority,
customer_tier,
concat('Customer: ', customer_name,', Tier: ', customer_tier,', Previous interactions: ', previous_interaction_count
)as customer_context,-- Pre-calculate prompt
concat('You are a helpful customer support agent. ','Respond to this customer question professionally and helpfully.\n\n','Customer Context: ',
concat('Customer: ', customer_name,', Tier: ', customer_tier,', Previous interactions: ', previous_interaction_count
),'\n','Category: ', ticket_category,'\n','Question: ', customer_question,'\n\n','Response:')as ai_prompt
from{{ref('support_tickets_seed')}}where status = 'pending_ai_response')select
ticket_id,
customer_id,
customer_name,
customer_question,
ticket_category,
customer_context,-- Call Databricks AI function
ai_query('{{ var("llm_evals_judge_model") }}',
ai_prompt
)as ai_response,current_timestamp()as generated_at
from
-- models/ai_examples/customer_support_responses.sqlWITH support_tickets AS(select
ticket_id,
customer_id,
customer_name,
customer_question,
ticket_category,
ticket_priority,
customer_tier,
concat('Customer: ', customer_name,', Tier: ', customer_tier,', Previous interactions: ', previous_interaction_count
)as customer_context,-- Pre-calculate prompt
concat('You are a helpful customer support agent. ','Respond to this customer question professionally and helpfully.\n\n','Customer Context: ',
concat('Customer: ', customer_name,', Tier: ', customer_tier,', Previous interactions: ', previous_interaction_count
),'\n','Category: ', ticket_category,'\n','Question: ', customer_question,'\n\n','Response:')as ai_prompt
from{{ref('support_tickets_seed')}}where status = 'pending_ai_response')select
ticket_id,
customer_id,
customer_name,
customer_question,
ticket_category,
customer_context,-- Call Databricks AI function
ai_query('{{ var("llm_evals_judge_model") }}',
ai_prompt
)as ai_response,current_timestamp()as generated_at
from
Step 6: Run Your Model
Execute your AI model. The post-hook automatically captures outputs:
dbt run --select
dbt run --select
dbt run --select
What happens on first run:
Your AI model generates responses
The post-hook detects no baseline exists
It creates a baseline from the current outputs
Future runs compare against this baseline
You should see output like:
Step 7: Run Evaluations
Process the captured data through the evaluation engine:
dbt run --select
dbt run --select
dbt run --select
This runs all evaluation models:
Generates judge prompts with context
Calls your warehouse's AI function to score outputs
Stores scores and reasoning
Step 8: View Results
Query the evaluation results to see how your AI is performing.
Performance summary:
select *
from llm_evals.llm_evals__performance_summary
orderby eval_date desc
select *
from llm_evals.llm_evals__performance_summary
orderby eval_date desc
select *
from llm_evals.llm_evals__performance_summary
orderby eval_date desc
Individual scores with reasoning:
select
c.input_data,
c.output_data,
e.criterion,
e.score,
e.reasoning
from llm_evals.llm_evals__captures c
join llm_evals.llm_evals__judge_evaluations e
on c.capture_id = e.capture_id
orderby e.evaluated_at desc
select
c.input_data,
c.output_data,
e.criterion,
e.score,
e.reasoning
from llm_evals.llm_evals__captures c
join llm_evals.llm_evals__judge_evaluations e
on c.capture_id = e.capture_id
orderby e.evaluated_at desc
select
c.input_data,
c.output_data,
e.criterion,
e.score,
e.reasoning
from llm_evals.llm_evals__captures c
join llm_evals.llm_evals__judge_evaluations e
on c.capture_id = e.capture_id
orderby e.evaluated_at desc
Find low-scoring outputs:
select
c.input_data,
c.output_data,
e.criterion,
e.score,
e.reasoning
from llm_evals.llm_evals__captures c
join llm_evals.llm_evals__judge_evaluations e
on c.capture_id = e.capture_id
where e.score < 5orderby e.score asc
select
c.input_data,
c.output_data,
e.criterion,
e.score,
e.reasoning
from llm_evals.llm_evals__captures c
join llm_evals.llm_evals__judge_evaluations e
on c.capture_id = e.capture_id
where e.score < 5orderby e.score asc
select
c.input_data,
c.output_data,
e.criterion,
e.score,
e.reasoning
from llm_evals.llm_evals__captures c
join llm_evals.llm_evals__judge_evaluations e
on c.capture_id = e.capture_id
where e.score < 5orderby e.score asc
Monitor for drift:
select *
from llm_evals.llm_evals__drift_detection
where drift_status in('WARNING','ALERT')
select *
from llm_evals.llm_evals__drift_detection
where drift_status in('WARNING','ALERT')
select *
from llm_evals.llm_evals__drift_detection
where drift_status in('WARNING','ALERT')
What's Being Evaluated?
The package uses an LLM-as-a-Judge approach. For each captured output, a judge model:
Receives the original input, output, and prompt context
Compares against baseline examples
Scores on each criterion (1-10 scale)
Provides reasoning for the score
Default evaluation criteria:
Criterion
What It Measures
Accuracy
Factual correctness
Relevance
Addresses the input directly
Tone
Appropriate style and professionalism
Completeness
Fully addresses all aspects
Automatic Baseline Management
The package handles baselines automatically:
First run: Creates baseline from initial outputs (no manual setup needed)
Subsequent runs: Compares new outputs against the baseline
New baseline version: Change baseline_version in your config:
meta:
llm_evals:
baseline_version: 'v2.0'# Creates new baseline
meta:
llm_evals:
baseline_version: 'v2.0'# Creates new baseline
meta:
llm_evals:
baseline_version: 'v2.0'# Creates new baseline
Force refresh: Add force_rebaseline: true to recreate the current version.
Scheduling in Production
For ongoing monitoring, schedule evaluation runs after your AI models.
With Paradime Bolt: Create a job that runs:
dbt run --select customer_support_responses
dbt run --select
dbt run --select customer_support_responses
dbt run --select
dbt run --select customer_support_responses
dbt run --select
Troubleshooting
Evaluations not running?
Check llm_evals.enabled: true in your model's meta config
Verify the post-hook is configured
Ensure you ran dbt run --select llm_evals__setup first
No captures appearing?
Check your sampling rate isn't set to 0
Verify the input_columns and output_column match your model's columns
Judge returning errors?
Ensure your warehouse AI functions are enabled
Check the judge model name matches your warehouse's available models
Next Steps
You now have warehouse-native LLM evaluation running. Here's what to explore next:
Add more criteria: Customize llm_evals_criteria for your use case
Adjust sampling: Increase sampling_rate for critical models
Set up alerts: Query llm_evals__drift_detection in your monitoring tools
Create dashboards: Build visualizations from llm_evals__performance_summary
*dbt® and dbt Core® are federally registered trademarks of dbt Labs, Inc. in the United States and various jurisdictions around the world. Paradime is not a partner of dbt Labs. All rights therein are reserved to dbt Labs. Paradime is not a product or service of or endorsed by dbt Labs, Inc.
*dbt® and dbt Core® are federally registered trademarks of dbt Labs, Inc. in the United States and various jurisdictions around the world. Paradime is not a partner of dbt Labs. All rights therein are reserved to dbt Labs. Paradime is not a product or service of or endorsed by dbt Labs, Inc.
*dbt® and dbt Core® are federally registered trademarks of dbt Labs, Inc. in the United States and various jurisdictions around the world. Paradime is not a partner of dbt Labs. All rights therein are reserved to dbt Labs. Paradime is not a product or service of or endorsed by dbt Labs, Inc.