Get Started with dbt™-llm-evals: Warehouse-Native LLM Evaluation in 15 Minutes
A hands-on tutorial for monitoring AI quality without data egress using dbt™

Fabio Di Leta
Jan 30, 2026
·
10
min read
What is dbt™-llm-evals?
dbt™-llm-evals is an open-source package that brings LLM evaluation directly into your data warehouse. Instead of sending data to external APIs, it uses your warehouse's native AI functions—Snowflake Cortex, BigQuery Vertex AI, or Databricks AI Functions—to evaluate AI outputs where your data already lives.
Why warehouse-native evaluation matters:
Zero data egress: Sensitive data never leaves your environment
No external APIs: One less dependency to manage
Native dbt™ integration: Works with your existing workflows
Automatic baselines: No manual curation required
This tutorial walks through installing the package, configuring your first evaluation, and viewing quality scores—all in about 15 minutes.
Prerequisites
Before starting, you'll need:
A dbt™ project connected to Snowflake, BigQuery, or Databricks
Warehouse AI functions enabled (Cortex, Vertex AI, or AI Functions)
Basic familiarity with dbt™ models and configuration
Step 1: Install the Package
Add dbt™-llm-evals to your packages.yml:
Then install dependencies:
Step 2: Run Setup
The package needs storage tables for captures and baselines. Create them with:
This creates two tables in your target schema:
raw_captures: Stores AI inputs, outputs, and promptsraw_baselines: Stores baseline examples for comparison
Step 3: Configure Global Variables
Add configuration to your dbt_project.yml. The key settings are the judge model and evaluation criteria.
For Snowflake:
For BigQuery:
For Databricks:
What these settings mean:
llm_evals_judge_model: The AI model that evaluates your outputsllm_evals_criteria: Quality dimensions to measure (accuracy, relevance, tone, completeness)llm_evals_sampling_rate: Percentage of outputs to evaluate (0.1 = 10%)llm_evals_pass_threshold: Minimum score considered "passing" (1-10 scale)
Step 4: Configure Your AI Model
The package uses a post-hook to capture AI outputs automatically. Add the configuration to any model that generates AI content.
Create the YAML configuration:
Key configuration options:
enabled: Turn evaluation on/off for this modelbaseline_version: Version tag for baseline comparisoninput_columns: Which columns contain the AI's inputoutput_column: Which column contains the AI's outputprompt: The prompt template used (helps the judge understand context)
Step 5: Create Your AI Model
Here's the complete example model that generates AI responses. Choose the version matching your warehouse.
Snowflake (using Cortex)
BigQuery (using Vertex AI)
Databricks (using AI Functions)
Step 6: Run Your Model
Execute your AI model. The post-hook automatically captures outputs:
What happens on first run:
Your AI model generates responses
The post-hook detects no baseline exists
It creates a baseline from the current outputs
Future runs compare against this baseline
You should see output like:
Step 7: Run Evaluations
Process the captured data through the evaluation engine:
This runs all evaluation models:
Generates judge prompts with context
Calls your warehouse's AI function to score outputs
Stores scores and reasoning
Step 8: View Results
Query the evaluation results to see how your AI is performing.
Performance summary:
Individual scores with reasoning:
Find low-scoring outputs:
Monitor for drift:
What's Being Evaluated?
The package uses an LLM-as-a-Judge approach. For each captured output, a judge model:
Receives the original input, output, and prompt context
Compares against baseline examples
Scores on each criterion (1-10 scale)
Provides reasoning for the score
Default evaluation criteria:
Criterion | What It Measures |
|---|---|
Accuracy | Factual correctness |
Relevance | Addresses the input directly |
Tone | Appropriate style and professionalism |
Completeness | Fully addresses all aspects |
Automatic Baseline Management
The package handles baselines automatically:
First run: Creates baseline from initial outputs (no manual setup needed)
Subsequent runs: Compares new outputs against the baseline
New baseline version: Change baseline_version in your config:
Force refresh: Add force_rebaseline: true to recreate the current version.
Scheduling in Production
For ongoing monitoring, schedule evaluation runs after your AI models.
With Paradime Bolt: Create a job that runs:
Troubleshooting
Evaluations not running?
Check
llm_evals.enabled: truein your model's meta configVerify the post-hook is configured
Ensure you ran
dbt run --select llm_evals__setupfirst
No captures appearing?
Check your sampling rate isn't set to 0
Verify the
input_columnsandoutput_columnmatch your model's columns
Judge returning errors?
Ensure your warehouse AI functions are enabled
Check the judge model name matches your warehouse's available models
Next Steps
You now have warehouse-native LLM evaluation running. Here's what to explore next:
Add more criteria: Customize
llm_evals_criteriafor your use caseAdjust sampling: Increase
sampling_ratefor critical modelsSet up alerts: Query
llm_evals__drift_detectionin your monitoring toolsCreate dashboards: Build visualizations from
llm_evals__performance_summary
Resources
GitHub: paradime-io/dbt-llm-evals
Full documentation: Package Overview
Architecture deep-dive: Architecture Docs
This post is part of a series on LLM evaluation:
Get Started with dbt™-llm-evals (this post)
Star the dbt™-llm-evals repo if this helped! ⭐





