Learn

Learn

Get Started with dbt™-llm-evals: Warehouse-Native LLM Evaluation in 15 Minutes

A hands-on tutorial for monitoring AI quality without data egress using dbt™

Hannah, Creative Designer at Paradime

Fabio Di Leta

·

Jan 30, 2026

·

10

min read

What is dbt™-llm-evals?

dbt™-llm-evals is an open-source package that brings LLM evaluation directly into your data warehouse. Instead of sending data to external APIs, it uses your warehouse's native AI functions—Snowflake Cortex, BigQuery Vertex AI, or Databricks AI Functions—to evaluate AI outputs where your data already lives.

Why warehouse-native evaluation matters:

  • Zero data egress: Sensitive data never leaves your environment

  • No external APIs: One less dependency to manage

  • Native dbt™ integration: Works with your existing workflows

  • Automatic baselines: No manual curation required

This tutorial walks through installing the package, configuring your first evaluation, and viewing quality scores—all in about 15 minutes.

Prerequisites

Before starting, you'll need:

  • A dbt™ project connected to Snowflake, BigQuery, or Databricks

  • Warehouse AI functions enabled (Cortex, Vertex AI, or AI Functions)

  • Basic familiarity with dbt™ models and configuration

Step 1: Install the Package

Add dbt™-llm-evals to your packages.yml:

# packages.yml
packages:
  - git: "https://github.com/paradime-io/dbt-llm-evals.git"
    revision: 1.0.0

Then install dependencies:

Step 2: Run Setup

The package needs storage tables for captures and baselines. Create them with:

dbt run --select

This creates two tables in your target schema:

  • raw_captures: Stores AI inputs, outputs, and prompts

  • raw_baselines: Stores baseline examples for comparison

Step 3: Configure Global Variables

Add configuration to your dbt_project.yml. The key settings are the judge model and evaluation criteria.

For Snowflake:

vars:
  llm_evals_judge_model: 'gemini-2.5-flash'
  llm_evals_criteria: '["accuracy", "relevance", "tone", "completeness"]'
  llm_evals_sampling_rate: 0.1
  llm_evals_pass_threshold: 7

For BigQuery:

vars:
  llm_evals_judge_model: 'gemini-2.5-flash'
  llm_evals_criteria: '["accuracy", "relevance", "tone", "completeness"]'
  llm_evals_sampling_rate: 0.1
  llm_evals_pass_threshold: 7
  gcp_project_id: 'your-project'
  gcp_location: 'us-central1'
  ai_connection_id: 'projects/your-project/locations/us-central1/connections/your-connection'
  ai_endpoint: 'gemini-2.5-flash'

For Databricks:

vars:
  llm_evals_judge_model: 'databricks-meta-llama-3-1-8b-instruct'
  llm_evals_criteria: '["accuracy", "relevance", "tone", "completeness"]'
  llm_evals_sampling_rate: 0.1
  llm_evals_pass_threshold: 7

What these settings mean:

  • llm_evals_judge_model: The AI model that evaluates your outputs

  • llm_evals_criteria: Quality dimensions to measure (accuracy, relevance, tone, completeness)

  • llm_evals_sampling_rate: Percentage of outputs to evaluate (0.1 = 10%)

  • llm_evals_pass_threshold: Minimum score considered "passing" (1-10 scale)

Step 4: Configure Your AI Model

The package uses a post-hook to capture AI outputs automatically. Add the configuration to any model that generates AI content.

Create the YAML configuration:

# models/ai_examples/_customer_support_responses.yml
version: 2

models:
  - name: customer_support_responses
    description: "AI-generated customer support responses with automatic quality evaluation"
    config:
      materialized: table
      post_hook: "{{ dbt_llm_evals.capture_and_evaluate() }}"
      meta:
        llm_evals:
          enabled: true
          baseline_version: 'v1.0'
          input_columns:
            - customer_question
            - customer_context
            - ticket_category
          output_column: 'ai_response'
          prompt: |
            You are a helpful customer support agent. 
            Respond to this customer question professionally and helpfully.
            Customer Context: {customer_context}
            Category: {ticket_category}
            Question: {customer_question}
            Response:
          sampling_rate: 0.2

Key configuration options:

  • enabled: Turn evaluation on/off for this model

  • baseline_version: Version tag for baseline comparison

  • input_columns: Which columns contain the AI's input

  • output_column: Which column contains the AI's output

  • prompt: The prompt template used (helps the judge understand context)

Step 5: Create Your AI Model

Here's the complete example model that generates AI responses. Choose the version matching your warehouse.

Snowflake (using Cortex)

-- models/ai_examples/customer_support_responses.sql
WITH support_tickets AS (
    select
        ticket_id,
        customer_id,
        customer_name,
        customer_question,
        ticket_category,
        ticket_priority,
        customer_tier,
        
        concat(
            'Customer: ', customer_name,
            ', Tier: ', customer_tier,
            ', Previous interactions: ', previous_interaction_count
        ) as customer_context,
        
        -- Pre-calculate prompt
        concat(
            'You are a helpful customer support agent. ',
            'Respond to this customer question professionally and helpfully.\n\n',
            'Customer Context: ', 
            concat(
                'Customer: ', customer_name,
                ', Tier: ', customer_tier,
                ', Previous interactions: ', previous_interaction_count
            ), '\n',
            'Category: ', ticket_category, '\n',
            'Question: ', customer_question, '\n\n',
            'Response:'
        ) as ai_prompt
        
    from {{ ref('support_tickets_seed') }}
    where status = 'pending_ai_response'
)

select
    ticket_id,
    customer_id,
    customer_name,
    customer_question,
    ticket_category,
    customer_context,
    
    -- Call Snowflake Cortex AI function
    AI_COMPLETE(
        '{{ var("llm_evals_judge_model") }}',
        ai_prompt
    ) as ai_response,
    
    current_timestamp() as generated_at
    
from

BigQuery (using Vertex AI)

-- models/ai_examples/customer_support_responses.sql
WITH support_tickets AS (
    select
        ticket_id,
        customer_id,
        customer_name,
        customer_question,
        ticket_category,
        ticket_priority,
        customer_tier,
        
        concat(
            'Customer: ', customer_name,
            ', Tier: ', customer_tier,
            ', Previous interactions: ', previous_interaction_count
        ) as customer_context,
        
        -- Pre-calculate prompt
        concat(
            'You are a helpful customer support agent. ',
            'Respond to this customer question professionally and helpfully.\n\n',
            'Customer Context: ', 
            concat(
                'Customer: ', customer_name,
                ', Tier: ', customer_tier,
                ', Previous interactions: ', previous_interaction_count
            ), '\n',
            'Category: ', ticket_category, '\n',
            'Question: ', customer_question, '\n\n',
            'Response:'
        ) as ai_prompt
        
    from {{ ref('support_tickets_seed') }}
    where status = 'pending_ai_response'
)

select
    ticket_id,
    customer_id,
    customer_name,
    customer_question,
    ticket_category,
    customer_context,
    
    -- Call BigQuery Vertex AI function
    AI.GENERATE(
        ai_prompt,
        connection_id => '{{ var("ai_connection_id") }}',
        endpoint => '{{ var("ai_endpoint") }}'
    ).result as ai_response,
    
    current_timestamp() as generated_at
    
from

Databricks (using AI Functions)

-- models/ai_examples/customer_support_responses.sql
WITH support_tickets AS (
    select
        ticket_id,
        customer_id,
        customer_name,
        customer_question,
        ticket_category,
        ticket_priority,
        customer_tier,
        
        concat(
            'Customer: ', customer_name,
            ', Tier: ', customer_tier,
            ', Previous interactions: ', previous_interaction_count
        ) as customer_context,
        
        -- Pre-calculate prompt
        concat(
            'You are a helpful customer support agent. ',
            'Respond to this customer question professionally and helpfully.\n\n',
            'Customer Context: ', 
            concat(
                'Customer: ', customer_name,
                ', Tier: ', customer_tier,
                ', Previous interactions: ', previous_interaction_count
            ), '\n',
            'Category: ', ticket_category, '\n',
            'Question: ', customer_question, '\n\n',
            'Response:'
        ) as ai_prompt
        
    from {{ ref('support_tickets_seed') }}
    where status = 'pending_ai_response'
)

select
    ticket_id,
    customer_id,
    customer_name,
    customer_question,
    ticket_category,
    customer_context,
    
    -- Call Databricks AI function
    ai_query(
        '{{ var("llm_evals_judge_model") }}',
        ai_prompt
    ) as ai_response,
    
    current_timestamp() as generated_at
    
from

Step 6: Run Your Model

Execute your AI model. The post-hook automatically captures outputs:

dbt run --select

What happens on first run:

  1. Your AI model generates responses

  2. The post-hook detects no baseline exists

  3. It creates a baseline from the current outputs

  4. Future runs compare against this baseline

You should see output like:


Step 7: Run Evaluations

Process the captured data through the evaluation engine:

dbt run --select

This runs all evaluation models:

  • Generates judge prompts with context

  • Calls your warehouse's AI function to score outputs

  • Stores scores and reasoning

Step 8: View Results

Query the evaluation results to see how your AI is performing.

Performance summary:

select * 
from llm_evals.llm_evals__performance_summary
order by eval_date desc

Individual scores with reasoning:

select 
    c.input_data,
    c.output_data,
    e.criterion,
    e.score,
    e.reasoning
from llm_evals.llm_evals__captures c
join llm_evals.llm_evals__judge_evaluations e 
    on c.capture_id = e.capture_id
order by e.evaluated_at desc

Find low-scoring outputs:

select 
    c.input_data,
    c.output_data,
    e.criterion,
    e.score,
    e.reasoning
from llm_evals.llm_evals__captures c
join llm_evals.llm_evals__judge_evaluations e 
    on c.capture_id = e.capture_id
where e.score < 5
order by e.score asc

Monitor for drift:

select * 
from llm_evals.llm_evals__drift_detection
where drift_status in ('WARNING', 'ALERT')

What's Being Evaluated?

The package uses an LLM-as-a-Judge approach. For each captured output, a judge model:

  1. Receives the original input, output, and prompt context

  2. Compares against baseline examples

  3. Scores on each criterion (1-10 scale)

  4. Provides reasoning for the score

Default evaluation criteria:

Criterion

What It Measures

Accuracy

Factual correctness

Relevance

Addresses the input directly

Tone

Appropriate style and professionalism

Completeness

Fully addresses all aspects

Automatic Baseline Management

The package handles baselines automatically:

First run: Creates baseline from initial outputs (no manual setup needed)

Subsequent runs: Compares new outputs against the baseline

New baseline version: Change baseline_version in your config:

meta:
  llm_evals:
    baseline_version: 'v2.0'  # Creates new baseline

Force refresh: Add force_rebaseline: true to recreate the current version.

Scheduling in Production

For ongoing monitoring, schedule evaluation runs after your AI models.

With Paradime Bolt: Create a job that runs:

dbt run --select customer_support_responses
dbt run --select

Troubleshooting

Evaluations not running?

  • Check llm_evals.enabled: true in your model's meta config

  • Verify the post-hook is configured

  • Ensure you ran dbt run --select llm_evals__setup first

No captures appearing?

  • Check your sampling rate isn't set to 0

  • Verify the input_columns and output_column match your model's columns

Judge returning errors?

  • Ensure your warehouse AI functions are enabled

  • Check the judge model name matches your warehouse's available models

Next Steps

You now have warehouse-native LLM evaluation running. Here's what to explore next:

  • Add more criteria: Customize llm_evals_criteria for your use case

  • Adjust sampling: Increase sampling_rate for critical models

  • Set up alerts: Query llm_evals__drift_detection in your monitoring tools

  • Create dashboards: Build visualizations from llm_evals__performance_summary

Resources

This post is part of a series on LLM evaluation:

  1. What is LLM-as-a-Judge? A Guide to AI Quality Evaluation

  2. LLM Evaluation Criteria: How to Measure AI Quality

  3. Get Started with dbt™-llm-evals (this post)

Star the dbt™-llm-evals repo if this helped! ⭐

Interested to Learn More?
Try Out the Free 14-Days Trial

More Articles

decorative icon

Experience Analytics for the AI-Era

Start your 14-day trial today - it's free and no credit card needed

decorative icon

Experience Analytics for the AI-Era

Start your 14-day trial today - it's free and no credit card needed

decorative icon

Experience Analytics for the AI-Era

Start your 14-day trial today - it's free and no credit card needed

Copyright © 2025 Paradime Labs, Inc.

Made with ❤️ in San Francisco ・ London

*dbt® and dbt Core® are federally registered trademarks of dbt Labs, Inc. in the United States and various jurisdictions around the world. Paradime is not a partner of dbt Labs. All rights therein are reserved to dbt Labs. Paradime is not a product or service of or endorsed by dbt Labs, Inc.

Copyright © 2025 Paradime Labs, Inc.

Made with ❤️ in San Francisco ・ London

*dbt® and dbt Core® are federally registered trademarks of dbt Labs, Inc. in the United States and various jurisdictions around the world. Paradime is not a partner of dbt Labs. All rights therein are reserved to dbt Labs. Paradime is not a product or service of or endorsed by dbt Labs, Inc.

Copyright © 2025 Paradime Labs, Inc.

Made with ❤️ in San Francisco ・ London

*dbt® and dbt Core® are federally registered trademarks of dbt Labs, Inc. in the United States and various jurisdictions around the world. Paradime is not a partner of dbt Labs. All rights therein are reserved to dbt Labs. Paradime is not a product or service of or endorsed by dbt Labs, Inc.