Optimizing dbt™ Incremental Models for Performance
Feb 26, 2026
Optimizing dbt™ Incremental Models for Performance
When your dbt™ project grows from a handful of models to hundreds—and your source tables balloon from millions to billions of rows—full table rebuilds stop being viable. Pipeline runs stretch from minutes to hours, warehouse bills spike, and stakeholders wait longer for fresh data.
dbt™ incremental models solve this by processing only what's new or changed. But configuring them correctly is the difference between a pipeline that hums along in seconds and one that's slower (and costlier) than the table materialization it replaced.
This guide walks through everything you need to know: how dbt™ incremental models work, which incremental strategy fits your warehouse, how to avoid common pitfalls, and how to monitor performance in production.
What Are dbt™ Incremental Models?
An incremental model is a dbt™ materialization strategy that only transforms and loads new or changed data instead of rebuilding the entire table from scratch on every run. Rather than dropping and recreating a table, dbt™ inserts new rows and optionally updates existing ones—processing a fraction of the data each time.
Think of it like updating a ledger: instead of rewriting every entry from the beginning of time, you simply add today's transactions.
Here's how incremental compares to the standard table materialization:
Table materialization: Rebuilds the entire dataset on every
dbt run, regardless of whether the source data changed. Simple and accurate, but expensive at scale.Incremental materialization: Processes only new or changed rows since the last successful run. More complex to configure, but dramatically faster and cheaper for large datasets.
For small models where a full rebuild takes seconds, table materialization is perfectly fine. But once you're dealing with append-heavy event data, high-frequency pipelines, or tables with hundreds of millions of rows, incremental models become essential.
Why Incremental Models Reduce Pipeline Runtime and Costs
The core value proposition of dbt™ incremental models is efficiency. Instead of scanning and transforming your entire dataset on every run, you're only touching the delta—the new or changed rows since the last execution.
For teams running production pipelines on hourly or sub-hourly schedules, this translates directly into tangible savings:
Faster execution: Only new rows are processed, reducing run times from hours to minutes (or minutes to seconds). A model that scans 500 million rows on a full rebuild might only need to process 50,000 rows incrementally.
Lower compute costs: Less warehouse time means fewer credits on Snowflake, fewer slots on BigQuery, and lower costs on Redshift. For organizations running dozens of models multiple times per day, the savings compound quickly.
Scalability: Incremental models are essential for large, append-heavy datasets like event logs, clickstreams, IoT telemetry, and transaction histories. Without them, pipeline runtime grows linearly with total data volume—an unsustainable trajectory.
The bottom line: if your team has billions of rows and can't afford full rebuilds every hour, incremental models aren't optional—they're a requirement for production-grade dbt™ pipelines.
How to Configure a dbt™ Materialized Incremental Model
Setting up a basic incremental model requires three things: the materialized='incremental' config, a unique_key for upsert behavior, and the is_incremental() Jinja macro to filter for new rows.
Here's a minimal example:
Let's break down each piece:
materialized='incremental'tells dbt™ to build this model incrementally rather than as a full table rebuild.unique_key='order_id'defines which column(s) dbt™ uses to match existing rows. When a row with a matchingorder_idalready exists in the target table, dbt™ updates it. Neworder_idvalues get inserted. Without this, dbt™ defaults to append-only behavior—which creates duplicates if the same row arrives twice.{% if is_incremental() %}wraps the filter that limits source data on incremental runs. On the first run (or after a--full-refresh), this block is skipped and the full dataset is loaded.{{ this }}is a Jinja variable that references the current model's target table in the warehouse, letting you compare incoming data against what's already been loaded.
This pattern forms the foundation. The sections that follow cover how to choose the right incremental strategy, optimize filtering, and handle edge cases.
dbt™ Incremental Strategies Explained
The incremental_strategy config option controls how dbt™ materializes new data into the existing table. Different warehouses support different strategies, and choosing the right one can have a significant impact on performance.
You set the strategy in your model's config block:
Here are the four primary strategies:
Append
The simplest strategy. dbt™ inserts all new rows into the target table without checking for duplicates or updating existing records.
This is ideal for immutable, append-only event data—think clickstream events, server logs, or IoT sensor readings—where each row is unique by nature and will never need updating. Because there's no matching or deduplication overhead, append is the fastest strategy.
The trade-off: if the same row arrives twice, you'll get duplicates.
Merge
The merge strategy uses a SQL MERGE statement (or equivalent) to upsert rows based on the unique_key. If a row with a matching key exists in the target, dbt™ updates it. If it doesn't exist, dbt™ inserts it.
This is the default dbt™ incremental strategy for most warehouses, and it's the right choice when your source data is mutable—records can be updated after initial creation. Order statuses changing from "pending" to "shipped," user profiles being edited, or transaction amounts being corrected are all classic merge use cases.
The trade-off: merge operations can be expensive on very large tables, especially without proper clustering or indexing on the unique_key columns.
Delete+Insert
The delete+insert strategy first deletes all rows in the target table that match the incoming data (based on unique_key or specified conditions), then inserts the new rows. It achieves the same end result as merge but through two separate operations.
This strategy is particularly useful when merge performance is poor—which can happen on some warehouses with very wide tables or complex merge conditions. It's also a good fit when you need to replace entire partitions or date ranges of data rather than updating individual rows.
Insert Overwrite
Insert overwrite replaces entire partitions rather than operating on individual rows. dbt™ identifies which partitions are affected by the incoming data and overwrites those partitions completely.
This strategy is ideal for partitioned tables in BigQuery or Spark/Databricks where partition-level operations are significantly more efficient than row-level updates. If your data is naturally partitioned by date (which it often is), insert overwrite can be dramatically faster than merge for large volumes.
Here's a summary of how the strategies compare:
Strategy | Best For | Warehouse Support |
|---|---|---|
Append | Immutable event logs, clickstreams | All warehouses |
Merge | Upserts with unique keys, mutable data | Snowflake, BigQuery, Redshift, Databricks |
Delete+Insert | Partition replacement, wide tables | Snowflake, Redshift, Databricks |
Insert Overwrite | Large partitioned tables | BigQuery, Spark, Databricks, Snowflake |
Choosing the Right dbt™ Incremental Strategy for Your Warehouse
The best dbt™ incremental strategy depends on two factors: your warehouse platform and your data characteristics. There's no universal "best" strategy—it's about matching the approach to the environment.
The key trade-off is between accuracy and speed. Merge gives you precise row-level upserts, which is great for correctness but expensive at scale. Insert overwrite is faster for large volumes but only works well with partitioned data and doesn't handle row-level updates.
Here's a practical mapping of warehouse to recommended strategies:
Snowflake: Merge is the default and works well for most use cases. Switch to delete+insert for large batch updates where merge performance degrades—particularly with wide tables or complex join conditions. Use
cluster_byon yourunique_keycolumns to speed up merge lookups.BigQuery: Insert overwrite combined with
partition_byis the most efficient approach for partitioned tables. BigQuery's architecture makes partition-level operations extremely fast. Use merge only when you need row-level upserts on non-partitioned tables.Redshift: Delete+insert often outperforms merge due to Redshift's architecture. The merge operation on Redshift is emulated (it doesn't have native
MERGEsyntax in older versions), making delete+insert the more performant choice for most incremental workloads.Databricks/Spark: Insert overwrite with partition pruning is the go-to strategy for large datasets. Databricks also supports merge natively on Delta tables, which works well for smaller-to-medium tables that need upsert behavior.
When in doubt, start with the default strategy for your warehouse, measure execution time, and adjust from there. The dbt™ logs will show you exactly how long each operation takes.
How to Use is_incremental() for Efficient Row Filtering
The is_incremental() Jinja macro is where most performance gains happen in dbt™ incremental models. It controls which rows get processed on incremental runs, and getting this filter right is critical.
is_incremental() returns True only when all three conditions are met:
The model is configured with
materialized='incremental'The target table already exists in the warehouse
The
--full-refreshflag was not passed
On the first run, is_incremental() returns False, so dbt™ processes the entire dataset. On subsequent runs, it returns True, and your filter kicks in to limit the data.
Here's the standard pattern using a timestamp-based filter:
A few important details:
Use
coalescewith a fallback date to handle the edge case where the target table is empty—without it, the subquery could returnNULLand filter out all rows.The lookback window (
- interval '3 days') is intentional. It accounts for late-arriving data, which we'll cover in more detail in the pitfalls section. When combined with aunique_key, reprocessing overlapping rows simply updates them without creating duplicates.Filter on indexed or partitioned columns whenever possible. Filtering on
occurred_atwhen the source table is partitioned by date means the warehouse can prune partitions and avoid scanning the entire source.
Handling Schema Changes with dbt™ on_schema_change
Incremental tables persist across runs, which means they can drift from your model's current schema. If you add a column to your SELECT statement but the target table doesn't have it, or if you remove a column that still exists in the target, things can break—or silently produce incorrect results.
The on_schema_change config option controls how dbt™ handles these situations:
ignore
The default behavior. dbt™ ignores any schema differences between the model's output and the existing target table. New columns in the model won't appear in the target table, and removed columns won't cause errors.
This is the safest option for avoiding disruptions, but it can lead to silent data issues—you might think a new column is being loaded when it's actually being dropped.
fail
The run fails immediately if dbt™ detects a schema difference between the model and the target table. This is the strictest option and is especially useful in CI/CD pipelines where you want to catch unexpected changes before they reach production.
append_new_columns
dbt™ adds any new columns from the model to the target table but does not remove columns that are no longer in the model. This is a good middle ground—your schema evolves forward without breaking existing downstream queries that might depend on older columns.
Note that existing rows will have NULL values for the newly added columns. dbt™ does not backfill.
sync_all_columns
The most aggressive option. dbt™ fully syncs the schema by adding new columns and removing columns that are no longer in the model. This includes handling data type changes.
Use this carefully—removing columns can break downstream models or dashboards that reference them.
Here's a quick summary to guide your choice:
ignore: Safe default, no schema updates—good when schema is stable
fail: Strict mode for CI/CD validation—catch issues before production
append_new_columns: Add-only schema evolution—best for most production pipelines
sync_all_columns: Full schema sync—use carefully, typically paired with a full refresh
When to Run a Full Refresh in dbt™
The --full-refresh flag tells dbt™ to drop the existing incremental table and rebuild it entirely from scratch. This temporarily negates the benefits of incremental materialization, but there are legitimate scenarios where it's necessary:
After changing the model's business logic: If you modify a transformation that affects how historical data should be calculated, existing rows in the target table won't reflect the change. A full refresh reprocesses everything with the updated logic.
When backfilling historical data: If your source system retroactively adds or modifies older records beyond your lookback window, a full refresh ensures they're captured.
To fix data quality issues in existing rows: If you discover incorrect data in the target table—perhaps from a bug in a previous model version—a full refresh is the cleanest way to correct it.
After adding new columns with
sync_all_columns: Whilesync_all_columnsadds the column to the target table, it doesn't backfill existing rows. A full refresh populates the new column for all historical data.
Run a full refresh with:
A word of caution: avoid scheduling full refreshes "just in case." Some teams run daily full refreshes as a safety net, which completely defeats the purpose of incremental models. Instead, adopt an incremental-first approach and reserve full refreshes for specific, justified scenarios. A common best practice is to schedule a full refresh weekly or monthly during off-peak hours.
Optimizing Incremental Models with dbt™ partition_by and Clustering
Beyond choosing the right strategy, physical table organization can dramatically improve incremental model performance.
partition_by: Defines how the target table is partitioned, typically by a date column. This is critical for BigQuery and Spark/Databricks, where the
insert_overwritestrategy uses partition boundaries to replace only affected partitions. Without proper partitioning,insert_overwritewould replace the entire table—defeating the purpose.cluster_by: Organizes data within a table (or within partitions) by specified columns, enabling the warehouse to skip irrelevant micro-partitions during queries. For incremental models using the merge strategy, clustering on the
unique_keycolumn(s) dramatically speeds up the lookup that determines whether a row should be inserted or updated.
In Snowflake, clustering on the merge key reduces the number of micro-partitions scanned during the MERGE operation—turning what could be a full table scan into a targeted lookup. In BigQuery, clustering works in tandem with partitioning to further reduce bytes scanned.
The general rule: partition by time, cluster by your unique key.
Common dbt™ Incremental Model Performance Pitfalls
Incremental models are powerful but unforgiving when misconfigured. Here are the most common mistakes that hurt performance or correctness—and how to fix them.
Missing Unique Keys Causing Duplicates
Without a unique_key defined, dbt™ defaults to append behavior on most warehouses. This means every incremental run inserts all qualifying rows—even if they already exist in the target table.
The result? Duplicates that silently compound over time. Downstream aggregations start inflating, dashboards show wrong numbers, and the root cause isn't immediately obvious.
The fix: Always define a unique_key unless you're intentionally using the append strategy with immutable, naturally unique data (like event logs with guaranteed unique event IDs).
Inefficient Filtering Logic in is_incremental()
The is_incremental() filter determines what data your warehouse scans on each run. A poorly written filter can trigger full table scans on the source, negating the performance benefit entirely.
Anti-pattern:
Recommended pattern:
The first example scans every row in both the source and target tables. The second retrieves a single timestamp value and filters the source using it—ideally against a partitioned or indexed column.
Over-Reliance on Full Refreshes
Some teams schedule daily full refreshes alongside their incremental runs "just in case" the data drifts. This completely defeats the purpose of incremental models—you're paying for a full rebuild every day while maintaining the added complexity of incremental logic.
The fix: Trust your incremental logic. Use lookback windows to handle late-arriving data, monitor for drift with data tests, and reserve full refreshes for specific events (logic changes, schema migrations, data quality incidents). A weekly or monthly scheduled full refresh during off-peak hours is a reasonable safety net.
Ignoring Late-Arriving Data
Timestamp-based filters assume data arrives in order. In reality, source systems often deliver records late—a row with updated_at = '2025-01-15' might not land in your raw table until January 18th.
If your incremental filter looks at max(updated_at) from the target table, those late-arriving rows get permanently missed.
The fix: Add a lookback window to your filter:
Combined with a unique_key, this reprocesses the last three days of data on every run. Rows that already exist get updated (no duplicates), and late arrivals get captured. Tune the window size based on how late your data typically arrives.
How to Monitor and Debug Slow dbt™ Incremental Builds
Getting your incremental models configured correctly is half the battle. The other half is keeping them performant in production over time—tables grow, source schemas evolve, and what worked last quarter may need tuning today.
Real-Time Execution Logging
The first step in debugging a slow incremental model is inspecting the run logs. dbt™ logs show exactly how long each model takes, what SQL was compiled, and how many rows were processed.
In production environments with dozens or hundreds of models, you need centralized log inspection across all runs—not just what's in the terminal. Platforms like Paradime Bolt provide a unified view of all dbt™ run logs, making it easy to spot which models are slowing down over time and drill into specific executions.
Column-Level Lineage for Impact Analysis
When an incremental model starts behaving unexpectedly—wrong row counts, missing data, or performance regressions—understanding which upstream columns flow into the model is critical for diagnosis.
Paradime's column-level lineage diff lets you trace data flow at the column level, so you can quickly identify whether a change in an upstream source or staging model is causing issues downstream. This is especially valuable for incremental models where a subtle change in a timestamp column's behavior can break the entire filtering logic.
Automated Alerting and Failure Triage
Incremental models that fail silently are worse than models that fail loudly. If an incremental run fails at 2 AM and nobody notices until the morning standup, you're hours behind on fresh data.
Setting up automated alerts for failed or slow incremental runs is essential for production pipelines. Paradime Bolt integrates with Slack, JIRA, and DataDog, so your team gets notified immediately when a run fails, exceeds expected duration, or produces unexpected results. Proactive monitoring turns "we found the issue this morning" into "we fixed it before anyone noticed."
Run Faster dbt™ Pipelines with AI-Native Orchestration
If you're serious about incremental model performance, the tooling around your dbt™ project matters just as much as the SQL inside it. Paradime is built as a modern, AI-native alternative to dbt Cloud™ for teams that need faster pipelines, smarter testing, and less operational overhead.
Bolt scheduling: Run incremental models on flexible schedules with deferred runs, so your incremental logic always has the right production state to compare against.
TurboCI: Test only affected models with slim CI—don't waste compute re-running your entire project when one model changes.
DinoAI copilot: Get AI-powered suggestions for optimizing incremental logic, writing efficient
is_incremental()filters, and choosing the right strategy for your warehouse.One-click dbt Cloud™ migration: Import existing dbt Cloud™ jobs into Paradime with zero downtime—no need to rebuild your scheduling from scratch.
Start for free and see how Paradime accelerates your dbt™ pipelines from development through production.
FAQs About dbt™ Incremental Models
Can I convert an existing dbt™ table model to an incremental model?
Yes. Change the materialized config from 'table' to 'incremental', add a unique_key, and implement the is_incremental() filter in your model's SQL. On the first run after the change, dbt™ will perform a full refresh to build the initial table, then switch to incremental processing on all subsequent runs.
How do I test dbt™ incremental models in CI/CD pipelines?
Use dbt™'s --defer flag with a production manifest to test incremental logic against real production state. This lets your CI pipeline run incrementally against actual data without needing a full copy of the target table. Alternatively, run with --full-refresh in CI to validate the complete build path and ensure the model works end-to-end.
What happens if my dbt™ incremental model unique_key changes?
Changing the unique_key can cause duplicates or merge failures because the existing target table was built around the old key. After modifying the unique_key, always run a --full-refresh to rebuild the table with the new key structure. Skipping this step is a common source of subtle data quality issues.
How do I handle late-arriving data in dbt™ incremental models?
Add a lookback window to your is_incremental() filter. Instead of comparing against max(updated_at) directly, subtract a buffer period:
This reprocesses the last three days of data on each run. With a unique_key defined, existing rows get updated (no duplicates), and any late-arriving rows within the window get captured. Adjust the window size based on how late your source data typically arrives.
Should I use incremental models for small dbt™ tables?
Generally no. The added complexity of incremental logic—writing and maintaining the is_incremental() filter, choosing a strategy, handling schema changes, managing lookback windows—isn't worth it for tables under a few million rows where full rebuilds complete in seconds. Start with table materialization, and only switch to incremental when build times or warehouse costs become a problem.


