Migrating Spark or PySpark Workflows to dbt™ Python Models in Paradime
Feb 26, 2026
How to Migrate PySpark Jobs to dbt™ Python Models: A Step-by-Step Tutorial
If your data platform is already running PySpark jobs for transformations that could live inside your dbt™ DAG, you're likely maintaining two orchestration systems, two sets of dependency declarations, and two mental models for how data moves from raw to refined. dbt™ Python models offer a way to collapse that sprawl—without abandoning Python—by running DataFrame-based transformations inside the same lineage, testing, and scheduling framework you already use for SQL.
This tutorial walks you through migrating PySpark transformation jobs into dbt™ Python models on Paradime, covering dependency management, code translation with DinoAI, DAG integration, production scheduling with Bolt, and validation.
What dbt™ Python models are (and aren't): A dbt™ Python model is a
.pyfile in yourmodels/directory that defines amodel(dbt, session)function returning a single DataFrame. It participates fully in the dbt™ DAG—withref(), testing, documentation, and lineage—but it is not a general-purpose Python execution environment. You cannot run arbitrary scripts, call external APIs (by default), or import code across models. All computation is pushed down to the data platform's Python runtime (Snowpark, PySpark on Databricks, BigQuery DataFrames). Think of them as a transformation contract, not a notebook.
When dbt™ Python Models Are a Good Replacement for PySpark Jobs
Not every Spark job should become a dbt™ Python model. The sweet spot is PySpark transformations that produce tables consumed by downstream SQL models—feature engineering, statistical enrichments, text parsing, date/time logic, and lightweight ML scoring.
Keeping Transformations in dbt™ DAGs
When a PySpark job writes to a table that a dbt™ SQL model later references via source(), you have a blind spot in lineage. Converting that job into a dbt™ Python model means:
ref()replaces hard-coded table names, so the DAG is complete.Tests apply uniformly to Python and SQL model outputs.
Documentation lives in one place—your
schema.ymlfiles.
Reducing Orchestration Sprawl
With PySpark jobs living outside dbt™, you often maintain a separate Airflow DAG (or Step Functions, or Glue workflow) just to sequence Spark → dbt™ → Spark. Moving eligible Spark logic into dbt™ Python models lets you eliminate that outer orchestrator for the transformation layer entirely.
Before and after: collapsing PySpark + dbt™ into a single DAG reduces orchestration surface area.
Prerequisites and Constraints
Supported Runtimes / Adapters (Warehouse-Dependent)
dbt™ Python models execute remotely on your data platform, not on your local machine. As of today, the following adapters support Python models:
Platform | Runtime | DataFrame API |
|---|---|---|
Snowflake | Snowpark | Snowpark DataFrame |
Databricks | PySpark | PySpark DataFrame |
BigQuery | BigQuery DataFrames / Dataproc | BigFrames / PySpark |
⚠️ Redshift, Postgres, and DuckDB do not support dbt™ Python models. If your warehouse isn't in the table above, Python models aren't an option yet.
Dependency and Package Considerations
Python models can import packages that are pre-installed or declared on the data platform's Python runtime. How you install packages differs by platform:
Snowflake: Packages from the Snowflake Anaconda channel. Declare them in
dbt.config(packages=["pandas", "holidays"]).Databricks: Packages available on the cluster. Configure via cluster libraries or
%pip installin init scripts.BigQuery (Dataproc): Packages installed on the Dataproc cluster.
Key constraint: You cannot
importa function defined in another dbt™ Python model. Each.pymodel is an isolated execution unit. For shared logic, use packages installed on the platform or duplicate small helper functions.
Step 1: Configure Python Dependencies in Paradime
When running Python scripts and models through Paradime, Poetry manages your dependency environment. This ensures every Bolt run installs the exact same package versions.
pyproject.toml Basics
Initialize your project with Poetry and define the packages your Python models or scripts rely on:
In Paradime, run poetry init in the Code IDE terminal to interactively generate this file, or create it manually.
Pinning Versions for Reproducibility
Always pin to exact or narrow ranges. A poetry.lock file locks the resolved dependency tree so every environment—development and production—uses the same versions.
Important for Bolt: When configuring a Bolt schedule that includes Python commands, ensure
poetry installis the first command in the schedule. This creates the virtual environment before your dbt™ or Python commands execute. (Paradime Docs: Python Scripts)
Dependency flow from development to production: the lock file travels with your code.
Step 2: Translate PySpark Logic into dbt™ Python Models (with DinoAI)
Paradime's DinoAI is a warehouse-aware AI assistant that understands your dbt™ project context. Use it to accelerate PySpark → dbt™ Python model conversion with custom prompts via the .dinoprompts file.
Prompt: Convert Transformations to DataFrame Operations
Create a .dinoprompts file in your repo root with reusable migration prompts:
Open your existing PySpark .py file in the Code IDE, trigger DinoAI with [ → select the prompt, and DinoAI will generate a dbt™-compatible Python model.
Prompt: Keep Naming Consistent and Add Comments
Prompt: Identify UDFs and Propose Alternatives
PySpark UDFs are a common migration stumbling block—they serialize data to Python row-by-row, killing performance.
Tip: Access prompts in DinoAI by pressing
[in the chat input or clicking the Prompts button. Learn more in the DinoAI .dinoprompts docs.
Step 3: Integrate Python Models with SQL Models
Using ref() to Connect Models
The single most important integration point is dbt.ref(). Python models can read from upstream SQL or Python models, and downstream SQL models can reference Python model outputs.
Python model reading from a SQL model:
SQL model referencing the Python model downstream:
Python models sit naturally inside the DAG alongside SQL models. Yellow indicates the Python model.
Materialization Strategies and Performance
Python models support only two materializations:
Materialization | When to Use |
|---|---|
| Full refresh each run. Simplest. Use for small-to-medium datasets or when incremental logic isn't worth the complexity. |
| Process only new/changed rows. Use for large, append-heavy datasets to reduce compute cost and runtime. |
Not supported:
viewandephemeralmaterializations are unavailable for Python models. Python models always write a physical table.
Incremental Python model example (Databricks/PySpark):
Step 4: Schedule and Operate with Bolt
Paradime Bolt orchestrates your dbt™ and Python pipelines from a single interface—no external Airflow or cron required.
Run Order and Job Composition
Bolt lets you chain multiple commands in a single schedule. Commands execute sequentially, and if a command errors, subsequent commands are skipped by default (with specific exceptions for source freshness checks and observability tools).
A typical migration schedule looks like:
Step | Command | Purpose |
|---|---|---|
1 |
| Install Python dependencies |
2 |
| Execute SQL + Python models |
3 |
| Run data quality tests |
4 | Elementary / Lightdash refresh | Post-run observability or BI refresh |
Bolt also supports trigger chaining—one schedule can trigger another on completion—so you can separate staging, intermediate, and mart layers into distinct schedules if needed:
Scheduled Run: Cron-based triggers (e.g.,
0 6 * * *for daily at 6 AM).On Run Completion: Fires after a "parent" schedule finishes.
On Merge: Runs when a PR merges into a branch (CI/CD).
Bolt API: Trigger runs programmatically from external systems.
Retries, Timeouts, and Logging
Bolt provides several operational safeguards:
SLA Thresholds: Set a runtime limit (e.g., 30 minutes). If a run exceeds it, Bolt sends an SLA breach notification via Slack, MS Teams, or email. (Notification Settings docs)
24-Hour Run Timeout: Bolt enforces a system-level 24-hour timeout for any run.
OOM Detection: Out-of-memory runs are flagged with workspace-level system alerts.
Logging: Every run produces detailed logs—view them in real time during execution or review historical runs with execution duration trends, success/error rates, and commit-level traceability. (Run Log History docs)
DinoAI Debugging: Failed runs automatically generate human-readable error summaries with remediation steps and links to the problematic code.
Bolt command execution flow with error handling and SLA alerting.
Validation and Testing
Data Quality Tests for Python Outputs
dbt™ tests work identically on Python and SQL model outputs. Since Python models materialize as tables, you apply generic and singular tests in your schema.yml the same way you always have:
For more advanced checks, consider the dbt-expectations package, which provides statistical distribution tests, row-count validations, and column-level range assertions.
Parity Checks vs the Old Spark Job
During migration, you need confidence that the new dbt™ Python model produces results identical to the legacy PySpark job. A practical approach:
Run both in parallel for a defined period—old Spark job writes to
legacy_schema.table, new dbt™ model writes toanalytics.table.Create a singular test that compares the two:
Include the parity test in your Bolt schedule during the migration window. Once parity holds for multiple runs, decommission the old Spark job.
Run both pipelines in parallel and compare outputs with a singular dbt™ test.
Common Migration Pitfalls
Distributed Compute Assumptions
PySpark jobs often rely on distributed-compute patterns that don't translate directly:
Cluster sizing / partitioning hints: In a standalone Spark cluster, you tune
spark.sql.shuffle.partitionsor call.repartition(). In dbt™ Python models, the data platform manages the runtime. Snowpark, for instance, runs on warehouse compute with no exposed partitioning knobs. Remove explicit partition tuning and let the platform optimizer handle it.Broadcast joins:
F.broadcast()hints may be ignored or unsupported depending on the platform. Test join performance without hints first.SparkContext / SparkSession: In dbt™ Python models, the
sessionparameter is your session. Do not create a newSparkSession—use the one provided.
Serialization and Type Differences
Moving between PySpark and warehouse-native types introduces subtle issues:
Decimal precision: PySpark
DecimalType(38, 18)may map differently to SnowflakeNUMBER(38, 18)or BigQueryNUMERIC. Validate precision-sensitive columns explicitly.Null handling: PySpark treats
nullin aggregations differently than some SQL engines. Acount(column)in PySpark excludes nulls; confirm your warehouse SQL models do the same.Timestamp zones: PySpark defaults to
TimestampType(with timezone). Snowflake hasTIMESTAMP_NTZvsTIMESTAMP_LTZ. Align on a convention early.Column name casing: PySpark preserves case; Snowflake upper-cases by default. Normalize column names to lowercase in your Python model to avoid downstream
ref()mismatches.
Overusing Python Where SQL Is Better
"The vast majority of models should continue to be in SQL." — dbt Labs
A common anti-pattern is converting all Spark jobs to Python models—including simple SELECT, JOIN, and GROUP BY logic that runs faster and is more readable in SQL. Use the decision framework below:
Task | Recommended | Why |
|---|---|---|
Joins, aggregations, window functions | SQL | Warehouse optimizers excel here; SQL is more readable for these patterns. |
Text parsing, regex, NLP | Python | Native string libraries outperform SQL string functions. |
Statistical modeling / ML scoring | Python |
|
Date/time manipulation (simple) | SQL |
|
Date/time manipulation (complex: holidays, business days) | Python | Libraries like |
Pivoting / reshaping | SQL (usually) |
|
Conclusion
Migrating PySpark jobs to dbt™ Python models isn't about replacing Spark—it's about moving eligible transformations into a unified DAG so you get lineage, testing, and documentation for free. The steps are concrete:
Configure dependencies with Poetry and
pyproject.tomlin Paradime so environments are reproducible.Translate PySpark logic into
model(dbt, session)functions using DinoAI prompts to accelerate the conversion.Integrate with SQL models using
ref()so the full DAG is visible and testable.Schedule with Bolt for production execution with SLA monitoring, failure alerts, and real-time logging.
Validate with parity tests to confirm output equivalence before decommissioning the legacy Spark job.
The migration pitfalls are real—distributed compute assumptions, type serialization, and the temptation to over-use Python—but they're avoidable with the right decision framework. Keep SQL for what SQL does best, bring Python in for what it uniquely enables, and let the dbt™ DAG be the single source of truth for your transformation layer.
Ready to get started? Try Paradime free for 14 days and run your first dbt™ Python model in production with Bolt.


