Migrating Spark or PySpark Workflows to dbt™ Python Models in Paradime

Feb 26, 2026

Table of Contents

How to Migrate PySpark Jobs to dbt™ Python Models: A Step-by-Step Tutorial

If your data platform is already running PySpark jobs for transformations that could live inside your dbt™ DAG, you're likely maintaining two orchestration systems, two sets of dependency declarations, and two mental models for how data moves from raw to refined. dbt™ Python models offer a way to collapse that sprawl—without abandoning Python—by running DataFrame-based transformations inside the same lineage, testing, and scheduling framework you already use for SQL.

This tutorial walks you through migrating PySpark transformation jobs into dbt™ Python models on Paradime, covering dependency management, code translation with DinoAI, DAG integration, production scheduling with Bolt, and validation.

What dbt™ Python models are (and aren't): A dbt™ Python model is a .py file in your models/ directory that defines a model(dbt, session) function returning a single DataFrame. It participates fully in the dbt™ DAG—with ref(), testing, documentation, and lineage—but it is not a general-purpose Python execution environment. You cannot run arbitrary scripts, call external APIs (by default), or import code across models. All computation is pushed down to the data platform's Python runtime (Snowpark, PySpark on Databricks, BigQuery DataFrames). Think of them as a transformation contract, not a notebook.

When dbt™ Python Models Are a Good Replacement for PySpark Jobs

Not every Spark job should become a dbt™ Python model. The sweet spot is PySpark transformations that produce tables consumed by downstream SQL models—feature engineering, statistical enrichments, text parsing, date/time logic, and lightweight ML scoring.

Keeping Transformations in dbt™ DAGs

When a PySpark job writes to a table that a dbt™ SQL model later references via source(), you have a blind spot in lineage. Converting that job into a dbt™ Python model means:

ref() replaces hard-coded table names, so the DAG is complete.
Tests apply uniformly to Python and SQL model outputs.
Documentation lives in one place—your schema.yml files.

Reducing Orchestration Sprawl

With PySpark jobs living outside dbt™, you often maintain a separate Airflow DAG (or Step Functions, or Glue workflow) just to sequence Spark → dbt™ → Spark. Moving eligible Spark logic into dbt™ Python models lets you eliminate that outer orchestrator for the transformation layer entirely.

Before and after: collapsing PySpark + dbt™ into a single DAG reduces orchestration surface area.

Prerequisites and Constraints

Supported Runtimes / Adapters (Warehouse-Dependent)

dbt™ Python models execute remotely on your data platform, not on your local machine. As of today, the following adapters support Python models:

Platform	Runtime	DataFrame API
Snowflake	Snowpark	Snowpark DataFrame
Databricks	PySpark	PySpark DataFrame
BigQuery	BigQuery DataFrames / Dataproc	BigFrames / PySpark

⚠️ Redshift, Postgres, and DuckDB do not support dbt™ Python models. If your warehouse isn't in the table above, Python models aren't an option yet.

Dependency and Package Considerations

Python models can import packages that are pre-installed or declared on the data platform's Python runtime. How you install packages differs by platform:

Snowflake: Packages from the Snowflake Anaconda channel. Declare them in dbt.config(packages=["pandas", "holidays"]).
Databricks: Packages available on the cluster. Configure via cluster libraries or %pip install in init scripts.
BigQuery (Dataproc): Packages installed on the Dataproc cluster.

Key constraint: You cannot import a function defined in another dbt™ Python model. Each .py model is an isolated execution unit. For shared logic, use packages installed on the platform or duplicate small helper functions.

Step 1: Configure Python Dependencies in Paradime

When running Python scripts and models through Paradime, Poetry manages your dependency environment. This ensures every Bolt run installs the exact same package versions.

`pyproject.toml` Basics

Initialize your project with Poetry and define the packages your Python models or scripts rely on:

In Paradime, run poetry init in the Code IDE terminal to interactively generate this file, or create it manually.

Pinning Versions for Reproducibility

Always pin to exact or narrow ranges. A poetry.lock file locks the resolved dependency tree so every environment—development and production—uses the same versions.

Important for Bolt: When configuring a Bolt schedule that includes Python commands, ensure poetry install is the first command in the schedule. This creates the virtual environment before your dbt™ or Python commands execute. (Paradime Docs: Python Scripts)

Dependency flow from development to production: the lock file travels with your code.

Step 2: Translate PySpark Logic into dbt™ Python Models (with DinoAI)

Paradime's DinoAI is a warehouse-aware AI assistant that understands your dbt™ project context. Use it to accelerate PySpark → dbt™ Python model conversion with custom prompts via the .dinoprompts file.

Prompt: Convert Transformations to DataFrame Operations

Create a .dinoprompts file in your repo root with reusable migration prompts:

Open your existing PySpark .py file in the Code IDE, trigger DinoAI with [ → select the prompt, and DinoAI will generate a dbt™-compatible Python model.

Prompt: Keep Naming Consistent and Add Comments

Prompt: Identify UDFs and Propose Alternatives

PySpark UDFs are a common migration stumbling block—they serialize data to Python row-by-row, killing performance.

Tip: Access prompts in DinoAI by pressing [ in the chat input or clicking the Prompts button. Learn more in the DinoAI .dinoprompts docs.

Step 3: Integrate Python Models with SQL Models

Using `ref()` to Connect Models

The single most important integration point is dbt.ref(). Python models can read from upstream SQL or Python models, and downstream SQL models can reference Python model outputs.

Python model reading from a SQL model:

SQL model referencing the Python model downstream:

Python models sit naturally inside the DAG alongside SQL models. Yellow indicates the Python model.

Materialization Strategies and Performance

Python models support only two materializations:

Materialization	When to Use
`table` (default)	Full refresh each run. Simplest. Use for small-to-medium datasets or when incremental logic isn't worth the complexity.
`incremental`	Process only new/changed rows. Use for large, append-heavy datasets to reduce compute cost and runtime.

Not supported: view and ephemeral materializations are unavailable for Python models. Python models always write a physical table.

Incremental Python model example (Databricks/PySpark):

Step 4: Schedule and Operate with Bolt

Paradime Bolt orchestrates your dbt™ and Python pipelines from a single interface—no external Airflow or cron required.

Run Order and Job Composition

Bolt lets you chain multiple commands in a single schedule. Commands execute sequentially, and if a command errors, subsequent commands are skipped by default (with specific exceptions for source freshness checks and observability tools).

A typical migration schedule looks like:

Step	Command	Purpose
1	`poetry install`	Install Python dependencies
2	`dbt run`	Execute SQL + Python models
3	`dbt test`	Run data quality tests
4	Elementary / Lightdash refresh	Post-run observability or BI refresh

Bolt also supports trigger chaining—one schedule can trigger another on completion—so you can separate staging, intermediate, and mart layers into distinct schedules if needed:

Scheduled Run: Cron-based triggers (e.g., 0 6 * * * for daily at 6 AM).
On Run Completion: Fires after a "parent" schedule finishes.
On Merge: Runs when a PR merges into a branch (CI/CD).
Bolt API: Trigger runs programmatically from external systems.

Retries, Timeouts, and Logging

Bolt provides several operational safeguards:

SLA Thresholds: Set a runtime limit (e.g., 30 minutes). If a run exceeds it, Bolt sends an SLA breach notification via Slack, MS Teams, or email. (Notification Settings docs)
24-Hour Run Timeout: Bolt enforces a system-level 24-hour timeout for any run.
OOM Detection: Out-of-memory runs are flagged with workspace-level system alerts.
Logging: Every run produces detailed logs—view them in real time during execution or review historical runs with execution duration trends, success/error rates, and commit-level traceability. (Run Log History docs)
DinoAI Debugging: Failed runs automatically generate human-readable error summaries with remediation steps and links to the problematic code.

Bolt command execution flow with error handling and SLA alerting.

Validation and Testing

Data Quality Tests for Python Outputs

dbt™ tests work identically on Python and SQL model outputs. Since Python models materialize as tables, you apply generic and singular tests in your schema.yml the same way you always have:

For more advanced checks, consider the dbt-expectations package, which provides statistical distribution tests, row-count validations, and column-level range assertions.

Parity Checks vs the Old Spark Job

During migration, you need confidence that the new dbt™ Python model produces results identical to the legacy PySpark job. A practical approach:

Run both in parallel for a defined period—old Spark job writes to legacy_schema.table, new dbt™ model writes to analytics.table.
Create a singular test that compares the two:
Include the parity test in your Bolt schedule during the migration window. Once parity holds for multiple runs, decommission the old Spark job.

Run both pipelines in parallel and compare outputs with a singular dbt™ test.

Common Migration Pitfalls

Distributed Compute Assumptions

PySpark jobs often rely on distributed-compute patterns that don't translate directly:

Cluster sizing / partitioning hints: In a standalone Spark cluster, you tune spark.sql.shuffle.partitions or call .repartition(). In dbt™ Python models, the data platform manages the runtime. Snowpark, for instance, runs on warehouse compute with no exposed partitioning knobs. Remove explicit partition tuning and let the platform optimizer handle it.
Broadcast joins: F.broadcast() hints may be ignored or unsupported depending on the platform. Test join performance without hints first.
SparkContext / SparkSession: In dbt™ Python models, the session parameter is your session. Do not create a new SparkSession—use the one provided.

Serialization and Type Differences

Moving between PySpark and warehouse-native types introduces subtle issues:

Decimal precision: PySpark DecimalType(38, 18) may map differently to Snowflake NUMBER(38, 18) or BigQuery NUMERIC. Validate precision-sensitive columns explicitly.
Null handling: PySpark treats null in aggregations differently than some SQL engines. A count(column) in PySpark excludes nulls; confirm your warehouse SQL models do the same.
Timestamp zones: PySpark defaults to TimestampType (with timezone). Snowflake has TIMESTAMP_NTZ vs TIMESTAMP_LTZ. Align on a convention early.
Column name casing: PySpark preserves case; Snowflake upper-cases by default. Normalize column names to lowercase in your Python model to avoid downstream ref() mismatches.

Overusing Python Where SQL Is Better

"The vast majority of models should continue to be in SQL." — dbt Labs

A common anti-pattern is converting all Spark jobs to Python models—including simple SELECT, JOIN, and GROUP BY logic that runs faster and is more readable in SQL. Use the decision framework below:

Task	Recommended	Why
Joins, aggregations, window functions	SQL	Warehouse optimizers excel here; SQL is more readable for these patterns.
Text parsing, regex, NLP	Python	Native string libraries outperform SQL string functions.
Statistical modeling / ML scoring	Python	`scikit-learn`, `statsmodels`, etc. have no SQL equivalent.
Date/time manipulation (simple)	SQL	`DATE_TRUNC`, `DATEDIFF` are clearer in SQL.
Date/time manipulation (complex: holidays, business days)	Python	Libraries like `holidays` simplify this considerably.
Pivoting / reshaping	SQL (usually)	`dbt_utils.pivot` or native `PIVOT` syntax is simpler unless the reshape is truly dynamic.

Conclusion

Migrating PySpark jobs to dbt™ Python models isn't about replacing Spark—it's about moving eligible transformations into a unified DAG so you get lineage, testing, and documentation for free. The steps are concrete:

Configure dependencies with Poetry and pyproject.toml in Paradime so environments are reproducible.
Translate PySpark logic into model(dbt, session) functions using DinoAI prompts to accelerate the conversion.
Integrate with SQL models using ref() so the full DAG is visible and testable.
Schedule with Bolt for production execution with SLA monitoring, failure alerts, and real-time logging.
Validate with parity tests to confirm output equivalence before decommissioning the legacy Spark job.

The migration pitfalls are real—distributed compute assumptions, type serialization, and the temptation to over-use Python—but they're avoidable with the right decision framework. Keep SQL for what SQL does best, bring Python in for what it uniquely enables, and let the dbt™ DAG be the single source of truth for your transformation layer.

Ready to get started? Try Paradime free for 14 days and run your first dbt™ Python model in production with Bolt.

Interested to Learn More?
Try Out the Free 14-Days Trial

Start free trial

Stop Managing Pipelines. Start Shipping Them.

Join the teams that replaced manual dbt™ workflows with agentic AI. Free to start, no credit card required.

Start for free

Stop Managing Pipelines. Start Shipping Them.

Join the teams that replaced manual dbt™ workflows with agentic AI. Free to start, no credit card required.

Start for free

Stop Managing Pipelines. Start Shipping Them.

Join the teams that replaced manual dbt™ workflows with agentic AI. Free to start, no credit card required.

Platform

ADD-ONs

DINOAI

NEW

Programmable Agents

Self-Healing Pipelines

Resources

Industries

About

Legal

Responsible Disclosure Policy

*dbt® and dbt Core® are federally registered trademarks of dbt Labs, Inc. in the United States and various jurisdictions around the world. Paradime is not a partner of dbt Labs. All rights therein are reserved to dbt Labs. Paradime is not a product or service of or endorsed by dbt Labs, Inc.

Platform

ADD-ONs

DINOAI

NEW