How to Clean and Deduplicate Spreadsheet Data with OpenClaw in Paradime
Feb 26, 2026
Automate Spreadsheet Cleanup with Paradime, OpenClaw, and dbt™: A Repeatable Workflow
Messy spreadsheets cost data teams hours every week. Duplicate rows, inconsistent phone formats, malformed email addresses, and rogue address abbreviations silently corrupt downstream analytics. Most "optimize your data" advice stops at vague platitudes—this guide doesn't.
Below, you'll build a concrete, repeatable workflow: read a Google Sheet, identify duplicates, standardize phone/email/address formats, flag anomalies, schedule the whole pipeline with Paradime Bolt, and validate that your cleanup actually stuck. Every step includes code you can copy, a schedule you can deploy, and monitoring you can rely on.
What is Paradime
Paradime is an all-in-one, AI-native platform purpose-built for analytics and data engineering teams. It replaces dbt Cloud™ with a faster, more flexible stack that covers the full lifecycle—development, orchestration, cost management, and monitoring—in one workspace.
Key capabilities you'll use in this guide
Capability | Product | What it does |
|---|---|---|
Code IDE | Paradime Code | AI-assisted dbt™ development with DinoAI; cuts boilerplate by 83%+ |
Scheduling | Paradime Bolt | Cron and event-driven orchestration for dbt™ jobs—YAML-as-code or UI |
Monitoring | Paradime Radar | Schedule dashboards, model performance, real-time alerting |
FinOps | Paradime Radar | Snowflake and BigQuery cost visibility and warehouse credit optimization |
Paradime is SOC 2 Type II certified, GDPR and CCPA compliant, and guarantees 99.9% uptime. Full docs live at docs.paradime.io.
What is OpenClaw
OpenClaw is an open-source, local-first AI assistant and automation platform. It runs on your own hardware—laptop, homelab, or VPS—and connects to chat apps like Slack, Telegram, or Discord to execute real tasks: file management, API calls, web scraping, and data cleaning.
Why OpenClaw for spreadsheet cleanup
Feature | Benefit for this workflow |
|---|---|
Extensible skills & plugins | Install or author skills that read Google Sheets, call cleaning logic, and write results back |
Model flexibility | Use any LLM backend (OpenAI, Claude, Kimi K2.5, local models) for anomaly detection |
Local execution | Data never leaves your infrastructure—critical when spreadsheets contain PII |
Plugin registry |
|
OpenClaw's architecture centers on a CLI, an agent gateway, and chat-app connectors. Requests route to your chosen model provider, which triggers skill execution through structured agent loops.
Setup: openclaw-sdk + Google Sheets API
Before writing any cleaning logic, you need two things wired up: the OpenClaw Python SDK for orchestrating the cleanup agent, and Google Sheets API credentials for reading and writing spreadsheet data.
Figure 1 — Credential and SDK setup flow: Google service account and OpenClaw API key feed into the Python cleanup script.
Step 1: Install the OpenClaw Python SDK
Verify the install:
Step 2: Create a Google Cloud service account
Go to the Google Cloud Console.
Create a new project (or select an existing one).
Navigate to APIs & Services → Library and enable both the Google Sheets API and the Google Drive API.
Go to APIs & Services → Credentials → Create Credentials → Service Account Key.
Download the JSON key file.
Share your target spreadsheet with the service account email (the
client_emailfield inside the JSON).
Step 3: Set environment variables
Store credentials as environment variables rather than hardcoding them:
For OpenClaw specifically, you can also configure API keys in ~/.openclaw/openclaw.json:
Tip: OpenClaw resolves environment variables from multiple sources in order of precedence: process environment →
.envin CWD → global~/.openclaw/.env→ configenvblock. See the OpenClaw Environment Variables docs for details.
Script: Read Spreadsheet, Clean, Standardize, and Flag
This is the core of the workflow. The Python script follows four discrete steps—measure → identify → fix → validate—so you can audit exactly what changed and by how much.
Figure 2 — The four-phase cleanup loop: measure, identify, fix, validate.
Full cleanup script
What each phase does
Phase | Action | Output |
|---|---|---|
Measure | Read all rows, snapshot | Baseline row count |
Identify | Hash-based duplicate detection; regex checks for email, phone | Count of duplicates, list of anomalies |
Fix | Remove duplicate rows (keep first); normalize phone to E.164, lowercase email, expand address abbreviations | Cleaned record set |
Validate | Re-run anomaly checks on cleaned data; compare before/after counts; send remaining issues to OpenClaw for AI analysis | Validation report with savings metrics |
Env Vars: GOOGLE_CREDENTIALS_JSON, OPENCLAW_API_KEY
Both secrets must be available at runtime. Here's a quick reference for each deployment context:
Variable | Description | Where to set |
|---|---|---|
| Base64-encoded Google Cloud service account JSON | Shell export, |
| API key for the OpenClaw SDK client | Shell export, |
Security best practices:
Never commit credentials to Git. Use
.gitignoreto exclude.envfiles.In OpenClaw, prefer the
envblock in~/.openclaw/openclaw.jsonor the${VAR}substitution syntax for secret references. See OpenClaw secrets docs.For Paradime Bolt, store secrets as environment variables in your Scheduler Environment configuration.
Bolt Schedule: Cron Weekly
With the cleanup script working locally, the next step is to schedule it to run automatically every week using Paradime Bolt. Bolt supports schedules-as-code via a paradime_schedules.yml file in your dbt™ project root—right alongside dbt_project.yml.
Figure 3 — Weekly Bolt schedule sequence: cron trigger → cleanup → dbt™ freshness check → notification.
paradime_schedules.yml
Key configuration details
schedule: '0 2 * * 0'— Standard cron: minute 0, hour 2, any day of month, any month, Sunday (day 0). Validate at crontab.guru.environment: production— The only supported environment for Bolt schedules.Deployment: Merge
paradime_schedules.ymlto your default branch (mainormaster). Bolt auto-refreshes every 10 minutes, or trigger manually via Bolt → Parse Schedules.Trigger types: Beyond cron, Bolt supports On Run Completion, On Merge, and Bolt API triggers.
Validate your schedule YAML locally
Use the Paradime CLI to catch formatting errors before merging:
To dry-run without executing dbt™ models:
Adding dbt™ tests for ongoing data quality
Once the cleaned data lands in your warehouse (via a source or seed), add dbt™ tests that enforce the same standards your script applies. This way, regressions get caught automatically on every run.
In your schema.yml:
Note: The
dbt_expectationspackage provides regex-based validation out of the box. Add it to yourpackages.yml:
Monitoring and Debugging
Once your weekly schedule is live, you need visibility into whether it's actually running, how long it takes, and what happens when it breaks.
Paradime Radar: Schedule monitoring
Radar's dbt™ Monitoring provides three dashboards relevant to this workflow:
Dashboard | What you see | Link |
|---|---|---|
Schedules Dashboard | Execution frequency, success/error rates, SLA compliance | |
Models Dashboard | Per-model execution time, materialization type, dependency graph | |
Real-time Alerting | Slack/email alerts when schedules fail or breach SLA |
Bolt run logs: Three levels of depth
When a schedule run completes (or fails), Bolt provides three tiers of logs accessible from the Run History view:
Figure 4 — Bolt's three log tiers: summary for quick triage, console for troubleshooting, debug for deep dives.
Summary Logs: DinoAI-generated overview of what ran, what warned, and what to fix. Start here.
Console Logs: Full chronological output—equivalent to what you'd see in your terminal. Use for step-by-step debugging.
Debug Logs: System-level dbt™ internals and performance data. Use when console logs aren't enough.
Access logs by navigating to Bolt → [Your Schedule] → Run History → [Run ID] → Logs and Artifacts.
Artifacts for post-mortem analysis
Each Bolt run produces standard dbt™ artifacts:
Artifact | Purpose |
|---|---|
| Full project graph—models, tests, sources, exposures |
| Per-node execution status, timing, and row counts |
| Column-level metadata for all materialized objects |
| Source freshness results |
Download these from the run detail page and use them to compare performance across runs or feed them into external observability tools.
OpenClaw monitoring
For the OpenClaw portion of your pipeline (AI anomaly detection), monitor:
Gateway logs: Check
~/.openclaw/logs/for skill execution traces.Skill errors: If the
oc.chat()call fails, the cleanup script should catch the exception and log it—don't let a failed AI call block the entire cleanup.Token usage: Track token consumption per run if you're using a metered LLM provider. Log the response metadata from the OpenClaw client.
Troubleshooting Common Issues
Authentication failures
Symptom | Cause | Fix |
|---|---|---|
| Service account email not shared on the spreadsheet | Open the Google Sheet → Share → add the |
| Expired or invalid service account key | Re-download the JSON key from Google Cloud Console and re-encode it |
|
| Check |
Schedule not triggering
Symptom | Cause | Fix |
|---|---|---|
Schedule shows as "Off" in Bolt UI |
| Validate cron at crontab.guru — use days 0-6, not 1-7 |
YAML changes not reflected |
| Merge to |
Schedule runs but cleanup script fails | Missing env vars in production environment | Configure |
Data quality issues persist after cleanup
Symptom | Cause | Fix |
|---|---|---|
dbt™ tests still failing on | New rows added between cleanup and dbt™ run | Add the cleanup script before |
Phone numbers still failing regex | International numbers outside E.164 range | Expand |
Duplicates reappearing weekly | Source system re-exports duplicates | Add a deduplication dbt™ model with |
OpenClaw skill or SDK errors
Symptom | Cause | Fix |
|---|---|---|
| SDK not installed in the execution environment | Run |
OpenClaw AI response is empty or unhelpful | Prompt too vague or model context window exceeded | Reduce the anomaly payload sent to |
Gateway timeout on | OpenClaw gateway not running or unreachable | Verify with |
Wrapping Up
You now have a complete, repeatable pipeline that turns a messy Google Sheet into clean, validated data—every week, without manual intervention.
Here's what you built:
Figure 5 — End-to-end architecture: Bolt schedules the cleanup, the script cleans the data, dbt™ validates quality, and Radar monitors everything.
The four-phase loop, summarized
Measure — Read the spreadsheet and capture baseline metrics (row count, anomaly count).
Identify — Hash-based duplicate detection and regex-based format validation for phone, email, and address fields.
Fix — Remove duplicates (keep first occurrence), normalize phone to E.164, lowercase email, expand address abbreviations.
Validate — Re-run all checks on the cleaned dataset, log before/after savings, and route remaining edge cases to OpenClaw for AI-powered triage.
What to do next
Extend the script: Add columns (company name normalization, zip code validation) by following the same measure → identify → fix → validate pattern.
Add more dbt™ tests: Use dbt-expectations for column-level regex checks, uniqueness, and accepted-value constraints.
Scale with OpenClaw skills: Build a custom OpenClaw skill that wraps your cleanup logic and exposes it via Slack commands—let stakeholders trigger ad-hoc cleanups by typing a message.
Monitor costs: Use Paradime Radar to track warehouse credit consumption from your cleanup-related dbt™ runs and optimize where needed.
The hardest part of data quality isn't the first cleanup—it's making sure it sticks. By combining Paradime Bolt for scheduling, OpenClaw for intelligent anomaly detection, and dbt™ tests for automated validation, you've built a system that catches regressions before they reach your dashboards.

