How to Clean and Deduplicate Spreadsheet Data with OpenClaw in Paradime

Feb 26, 2026

Table of Contents

Automate Spreadsheet Cleanup with Paradime, OpenClaw, and dbt™: A Repeatable Workflow

Messy spreadsheets cost data teams hours every week. Duplicate rows, inconsistent phone formats, malformed email addresses, and rogue address abbreviations silently corrupt downstream analytics. Most "optimize your data" advice stops at vague platitudes—this guide doesn't.

Below, you'll build a concrete, repeatable workflow: read a Google Sheet, identify duplicates, standardize phone/email/address formats, flag anomalies, schedule the whole pipeline with Paradime Bolt, and validate that your cleanup actually stuck. Every step includes code you can copy, a schedule you can deploy, and monitoring you can rely on.

What is Paradime

Paradime is an all-in-one, AI-native platform purpose-built for analytics and data engineering teams. It replaces dbt Cloud™ with a faster, more flexible stack that covers the full lifecycle—development, orchestration, cost management, and monitoring—in one workspace.

Key capabilities you'll use in this guide

Capability	Product	What it does
Code IDE	Paradime Code	AI-assisted dbt™ development with DinoAI; cuts boilerplate by 83%+
Scheduling	Paradime Bolt	Cron and event-driven orchestration for dbt™ jobs—YAML-as-code or UI
Monitoring	Paradime Radar	Schedule dashboards, model performance, real-time alerting
FinOps	Paradime Radar	Snowflake and BigQuery cost visibility and warehouse credit optimization

Paradime is SOC 2 Type II certified, GDPR and CCPA compliant, and guarantees 99.9% uptime. Full docs live at docs.paradime.io.

What is OpenClaw

OpenClaw is an open-source, local-first AI assistant and automation platform. It runs on your own hardware—laptop, homelab, or VPS—and connects to chat apps like Slack, Telegram, or Discord to execute real tasks: file management, API calls, web scraping, and data cleaning.

Why OpenClaw for spreadsheet cleanup

Feature	Benefit for this workflow
Extensible skills & plugins	Install or author skills that read Google Sheets, call cleaning logic, and write results back
Model flexibility	Use any LLM backend (OpenAI, Claude, Kimi K2.5, local models) for anomaly detection
Local execution	Data never leaves your infrastructure—critical when spreadsheets contain PII
Plugin registry	`openclaw plugins install` to add capabilities like Google Sheets MCP integration

OpenClaw's architecture centers on a CLI, an agent gateway, and chat-app connectors. Requests route to your chosen model provider, which triggers skill execution through structured agent loops.

Setup: openclaw-sdk + Google Sheets API

Before writing any cleaning logic, you need two things wired up: the OpenClaw Python SDK for orchestrating the cleanup agent, and Google Sheets API credentials for reading and writing spreadsheet data.

Figure 1 — Credential and SDK setup flow: Google service account and OpenClaw API key feed into the Python cleanup script.

Step 1: Install the OpenClaw Python SDK

Verify the install:

Step 2: Create a Google Cloud service account

Go to the Google Cloud Console.
Create a new project (or select an existing one).
Navigate to APIs & Services → Library and enable both the Google Sheets API and the Google Drive API.
Go to APIs & Services → Credentials → Create Credentials → Service Account Key.
Download the JSON key file.
Share your target spreadsheet with the service account email (the client_email field inside the JSON).

Step 3: Set environment variables

Store credentials as environment variables rather than hardcoding them:

For OpenClaw specifically, you can also configure API keys in ~/.openclaw/openclaw.json:

Tip: OpenClaw resolves environment variables from multiple sources in order of precedence: process environment → .env in CWD → global ~/.openclaw/.env → config env block. See the OpenClaw Environment Variables docs for details.

Script: Read Spreadsheet, Clean, Standardize, and Flag

This is the core of the workflow. The Python script follows four discrete steps—measure → identify → fix → validate—so you can audit exactly what changed and by how much.

Figure 2 — The four-phase cleanup loop: measure, identify, fix, validate.

Full cleanup script

What each phase does

Phase	Action	Output
Measure	Read all rows, snapshot `total_before`	Baseline row count
Identify	Hash-based duplicate detection; regex checks for email, phone	Count of duplicates, list of anomalies
Fix	Remove duplicate rows (keep first); normalize phone to E.164, lowercase email, expand address abbreviations	Cleaned record set
Validate	Re-run anomaly checks on cleaned data; compare before/after counts; send remaining issues to OpenClaw for AI analysis	Validation report with savings metrics

Env Vars: `GOOGLE_CREDENTIALS_JSON`, `OPENCLAW_API_KEY`

Both secrets must be available at runtime. Here's a quick reference for each deployment context:

Variable	Description	Where to set
`GOOGLE_CREDENTIALS_JSON`	Base64-encoded Google Cloud service account JSON	Shell export, `.env` file, or CI/CD secrets
`OPENCLAW_API_KEY`	API key for the OpenClaw SDK client	Shell export, `~/.openclaw/openclaw.json`, or `.env`

Security best practices:

Never commit credentials to Git. Use .gitignore to exclude .env files.
In OpenClaw, prefer the env block in ~/.openclaw/openclaw.json or the ${VAR} substitution syntax for secret references. See OpenClaw secrets docs.
For Paradime Bolt, store secrets as environment variables in your Scheduler Environment configuration.

Bolt Schedule: Cron Weekly

With the cleanup script working locally, the next step is to schedule it to run automatically every week using Paradime Bolt. Bolt supports schedules-as-code via a paradime_schedules.yml file in your dbt™ project root—right alongside dbt_project.yml.

Figure 3 — Weekly Bolt schedule sequence: cron trigger → cleanup → dbt™ freshness check → notification.

`paradime_schedules.yml`

Key configuration details

schedule: '0 2 * * 0' — Standard cron: minute 0, hour 2, any day of month, any month, Sunday (day 0). Validate at crontab.guru.
environment: production — The only supported environment for Bolt schedules.
Deployment: Merge paradime_schedules.yml to your default branch (main or master). Bolt auto-refreshes every 10 minutes, or trigger manually via Bolt → Parse Schedules.
Trigger types: Beyond cron, Bolt supports On Run Completion, On Merge, and Bolt API triggers.

Validate your schedule YAML locally

Use the Paradime CLI to catch formatting errors before merging:

To dry-run without executing dbt™ models:

Adding dbt™ tests for ongoing data quality

Once the cleaned data lands in your warehouse (via a source or seed), add dbt™ tests that enforce the same standards your script applies. This way, regressions get caught automatically on every run.

In your schema.yml:

Note: The dbt_expectations package provides regex-based validation out of the box. Add it to your packages.yml:

Monitoring and Debugging

Once your weekly schedule is live, you need visibility into whether it's actually running, how long it takes, and what happens when it breaks.

Paradime Radar: Schedule monitoring

Radar's dbt™ Monitoring provides three dashboards relevant to this workflow:

Dashboard	What you see	Link
Schedules Dashboard	Execution frequency, success/error rates, SLA compliance	Schedules Dashboard
Models Dashboard	Per-model execution time, materialization type, dependency graph	Models Dashboard
Real-time Alerting	Slack/email alerts when schedules fail or breach SLA	Real-time Alerting

Bolt run logs: Three levels of depth

When a schedule run completes (or fails), Bolt provides three tiers of logs accessible from the Run History view:

Figure 4 — Bolt's three log tiers: summary for quick triage, console for troubleshooting, debug for deep dives.

Summary Logs: DinoAI-generated overview of what ran, what warned, and what to fix. Start here.
Console Logs: Full chronological output—equivalent to what you'd see in your terminal. Use for step-by-step debugging.
Debug Logs: System-level dbt™ internals and performance data. Use when console logs aren't enough.

Access logs by navigating to Bolt → [Your Schedule] → Run History → [Run ID] → Logs and Artifacts.

Artifacts for post-mortem analysis

Each Bolt run produces standard dbt™ artifacts:

Artifact	Purpose
`manifest.json`	Full project graph—models, tests, sources, exposures
`run_results.json`	Per-node execution status, timing, and row counts
`catalog.json`	Column-level metadata for all materialized objects
`sources.json`	Source freshness results

Download these from the run detail page and use them to compare performance across runs or feed them into external observability tools.

OpenClaw monitoring

For the OpenClaw portion of your pipeline (AI anomaly detection), monitor:

Gateway logs: Check ~/.openclaw/logs/ for skill execution traces.
Skill errors: If the oc.chat() call fails, the cleanup script should catch the exception and log it—don't let a failed AI call block the entire cleanup.
Token usage: Track token consumption per run if you're using a metered LLM provider. Log the response metadata from the OpenClaw client.

Troubleshooting Common Issues

Authentication failures

Symptom	Cause	Fix
`gspread.exceptions.SpreadsheetNotFound`	Service account email not shared on the spreadsheet	Open the Google Sheet → Share → add the `client_email` from your service account JSON
`oauth2client.client.HttpAccessTokenRefreshError`	Expired or invalid service account key	Re-download the JSON key from Google Cloud Console and re-encode it
`openclaw: No API key found`	`OPENCLAW_API_KEY` not set or not in the correct precedence layer	Check `echo $OPENCLAW_API_KEY` or verify `~/.openclaw/openclaw.json` `env` block

Schedule not triggering

Symptom	Cause	Fix
Schedule shows as "Off" in Bolt UI	`schedule: 'OFF'` in YAML or invalid cron expression	Validate cron at crontab.guru — use days 0-6, not 1-7
YAML changes not reflected	`paradime_schedules.yml` not on default branch	Merge to `main`/`master`; Bolt auto-refreshes every 10 min, or manually parse via Bolt → Parse Schedules
Schedule runs but cleanup script fails	Missing env vars in production environment	Configure `GOOGLE_CREDENTIALS_JSON` and `OPENCLAW_API_KEY` in Bolt's Scheduler Environment settings

Data quality issues persist after cleanup

Symptom	Cause	Fix
dbt™ tests still failing on `email` column	New rows added between cleanup and dbt™ run	Add the cleanup script before `dbt run` in your Bolt commands list
Phone numbers still failing regex	International numbers outside E.164 range	Expand `PHONE_REGEX` to handle country-specific formats; use the `phonenumbers` Python library for robust parsing
Duplicates reappearing weekly	Source system re-exports duplicates	Add a deduplication dbt™ model with `QUALIFY ROW_NUMBER() OVER (PARTITION BY email ORDER BY updated_at DESC) = 1`

OpenClaw skill or SDK errors

Symptom	Cause	Fix
`ModuleNotFoundError: No module named 'openclaw'`	SDK not installed in the execution environment	Run `pip install openclaw` in the same virtualenv that Bolt uses
OpenClaw AI response is empty or unhelpful	Prompt too vague or model context window exceeded	Reduce the anomaly payload sent to `oc.chat()` — batch into smaller chunks
Gateway timeout on `oc.chat()`	OpenClaw gateway not running or unreachable	Verify with `openclaw gateway status`; restart with `openclaw gateway restart`

Wrapping Up

You now have a complete, repeatable pipeline that turns a messy Google Sheet into clean, validated data—every week, without manual intervention.

Here's what you built:

Figure 5 — End-to-end architecture: Bolt schedules the cleanup, the script cleans the data, dbt™ validates quality, and Radar monitors everything.

The four-phase loop, summarized

Measure — Read the spreadsheet and capture baseline metrics (row count, anomaly count).
Identify — Hash-based duplicate detection and regex-based format validation for phone, email, and address fields.
Fix — Remove duplicates (keep first occurrence), normalize phone to E.164, lowercase email, expand address abbreviations.
Validate — Re-run all checks on the cleaned dataset, log before/after savings, and route remaining edge cases to OpenClaw for AI-powered triage.

What to do next

Extend the script: Add columns (company name normalization, zip code validation) by following the same measure → identify → fix → validate pattern.
Add more dbt™ tests: Use dbt-expectations for column-level regex checks, uniqueness, and accepted-value constraints.
Scale with OpenClaw skills: Build a custom OpenClaw skill that wraps your cleanup logic and exposes it via Slack commands—let stakeholders trigger ad-hoc cleanups by typing a message.
Monitor costs: Use Paradime Radar to track warehouse credit consumption from your cleanup-related dbt™ runs and optimize where needed.

The hardest part of data quality isn't the first cleanup—it's making sure it sticks. By combining Paradime Bolt for scheduling, OpenClaw for intelligent anomaly detection, and dbt™ tests for automated validation, you've built a system that catches regressions before they reach your dashboards.

Interested to Learn More?
Try Out the Free 14-Days Trial

Start free trial

Stop Managing Pipelines. Start Shipping Them.

Join the teams that replaced manual dbt™ workflows with agentic AI. Free to start, no credit card required.

Start for free

Stop Managing Pipelines. Start Shipping Them.

Join the teams that replaced manual dbt™ workflows with agentic AI. Free to start, no credit card required.

Start for free

Stop Managing Pipelines. Start Shipping Them.

Join the teams that replaced manual dbt™ workflows with agentic AI. Free to start, no credit card required.

Platform

NEW

Resources

ADD-ONs

Industries

About

Legal

*dbt® and dbt Core® are federally registered trademarks of dbt Labs, Inc. in the United States and various jurisdictions around the world. Paradime is not a partner of dbt Labs. All rights therein are reserved to dbt Labs. Paradime is not a product or service of or endorsed by dbt Labs, Inc.

Platform

NEW