How to Clean and Deduplicate Spreadsheet Data with OpenClaw in Paradime

Feb 26, 2026

Table of Contents

Automate Spreadsheet Cleanup with Paradime, OpenClaw, and dbt™: A Repeatable Workflow

Messy spreadsheets cost data teams hours every week. Duplicate rows, inconsistent phone formats, malformed email addresses, and rogue address abbreviations silently corrupt downstream analytics. Most "optimize your data" advice stops at vague platitudes—this guide doesn't.

Below, you'll build a concrete, repeatable workflow: read a Google Sheet, identify duplicates, standardize phone/email/address formats, flag anomalies, schedule the whole pipeline with Paradime Bolt, and validate that your cleanup actually stuck. Every step includes code you can copy, a schedule you can deploy, and monitoring you can rely on.

What is Paradime

Paradime is an all-in-one, AI-native platform purpose-built for analytics and data engineering teams. It replaces dbt Cloud™ with a faster, more flexible stack that covers the full lifecycle—development, orchestration, cost management, and monitoring—in one workspace.

Key capabilities you'll use in this guide

Capability

Product

What it does

Code IDE

Paradime Code

AI-assisted dbt™ development with DinoAI; cuts boilerplate by 83%+

Scheduling

Paradime Bolt

Cron and event-driven orchestration for dbt™ jobs—YAML-as-code or UI

Monitoring

Paradime Radar

Schedule dashboards, model performance, real-time alerting

FinOps

Paradime Radar

Snowflake and BigQuery cost visibility and warehouse credit optimization

Paradime is SOC 2 Type II certified, GDPR and CCPA compliant, and guarantees 99.9% uptime. Full docs live at docs.paradime.io.

What is OpenClaw

OpenClaw is an open-source, local-first AI assistant and automation platform. It runs on your own hardware—laptop, homelab, or VPS—and connects to chat apps like Slack, Telegram, or Discord to execute real tasks: file management, API calls, web scraping, and data cleaning.

Why OpenClaw for spreadsheet cleanup

Feature

Benefit for this workflow

Extensible skills & plugins

Install or author skills that read Google Sheets, call cleaning logic, and write results back

Model flexibility

Use any LLM backend (OpenAI, Claude, Kimi K2.5, local models) for anomaly detection

Local execution

Data never leaves your infrastructure—critical when spreadsheets contain PII

Plugin registry

openclaw plugins install to add capabilities like Google Sheets MCP integration

OpenClaw's architecture centers on a CLI, an agent gateway, and chat-app connectors. Requests route to your chosen model provider, which triggers skill execution through structured agent loops.

Setup: openclaw-sdk + Google Sheets API

Before writing any cleaning logic, you need two things wired up: the OpenClaw Python SDK for orchestrating the cleanup agent, and Google Sheets API credentials for reading and writing spreadsheet data.

Figure 1 — Credential and SDK setup flow: Google service account and OpenClaw API key feed into the Python cleanup script.

Step 1: Install the OpenClaw Python SDK

Verify the install:

Step 2: Create a Google Cloud service account

  1. Go to the Google Cloud Console.

  2. Create a new project (or select an existing one).

  3. Navigate to APIs & Services → Library and enable both the Google Sheets API and the Google Drive API.

  4. Go to APIs & Services → Credentials → Create Credentials → Service Account Key.

  5. Download the JSON key file.

  6. Share your target spreadsheet with the service account email (the client_email field inside the JSON).

Step 3: Set environment variables

Store credentials as environment variables rather than hardcoding them:

For OpenClaw specifically, you can also configure API keys in ~/.openclaw/openclaw.json:

Tip: OpenClaw resolves environment variables from multiple sources in order of precedence: process environment → .env in CWD → global ~/.openclaw/.env → config env block. See the OpenClaw Environment Variables docs for details.

Script: Read Spreadsheet, Clean, Standardize, and Flag

This is the core of the workflow. The Python script follows four discrete steps—measure → identify → fix → validate—so you can audit exactly what changed and by how much.

Figure 2 — The four-phase cleanup loop: measure, identify, fix, validate.

Full cleanup script

What each phase does

Phase

Action

Output

Measure

Read all rows, snapshot total_before

Baseline row count

Identify

Hash-based duplicate detection; regex checks for email, phone

Count of duplicates, list of anomalies

Fix

Remove duplicate rows (keep first); normalize phone to E.164, lowercase email, expand address abbreviations

Cleaned record set

Validate

Re-run anomaly checks on cleaned data; compare before/after counts; send remaining issues to OpenClaw for AI analysis

Validation report with savings metrics

Env Vars: GOOGLE_CREDENTIALS_JSON, OPENCLAW_API_KEY

Both secrets must be available at runtime. Here's a quick reference for each deployment context:

Variable

Description

Where to set

GOOGLE_CREDENTIALS_JSON

Base64-encoded Google Cloud service account JSON

Shell export, .env file, or CI/CD secrets

OPENCLAW_API_KEY

API key for the OpenClaw SDK client

Shell export, ~/.openclaw/openclaw.json, or .env

Security best practices:

  • Never commit credentials to Git. Use .gitignore to exclude .env files.

  • In OpenClaw, prefer the env block in ~/.openclaw/openclaw.json or the ${VAR} substitution syntax for secret references. See OpenClaw secrets docs.

  • For Paradime Bolt, store secrets as environment variables in your Scheduler Environment configuration.

Bolt Schedule: Cron Weekly

With the cleanup script working locally, the next step is to schedule it to run automatically every week using Paradime Bolt. Bolt supports schedules-as-code via a paradime_schedules.yml file in your dbt™ project root—right alongside dbt_project.yml.

Figure 3 — Weekly Bolt schedule sequence: cron trigger → cleanup → dbt™ freshness check → notification.

paradime_schedules.yml

Key configuration details

  • schedule: '0 2 * * 0' — Standard cron: minute 0, hour 2, any day of month, any month, Sunday (day 0). Validate at crontab.guru.

  • environment: production — The only supported environment for Bolt schedules.

  • Deployment: Merge paradime_schedules.yml to your default branch (main or master). Bolt auto-refreshes every 10 minutes, or trigger manually via Bolt → Parse Schedules.

  • Trigger types: Beyond cron, Bolt supports On Run Completion, On Merge, and Bolt API triggers.

Validate your schedule YAML locally

Use the Paradime CLI to catch formatting errors before merging:

To dry-run without executing dbt™ models:

Adding dbt™ tests for ongoing data quality

Once the cleaned data lands in your warehouse (via a source or seed), add dbt™ tests that enforce the same standards your script applies. This way, regressions get caught automatically on every run.

In your schema.yml:

Note: The dbt_expectations package provides regex-based validation out of the box. Add it to your packages.yml:

Monitoring and Debugging

Once your weekly schedule is live, you need visibility into whether it's actually running, how long it takes, and what happens when it breaks.

Paradime Radar: Schedule monitoring

Radar's dbt™ Monitoring provides three dashboards relevant to this workflow:

Dashboard

What you see

Link

Schedules Dashboard

Execution frequency, success/error rates, SLA compliance

Schedules Dashboard

Models Dashboard

Per-model execution time, materialization type, dependency graph

Models Dashboard

Real-time Alerting

Slack/email alerts when schedules fail or breach SLA

Real-time Alerting

Bolt run logs: Three levels of depth

When a schedule run completes (or fails), Bolt provides three tiers of logs accessible from the Run History view:

Figure 4 — Bolt's three log tiers: summary for quick triage, console for troubleshooting, debug for deep dives.

  • Summary Logs: DinoAI-generated overview of what ran, what warned, and what to fix. Start here.

  • Console Logs: Full chronological output—equivalent to what you'd see in your terminal. Use for step-by-step debugging.

  • Debug Logs: System-level dbt™ internals and performance data. Use when console logs aren't enough.

Access logs by navigating to Bolt → [Your Schedule] → Run History → [Run ID] → Logs and Artifacts.

Artifacts for post-mortem analysis

Each Bolt run produces standard dbt™ artifacts:

Artifact

Purpose

manifest.json

Full project graph—models, tests, sources, exposures

run_results.json

Per-node execution status, timing, and row counts

catalog.json

Column-level metadata for all materialized objects

sources.json

Source freshness results

Download these from the run detail page and use them to compare performance across runs or feed them into external observability tools.

OpenClaw monitoring

For the OpenClaw portion of your pipeline (AI anomaly detection), monitor:

  • Gateway logs: Check ~/.openclaw/logs/ for skill execution traces.

  • Skill errors: If the oc.chat() call fails, the cleanup script should catch the exception and log it—don't let a failed AI call block the entire cleanup.

  • Token usage: Track token consumption per run if you're using a metered LLM provider. Log the response metadata from the OpenClaw client.

Troubleshooting Common Issues

Authentication failures

Symptom

Cause

Fix

gspread.exceptions.SpreadsheetNotFound

Service account email not shared on the spreadsheet

Open the Google Sheet → Share → add the client_email from your service account JSON

oauth2client.client.HttpAccessTokenRefreshError

Expired or invalid service account key

Re-download the JSON key from Google Cloud Console and re-encode it

openclaw: No API key found

OPENCLAW_API_KEY not set or not in the correct precedence layer

Check echo $OPENCLAW_API_KEY or verify ~/.openclaw/openclaw.json env block

Schedule not triggering

Symptom

Cause

Fix

Schedule shows as "Off" in Bolt UI

schedule: 'OFF' in YAML or invalid cron expression

Validate cron at crontab.guru — use days 0-6, not 1-7

YAML changes not reflected

paradime_schedules.yml not on default branch

Merge to main/master; Bolt auto-refreshes every 10 min, or manually parse via Bolt → Parse Schedules

Schedule runs but cleanup script fails

Missing env vars in production environment

Configure GOOGLE_CREDENTIALS_JSON and OPENCLAW_API_KEY in Bolt's Scheduler Environment settings

Data quality issues persist after cleanup

Symptom

Cause

Fix

dbt™ tests still failing on email column

New rows added between cleanup and dbt™ run

Add the cleanup script before dbt run in your Bolt commands list

Phone numbers still failing regex

International numbers outside E.164 range

Expand PHONE_REGEX to handle country-specific formats; use the phonenumbers Python library for robust parsing

Duplicates reappearing weekly

Source system re-exports duplicates

Add a deduplication dbt™ model with QUALIFY ROW_NUMBER() OVER (PARTITION BY email ORDER BY updated_at DESC) = 1

OpenClaw skill or SDK errors

Symptom

Cause

Fix

ModuleNotFoundError: No module named 'openclaw'

SDK not installed in the execution environment

Run pip install openclaw in the same virtualenv that Bolt uses

OpenClaw AI response is empty or unhelpful

Prompt too vague or model context window exceeded

Reduce the anomaly payload sent to oc.chat() — batch into smaller chunks

Gateway timeout on oc.chat()

OpenClaw gateway not running or unreachable

Verify with openclaw gateway status; restart with openclaw gateway restart

Wrapping Up

You now have a complete, repeatable pipeline that turns a messy Google Sheet into clean, validated data—every week, without manual intervention.

Here's what you built:

Figure 5 — End-to-end architecture: Bolt schedules the cleanup, the script cleans the data, dbt™ validates quality, and Radar monitors everything.

The four-phase loop, summarized

  1. Measure — Read the spreadsheet and capture baseline metrics (row count, anomaly count).

  2. Identify — Hash-based duplicate detection and regex-based format validation for phone, email, and address fields.

  3. Fix — Remove duplicates (keep first occurrence), normalize phone to E.164, lowercase email, expand address abbreviations.

  4. Validate — Re-run all checks on the cleaned dataset, log before/after savings, and route remaining edge cases to OpenClaw for AI-powered triage.

What to do next

  • Extend the script: Add columns (company name normalization, zip code validation) by following the same measure → identify → fix → validate pattern.

  • Add more dbt™ tests: Use dbt-expectations for column-level regex checks, uniqueness, and accepted-value constraints.

  • Scale with OpenClaw skills: Build a custom OpenClaw skill that wraps your cleanup logic and exposes it via Slack commands—let stakeholders trigger ad-hoc cleanups by typing a message.

  • Monitor costs: Use Paradime Radar to track warehouse credit consumption from your cleanup-related dbt™ runs and optimize where needed.

The hardest part of data quality isn't the first cleanup—it's making sure it sticks. By combining Paradime Bolt for scheduling, OpenClaw for intelligent anomaly detection, and dbt™ tests for automated validation, you've built a system that catches regressions before they reach your dashboards.

Interested to Learn More?
Try Out the Free 14-Days Trial

Stop Managing Pipelines. Start Shipping Them.

Join the teams that replaced manual dbt™ workflows with agentic AI. Free to start, no credit card required.

Stop Managing Pipelines. Start Shipping Them.

Join the teams that replaced manual dbt™ workflows with agentic AI. Free to start, no credit card required.

Stop Managing Pipelines. Start Shipping Them.

Join the teams that replaced manual dbt™ workflows with agentic AI. Free to start, no credit card required.

Copyright © 2026 Paradime Labs, Inc. Made with ❤️ in San Francisco ・ London

*dbt® and dbt Core® are federally registered trademarks of dbt Labs, Inc. in the United States and various jurisdictions around the world. Paradime is not a partner of dbt Labs. All rights therein are reserved to dbt Labs. Paradime is not a product or service of or endorsed by dbt Labs, Inc.

Copyright © 2026 Paradime Labs, Inc. Made with ❤️ in San Francisco ・ London

*dbt® and dbt Core® are federally registered trademarks of dbt Labs, Inc. in the United States and various jurisdictions around the world. Paradime is not a partner of dbt Labs. All rights therein are reserved to dbt Labs. Paradime is not a product or service of or endorsed by dbt Labs, Inc.

Copyright © 2026 Paradime Labs, Inc. Made with ❤️ in San Francisco ・ London

*dbt® and dbt Core® are federally registered trademarks of dbt Labs, Inc. in the United States and various jurisdictions around the world. Paradime is not a partner of dbt Labs. All rights therein are reserved to dbt Labs. Paradime is not a product or service of or endorsed by dbt Labs, Inc.