Can a local-first DuckDB pipeline handle large corporate datasets?

Absolutely. DuckDB is a highly optimized, columnar analytical engine. It handles datasets up to 100GB (hundreds of millions of transaction rows) directly on standard developer laptops and lightweight virtual private servers (VPS) in seconds.

How do we handle real-time streaming data with this batch architecture?

Real-time queries often run into API rate limits and slow down performance. If you need near-instant updates, use Webhook triggers to save incoming events directly to your local file storage (your Bronze layer), and schedule your dbt transformation models to process these new files in small, frequent intervals.

Does using Evidence.dev mean my raw database details are visible on the web?

No. Evidence pre-renders your pages into static HTML and Javascript. Your SQL queries run during the build stage on your secure servers. The only thing sent to the browser is the compiled static files, keeping your credentials, databases, and structural assets fully isolated from public access.

Ditch the SaaS Dashboard: Architect Your Own Custom Command Center

There is a quiet, expensive lie at the heart of the modern tech stack: to understand your own business, you must pay someone else to hold your data. Founders, creators, and engineers routinely lock themselves into pre-built SaaS analytics platforms, paying heavy premiums to view basic charts of their own transactional activities.

When you outgrow default dashboards, the standard recommendation is to purchase a complex, enterprise-grade cloud data warehouse. For a 100-person startup with five core SaaS connectors, hosting a stack centered on Snowflake, BigQuery, Fivetran, dbt Cloud, and Looker easily baselines between $2,000 and $7,400+ per month ($24,000 to $88,000+ annually) in licensing alone, according to Definite.app's infrastructure analysis. That is a significant capital drain before you even pay a single platform engineer’s salary.

This tutorial shows you how to break free of this tax. We will build a private, ultra-fast, local-first analytical database and custom command center using lightweight open-source software and Git-driven BI. By the end of this guide, you will have a production-grade, self-updating data hub running on serverless infrastructure for pennies.

What You'll Build

We will construct a unified, automated pipeline that extracts live transaction records from Stripe, writes them into a secure, local DuckDB analytical lakehouse, processes them into clean data marts using dbt Core, and serves an interactive, code-controlled dashboard via Evidence.dev. Finally, we will orchestrate the entire process as an automated cron loop using Windmill.dev.

Prerequisites

Python 3.10 or higher.
Node.js (v18+) and npm.
A Stripe account with API access.
Comfort using a command-line interface.

1. The Invisible Toll of the Modern Data Stack

Most young businesses default to a multi-SaaS workflow where customer data lives in HubSpot, financial data resides in Stripe, and product metrics are trapped in Mixpanel. To view combined analytics, engineering teams are pushed toward centralized cloud structures that carry exorbitant BI costs.

The cost isn't just financial; it is operational. Modern data professionals spend an average of 37.75% of their time manually wrangling and cleansing fragmented API payloads. For a lean engineering team of four, this overhead represents $94,464 per year in wasted cycles—money spent writing brittle boilerplate instead of driving revenue.

A modular, self-hosted, code-first stack avoids these traps. By utilizing local-first columnar SQL engines like DuckDB, declarative ingestion tools, and static Markdown-based visualization, you bypass expensive cloud middleware. This allows teams to cure metric debt and establish a single source of truth without subscription bloat.

2. Designing a Near-Real-Time Local-First Lakehouse

Before writing code, we must address an architectural anti-pattern: querying production API endpoints on every page load. Live polling on UI render introduces major issues:

Your dashboard loading speed is at the mercy of third-party latency.
You will quickly trigger rate limit locks, such as Stripe’s concurrent limits.
Users suffer through sluggish, frustrating interfaces.

Instead, we adopt a "Near-Real-Time Lakehouse" pattern. In this architecture, raw webhook payloads and API extractions are decoupled from visual display. We stream incoming mutations into a raw storage directory (the Bronze layer), process them into a structured database (the Silver layer), and build aggregated metric views (the Gold layer) on a 10-to-30 minute schedule.

Ditch the SaaS Dashboard: Architect Your Own Custom Command Center contextual illustration — Photo by Nemuel Sereti on Pexels

This design splits your workflow into two clean paths. Historical datasets are extracted in bulk, while daily changes are handled by lightweight delta runs. Our dashboard queries this pre-computed local engine directly. This guarantees sub-second rendering, offline capability, and moves us away from the dynamic dashboard era that constantly pings remote databases.

3. Ingesting SaaS Data via dlt

APIs enforce strict protections to prevent being overwhelmed. For instance, HubSpot enforces burst restrictions of 100 to 200 requests per 10 seconds, as detailed in HubSpot's standards. Stripe limits production calls to 100 requests per second (rps), as outlined in their API guidelines.

Naive custom scripts that poll raw JSON often hit 429 Too Many Requests errors and drop data. We will instead use dlt (data load tool) to handle backoffs and schema evolution automatically.

Creating the Ingestion Module

mkdir -p custom_command_center/dlt_pipeline/.dlt
mkdir -p custom_command_center/dbt_project
mkdir -p custom_command_center/data
cd custom_command_center

Configure your dlt_pipeline/.dlt/secrets.toml with your Stripe API key:

[sources.stripe_source]
api_key = "sk_test_..."

Write the pipeline script at dlt_pipeline/load_stripe.py:

import dlt
from dlt.sources.helpers.rest_client import RESTClient
from dlt.sources.helpers.rest_client.paginators import SingleEntityPaginator

@dlt.source
def stripe_source(api_key=dlt.secrets.value):
    client = RESTClient(
        base_url="https://api.stripe.com/v1/",
        auth={"type": "bearer", "token": api_key}
    )
    @dlt.resource(write_disposition="append")
    def charges():
        paginator = SingleEntityPaginator(next_value_path="next_page")
        for page in client.paginate("charges", paginator=paginator):
            yield page

if __name__ == "__main__":
    pipeline = dlt.pipeline(
        pipeline_name="stripe_analytics",
        destination="duckdb",
        credentials="../data/analytics.duckdb",
        dataset_name="raw_stripe"
    )
    load_info = pipeline.run(stripe_source())
    print(load_info)

4. Transforming Data with dbt Core

We use dbt Core to transform records directly within DuckDB. This provides lightning-fast columnar processing without warehouse costs.

Step 1: Configure dbt

Set up dbt_project/profiles.yml to point to your DuckDB file:

stripe_analytics:
  outputs:
    dev:
      type: duckdb
      path: '../data/analytics.duckdb'
      threads: 4
  target: dev

Step 2: Write Transformation Model

Create dbt_project/models/marts/fct_mrr.sql to clean your Stripe data:

with source_charges as (select * from {{ source('raw_stripe', 'charges') }}),
processed_transactions as (
    select
        id as charge_id,
        amount / 100.0 as charge_amount_usd,
        epoch_to_timestamp(created) as charge_timestamp,
        status, refunded
    from source_charges
)
select
    date_trunc('month', charge_timestamp) as charge_month,
    sum(charge_amount_usd) as total_mrr,
    count(distinct customer_id) as paying_customers
from processed_transactions
where status = 'succeeded' and refunded = false
group by 1
order by 1 desc

5. Visualizing with Evidence.dev

Traditional tools create a "Version Control Void." We solve this with Evidence.dev, an open-source "BI-as-Code" tool that treats dashboards as text. It uses WebAssembly to run DuckDB in the browser, ensuring sub-50ms load times.

Dashboard Setup

npx deinit evidence_dashboard
cd evidence_dashboard
npm install @evidence-dev/duckdb

In evidence_dashboard/pages/index.md, define your visualization:

```sql mrr_by_month
select charge_month, total_mrr, paying_customers from stripe_db.fct_mrr
```
<BarChart data={mrr_by_month} x=charge_month y=total_mrr />

6. Orchestrating with Windmill.dev

Automate your stack with run.sh:

#!/bin/bash
set -e
python dlt_pipeline/load_stripe.py
cd dbt_project && dbt run --profiles-dir .
cd ../evidence_dashboard && npm run build

Schedule this script in Windmill.dev to keep metrics fresh. Your final build produces static assets that can be hosted anywhere, ensuring your analytical database remains isolated and secure.

Common Pitfalls

Ignoring Schema Drift: Always use tools like dlt that handle API changes gracefully.
Disk Space: Monitor your local database size during high-volume periods.
Large Datasets: For data exceeding 100GB, consider integrating MotherDuck.

Cover photo by Brett Sayles on Pexels.