Data provisioning pipeline · gpn-yp-site-worker

Overview

The YP site displays businesses that originate from the Google Places API, flow through the scraper worker for enrichment, and land in the gpn-user-d1 D1 database (the shared user-facing database).

Canonical path (R-4 / ADR-0001): gpn-user-d1 now has **exactly one
writer** — gpn-data-publisher-worker. It reads gpn-data-d1 and projects the
public serving rows via POST /publish?project=&dataset= (baseline +
enrichment overlay). yp-site only reads gpn-user-d1. The producer
direct-writes were removed (google-places provision-site-db and the
interpreter site-projection now return 410).
The npm scripts below are legacy local-dev / bootstrap helpers that write
gpn-user-d1 directly; they predate the publisher and are kept for local
setup and one-off schema work. Prefer the publisher for any
production/provisioning delivery.

Automatic delivery. The publisher runs publishAll on a nightly cron (0 3 * * *) — it auto-discovers every project (agent_places.project_name) and every dataset with a publication projection, and re-projects them. For immediate freshness after adding data, call the on-demand endpoint:

curl -sS -X POST "https://publisher.mondial-it.nl/publish-all" -H "x-admin-token: $PUBLISHER_TOKEN"
# -> { ok:true, projects:[...], datasets:[...], baselines:[...], enrichments:[...] }

Publishing is delta/idempotent: every row is written with INSERT … ON CONFLICT(id) DO UPDATE … WHERE <content differs>, so a re-run that changes nothing performs zero writes and leaves updated_at untouched (a real content change still refreshes the row).

Event-driven delivery. After writing gpn-data-d1 (e.g. a Places search or import), gpn-google-places-worker enqueues a best-effort signal on the gpn-publish-requests queue. The publisher consumes it and coalesces the batch into a single delta re-projection — so the serving DB updates within seconds, not at the next cron tick. The queue binding is the trust boundary (no admin token needed). Three delivery paths, in order of immediacy:

Event-driven (queue) — automatic, seconds after data changes.
On-demand — POST /publish-all (admin-gated) for a manual full refresh.
Nightly cron (0 3 * * *) — safety net.

A missed or duplicated signal is harmless because re-projection is idempotent.

Scripts (legacy / local)

db:provision — Create and seed tables

npm run db:provision

Runs scripts/provision-d1.sh, which:

Checks that wrangler is available on PATH
Executes schema.sql against the remote D1 database (creates all tables if they do not exist)
Executes seed.sql against the remote D1 database (inserts baseline data)

The target database name defaults to gpn-user-d1 but can be overridden with SITE_DB_NAME.

Use this script once when setting up a new environment or after a schema reset.

db:migrate — Run pending migrations

npm run db:migrate

Runs scripts/migrate-site-db.sh, which applies the migration SQL in scripts/site-db-migration.sql. This handles schema evolution — adding new columns or indexes — on the gpn-user-d1 database. (The historical makelaar-db/site-db databases were consolidated into the two function-split databases gpn-data-d1 and gpn-user-d1.)

Use this script when the schema has changed since your last provision.

db:backfill:listings — Compute listing cards

npm run db:backfill:listings

Runs scripts/backfill-business-listings.sh, which executes scripts/backfill-business-listings.sql. This script reads from the businesses and business_profiles tables and computes pre-formatted business_listings rows for the search page.

The business_listings table is a denormalized view optimized for fast search display. Each row contains pre-computed labels (title, subtitle, snippet, location, status, category, rating) so the search page loader can return results without joining multiple tables.

Use this script after importing new business data or updating profiles.

provision-from-scraper — Import from scraper output

npm run db:provision:scraper

Runs scripts/provision-from-scraper.sh, which imports business data from the scraper worker output into the gpn-user-d1 tables. This is the primary data pipeline for getting crawled and enriched business data into the public site.

Data flow sequence

1. Google Places API
   -> gpn-google-places-worker resolves and normalizes
   -> hands domains to gpn-domain-scraper-worker

2. Scraper crawls websites
   -> extracts evidence, facts, structured data
   -> stores in gpn-data-d1 and R2

3. Provisioning (canonical: the publisher worker)
   -> gpn-data-publisher-worker reads gpn-data-d1
   -> POST /publish?project=&dataset= projects baseline + enrichment overlay
   -> writes businesses / profiles / listings / field_visibility into gpn-user-d1
   (legacy local: db:provision / db:provision:scraper / db:backfill:listings)

4. Site serves data
   -> Remix loader queries gpn-user-d1
   -> search page returns business_listings
   -> detail page returns businesses + profiles + contents

Local development

For local development, npm run dev uses wrangler's local D1 simulator. The local database state lives in .wrangler/state/. Run db:provision with --local flag (or modify the script) to populate the local database.

Validation

npm run mvp:check

Runs db:provision followed by smoke-check to verify the database is provisioned and the deployed worker responds correctly.