Overview
The YP site displays businesses that originate from the Google Places API, flow through the scraper worker for enrichment, and land in the gpn-user-d1 D1 database (the shared user-facing database).
Canonical path (R-4 / ADR-0001):
gpn-user-d1now has **exactly onewriter** —
gpn-data-publisher-worker. It readsgpn-data-d1and projects thepublic serving rows via
POST /publish?project=&dataset=(baseline +enrichment overlay). yp-site only reads
gpn-user-d1. The producerdirect-writes were removed (google-places
provision-site-dband theinterpreter site-projection now return 410).
The npm scripts below are legacy local-dev / bootstrap helpers that write
gpn-user-d1directly; they predate the publisher and are kept for localsetup and one-off schema work. Prefer the publisher for any
production/provisioning delivery.
Automatic delivery. The publisher runs publishAll on a nightly cron (0 3 * * *) — it auto-discovers every project (agent_places.project_name) and every dataset with a publication projection, and re-projects them. For immediate freshness after adding data, call the on-demand endpoint:
curl -sS -X POST "https://publisher.mondial-it.nl/publish-all" -H "x-admin-token: $PUBLISHER_TOKEN"
# -> { ok:true, projects:[...], datasets:[...], baselines:[...], enrichments:[...] }
Publishing is delta/idempotent: every row is written with INSERT … ON CONFLICT(id) DO UPDATE … WHERE <content differs>, so a re-run that changes nothing performs zero writes and leaves updated_at untouched (a real content change still refreshes the row).
Event-driven delivery. After writing gpn-data-d1 (e.g. a Places search or import), gpn-google-places-worker enqueues a best-effort signal on the gpn-publish-requests queue. The publisher consumes it and coalesces the batch into a single delta re-projection — so the serving DB updates within seconds, not at the next cron tick. The queue binding is the trust boundary (no admin token needed). Three delivery paths, in order of immediacy:
- Event-driven (queue) — automatic, seconds after data changes.
- On-demand —
POST /publish-all(admin-gated) for a manual full refresh. - Nightly cron (
0 3 * * *) — safety net.
A missed or duplicated signal is harmless because re-projection is idempotent.
Scripts (legacy / local)
db:provision — Create and seed tables
npm run db:provision
Runs scripts/provision-d1.sh, which:
- Checks that
wrangleris available on PATH - Executes
schema.sqlagainst the remote D1 database (creates all tables if they do not exist) - Executes
seed.sqlagainst the remote D1 database (inserts baseline data)
The target database name defaults to gpn-user-d1 but can be overridden with SITE_DB_NAME.
Use this script once when setting up a new environment or after a schema reset.
db:migrate — Run pending migrations
npm run db:migrate
Runs scripts/migrate-site-db.sh, which applies the migration SQL in scripts/site-db-migration.sql. This handles schema evolution — adding new columns or indexes — on the gpn-user-d1 database. (The historical makelaar-db/site-db databases were consolidated into the two function-split databases gpn-data-d1 and gpn-user-d1.)
Use this script when the schema has changed since your last provision.
db:backfill:listings — Compute listing cards
npm run db:backfill:listings
Runs scripts/backfill-business-listings.sh, which executes scripts/backfill-business-listings.sql. This script reads from the businesses and business_profiles tables and computes pre-formatted business_listings rows for the search page.
The business_listings table is a denormalized view optimized for fast search display. Each row contains pre-computed labels (title, subtitle, snippet, location, status, category, rating) so the search page loader can return results without joining multiple tables.
Use this script after importing new business data or updating profiles.
provision-from-scraper — Import from scraper output
npm run db:provision:scraper
Runs scripts/provision-from-scraper.sh, which imports business data from the scraper worker output into the gpn-user-d1 tables. This is the primary data pipeline for getting crawled and enriched business data into the public site.
Data flow sequence
1. Google Places API
-> gpn-google-places-worker resolves and normalizes
-> hands domains to gpn-domain-scraper-worker
2. Scraper crawls websites
-> extracts evidence, facts, structured data
-> stores in gpn-data-d1 and R2
3. Provisioning (canonical: the publisher worker)
-> gpn-data-publisher-worker reads gpn-data-d1
-> POST /publish?project=&dataset= projects baseline + enrichment overlay
-> writes businesses / profiles / listings / field_visibility into gpn-user-d1
(legacy local: db:provision / db:provision:scraper / db:backfill:listings)
4. Site serves data
-> Remix loader queries gpn-user-d1
-> search page returns business_listings
-> detail page returns businesses + profiles + contents
Local development
For local development, npm run dev uses wrangler's local D1 simulator. The local database state lives in .wrangler/state/. Run db:provision with --local flag (or modify the script) to populate the local database.
Validation
npm run mvp:check
Runs db:provision followed by smoke-check to verify the database is provisioned and the deployed worker responds correctly.