Bring up the enrichment pipeline (live) · gpn-openai-interpreter-worker

What this brings up

The interpreter enrichment pipeline is built and deployed across four workers. This runbook lights it up end to end on real data:

google-places (domains)            POST /api/dispatch-scrape
  -> domain-scraper (evidence)     scrape-tasks queue -> broker_contents
    -> openai-interpreter (facts   /admin/api/migration/scraper-import-d1
       + display/filter metadata)  enrichment workflow (OpenAI via AI Gateway)
      -> yp-site (renders)         field_visibility + filter_signals_json

Verified live 2026-06-25 (R-10). Gotchas the first real run surfaced —
read before re-running:
- ADMIN_AUTH_TOKENS_JSON shape: each token object must include
principalId and a non-empty scopes array, e.g.
{"realm":"gpn-interpreter","tokens":[{"token":"…","principalId":"operator","scopes":["admin:read","maintenance:write"]}]}.
Set it via stdin pipe, not the interactive paste:
printf '%s' '<json>' | wrangler secret put ADMIN_AUTH_TOKENS_JSON.
- A default dataset must exist (scraper-import-d1 resolves it via the
dashboard payload, which prefers slug default) and an **active
guidance_bundles row** must exist (the enrichment_runs.guidance_bundle_id
FK requires one).
- Brokers are PENDING until a full crawl finalizes, so import with
{"acceptStatuses":["PENDING","SUCCESS"]} to ingest pre-scraped evidence.
- Manual per-entity enrichment:
wrangler workflows trigger gpn-openai-interpreter-business-enrichment '{"scheduleId":"manual","datasetId":"dataset-default","rootEntityId":"<entityId>","jobType":"stale_business_enrichment","queuedAt":"<iso>","cronExpression":"manual"}'.
- The enrichment workflow now loads the entity's real R2 evidence; runs land
in waiting_review (human-accept gate). The discovery cron (0 */6 * * *)
does not yet fan out per business (R-11).

All code is committed and deployed; the steps below are operator actions (secrets and triggers) that cannot live in the repo. Plan of record: .project/plans/INTERPRETER-ENRICHMENT-PLAN.md.

Prerequisites (one-time)

Set these secrets/vars. Run each from the worker's app directory.

# 1. google-places admin token (gates /api/dispatch-scrape and provisioning)
cd apps/gpn-google-places-worker
wrangler secret put ADMIN_TOKEN            # choose a strong value; you send it as x-admin-token

# 2. interpreter admin auth (gates /admin/api/*). JSON of tokens + scopes:
cd ../gpn-openai-interpreter-worker
wrangler secret put ADMIN_AUTH_TOKENS_JSON
# value, single line, e.g.:
# {"realm":"gpn-openai-interpreter-admin","tokens":[{"token":"<bearer>","principalId":"ops","scopes":["admin:read","guidance:write","evaluation:run","review:write","maintenance:write","run:write"]}]}

# 3. interpreter OpenAI access — CENTRAL Cloudflare AI Gateway (authenticated + BYOK).
# The gateway stores the OpenAI key; the worker authenticates to the gateway with a
# gateway token and never sends the OpenAI key itself.
#
#  a. Store the OpenAI key in the gateway (BYOK): AI Gateway -> <gateway> ->
#     Provider keys -> add your OpenAI key. One central key serves every caller.
#
#  b. Create the gateway token — this becomes CF_AIG_TOKEN. In the Cloudflare AI
#     Gateway screen click "Create token" and fill in the permissions as:
#        Account  ->  AI Gateway  ->  Run         (the ONLY required permission)
#     Account Resources: Include -> your account.
#     Notes: AI Gateway Read/Run/Edit cannot be scoped to a single gateway, and
#     the token is shown only once. Workers AI Read/Edit are NOT needed here.
#
#  c. Put the token on the interpreter worker (worker-specific secret):
wrangler secret put CF_AIG_TOKEN            # paste the gateway token at the prompt
#
# OPENAI_BASE_URL is already set in wrangler.jsonc to the gateway OpenAI endpoint
# (.../v1/<account_id>/<gateway_id>/openai); the worker sends CF_AIG_TOKEN as the
# `cf-aig-authorization` header. Fallback: the OPENAI_API_KEY Secrets Store binding
# still exists — blank OPENAI_BASE_URL to call api.openai.com directly with it.
#   "OPENAI_MODEL": "gpt-4.1-mini"   (per-stage overridable in guidance)
npm run deploy

Interpreter route is live. The interpreter is reachable at
interpreter.mondial-it.nl (custom domain). /health is open; /admin/api/*
is bearer-gated by ADMIN_AUTH_TOKENS_JSON (401 until that secret is set). The
google-places and yp-site workers are also reachable
(google-places.mondial-it.nl, gpn-yp-site-worker.mondial-it.workers.dev).

Step 1 — Seed discovered businesses (google-places)

Run a Places search so agent_places is populated in gpn-data-d1. From the workspace UI (https://google-places.mondial-it.nl/) or via the API:

curl -sS -X POST https://google-places.mondial-it.nl/api/places/search \
  -H "x-admin-token: $ADMIN_TOKEN" -H 'content-type: application/json' \
  -d '{"project_name":"makelaars-groningen","queries":["makelaar Groningen"]}'

A Places search consumes Google Places API quota. Keep the query list small
for a first run.

Step 2 — Hand domains to the scraper

Enqueues one scrape-tasks message per resolved domain and seeds brokers.

curl -sS -X POST https://google-places.mondial-it.nl/api/dispatch-scrape \
  -H "x-admin-token: $ADMIN_TOKEN" -H 'content-type: application/json' \
  -d '{"project_name":"makelaars-groningen","limit":10}'
# -> { ok:true, dispatched:N, skipped:M, upload_id:"places-..." }

The scraper consumes the queue (Cloudflare Browser) and writes broker_contents (+ broker_info/broker_listings when extraction succeeds). Verify:

cd apps/gpn-domain-scraper-worker
wrangler d1 execute gpn-data-d1 --remote \
  --command "SELECT status, COUNT(*) FROM brokers GROUP BY status"
wrangler d1 execute gpn-data-d1 --remote \
  --command "SELECT COUNT(*) FROM broker_contents"

Step 3 — Ingest scraper evidence into the interpreter

Reads the broker_* tables straight from gpn-data-d1 (no manifest needed) and creates source snapshots + candidate entities. acceptStatuses lets you import crawled-but-unfinalized brokers.

INTERP=<interpreter base url>   # e.g. https://gpn-openai-interpreter-worker.<acct>.workers.dev
curl -sS -X POST "$INTERP/admin/api/migration/scraper-import-d1" \
  -H "authorization: Bearer $INTERP_TOKEN" -H 'content-type: application/json' \
  -d '{"limit":50,"requireContents":true,"acceptStatuses":["SUCCESS","PENDING"]}'
# -> { totalProcessed, snapshotsCreated, entitiesCreated, skipped, errors }

Step 4 — Run enrichment (OpenAI interpretation)

The BusinessEnrichmentWorkflow runs from the scheduled-discovery path: the cron scheduled handler enqueues work for enabled rows in processing_schedules, the queue handler starts a workflow per business. With CF_AIG_TOKEN set, the model cascade calls OpenAI through the central AI Gateway (BYOK) and validates output against the pilot schema; without it (or without a fallback OPENAI_API_KEY + blank OPENAI_BASE_URL), the deterministic heuristic provider runs. Each run also:

attaches displayMetadata (policy fields + derived filterSignals) to the

canonical projection, and

writes a publication projection ({domain, filterSignals}) into

gpn-data-d1 (projection_versions, kind publication).

R-4 / ADR-0001: the interpreter no longer writes gpn-user-d1 directly.
Per-business filter_signals_json reaches the site only when
gpn-data-publisher-worker reads that publication projection and projects it
(see Step 5). The old workflow direct-write was removed.

Confirm a run landed:

curl -sS "$INTERP/admin/api/runs/<runId>" -H "authorization: Bearer $INTERP_TOKEN"
# or inspect projection_versions / enrichment_runs in gpn-data-d1

Step 5 — Configure & deliver the display/filter policy

The policy is the dashboard-configurable contract for what the site shows and how it filters (using yp-site shadcn filter templates). View / edit it:

# current policy (stored or default)
curl -sS "$INTERP/admin/api/display-policy" -H "authorization: Bearer $INTERP_TOKEN"

# override it (operator edit)
curl -sS -X PUT "$INTERP/admin/api/display-policy" \
  -H "authorization: Bearer $INTERP_TOKEN" -H 'content-type: application/json' \
  -d '{"policy":{"version":"v2","fields":[
        {"fieldKey":"displayName","label":"Name","surfaces":["list","detail"],"displayComponent":"field.display.name","filterTemplate":"filter.template.text-search","visible":true,"sortOrder":10},
        {"fieldKey":"rating","label":"Rating","surfaces":["list","filter"],"displayComponent":"field.display.rating","filterTemplate":"filter.template.numeric-range","visible":true,"sortOrder":60}
      ]}}'

Deliver to the public site via the publisher (the sole gpn-user-d1 writer). One call projects both the baseline rows and the enrichment overlay — field_visibility from the stored policy and business_listings.filter_signals_json from the publication projections:

PUBLISHER="https://publisher.mondial-it.nl"
curl -sS -X POST "$PUBLISHER/publish?project=<projectName>&dataset=<datasetId>&category=*" \
  -H "x-admin-token: $PUBLISHER_TOKEN"
# -> { ok:true, baseline:{...}, enrichment:{ fieldVisibilityRows:N, filterSignalBusinesses:M, policyFound:true } }

The interpreter's old POST /admin/api/project-to-site is retired (returns
410). The interpreter still authors the policy (PUT /admin/api/display-policy
above); the publisher delivers it.

Step 6 — Verify on the site

curl -s "https://gpn-yp-site-worker.mondial-it.workers.dev/?q=" | grep -o 'Business status\|Rating and reviews\|Category and type'

The directory sidebar renders only the filter groups the policy marks visible on the filter surface; per-business filter_signals_json is available on each listing row for downstream filtering. Removing the field_visibility rows restores the default "show all filters" behavior.

Cleanup

If you enabled workers_dev on the interpreter for the admin calls, set it

back to false and redeploy.

Treat tokens as secrets; never paste them into committed files or logs.

Endpoint reference

Worker	Endpoint	Auth	Purpose
google-places	`POST /api/places/search`	`x-admin-token`	seed `agent_places`
google-places	`POST /api/dispatch-scrape`	`x-admin-token`	enqueue domains to scraper
interpreter	`POST /admin/api/migration/scraper-import-d1`	Bearer (`maintenance:write`)	ingest broker_* evidence
interpreter	`GET/PUT /admin/api/display-policy`	Bearer (`admin:read` / `maintenance:write`)	view/edit display+filter policy
interpreter	`GET /admin/api/runs/{id}`	Bearer (`admin:read`)	inspect an enrichment run
publisher	`POST /publish?project=&dataset=&category=`	`x-admin-token`	project baseline + enrichment into `gpn-user-d1` (sole writer)
yp-site	`GET /`	public	renders filters from `field_visibility`