Explains submit verbs, durable ops, workers, runners, and how JS fits into the execution model.
The scraper runtime is intentionally split into two JavaScript environments that do different jobs. Submit verbs create initial durable work. Op scripts execute durable work later. New contributors often blur those together at first, and that confusion makes the code harder to reason about. This page is the short version of the real execution model.
At the highest level, the CLI submits workflows into the engine DB and the worker process executes them later. The worker uses runners such as js and http/fetch, reads site definitions from the registry, opens the right site DB, and writes back results plus any child ops emitted during execution.
Submission-time JS lives under sites/<site>/verbs/. These files expose top-level functions annotated with __verb__. The submit-verb host scans those files, builds Glazed/Cobra commands, and runs the selected function exactly once when the operator invokes a command such as scraper --sites-manifest-dir ./sites site js-demo run seed.
CLI flags are defined in the __verb__ metadata, not in Go code. For example, sites/hackernews/verbs/seed.js declares --base-url and --max-pages as fields, and the host wires them into Cobra automatically. The parsed values are available in the verb function as ctx.values.
The important constraint is that a submit verb is not a worker. It does not crawl pages for minutes and it does not keep an in-process scheduler alive by default. Its job is to describe or emit the initial durable work graph.
The submit-verb host is implemented in:
pkg/sites/submitverbs/host.gopkg/sites/submitverbs/runtime.goExecution-time JS lives under sites/<site>/scripts/. These files run as durable ops through the js runner. Each op is persisted in the engine DB and references the script to run through op metadata, usually metadata.script.
Scripts can be synchronous or async. The runtime supports async function exports and will await the returned Promise before persisting results.
The execution context is intentionally narrow. A script can:
ctx.inputctx.workflow and op metadata via ctx.opctx.dep(opID)ctx.emit(spec)ctx.writeRecord(collection, key, data) and artifacts with ctx.writeArtifact(spec)require("site-db") and require("scraper-db") for direct SQL accessFor the complete API with type signatures, see scraper help scraper-js-api-reference.
The execution path is implemented in:
pkg/engine/runner/js.gopkg/js/runtime/executor.gopkg/js/runtime/databases.goThe worker process runs scraper worker run. It opens the engine store, builds the runner registry, opens site DBs on demand, and loops over ready queues. The scheduler is responsible for dependency refresh, lease recovery, queue policy resolution, and calling the appropriate runner.
The runner registry (pkg/engine/runner/runner.go) maps op kinds to runner implementations. The two built-in runners are registered at startup:
js runner — loads site scripts, builds a goja runtime per execution, and runs the script functionhttp/fetch runner — performs HTTP requests using pkg/engine/config/ settings (user agent, timeout)The main files are:
pkg/cmd/worker.gopkg/engine/scheduler/scheduler.gopkg/engine/runner/http.gopkg/engine/store/sqlite/store.goThe scheduler does not know site-specific parsing logic. It only knows how to:
That separation is what lets js-demo, hackernews, slashdot, and nereval all use the same engine, even though their manifests are loaded from the filesystem during bootstrap rather than being compiled into the binary.
The runtime model is easiest to understand as a durable graph:
submit verb
-> create workflow
-> emit initial op(s)
worker
-> lease op
-> run js or http/fetch
-> persist result/artifacts
-> persist emitted child ops
-> later child ops become ready
For example, the NEREVAL workflow looks like this:
scraper --sites-manifest-dir ./sites site nereval run seed
-> verb emits js seed op
worker run
-> js seed emits list fetch + list extract
-> list extract emits detail fetches and page-2 fetch
-> detail extractors write normalized property tables into nereval.db
The engine DB is the durable workflow runtime database. It stores workflows, ops, dependencies, leases, results, artifacts, and queue limiter state.
The site DB is the query-facing projection database for one site. It stores the records an operator or downstream tool actually wants to inspect. Keeping those separate matters because engine state is generic runtime infrastructure, while site schema is allowed to be specific and evolve differently.
If a contributor is unsure where a new table should go, the rule of thumb is:
Read these in order if you want the shortest path through the real runtime:
pkg/cmd/bootstrap.gopkg/cmd/app_config.gopkg/cmd/root.gopkg/cmd/site.gopkg/sites/submitverbs/host.gopkg/sites/submitverbs/runtime.gopkg/cmd/worker.gopkg/engine/scheduler/scheduler.gopkg/engine/store/sqlite/store.goThen read one site directory under sites/ end to end.
| Problem | Cause | Solution |
|---|---|---|
You expected site <site> run <verb> to do the whole scrape | Submit verbs only seed work | Run scraper --sites-manifest-dir ./sites worker run against the same DBs |
| JS script cannot find dependency output | The dependency op ID is wrong or the dependency was never emitted | Check the workflow graph in the emitting script and the dependency read in the consumer script |
| A site writes nothing to its DB | The worker never opened that site DB or the script returned an error early | Start with the command-path tests in pkg/cmd/site_test.go |
site <site> run <verb> is missing entirely | The site manifests were not resolved before the command tree was built | Pass --sites-manifest-dir, set SCRAPER_SITES_MANIFEST_DIRS, or configure ~/.scraper/config.yaml so bootstrap can load the site directories |
| It is unclear whether logic belongs in a verb or a script | Submission and execution responsibilities are mixed | Keep workflow seeding in verbs/ and durable parsing/fan-out in scripts/ |
scraper help scraper-architecture-overview — Broader system map and command overviewscraper help scraper-js-api-reference — Complete JavaScript API reference for both verb and script contextsscraper help scraper-queue-policies-and-rate-limiting — Queue policy and token-bucket behavior in the workerscraper help scraper-new-developer-onboarding — First-day path through the repo and smoke testsscraper help scraper-bootstrap-config-and-site-manifest-loading — How scraper resolves site directories before building dynamic site commands