---
title: Scraper Runtime Model
description: Explains submit verbs, durable ops, workers, runners, and how JS fits into the execution model.
doc_version: 1
last_updated: 2026-07-02
---


The scraper runtime is intentionally split into two JavaScript environments that do different jobs. Submit verbs create initial durable work. Op scripts execute durable work later. New contributors often blur those together at first, and that confusion makes the code harder to reason about. This page is the short version of the real execution model.

At the highest level, the CLI submits workflows into the engine DB and the worker process executes them later. The worker uses runners such as `js` and `http/fetch`, reads site definitions from the registry, opens the right site DB, and writes back results plus any child ops emitted during execution.

## Submission-Time JS

Submission-time JS lives under `sites/<site>/verbs/`. These files expose top-level functions annotated with `__verb__`. The submit-verb host scans those files, builds Glazed/Cobra commands, and runs the selected function exactly once when the operator invokes a command such as `scraper --sites-manifest-dir ./sites site js-demo run seed`.

CLI flags are defined in the `__verb__` metadata, not in Go code. For example, `sites/hackernews/verbs/seed.js` declares `--base-url` and `--max-pages` as fields, and the host wires them into Cobra automatically. The parsed values are available in the verb function as `ctx.values`.

The important constraint is that a submit verb is not a worker. It does not crawl pages for minutes and it does not keep an in-process scheduler alive by default. Its job is to describe or emit the initial durable work graph.

The submit-verb host is implemented in:

- `pkg/sites/submitverbs/host.go`
- `pkg/sites/submitverbs/runtime.go`

## Execution-Time JS

Execution-time JS lives under `sites/<site>/scripts/`. These files run as durable ops through the `js` runner. Each op is persisted in the engine DB and references the script to run through op metadata, usually `metadata.script`.

Scripts can be synchronous or async. The runtime supports `async function` exports and will await the returned Promise before persisting results.

The execution context is intentionally narrow. A script can:

- read `ctx.input`
- inspect workflow metadata via `ctx.workflow` and op metadata via `ctx.op`
- read dependency results with `ctx.dep(opID)`
- emit child ops with `ctx.emit(spec)`
- write records with `ctx.writeRecord(collection, key, data)` and artifacts with `ctx.writeArtifact(spec)`
- use `require("site-db")` and `require("scraper-db")` for direct SQL access

For the complete API with type signatures, see `scraper help scraper-js-api-reference`.

The execution path is implemented in:

- `pkg/engine/runner/js.go`
- `pkg/js/runtime/executor.go`
- `pkg/js/runtime/databases.go`

## Workers, Runners, and Scheduling

The worker process runs `scraper worker run`. It opens the engine store, builds the runner registry, opens site DBs on demand, and loops over ready queues. The scheduler is responsible for dependency refresh, lease recovery, queue policy resolution, and calling the appropriate runner.

The runner registry (`pkg/engine/runner/runner.go`) maps op kinds to runner implementations. The two built-in runners are registered at startup:

- `js` runner — loads site scripts, builds a goja runtime per execution, and runs the script function
- `http/fetch` runner — performs HTTP requests using `pkg/engine/config/` settings (user agent, timeout)

The main files are:

- `pkg/cmd/worker.go`
- `pkg/engine/scheduler/scheduler.go`
- `pkg/engine/runner/http.go`
- `pkg/engine/store/sqlite/store.go`

The scheduler does not know site-specific parsing logic. It only knows how to:

- find ready queues
- resolve queue policy
- lease work safely
- invoke a runner
- write back results, emitted ops, artifacts, and errors

That separation is what lets `js-demo`, `hackernews`, `slashdot`, and `nereval` all use the same engine, even though their manifests are loaded from the filesystem during bootstrap rather than being compiled into the binary.

## Data Flow

The runtime model is easiest to understand as a durable graph:

```text
submit verb
  -> create workflow
  -> emit initial op(s)

worker
  -> lease op
  -> run js or http/fetch
  -> persist result/artifacts
  -> persist emitted child ops
  -> later child ops become ready
```

For example, the NEREVAL workflow looks like this:

```text
scraper --sites-manifest-dir ./sites site nereval run seed
  -> verb emits js seed op

worker run
  -> js seed emits list fetch + list extract
  -> list extract emits detail fetches and page-2 fetch
  -> detail extractors write normalized property tables into nereval.db
```

## Engine DB vs Site DB

The engine DB is the durable workflow runtime database. It stores workflows, ops, dependencies, leases, results, artifacts, and queue limiter state.

The site DB is the query-facing projection database for one site. It stores the records an operator or downstream tool actually wants to inspect. Keeping those separate matters because engine state is generic runtime infrastructure, while site schema is allowed to be specific and evolve differently.

If a contributor is unsure where a new table should go, the rule of thumb is:

- runtime/scheduling/lease/result table: engine DB
- query-oriented site projection: site DB

## What To Read In Code

Read these in order if you want the shortest path through the real runtime:

1. `pkg/cmd/bootstrap.go`
2. `pkg/cmd/app_config.go`
3. `pkg/cmd/root.go`
4. `pkg/cmd/site.go`
5. `pkg/sites/submitverbs/host.go`
6. `pkg/sites/submitverbs/runtime.go`
7. `pkg/cmd/worker.go`
8. `pkg/engine/scheduler/scheduler.go`
9. `pkg/engine/store/sqlite/store.go`

Then read one site directory under `sites/` end to end.

## Troubleshooting

| Problem | Cause | Solution |
|---------|-------|----------|
| You expected `site <site> run <verb>` to do the whole scrape | Submit verbs only seed work | Run `scraper --sites-manifest-dir ./sites worker run` against the same DBs |
| JS script cannot find dependency output | The dependency op ID is wrong or the dependency was never emitted | Check the workflow graph in the emitting script and the dependency read in the consumer script |
| A site writes nothing to its DB | The worker never opened that site DB or the script returned an error early | Start with the command-path tests in `pkg/cmd/site_test.go` |
| `site <site> run <verb>` is missing entirely | The site manifests were not resolved before the command tree was built | Pass `--sites-manifest-dir`, set `SCRAPER_SITES_MANIFEST_DIRS`, or configure `~/.scraper/config.yaml` so bootstrap can load the site directories |
| It is unclear whether logic belongs in a verb or a script | Submission and execution responsibilities are mixed | Keep workflow seeding in `verbs/` and durable parsing/fan-out in `scripts/` |

## See Also

- `scraper help scraper-architecture-overview` — Broader system map and command overview
- `scraper help scraper-js-api-reference` — Complete JavaScript API reference for both verb and script contexts
- `scraper help scraper-queue-policies-and-rate-limiting` — Queue policy and token-bucket behavior in the worker
- `scraper help scraper-new-developer-onboarding` — First-day path through the repo and smoke tests
- `scraper help scraper-bootstrap-config-and-site-manifest-loading` — How scraper resolves site directories before building dynamic site commands
