High-level map of the durable engine, JS site layer, and filesystem-loaded site manifests.
The scraper repository is a durable workflow engine for scraping tasks. Go owns persistence, scheduling, HTTP execution, leases, retries, queue policy, and CLI ergonomics. JavaScript owns most site-specific behavior: parsing HTML, deciding what work to emit next, and writing site-specific projections into each site database. That split is the main thing a new contributor needs to understand before reading code.
The current system is built around a small set of stable primitives. A workflow contains ops. Ops are persisted in the engine SQLite database. Workers poll for ready ops, lease them, execute them through a runner such as js or http/fetch, and persist results plus artifacts. Site manifests such as js-demo, hackernews, slashdot, and nereval live under the repo-level sites/ directory and provide the JS scripts, submit verbs, fixtures, and per-site schema that sit on top of that engine.
The engine layer is the durable runtime. It lives mainly in pkg/engine/model/types.go, pkg/engine/scheduler/scheduler.go, and pkg/engine/store/sqlite/store.go. It is responsible for turning "a graph of durable work" into repeatable execution with leases, retries, dependency tracking, queue policies, artifacts, and workflow state.
The runner layer maps op kinds to execution logic. A runner registry (pkg/engine/runner/runner.go) holds the active runners. The two built-in runners are:
js — executes site scripts through the embedded goja JS runtime (pkg/engine/runner/js.go)http/fetch — performs HTTP requests and persists response bodies as artifacts (pkg/engine/runner/http.go)HTTP runner behavior is configured through pkg/engine/config/config.go (user agent, timeout).
The site layer is the programmable behavior layer. Each site definition under sites/<site>/ contributes:
site.yaml manifestverbs/ (define CLI commands via __verb__ metadata)scripts/migrations/The CLI layer is the operator shell. It wires logging, help, migrations, engine inspection, site submission commands, and the background worker loop. The main entrypoints are pkg/cmd/root.go, pkg/cmd/site.go, pkg/cmd/worker.go, and pkg/cmd/engine.go.
The most important runtime distinction is between submission-time JS and execution-time JS.
verbs/ and is discovered from __verb__ metadata. The submit-verb host scans verb files, builds Cobra CLI commands automatically, and runs the selected function when the operator invokes a command. CLI flags like --max-pages or --base-url are defined in the JS __verb__ metadata, not in Go code.scripts/ and runs as durable js ops through the worker.This means a typical workflow looks like this:
scraper site <site> run <verb>
-> Go host loads JS submit verb
-> JS submit verb inserts initial durable ops
-> CLI exits
scraper worker run
-> polls engine DB
-> leases ready ops
-> runs http/fetch or js runners
-> persists results, artifacts, emitted child ops
The submit verb does not run the whole scrape inline. It only seeds the first durable work. The worker is the process that actually executes queued ops.
The engine database stores cross-site runtime state. It contains workflows, ops, leases, dependencies, results, artifacts, and queue limiter state. The schema is managed in pkg/engine/store/sqlite/migrations/.
Each site gets its own SQLite database under the sites directory. A site DB stores query-oriented read models and projections that are specific to one site. For example, nereval.db contains normalized property assessment tables, while js-demo.db contains demo rows used to prove the JS runtime and durable worker path.
This split matters because it keeps engine correctness and site-specific schema evolution separate. If a site needs new projection tables, it changes its own migrations without forcing a top-level engine redesign.
The current default site set is intentionally progressive:
js-demo proves pure js -> js -> site-db execution without HTTP.hackernews proves js -> http/fetch -> js -> site-db.slashdot proves the same path on a different HTML shape and multipage fan-out.nereval is the first complex site. It adds ASP.NET form-state pagination, detail-page fan-out, and normalized site projections.All four sites use JS submit verbs for CLI entrypoints and are loaded from the repo-level sites/ directory during bootstrap.
The fastest way to get oriented is to use the CLI against the engine visibility commands and simple sites.
scraper engine statusscraper engine migrations statusscraper --sites-manifest-dir ./sites site migrate js-demoscraper --sites-manifest-dir ./sites site js-demo run seed --workflow-id demo-1scraper --sites-manifest-dir ./sites worker run --max-cycles 16 --poll-interval 5msscraper --sites-manifest-dir ./sites site hackernews run seed --max-pages 2scraper --sites-manifest-dir ./sites site slashdot run seed --max-pages 2scraper --sites-manifest-dir ./sites site nereval run seed --workflow-id nereval-test --max-pages 2Use scraper help <slug> for the detailed pages added in this help set.
| Problem | Cause | Solution |
|---|---|---|
site <name> run <verb> submits work but nothing happens | The worker is not polling the engine DB | Run scraper worker run against the same --engine-db and --sites-dir |
| JS script cannot see a database | The runtime was not given site-db or scraper-db | Start by reviewing pkg/js/runtime/databases.go and the worker setup in pkg/cmd/worker.go |
| Workflow looks stuck | Ready ops are not being leased or a dependency failed | Check scraper engine status, then inspect the scheduler/store path in pkg/engine/scheduler/scheduler.go |
| A site parser seems wrong | The HTML fixture does not match the parser assumptions or the live site changed | Start with the fixture-backed tests for that site before changing runtime code |
scraper help scraper-runtime-model — Deeper explanation of submit verbs, workers, op JS, and durable executionscraper help scraper-js-api-reference — Complete JavaScript API reference for both verb and script contextsscraper help scraper-queue-policies-and-rate-limiting — How queue policies and durable token-bucket pacing workscraper help scraper-new-developer-onboarding — Step-by-step onboarding path for a new contributorscraper help scraper-adding-a-site — How to add a Go-native site when declarative manifests are not enoughscraper help scraper-bootstrap-config-and-site-manifest-loading — How scraper finds site manifests before building dynamic site commands