Scraper Architecture Overview

High-level map of the durable engine, JS site layer, and filesystem-loaded site manifests.

Sections

Terminology & Glossary
📖 Documentation
Navigation
11 sectionsv0.1
📄 Scraper Architecture Overview — glaze help scraper-architecture-overview
scraper-architecture-overview

Scraper Architecture Overview

High-level map of the durable engine, JS site layer, and filesystem-loaded site manifests.

Topicscraperarchitectureenginejavascriptsitesscraperworkersite

The scraper repository is a durable workflow engine for scraping tasks. Go owns persistence, scheduling, HTTP execution, leases, retries, queue policy, and CLI ergonomics. JavaScript owns most site-specific behavior: parsing HTML, deciding what work to emit next, and writing site-specific projections into each site database. That split is the main thing a new contributor needs to understand before reading code.

The current system is built around a small set of stable primitives. A workflow contains ops. Ops are persisted in the engine SQLite database. Workers poll for ready ops, lease them, execute them through a runner such as js or http/fetch, and persist results plus artifacts. Site manifests such as js-demo, hackernews, slashdot, and nereval live under the repo-level sites/ directory and provide the JS scripts, submit verbs, fixtures, and per-site schema that sit on top of that engine.

Core Layers

The engine layer is the durable runtime. It lives mainly in pkg/engine/model/types.go, pkg/engine/scheduler/scheduler.go, and pkg/engine/store/sqlite/store.go. It is responsible for turning "a graph of durable work" into repeatable execution with leases, retries, dependency tracking, queue policies, artifacts, and workflow state.

The runner layer maps op kinds to execution logic. A runner registry (pkg/engine/runner/runner.go) holds the active runners. The two built-in runners are:

  • js — executes site scripts through the embedded goja JS runtime (pkg/engine/runner/js.go)
  • http/fetch — performs HTTP requests and persists response bodies as artifacts (pkg/engine/runner/http.go)

HTTP runner behavior is configured through pkg/engine/config/config.go (user agent, timeout).

The site layer is the programmable behavior layer. Each site definition under sites/<site>/ contributes:

  • a site.yaml manifest
  • submit verbs under verbs/ (define CLI commands via __verb__ metadata)
  • op execution scripts under scripts/
  • site DB migrations under migrations/
  • optional fixtures used by tests

The CLI layer is the operator shell. It wires logging, help, migrations, engine inspection, site submission commands, and the background worker loop. The main entrypoints are pkg/cmd/root.go, pkg/cmd/site.go, pkg/cmd/worker.go, and pkg/cmd/engine.go.

Runtime Model

The most important runtime distinction is between submission-time JS and execution-time JS.

  • Submission-time JS lives in verbs/ and is discovered from __verb__ metadata. The submit-verb host scans verb files, builds Cobra CLI commands automatically, and runs the selected function when the operator invokes a command. CLI flags like --max-pages or --base-url are defined in the JS __verb__ metadata, not in Go code.
  • Execution-time JS lives in scripts/ and runs as durable js ops through the worker.

This means a typical workflow looks like this:

scraper site <site> run <verb>
  -> Go host loads JS submit verb
  -> JS submit verb inserts initial durable ops
  -> CLI exits

scraper worker run
  -> polls engine DB
  -> leases ready ops
  -> runs http/fetch or js runners
  -> persists results, artifacts, emitted child ops

The submit verb does not run the whole scrape inline. It only seeds the first durable work. The worker is the process that actually executes queued ops.

Databases

The engine database stores cross-site runtime state. It contains workflows, ops, leases, dependencies, results, artifacts, and queue limiter state. The schema is managed in pkg/engine/store/sqlite/migrations/.

Each site gets its own SQLite database under the sites directory. A site DB stores query-oriented read models and projections that are specific to one site. For example, nereval.db contains normalized property assessment tables, while js-demo.db contains demo rows used to prove the JS runtime and durable worker path.

This split matters because it keeps engine correctness and site-specific schema evolution separate. If a site needs new projection tables, it changes its own migrations without forcing a top-level engine redesign.

Default Site Set

The current default site set is intentionally progressive:

  • js-demo proves pure js -> js -> site-db execution without HTTP.
  • hackernews proves js -> http/fetch -> js -> site-db.
  • slashdot proves the same path on a different HTML shape and multipage fan-out.
  • nereval is the first complex site. It adds ASP.NET form-state pagination, detail-page fan-out, and normalized site projections.

All four sites use JS submit verbs for CLI entrypoints and are loaded from the repo-level sites/ directory during bootstrap.

Commands You Will Use First

The fastest way to get oriented is to use the CLI against the engine visibility commands and simple sites.

  • scraper engine status
  • scraper engine migrations status
  • scraper --sites-manifest-dir ./sites site migrate js-demo
  • scraper --sites-manifest-dir ./sites site js-demo run seed --workflow-id demo-1
  • scraper --sites-manifest-dir ./sites worker run --max-cycles 16 --poll-interval 5ms
  • scraper --sites-manifest-dir ./sites site hackernews run seed --max-pages 2
  • scraper --sites-manifest-dir ./sites site slashdot run seed --max-pages 2
  • scraper --sites-manifest-dir ./sites site nereval run seed --workflow-id nereval-test --max-pages 2

Use scraper help <slug> for the detailed pages added in this help set.

Troubleshooting

ProblemCauseSolution
site <name> run <verb> submits work but nothing happensThe worker is not polling the engine DBRun scraper worker run against the same --engine-db and --sites-dir
JS script cannot see a databaseThe runtime was not given site-db or scraper-dbStart by reviewing pkg/js/runtime/databases.go and the worker setup in pkg/cmd/worker.go
Workflow looks stuckReady ops are not being leased or a dependency failedCheck scraper engine status, then inspect the scheduler/store path in pkg/engine/scheduler/scheduler.go
A site parser seems wrongThe HTML fixture does not match the parser assumptions or the live site changedStart with the fixture-backed tests for that site before changing runtime code

See Also

  • scraper help scraper-runtime-model — Deeper explanation of submit verbs, workers, op JS, and durable execution
  • scraper help scraper-js-api-reference — Complete JavaScript API reference for both verb and script contexts
  • scraper help scraper-queue-policies-and-rate-limiting — How queue policies and durable token-bucket pacing work
  • scraper help scraper-new-developer-onboarding — Step-by-step onboarding path for a new contributor
  • scraper help scraper-adding-a-site — How to add a Go-native site when declarative manifests are not enough
  • scraper help scraper-bootstrap-config-and-site-manifest-loading — How scraper finds site manifests before building dynamic site commands