---
title: Scraper Architecture Overview
description: High-level map of the durable engine, JS site layer, and filesystem-loaded site manifests.
doc_version: 1
last_updated: 2026-07-02
---


The `scraper` repository is a durable workflow engine for scraping tasks. Go owns persistence, scheduling, HTTP execution, leases, retries, queue policy, and CLI ergonomics. JavaScript owns most site-specific behavior: parsing HTML, deciding what work to emit next, and writing site-specific projections into each site database. That split is the main thing a new contributor needs to understand before reading code.

The current system is built around a small set of stable primitives. A workflow contains ops. Ops are persisted in the engine SQLite database. Workers poll for ready ops, lease them, execute them through a runner such as `js` or `http/fetch`, and persist results plus artifacts. Site manifests such as `js-demo`, `hackernews`, `slashdot`, and `nereval` live under the repo-level `sites/` directory and provide the JS scripts, submit verbs, fixtures, and per-site schema that sit on top of that engine.

## Core Layers

The engine layer is the durable runtime. It lives mainly in `pkg/engine/model/types.go`, `pkg/engine/scheduler/scheduler.go`, and `pkg/engine/store/sqlite/store.go`. It is responsible for turning "a graph of durable work" into repeatable execution with leases, retries, dependency tracking, queue policies, artifacts, and workflow state.

The runner layer maps op kinds to execution logic. A runner registry (`pkg/engine/runner/runner.go`) holds the active runners. The two built-in runners are:

- `js` — executes site scripts through the embedded goja JS runtime (`pkg/engine/runner/js.go`)
- `http/fetch` — performs HTTP requests and persists response bodies as artifacts (`pkg/engine/runner/http.go`)

HTTP runner behavior is configured through `pkg/engine/config/config.go` (user agent, timeout).

The site layer is the programmable behavior layer. Each site definition under `sites/<site>/` contributes:

- a `site.yaml` manifest
- submit verbs under `verbs/` (define CLI commands via `__verb__` metadata)
- op execution scripts under `scripts/`
- site DB migrations under `migrations/`
- optional fixtures used by tests

The CLI layer is the operator shell. It wires logging, help, migrations, engine inspection, site submission commands, and the background worker loop. The main entrypoints are `pkg/cmd/root.go`, `pkg/cmd/site.go`, `pkg/cmd/worker.go`, and `pkg/cmd/engine.go`.

## Runtime Model

The most important runtime distinction is between submission-time JS and execution-time JS.

- Submission-time JS lives in `verbs/` and is discovered from `__verb__` metadata. The submit-verb host scans verb files, builds Cobra CLI commands automatically, and runs the selected function when the operator invokes a command. CLI flags like `--max-pages` or `--base-url` are defined in the JS `__verb__` metadata, not in Go code.
- Execution-time JS lives in `scripts/` and runs as durable `js` ops through the worker.

This means a typical workflow looks like this:

```text
scraper site <site> run <verb>
  -> Go host loads JS submit verb
  -> JS submit verb inserts initial durable ops
  -> CLI exits

scraper worker run
  -> polls engine DB
  -> leases ready ops
  -> runs http/fetch or js runners
  -> persists results, artifacts, emitted child ops
```

The submit verb does not run the whole scrape inline. It only seeds the first durable work. The worker is the process that actually executes queued ops.

## Databases

The engine database stores cross-site runtime state. It contains workflows, ops, leases, dependencies, results, artifacts, and queue limiter state. The schema is managed in `pkg/engine/store/sqlite/migrations/`.

Each site gets its own SQLite database under the sites directory. A site DB stores query-oriented read models and projections that are specific to one site. For example, `nereval.db` contains normalized property assessment tables, while `js-demo.db` contains demo rows used to prove the JS runtime and durable worker path.

This split matters because it keeps engine correctness and site-specific schema evolution separate. If a site needs new projection tables, it changes its own migrations without forcing a top-level engine redesign.

## Default Site Set

The current default site set is intentionally progressive:

- `js-demo` proves pure `js -> js -> site-db` execution without HTTP.
- `hackernews` proves `js -> http/fetch -> js -> site-db`.
- `slashdot` proves the same path on a different HTML shape and multipage fan-out.
- `nereval` is the first complex site. It adds ASP.NET form-state pagination, detail-page fan-out, and normalized site projections.

All four sites use JS submit verbs for CLI entrypoints and are loaded from the repo-level `sites/` directory during bootstrap.

## Commands You Will Use First

The fastest way to get oriented is to use the CLI against the engine visibility commands and simple sites.

- `scraper engine status`
- `scraper engine migrations status`
- `scraper --sites-manifest-dir ./sites site migrate js-demo`
- `scraper --sites-manifest-dir ./sites site js-demo run seed --workflow-id demo-1`
- `scraper --sites-manifest-dir ./sites worker run --max-cycles 16 --poll-interval 5ms`
- `scraper --sites-manifest-dir ./sites site hackernews run seed --max-pages 2`
- `scraper --sites-manifest-dir ./sites site slashdot run seed --max-pages 2`
- `scraper --sites-manifest-dir ./sites site nereval run seed --workflow-id nereval-test --max-pages 2`

Use `scraper help <slug>` for the detailed pages added in this help set.

## Troubleshooting

| Problem | Cause | Solution |
|---------|-------|----------|
| `site <name> run <verb>` submits work but nothing happens | The worker is not polling the engine DB | Run `scraper worker run` against the same `--engine-db` and `--sites-dir` |
| JS script cannot see a database | The runtime was not given `site-db` or `scraper-db` | Start by reviewing `pkg/js/runtime/databases.go` and the worker setup in `pkg/cmd/worker.go` |
| Workflow looks stuck | Ready ops are not being leased or a dependency failed | Check `scraper engine status`, then inspect the scheduler/store path in `pkg/engine/scheduler/scheduler.go` |
| A site parser seems wrong | The HTML fixture does not match the parser assumptions or the live site changed | Start with the fixture-backed tests for that site before changing runtime code |

## See Also

- `scraper help scraper-runtime-model` — Deeper explanation of submit verbs, workers, op JS, and durable execution
- `scraper help scraper-js-api-reference` — Complete JavaScript API reference for both verb and script contexts
- `scraper help scraper-queue-policies-and-rate-limiting` — How queue policies and durable token-bucket pacing work
- `scraper help scraper-new-developer-onboarding` — Step-by-step onboarding path for a new contributor
- `scraper help scraper-adding-a-site` — How to add a Go-native site when declarative manifests are not enough
- `scraper help scraper-bootstrap-config-and-site-manifest-loading` — How scraper finds site manifests before building dynamic site commands
