📖 Documentation

PackageVersion

Navigation

11 sectionsv0.1

📄 New Developer Onboarding — glaze help scraper-new-developer-onboarding

scraper-new-developer-onboarding

New Developer Onboarding

Step-by-step onboarding path for a new contributor using the current filesystem-loaded sites and engine commands.

Tutorialscraperonboardingtutorialsitestestingscraperengineworkersiteengine-dbsites-dirmax-pages

This tutorial gets a new contributor from zero context to a working mental model and a set of successful smoke tests. It deliberately starts with fixture-backed or pure-JS paths so the reader can validate the engine without depending on live websites. By the end, the reader will have seen the submit-verb flow, the worker loop, the engine visibility commands, and one complex site package.

The goal is not to memorize every file. The goal is to establish a safe first-day path through the codebase and the CLI.

Prerequisites

Before starting, the reader should have:

a working Go toolchain
the repo checked out with local workspace dependencies available
the ability to run go test ./...
basic familiarity with Go, Cobra, and JavaScript

All commands below assume the current working directory is the scraper/ repository root.

Step 1 — Read The Architecture Pages

Start with the shortest conceptual pages before reading implementation code. This gives names to the moving parts and reduces confusion when you see submit verbs, scheduler code, and JS op scripts later.

Read these pages in order:

scraper help scraper-architecture-overview
scraper help scraper-runtime-model
scraper help scraper-queue-policies-and-rate-limiting

Then skim these code files:

pkg/cmd/root.go
pkg/cmd/site.go
pkg/cmd/worker.go

Step 2 — Run The Full Test Suite

The fastest sanity check is the whole Go test suite. This confirms that the engine, site manifests, embedded help pages, and fixture-backed workflows all load correctly in the current environment.

go test ./... -count=1

If this fails, stop and debug the environment before moving on.

Step 3 — Smoke-Test The Pure-JS Path

js-demo is the smallest useful site. It proves the split between submit verbs and the worker without involving any HTTP.

First submit work:

tmpdir=$(mktemp -d)

go run ./cmd/scraper \
  --sites-manifest-dir ./sites \
  site js-demo run seed \
  --sites-dir "$tmpdir/sites" \
  --engine-db "$tmpdir/engine.db" \
  --workflow-id demo-1 \
  --count 3 \
  --multiplier 4 \
  --prefix smoke

The flags --count, --multiplier, and --prefix are defined in sites/jsdemo/verbs/seed.js using __verb__ metadata. The submit-verb host discovers these JS declarations and wires them into Cobra CLI flags automatically. This pattern is used by all default sites.

Then inspect the engine DB:

go run ./cmd/scraper engine status --engine-db "$tmpdir/engine.db"

You should see one workflow and ready work, but not a completed workflow yet. Now run the worker:

go run ./cmd/scraper \
  --sites-manifest-dir ./sites \
  worker run \
  --sites-dir "$tmpdir/sites" \
  --engine-db "$tmpdir/engine.db" \
  --max-cycles 16 \
  --poll-interval 5ms

Re-run engine status after that. The workflow should now be succeeded and the result/artifact counts should be non-zero.

Step 4 — Smoke-Test An HTTP Site

Now move to a site that uses the full js -> http/fetch -> js -> site-db path. Hacker News is the simplest HTTP site.

All sites use the same two-step pattern: submit work with a verb, then run the worker. The hackernews verb defines --base-url and --max-pages flags in sites/hackernews/verbs/seed.js.

tmpdir=$(mktemp -d)

go run ./cmd/scraper \
  --sites-manifest-dir ./sites \
  site hackernews run seed \
  --sites-dir "$tmpdir/sites" \
  --engine-db "$tmpdir/engine.db" \
  --workflow-id hn-test \
  --base-url "https://news.ycombinator.com/" \
  --max-pages 1

Then run the worker to execute the queued ops:

go run ./cmd/scraper \
  --sites-manifest-dir ./sites \
  worker run \
  --sites-dir "$tmpdir/sites" \
  --engine-db "$tmpdir/engine.db" \
  --max-cycles 32 \
  --poll-interval 25ms

This path proves that JS emits HTTP work, the HTTP runner persists artifacts, and the follow-up JS extractor writes rows into the site DB.

For fully offline testing, the go test ./... suite uses fixture-backed tests that serve embedded HTML from local HTTP test servers.

Step 5 — Inspect A Complex Site Without Going Live

The first complex site is nereval. Its value is not just parsing HTML. It proves:

submit-verb driven workflow creation
ASP.NET list-page pagination with explicit form state
detail-page fan-out
normalized site DB writes

Do not run it live as part of onboarding. Instead, study these files:

sites/nereval/site.yaml
sites/nereval/verbs/seed.js
sites/nereval/scripts/seed.js
sites/nereval/scripts/extract_list.js
sites/nereval/scripts/extract_detail.js
sites/nereval/migrations/001_init.sql

Then read the fixture-backed test:

pkg/cmd/site_test.go

Step 6 — Learn The Debugging Commands

The minimum useful operator debugging set is:

scraper engine status
scraper engine migrations status
scraper site migrate <site>
scraper worker run --max-cycles 1
scraper help <slug>

These commands are enough to answer:

was the engine DB created?
are migrations applied?
did a site DB get created?
is the worker leasing anything?
where is the missing conceptual documentation?

Step 7 — Read The Tickets Only After The Embedded Docs

The ticket docs are still valuable, but they should now be second-pass reading rather than the only onboarding path.

Read these if you need deeper implementation history (search for these ticket IDs in the ttmp/ directory):

SCRAPER-DESIGN — initial design guide and investigation diary
SCRAPER-RATE-LIMITER — queue rate limiter analysis and implementation guide

Troubleshooting

Problem	Cause	Solution
`go test ./...` fails immediately	Workspace dependencies or generated docs are not loading	Fix the environment before debugging scraper logic
`site js-demo run seed` works but nothing completes	The worker was never run	Use `worker run` against the same temp DBs
`site js-demo run seed` is missing entirely	scraper did not load the site manifests during bootstrap	Pass `--sites-manifest-dir ./sites`, set `SCRAPER_SITES_MANIFEST_DIRS`, or configure `~/.scraper/config.yaml`
You do not know whether a bug is engine or site specific	Too many layers are being changed at once	Reproduce first on `js-demo`, then on `hackernews` or `slashdot`, then on `nereval`
`nereval` feels too big to start with	You skipped the simpler sites	Go back to `js-demo` and one HTTP site first

New Developer Onboarding

Step-by-step onboarding path for a new contributor using the current filesystem-loaded sites and engine commands.

Sections

New Developer Onboarding

Prerequisites

Step 1 — Read The Architecture Pages

Step 2 — Run The Full Test Suite

Step 3 — Smoke-Test The Pure-JS Path

Step 4 — Smoke-Test An HTTP Site

Step 5 — Inspect A Complex Site Without Going Live

Step 6 — Learn The Debugging Commands

Step 7 — Read The Tickets Only After The Embedded Docs

Troubleshooting

See Also