A guided introduction to finding structured-data candidates inside larger text.
The extract module is for text that contains structured data but is not itself just structured data. A Markdown document may start with YAML frontmatter and later contain a JSON code fence. A model response may wrap YAML in <yaml>...</yaml>. A pasted snippet may be a raw JSON object with no wrapper at all. require("extract") finds those candidates and preserves where they came from.
The module deliberately returns candidates rather than parsed values. A candidate is evidence: it records the raw wrapper, the payload, source positions, format guesses, and confidence. That evidence lets a script decide which candidate to trust, show a reviewer where it came from, or validate it before parsing.
When you do not yet know the document shape, call extract.all(input). It runs the available extractors and sorts candidates by source order.
const extract = require("extract");
const input = `---
title: Demo
---
~~~json
{"ok": true}
~~~
<yaml>name: Alice\n</yaml>
`;
const candidates = extract.all(input);
for (const candidate of candidates) {
console.log(candidate.Kind, candidate.Format, candidate.StartRow, candidate.Confidence);
}
The result is intentionally descriptive. Kind tells you whether the candidate came from frontmatter, a Markdown code block, an XML-like tag, or raw text. Format records the guessed data format. The row and column fields let a tool point back to the original source.
Once a script knows the source shape, call a specific helper. This makes command behavior easier to explain and tests easier to read.
const frontmatter = extract.frontmatter(markdownText);
const blocks = extract.markdownCodeBlocks(markdownText);
const tagged = extract.xmlTagged(modelResponse);
const raw = extract.rawStructured(possibleJsonOrYaml);
The XML helper is XML-like rather than a full XML parser. It recognizes simple same-name wrappers such as <json>...</json> and <yaml>...</yaml>, which are common in model responses and prompt protocols.
Candidates are not trusted parsed values. Validate a candidate before handing its payload to JSON.parse() or to a YAML parser.
const candidate = extract.all(input)[0];
const validation = extract.validate(candidate);
if (!validation.Valid) {
throw new Error("candidate did not validate");
}
console.log(validation.Sanitized);
For JSON and YAML, validation delegates to the same repair semantics exposed by require("sanitize"). That means extraction and repair compose: first find the likely payload, then ask the format-specific sanitizer whether it can be made parseable.
A command-line tool can return only the payload text, but a review tool should preserve positions. StartRow, StartColumn, PayloadStartRow, and PayloadStartColumn are what make it possible to highlight the wrapper and explain the extraction result to a user.
const rows = extract.all(input).map((candidate) => ({
kind: candidate.Kind,
format: candidate.Format,
startsAt: `${candidate.StartRow}:${candidate.StartColumn}`,
payloadStartsAt: `${candidate.PayloadStartRow}:${candidate.PayloadStartColumn}`,
}));
all() currently keeps overlapping candidates. That is a feature of the evidence-first design. A document can have YAML frontmatter and also look like a raw YAML document; a Markdown code block can appear inside a larger wrapper. The module shows the evidence and leaves the policy to the caller.
Common policies are:
The demo script reads a sample Markdown document and prints the discovered candidates and validation results.
./dist/goja-text run examples/js/extract-demo.js
The bundled root-mounted JavaScript verbs turn extraction into reusable commands:
./dist/goja-text extract list examples/text/structured-data-sample.md
./dist/goja-text extract validate examples/text/structured-data-sample.md
These verbs are a useful starting point for automation because they return structured rows rather than prose output.
all() when exploring mixed documents.