Part 3 of 3 · the build

The chunk→fact bridge

qmd’s retrieval feeding Wax’s structured memory

A tool that reads an org-mode corpus — devlogs, notes, session journals — chunks it along its own structure, and distills durable, bitemporal, citable EAV facts back into Wax. Grounded in a real tree: ~/dev/org/{devlogs,notes} and ~/dev/my/claude-journal.Citation colours: green = the new bridge · amber = Wax · blue = qmd.

What the corpus actually is

All org. Markdown is vestigial here. Three shapes, each already semantically segmented.

Sub-corpus	Path shape	Structure the bridge exploits
Devlogs (481)	`org/devlogs/<project>/<yyyy>/<mm>/<date>_<project>.org`	`#+SOURCE-FINGERPRINT:` (a SHA — free content hash), `#+TITLE`, `#+FILETAGS: :daily:summary:<proj>:`, then `* Summary` → `* Journal` → ` HH:MM —` → `** Key Decisions` with dense `[[orgit-rev:…::<sha>]]` commit links
Notes (93)	`org/notes/<yyyy>/<slug>.org`	`:PROPERTIES:` drawer (`:ID:`, `:CREATED:`), `Affects:`/`Severity:`, `* Root cause` → `** Layer N`, `#+begin_src`/`#+begin_example` blocks
claude-journal (249)	`my/claude-journal/<project>/<yyyy>/<mm>/<ts>_<project>.org`	`#+PROJECT:`, `* Context` / `* User Context` / `* Reflections` / `* Observations` → `** <obs>`

The author already split this into Decisions / Observations / Root-cause / Reflections — high signal-to-noise. And project + date + kind live in the path and headers: free entity and time metadata, no inference needed.

The insight: org structure is the chunker

qmd guesses structure from # density. Your org tree hands it over.

qmd’s job (Markdown)

Score break points by regex (H1=100, code-fence=80, HR=60, paragraph=20…) and search a ~200-token window for the best cut with squared-distance decay. store.ts:106–242 It’s reconstructing structure the author left implicit.

The bridge’s job (org)

Prefer whole heading subtrees — a **** Key Decisions block is a chunk. Only fall back to qmd-style windowed cutoff inside an oversized subtree. Never split inside #+begin_…/#+end_… or :PROPERTIES:…:END: (the org analogue of qmd’s code-fence guard).

Two consequences worth their own line: the heading path tells you which section a chunk is in — so you distill only the high-signal ones and skip Conversation Excerpts for free; and the directory names are your entity table (every project is a top-level dir).

The org-aware chunker

Subtree-first, block-aware. Edit the org below; tune the granularity. Pre-filled with a (faithful) slice of a real invoicekit devlog.

break at heading level ≤ 3 max chars/chunk 1400

distill high-signal section · skip routed out by heading · block contains a no-split region · oversize would get a windowed sub-cut. Each chunk’s provenance — file, project, date, heading path, char span — rides into Wax as metadata.

The bridge, end to end

Click a stage.

Chunk → fact: the mapping

Each extracted triple lands as a fact_assert with an sm_evidence row pointing back at the exact source span.

Extraction output	→ Wax	Source field
`subject` (canonical key)	`fact_assert.subject`	resolved, see §06
`predicate` (controlled vocab)	`fact_assert.predicate`	—
`object` (typed)	`fact_assert.object`	int / string / entity-ref
entry date	`valid_from_ms`	`#+DATE` / path
extraction time	`system_from_ms`	now
chunk frame id	`sm_evidence.source_frame_id`	StructuredMemorySchema.swift:94
char span in chunk	`span_start_utf8`/`_end_utf8`	…:95–97
`"org-bridge"` + model id	`extractor_id`/`_version`	…:99–100
LLM confidence	`confidence`	…:102

Worked example — the `Key Decisions` chunk above

The bitemporal payoff for this corpus

valid_time = the day you wrote the devlog (“what was decided on 2026-05-01”); system_time = when the bridge learned it. So facts(about: project:invoicekit, asOf: <date>) reconstructs what you knew then, and re-running with a better extractor opens new system-time spans without losing history.

Identity — the part Wax won’t do for you

Wax does zero co-reference; assertFact binds subjects by exact key. So the bridge owns identity — but your corpus makes it easy.

Seed canonical project entities from the directory names, deterministically, before any LLM: entity_upsert(key:"project:invoicekit", kind:"Project", aliases:["invoicekit","ik"]). The dir tree is your entity table.
Resolve-then-assert for everything else: entity_resolve(surface) → reuse the rank-0 match, else entity_upsert. Commits become entities keyed by short SHA (commit:876d114), so a decision can point at the commit that implemented it — exactly what your orgit-rev links already encode.

Controlled predicate vocabulary (mandatory)

Wax interns predicates by exact key, so decided and made_decision would split into two. Lock a small set tuned to these logs; the extraction prompt maps prose onto it or drops the candidate.

Idempotency, the review gate, and the honest hard parts

What comes free

Re-runs are safe: Wax dedups triples by SHA-256(S,P,O) (UNIQUE(fact_hash)), so re-extracting an unchanged chunk is a no-op.
Skip unchanged files via #+SOURCE-FINGERPRINT / content hash.
Edits supersede: changed facts assert with version_relation:updates → the old span closes, history preserved.
Review gate: route sub-threshold facts to a checklist before they go durable — Wax’s DREAMS.md promotion flow is the model AgentBrokerService+Markdown.swift:168.

Decisions for you

Near-dup objects: if the LLM rewords an object on re-run, the triple hash differs → near-duplicate. Mitigate by canonicalizing objects, or dedup on (subject, predicate, source-span) instead of object text.
Predicate governance: closed vocab (~15) vs open-world. Start closed, grow deliberately.
Extractor model: qmd’s qwen3 models are retrieval-tuned, not extraction-tuned. A small instruct model (or your gptel setup) prompted with heading context + the vocab is the realistic path.
Review gate is Markdown-only in Wax — for an org-first flow you’d accept a Markdown review file, or have the bridge manage its own org/.fact-review.org checklist.

Pragmatic build

Given org-first + your Emacs/Babashka stack, the realistic shape is a Babashka/Clojure (or TS) orchestrator: org-parse → subtree chunk → extract → drive Wax via its CLI/daemon (fact_assert already takes an evidence arg AgentBrokerService.swift:1187). No Swift needed unless you want the custom-embedder path from Part 2. First step: a 50-file dry run over org/devlogs/invoicekit — the extracted facts will tell you fast whether they’re worth keeping.