Day 38

Day 38 - June 8, 2026: Governance Before Motion

Turning Thai tone support into an auditable validation path, grounding derived linguistic artifact provenance, cleaning project state, and reassessing UseThai friction readiness.

Day 38 was a governance-before-motion day.

The product desire is easy to state: UseThai should eventually provide tone-aware pronunciation information that helps learners understand how a Thai word should sound.

The engineering responsibility behind that desire is more demanding. Tone data cannot simply appear beside a dictionary entry. If it is generated, the platform needs to know which generator produced it, which version was used, which headword it came from, how its quality was evaluated, and what remains unresolved.

The shape of the day was:

Confirm tone as a product requirement -> govern generated-data lineage
-> record provisional product rulings -> persist a validation protocol
-> clean project state -> reassess whether UseThai can gather useful friction

No tone-generation pipeline was implemented. No generated tone data was ingested. No API integration, production dictionary expansion, Phase 16 work, or Phase 17 work was authorized.

Instead, the day made the future path more auditable before making it visible.

Starting With The Product Requirement

Day 37 established that tone-aware pronunciation is not merely a decorative enhancement for UseThai.

For a learner, romanization without tone can leave out one of the most important parts of pronunciation. That makes tone support a real product requirement.

The first temptation would be to move directly from requirement to implementation:

Find a generator -> produce tone values -> display them in the app

That path would create visible progress quickly, but it would also create unanswered questions inside the data model.

What produced the value? Which generator version was used? Can the result be reproduced? Did the generator receive the exact source headword or a transformed input? What happens when a future generator produces a different result? How was accuracy evaluated before the value reached the product?

The better first step was to make those questions architectural rather than incidental.

Grounding Derived Linguistic Artifact Provenance

The first major outcome was a narrow architectural concept:

derived linguistic artifact provenance

This concept describes the lineage of linguistic values that are generated or precomputed rather than copied directly from a dictionary source.

The grounded use case is intentionally limited to machine-generated pronunciation and tone surfaces. The required lineage includes:

The decision also established an important runtime boundary:

Generated tone must be precomputed offline and stored as static data.
Generated tone must not be inferred at runtime.

That distinction supports reproducibility and review. A stored result can be versioned, compared, validated, and replaced through a governed process. Runtime inference would make dictionary behavior depend on an opaque operation performed during a user request.

The concept was documented in ARCHITECTURE.md and accepted through ADR-0017: Derived Linguistic Artifact Provenance.

The scope stayed narrow. Derived linguistic artifact provenance does not replace source provenance, and it does not overload the existing generatedFrom artifact-classification marker. Human-curated override and exception-table provenance also remains deferred.

This was architecture work in service of a future product feature, not the feature itself.

Separating Provenance Concepts

One of the day’s most useful clarifications was that nearby provenance concepts should not be collapsed into one field merely because they all describe where something came from.

DictionarySourceProvenance describes source-shaped lineage. It can explain which dictionary or dataset supplied a record.

A generatedFrom marker classifies an internal artifact relationship.

Derived linguistic artifact provenance answers a different question:

Which governed generator, at which version, transformed which input headword
into this stored linguistic value?

Keeping those concepts separate prevents a future tone value from looking like source-authored dictionary data. It also prevents a general artifact marker from being asked to carry generator identity and reproducibility details it was not designed to represent.

The architecture now has a name for the responsibility without pretending that the final storage shape is already authorized.

Recording Provisional Tone Product Rulings

After the provenance concept was grounded, the next work recorded provisional product rulings for a future tone-generation path.

The current direction is:

These are provisional rulings, not implementation authorization.

The posture is:

baseline, then validate

It is not:

ship because a generator exists

No dependency was added. No generator was integrated. No validation run was performed. No generated pronunciation records, provenance schema, override table, or precompute pipeline was created.

Recording those boundaries is what keeps an exploratory direction from quietly becoming a production commitment.

Designing A Human-Only Validation Path

The next question was how to learn whether the provisional baseline is good enough.

The validation approach keeps ground truth human-only. A licensed pronunciation-dictionary comparison path remains closed for this track.

That decision avoids turning a quality test into an unclear licensing dependency. It also makes the evaluation question more direct: how closely does the generated learner-facing pronunciation match carefully annotated human judgments?

The planning work selected a composite future quality structure:

Whole-word exact match matters because a learner sees and relies on the whole pronunciation surface. A generator that is usually correct at the syllable level but frequently produces one wrong syllable may still create a poor product experience.

At the same time, an overall score can hide weaknesses concentrated in difficult categories. The independent irregular-stratum floor is intended to make those weaknesses visible.

No numeric acceptance threshold was set. Thresholds should follow pilot evidence rather than be invented before the project understands the error distribution.

Persisting The Pilot Protocol Without Running It

The most concrete planning artifact was:

docs/validation/tone-validation-pilot-protocol.md

The protocol records how a future pilot should be performed without authorizing the pilot itself.

The planned pilot target is 240 items. Approximately 70 to 75 percent of the sample should come from irregular or failure-prone categories, with the remaining 25 to 30 percent providing a regular contrast stratum.

The six planned irregular or failure categories are:

Items should use a primary stratum label with lightweight overlap flags. A subset should be double-annotated, prioritizing difficult categories, and a single independent adjudicator should resolve disagreements under a written segmentation-versus-tone rule.

The tltk comparison would use th2ipa IPA-with-tone-digit output. Syllable separators would be normalized out for whole-word exact match. Per-syllable tone accuracy would only be calculated where segmentation agrees.

The protocol also preserves a deliberate decision boundary around machine learning alternatives. ML comparison remains deferred unless the pilot shows that tltk is clearly weak and a candidate can satisfy runnable-environment, determinism-at-pin, and model-version-pinning requirements.

The pilot remains calibration-only and unauthorized. Persisting a protocol is not the same as executing it.

Keeping Local Spike Work Out Of Validation

The day also resolved a small but practical validation problem.

The repository uses .spike-local/ for throwaway local spike artifacts. Git already ignored that directory, but ESLint did not. A local virtual environment under .spike-local/venv/ could therefore cause lint to inspect third-party files and fail for reasons unrelated to the repository’s authored source.

The ESLint ignore block was aligned with the existing Git ignore behavior by excluding:

.spike-local/**

A temporary lint probe confirmed the exclusion worked, and the probe was removed before the tooling change was committed.

This was a small cleanup, but it protects the quality signal. Validation should report problems in the project, not in disposable local spike dependencies.

Making Session State Operative Again

The governance work created another kind of friction: the Next action field in .claude/SESSION_STATE.md had grown too large to function as a next action.

It was carrying the actual next step alongside durable tone rulings, derived-provenance history, validation methodology, prior spike context, fixture triage, deferred scope, and future authorization gates.

A docs-only cleanup reduced that field to an operative statement:

The durable context was preserved in the appropriate completed and deferred sections. The immediate next action became readable again.

This kind of cleanup is easy to dismiss because it does not change product behavior. In an agent-assisted project, however, session state is part of the operating system. A bloated next action increases the chance that a future session confuses historical context with present authorization.

Returning To UseThai Friction Readiness

With the governance path clearer, the day returned to the current safe product lane: gathering UseThai lookup friction.

That lane has a real limitation.

The application currently has a very small seed bank and no API integrations. It can support constrained testing of:

It cannot yet support strong conclusions about:

That changes the meaning of the next action.

The next friction pass should not pretend to be full dictionary-product testing. It should be a constrained pass against the current tiny seed set, with a second question running alongside it:

Can the current fixture coverage produce useful evidence, or is the evidence
pipeline itself blocked by insufficient realistic lookup coverage?

That is friction-readiness work. It helps determine whether the application can keep learning from its current data or whether a small, governed app-tier demo-data strategy needs to be considered before deeper UX conclusions are possible.

Why The Day Mattered

Day 38 did not produce a flashy feature.

It produced a trustworthy path toward one.

Tone support moved from a valid product desire into a set of governed questions:

What generated this value?
Which version generated it?
Which input did it derive from?
How will humans evaluate it?
What product-quality measure matters?
Which difficult categories need independent scrutiny?
What evidence must exist before implementation is authorized?

ADR-0017 gave generated linguistic data an architectural lineage concept. The provisional product rulings established a baseline-and-validate posture. The pilot protocol made the future evaluation reproducible without pretending that it had already been run.

Project-state cleanup kept those decisions from obscuring the actual next action. The return to UseThai friction testing then exposed a practical truth: evidence gathering is possible, but the current seed data limits what the evidence can prove.

The work made future motion safer because it made today’s boundaries clearer.

Outcome

Day 38 turned tone support from a vague future feature into an auditable, governed path.

Derived linguistic artifact provenance was grounded through architecture documentation and ADR-0017, with generator identity, generator version, and input headword lineage established as the minimum conceptual requirements. Generated tone remains an offline, stored-data concern rather than a runtime inference behavior.

Provisional product rulings recorded tltk 1.10 as the baseline generator candidate, IPA with tone digits as the current notation candidate, a human-annotated gold set as the validation source, and whole-word exact match as the future product-quality gate.

The tone-validation pilot protocol persisted an exploratory 240-item methodology with irregular-category emphasis, regular contrast, subset double-annotation, independent adjudication, segmentation-divergence reporting, and explicit ML-comparison gates.

Local spike workspace hygiene improved the reliability of lint validation. Session-state cleanup made the immediate next action operative again.

The day ended back at the UseThai app tier, with a clearer understanding that friction testing can continue but must remain honest about the limitations of the current tiny seed set.

No tone generation, validation execution, generated data, ingestion, API integration, production dictionary expansion, Phase 16 work, or Phase 17 work was implemented.

Definition Of Done

Day 38 reached a tone-governance and friction-readiness checkpoint:

The day closed with fewer visible features and substantially better footing for the feature work that may eventually follow.