Hallucination Check

Methodology and results of the verification passes on the Japan Year Book and Japan Magazine corpora.

Why this check?

The timeline entries on this site were transcribed from scanned PDFs using AI vision models. Vision models do a good job with nineteenth- and early-twentieth-century English typefaces, but they are also prone to hallucination: emitting plausible-looking text that is not on the page. A dated event that never appeared in the original source is the worst kind of error, because it reads like a fact and can't be caught by spell-check or OCR-confidence scoring.

This page describes how this class of error was checked in two of the four sources — The Japan Year Book (1905–1931, 23 volumes, 119 diary-section pages) and The Japan Magazine (1917–1929, 99 issues, 305 diary-column pages) — and what the check found. The same methodology will be applied to the remaining two sources (Contemporary Japan and Nippon Times Weekly) in a later pass.

Methodology

1. Re-transcribe every page with three independent tools

For each diary/chronicle page, three fresh, fully independent transcriptions were produced using:

Running three distinct systems—one pure OCR, two multimodal LLMs—guards against shared hallucination modes. If a line really is on the page, at least one of the three tools is almost certain to see it.

2. Fuzzy-match every existing entry against the three transcriptions

For every dated event already in the site's timeline, the event text was compared against the full concatenated output of each tool for the matching volume or issue, using rapidfuzz.fuzz.token_set_ratio with a sliding-window search. The best match score was recorded against each of the three tools, and the highest across the three taken as the event's confidence.

A real entry that is in the page (even if one tool misreads it) typically scores 85–100 against at least one tool. Entries scoring below a threshold were flagged as hallucination candidates. Thresholds in the 60–90 range were piloted, and the manual pass was done at threshold 80, which yielded a candidate list large enough to surface real problems across each corpus and small enough to inspect every one by hand against the source image.

3. Manual review of every candidate

Each flagged candidate was reviewed by hand against the image of the original page and all three tool transcriptions. Each candidate was classified as one of:

4. Reverse-direction check for missing entries

The three steps above only catch errors in entries that already exist in the timeline. To find entries missed during the original transcription, the comparison was run in the other direction: every dated entry present in both Gemini 2.5 Flash and GPT-5 Mini (pair similarity ≥ 70) was checked against the timeline’s <year>.md file. When the two independent vision-LLMs both saw an event and the timeline did not contain it, the entry was flagged as a candidate missing entry.

Candidates were manually reviewed against the page image. Real misses were added to the transcription; false positives (typically column-bleed artefacts where an OCR tool ran two adjacent column entries together) were discarded.

5. Cross-tool date-agreement check

A second kind of error is harder to catch: an event that is in the timeline but under the wrong date. To find these, each GPT-5 Mini transcription was parsed into YYYY.MM.DD — text form (matching the format of the timeline’s .md files) using the per-volume cover range to infer the year, and every GPT entry was compared against the timeline. Where the same event text matched an entry in the .md under a different date, the row was flagged as a date discrepancy.

Each flagged row was then cross-verified against Gemini 2.5 Flash and Apple Vision. When all three tools agreed on one date that differed from the one in the timeline, the correction was applied in bulk. Close calls were reviewed manually against the source page image; a 4th-opinion tiebreaker using Gemini 3 Flash was also used on borderline cases. In every tiebreak sampled, Gemini 3 Flash sided with the 3-tool consensus rather than the existing date in the timeline.

Results · The Japan Year Book

Scope: 2,439 dated events across 23 volumes (1905–1931).

Events checked

2,439

23 volumes, 119 pages
Candidates flagged

51

below similarity threshold
After manual review

3

outright hallucinations
Verdict Count % of candidates % of all events
REAL (noise, not an error) 32 62.7% 1.31%
PARTIAL ERROR (typo, wrong date, dropped word) 16 31.4% 0.66%
HALLUCINATION 3 5.9% 0.12%
Total real problems (partial + hallucination) 19 37.3% 0.78%

In the Japan Year Book about 1 in 128 entries (0.78%) has a real error of any kind that the check could detect, and about 1 in 813 entries (0.12%) is an outright fabrication. The large majority of errors are single-word or single-digit corruptions of a real entry rather than fabrications. The three hallucinations were scattered, not concentrated in any single volume.

Missing-entry pass

The reverse-direction check surfaced 76 candidate missing entries (1,371 Gemini∩GPT paired entries scanned; 5.5% flag rate). After manual review against the page image and the three tool outputs, the great majority were false positives — typically entries that were in the timeline but whose text had been mangled by the original OCR to the point that the fuzzy match couldn’t link them. Six genuine missing events were added to the timeline; the rest were either already present or turned out to be column-bleed artefacts. All of volume 1905 was excluded from this check because that volume uses a narrative, non-paragraph-per-event structure incompatible with the reverse extractor.

Date-discrepancy pass

The date-discrepancy check surfaced 259 rows where GPT-5 Mini placed a matching event under a different date (from 2,113 GPT entries, after excluding 1905). Cross-verification against Gemini and Apple showed three distinct populations:

The discrepancies were scattered across the corpus, not concentrated in any one volume. The per-volume rate of corrected date errors ran between 0 and about 3% of events; the corpus-wide rate after correction is substantially lower than before.

Results · The Japan Magazine

Scope: 2,572 dated events across 99 issues (1917–1929). The Japan Magazine diary column (“Monthly Record of Events”, “Month in Progress”, “Editor’s Diary”) was a much shorter column than the Japan Year Book diary, with terser, one-line entries.

Events checked

2,572

99 issues, 305 pages
Candidates flagged

54

below similarity threshold
After manual review

25

outright hallucinations
Verdict Count % of candidates % of all events
REAL (noise, not an error) 27 50.0% 1.05%
PARTIAL ERROR (typo, wrong date, dropped word, swapped event) 2 3.7% 0.08%
HALLUCINATION 25 46.3% 0.97%
Total real problems (partial + hallucination) 27 50.0% 1.05%

The Japan Magazine hallucination rate (0.97%, about 1 in 103) is nearly an order of magnitude higher than in the Japan Year Book, but the distribution tells a more specific story: all 25 hallucinated entries come from a single issue (jm-052, December 1923–January 1924) whose original transcription appears to have been produced against the wrong pages of the source PDF, yielding text that the three independent tools do not find anywhere in that issue’s actual diary column. Once that one issue is re-transcribed, the Japan Magazine hallucination rate drops to essentially zero in the sample.

The higher fraction of “REAL” candidates among the flagged items also reflects the nature of the Japan Magazine column: entries are very short (“Jan. 2.—Snowing.”), so a single OCR-substituted word tips the fuzzy-match score below threshold more easily than in the longer year-book entries.

Aggregate across both sources

Combined, the two sources contribute 5,011 dated events to the timeline; 105 of these surfaced as candidates and 28 were confirmed as real problems of some kind.

Source Events Candidates Real (noise) Partial errors Hallucinations
Japan Year Book 2,439 51 32 1.31% 16 0.66% 3 0.12%
Japan Magazine 2,572 54 27 1.05% 2 0.08% 25 0.97%
Aggregate 5,011 105 59 (1.18%) 18 (0.36%) 28 (0.56%)

About 1 in 180 entries across both sources (0.56%) is a full hallucination, and about 1 in 109 (0.92%) is either a hallucination or a partial error of some kind. Because the Japan Magazine hallucinations are so concentrated in a single issue, the practical fidelity of the corpus after correcting that one issue is closer to the Japan Year Book rate (roughly 0.1% hallucination, 0.8% total errors).

Caveats