Hallucination Check

Methodology and results of the verification passes on the Japan Year Book and Japan Magazine corpora.

Why this check?

The timeline entries on this site were transcribed from scanned PDFs using AI vision models. Vision models do a good job with nineteenth- and early-twentieth-century English typefaces, but they are also prone to hallucination: emitting plausible-looking text that is not on the page. A dated event that never appeared in the original source is the worst kind of error, because it reads like a fact and can't be caught by spell-check or OCR-confidence scoring.

This page describes how this class of error was checked in two of the four sources — The Japan Year Book (1905–1931, 23 volumes, 119 diary-section pages) and The Japan Magazine (1917–1929, 99 issues, 305 diary-column pages) — and what the check found. The same methodology will be applied to the remaining two sources (Contemporary Japan and Nippon Times Weekly) in a later pass.

Methodology

1. Re-transcribe every page with three independent tools

For each diary/chronicle page, three fresh, fully independent transcriptions were produced using:

Apple Vision (on-device OCR via VNRecognizeTextRequest, accurate mode)
Gemini 2.5 Flash (Google, vision-language model)
GPT-5 Mini (OpenAI, vision-language model)

Running three distinct systems—one pure OCR, two multimodal LLMs—guards against shared hallucination modes. If a line really is on the page, at least one of the three tools is almost certain to see it.

2. Fuzzy-match every existing entry against the three transcriptions

For every dated event already in the site's timeline, the event text was compared against the full concatenated output of each tool for the matching volume or issue, using rapidfuzz.fuzz.token_set_ratio with a sliding-window search. The best match score was recorded against each of the three tools, and the highest across the three taken as the event's confidence.

A real entry that is in the page (even if one tool misreads it) typically scores 85–100 against at least one tool. Entries scoring below a threshold were flagged as hallucination candidates. Thresholds in the 60–90 range were piloted, and the manual pass was done at threshold 80, which yielded a candidate list large enough to surface real problems across each corpus and small enough to inspect every one by hand against the source image.

3. Manual review of every candidate

Each flagged candidate was reviewed by hand against the image of the original page and all three tool transcriptions. Each candidate was classified as one of:

REAL — the entry really is on the page; the low score was an artefact of OCR noise (e.g. the tool split a word with a hyphen, or one of the three tools failed to read the line at all).
PARTIAL ERROR — the entry is real, but a name, word, or date has been corrupted (e.g. Lawsoo toward instead of Laws concerning; a date off by one day; a proper noun subtly misspelled).
HALLUCINATION — the entry as written does not appear on the page at all, and cannot be recovered from any of the three independent transcriptions.

4. Reverse-direction check for missing entries

The three steps above only catch errors in entries that already exist in the timeline. To find entries missed during the original transcription, the comparison was run in the other direction: every dated entry present in both Gemini 2.5 Flash and GPT-5 Mini (pair similarity ≥ 70) was checked against the timeline’s <year>.md file. When the two independent vision-LLMs both saw an event and the timeline did not contain it, the entry was flagged as a candidate missing entry.

Candidates were manually reviewed against the page image. Real misses were added to the transcription; false positives (typically column-bleed artefacts where an OCR tool ran two adjacent column entries together) were discarded.

5. Cross-tool date-agreement check

A second kind of error is harder to catch: an event that is in the timeline but under the wrong date. To find these, each GPT-5 Mini transcription was parsed into YYYY.MM.DD — text form (matching the format of the timeline’s .md files) using the per-volume cover range to infer the year, and every GPT entry was compared against the timeline. Where the same event text matched an entry in the .md under a different date, the row was flagged as a date discrepancy.

Each flagged row was then cross-verified against Gemini 2.5 Flash and Apple Vision. When all three tools agreed on one date that differed from the one in the timeline, the correction was applied in bulk. Close calls were reviewed manually against the source page image; a 4th-opinion tiebreaker using Gemini 3 Flash was also used on borderline cases. In every tiebreak sampled, Gemini 3 Flash sided with the 3-tool consensus rather than the existing date in the timeline.

Results · The Japan Year Book

Scope: 2,439 dated events across 23 volumes (1905–1931).

Events checked

2,439

23 volumes, 119 pages

Candidates flagged

below similarity threshold

After manual review

outright hallucinations

Verdict	Count	% of candidates	% of all events
REAL (noise, not an error)	32	62.7%	1.31%
PARTIAL ERROR (typo, wrong date, dropped word)	16	31.4%	0.66%
HALLUCINATION	3	5.9%	0.12%
Total real problems (partial + hallucination)	19	37.3%	0.78%

In the Japan Year Book about 1 in 128 entries (0.78%) has a real error of any kind that the check could detect, and about 1 in 813 entries (0.12%) is an outright fabrication. The large majority of errors are single-word or single-digit corruptions of a real entry rather than fabrications. The three hallucinations were scattered, not concentrated in any single volume.

Missing-entry pass

The reverse-direction check surfaced 76 candidate missing entries (1,371 Gemini∩GPT paired entries scanned; 5.5% flag rate). After manual review against the page image and the three tool outputs, the great majority were false positives — typically entries that were in the timeline but whose text had been mangled by the original OCR to the point that the fuzzy match couldn’t link them. Six genuine missing events were added to the timeline; the rest were either already present or turned out to be column-bleed artefacts. All of volume 1905 was excluded from this check because that volume uses a narrative, non-paragraph-per-event structure incompatible with the reverse extractor.

Date-discrepancy pass

The date-discrepancy check surfaced 259 rows where GPT-5 Mini placed a matching event under a different date (from 2,113 GPT entries, after excluding 1905). Cross-verification against Gemini and Apple showed three distinct populations:

~70 rows where all three tools agreed on a date differing from the one in the timeline: bulk-corrected. In every spot-checked case, the correction matched the source image.
~19 rows where the tool outputs were split 2–1 or harder to pinpoint: reviewed manually against the page image. Most were real date errors in the timeline; a handful were false alarms caused by OCR quirks (e.g. GPT reading a “9” as “2” on a single line).
A long tail of rows where the locator could not reliably place the event text in Gemini/Apple output because of column-bleed or hyphen splitting: no action taken unless the event was also flagged by a cleaner tool.

The discrepancies were scattered across the corpus, not concentrated in any one volume. The per-volume rate of corrected date errors ran between 0 and about 3% of events; the corpus-wide rate after correction is substantially lower than before.

Results · The Japan Magazine

Scope: 2,572 dated events across 99 issues (1917–1929). The Japan Magazine diary column (“Monthly Record of Events”, “Month in Progress”, “Editor’s Diary”) was a much shorter column than the Japan Year Book diary, with terser, one-line entries.

Events checked

2,572

99 issues, 305 pages

Candidates flagged

below similarity threshold

After manual review

outright hallucinations

Verdict	Count	% of candidates	% of all events
REAL (noise, not an error)	27	50.0%	1.05%
PARTIAL ERROR (typo, wrong date, dropped word, swapped event)	2	3.7%	0.08%
HALLUCINATION	25	46.3%	0.97%
Total real problems (partial + hallucination)	27	50.0%	1.05%

The Japan Magazine hallucination rate (0.97%, about 1 in 103) is nearly an order of magnitude higher than in the Japan Year Book, but the distribution tells a more specific story: all 25 hallucinated entries come from a single issue (jm-052, December 1923–January 1924) whose original transcription appears to have been produced against the wrong pages of the source PDF, yielding text that the three independent tools do not find anywhere in that issue’s actual diary column. Once that one issue is re-transcribed, the Japan Magazine hallucination rate drops to essentially zero in the sample.

The higher fraction of “REAL” candidates among the flagged items also reflects the nature of the Japan Magazine column: entries are very short (“Jan. 2.—Snowing.”), so a single OCR-substituted word tips the fuzzy-match score below threshold more easily than in the longer year-book entries.

Aggregate across both sources

Combined, the two sources contribute 5,011 dated events to the timeline; 105 of these surfaced as candidates and 28 were confirmed as real problems of some kind.

Source	Events	Candidates	Real (noise)	Partial errors	Hallucinations
Japan Year Book	2,439	51	32 1.31%	16 0.66%	3 0.12%
Japan Magazine	2,572	54	27 1.05%	2 0.08%	25 0.97%
Aggregate	5,011	105	59 (1.18%)	18 (0.36%)	28 (0.56%)

About 1 in 180 entries across both sources (0.56%) is a full hallucination, and about 1 in 109 (0.92%) is either a hallucination or a partial error of some kind. Because the Japan Magazine hallucinations are so concentrated in a single issue, the practical fidelity of the corpus after correcting that one issue is closer to the Japan Year Book rate (roughly 0.1% hallucination, 0.8% total errors).

Caveats

The check has been run on The Japan Year Book (1905–1931) and The Japan Magazine (1917–1929). The same methodology will be applied to Contemporary Japan’s “Chronicle of Current Events” and the Nippon Times Weekly later.
The fuzzy-match threshold is a tunable knob. A lower threshold surfaces more candidates but with a lower hit rate for real problems; a higher threshold is more efficient but risks missing subtler errors. The 55–90 range was explored before settling on 80.
“Partial errors” include a mix of OCR-style typos (a misspelled name) and semantic shifts (a dropped or substituted word that changes the meaning). Both classes are recorded in the per-source TSVs of candidates but are not separated here.
Hallucinations in dates are not fully captured by this check, which searches on event text regardless of date. A separate date-anchored pass would be needed to catch those.
The error rate is an upper bound on candidate-surface errors: it catches the events that scored low against all three independent transcriptions. An entry that is plausible and happens to echo real nearby text could in principle still slip through, so these numbers are a floor on the true fidelity of the corpus, not an absolute guarantee.
The LLM vision models, even when working in a chain task across pages, sometimes drop continuations of event descriptions at the top of a page. When this happens, they will usually skip to the next dated event. This may be a prompting problem, but may lead to event stubs or missing events.