Methodology and results of the verification passes on the Japan Year Book and Japan Magazine corpora.
The timeline entries on this site were transcribed from scanned PDFs using AI vision models. Vision models do a good job with nineteenth- and early-twentieth-century English typefaces, but they are also prone to hallucination: emitting plausible-looking text that is not on the page. A dated event that never appeared in the original source is the worst kind of error, because it reads like a fact and can't be caught by spell-check or OCR-confidence scoring.
This page describes how this class of error was checked in two of the four sources — The Japan Year Book (1905–1931, 23 volumes, 119 diary-section pages) and The Japan Magazine (1917–1929, 99 issues, 305 diary-column pages) — and what the check found. The same methodology will be applied to the remaining two sources (Contemporary Japan and Nippon Times Weekly) in a later pass.
For each diary/chronicle page, three fresh, fully independent transcriptions were produced using:
VNRecognizeTextRequest, accurate mode)Running three distinct systems—one pure OCR, two multimodal LLMs—guards against shared hallucination modes. If a line really is on the page, at least one of the three tools is almost certain to see it.
For every dated event already in the site's timeline, the event text
was compared against the full concatenated output of each tool for the
matching volume or issue, using
rapidfuzz.fuzz.token_set_ratio with a sliding-window search.
The best match score was recorded against each of the three tools, and the
highest across the three taken as the event's confidence.
A real entry that is in the page (even if one tool misreads it) typically scores 85–100 against at least one tool. Entries scoring below a threshold were flagged as hallucination candidates. Thresholds in the 60–90 range were piloted, and the manual pass was done at threshold 80, which yielded a candidate list large enough to surface real problems across each corpus and small enough to inspect every one by hand against the source image.
Each flagged candidate was reviewed by hand against the image of the original page and all three tool transcriptions. Each candidate was classified as one of:
The three steps above only catch errors in entries that already exist in
the timeline. To find entries missed during the original
transcription, the comparison was run in the other direction: every dated
entry present in both Gemini 2.5 Flash and GPT-5 Mini (pair
similarity ≥ 70) was checked against the timeline’s
<year>.md file. When the two independent vision-LLMs
both saw an event and the timeline did not contain it, the entry was
flagged as a candidate missing entry.
Candidates were manually reviewed against the page image. Real misses were added to the transcription; false positives (typically column-bleed artefacts where an OCR tool ran two adjacent column entries together) were discarded.
A second kind of error is harder to catch: an event that is in
the timeline but under the wrong date. To find these, each GPT-5 Mini
transcription was parsed into YYYY.MM.DD — text form
(matching the format of the timeline’s .md files) using the
per-volume cover range to infer the year, and every GPT entry was compared
against the timeline. Where the same event text matched an entry in the
.md under a different date, the row was flagged as
a date discrepancy.
Each flagged row was then cross-verified against Gemini 2.5 Flash and Apple Vision. When all three tools agreed on one date that differed from the one in the timeline, the correction was applied in bulk. Close calls were reviewed manually against the source page image; a 4th-opinion tiebreaker using Gemini 3 Flash was also used on borderline cases. In every tiebreak sampled, Gemini 3 Flash sided with the 3-tool consensus rather than the existing date in the timeline.
Scope: 2,439 dated events across 23 volumes (1905–1931).
2,439
51
3
| Verdict | Count | % of candidates | % of all events |
|---|---|---|---|
| REAL (noise, not an error) | 32 | 62.7% | 1.31% |
| PARTIAL ERROR (typo, wrong date, dropped word) | 16 | 31.4% | 0.66% |
| HALLUCINATION | 3 | 5.9% | 0.12% |
| Total real problems (partial + hallucination) | 19 | 37.3% | 0.78% |
In the Japan Year Book about 1 in 128 entries (0.78%) has a real error of any kind that the check could detect, and about 1 in 813 entries (0.12%) is an outright fabrication. The large majority of errors are single-word or single-digit corruptions of a real entry rather than fabrications. The three hallucinations were scattered, not concentrated in any single volume.
The reverse-direction check surfaced 76 candidate missing entries (1,371 Gemini∩GPT paired entries scanned; 5.5% flag rate). After manual review against the page image and the three tool outputs, the great majority were false positives — typically entries that were in the timeline but whose text had been mangled by the original OCR to the point that the fuzzy match couldn’t link them. Six genuine missing events were added to the timeline; the rest were either already present or turned out to be column-bleed artefacts. All of volume 1905 was excluded from this check because that volume uses a narrative, non-paragraph-per-event structure incompatible with the reverse extractor.
The date-discrepancy check surfaced 259 rows where GPT-5 Mini placed a matching event under a different date (from 2,113 GPT entries, after excluding 1905). Cross-verification against Gemini and Apple showed three distinct populations:
The discrepancies were scattered across the corpus, not concentrated in any one volume. The per-volume rate of corrected date errors ran between 0 and about 3% of events; the corpus-wide rate after correction is substantially lower than before.
Scope: 2,572 dated events across 99 issues (1917–1929). The Japan Magazine diary column (“Monthly Record of Events”, “Month in Progress”, “Editor’s Diary”) was a much shorter column than the Japan Year Book diary, with terser, one-line entries.
2,572
54
25
| Verdict | Count | % of candidates | % of all events |
|---|---|---|---|
| REAL (noise, not an error) | 27 | 50.0% | 1.05% |
| PARTIAL ERROR (typo, wrong date, dropped word, swapped event) | 2 | 3.7% | 0.08% |
| HALLUCINATION | 25 | 46.3% | 0.97% |
| Total real problems (partial + hallucination) | 27 | 50.0% | 1.05% |
The Japan Magazine hallucination rate (0.97%, about 1 in
103) is nearly an order of magnitude higher than in the Japan
Year Book, but the distribution tells a more specific story: all 25
hallucinated entries come from a single issue
(jm-052, December 1923–January 1924) whose
original transcription appears to have been produced against the wrong
pages of the source PDF, yielding text that the three independent tools do
not find anywhere in that issue’s actual diary column. Once that
one issue is re-transcribed, the Japan Magazine hallucination
rate drops to essentially zero in the sample.
The higher fraction of “REAL” candidates among the flagged items also reflects the nature of the Japan Magazine column: entries are very short (“Jan. 2.—Snowing.”), so a single OCR-substituted word tips the fuzzy-match score below threshold more easily than in the longer year-book entries.
Combined, the two sources contribute 5,011 dated events to the timeline; 105 of these surfaced as candidates and 28 were confirmed as real problems of some kind.
| Source | Events | Candidates | Real (noise) | Partial errors | Hallucinations |
|---|---|---|---|---|---|
| Japan Year Book | 2,439 | 51 | 32 1.31% | 16 0.66% | 3 0.12% |
| Japan Magazine | 2,572 | 54 | 27 1.05% | 2 0.08% | 25 0.97% |
| Aggregate | 5,011 | 105 | 59 (1.18%) | 18 (0.36%) | 28 (0.56%) |
About 1 in 180 entries across both sources (0.56%) is a full hallucination, and about 1 in 109 (0.92%) is either a hallucination or a partial error of some kind. Because the Japan Magazine hallucinations are so concentrated in a single issue, the practical fidelity of the corpus after correcting that one issue is closer to the Japan Year Book rate (roughly 0.1% hallucination, 0.8% total errors).