Want to AI proof your degree? Study History!

Can AI do history?

Someone sent me a link to this paper “Can LLMs Act as Historians? Evaluating Historical Research Capabilities of LLMs via the Chinese Imperial Examination” Gao et. al. are trying to get LLM’s to demonstrate high order skills in historical reasoning, using a new benchmark ProHist-Bench. They determine, that no, they can’t. LLM’s still hallucinate, and more importantly, they answer questions wrong. My problem is not that they get questions wrong, but that the people doing this don’t seem to know what doing history is.

I suppose that part of the problem is defining what “doing history” actually is. AI can make music, if you define music as orderly sounds coming out of a box. If you define music as a form of art created by people, then obviously it can’t. What is doing history? Gao et. al. set their tasks as answering questions (some “easy” and some “hard”) about the exam system in Imperial China and writing exam essays in the proper baguwen 八股文 style.

Asking the AI to write a baguwen essay is actually kind of interesting. On the one hand, it is absolutely not the type of thing a real historian would want to do. On the other hand, it does seem like something an AI could do. More importantly, the civil service exam essay is a good example of what a “school” question is. The purpose of the civil service exams in Late Imperial China was to staff the bureaucracy with classically educated men who had demonstrated their literary skill and moral character through the exams, most notably through the 8-legged essay, which became the official format for exam essays in 1487.¹ The exams also did other things, like create a common culture among those who took it, defining orthodox ideas, creating a class of exam “failures” looking for work etc.

The exams, and the baguwen in particular, were criticized by those who failed them in part because they did not work. Was this really the ideal way (or even an effective way) for men to prove that they had the knowledge of classical ideals and the moral character needed to serve the Son of Heaven in bringing order to the world? Was this not forcing the intellectual world in the direction of answering pointless questions in a poor format? This is actually a problem for any form of testing in education. You are trying to find something that is an efficient proxy for what you want. Answering multiple choice exam questions is not really a “task” that you will need to do in the REAL WORLD (as opposed to the unreal world of books and understanding in school) but if you can answer all those questions maybe you actually do understand whatever it is we want you to understand. Multiple choice questions are easy to grade, and it is easy to claim that the grading is objective (just like the baguwen). Exams work in the sense of providing quality control (they answered the multiple choice questions, they probably know something) and in the sense of encouraging students to study. The easiest way to be able to answer questions that make it seem like you have read and thought about Harrison’s Man Awakened from Dreams MIGHT be to read it and think about it.

Exams don’t work, however, if someone else takes the exam for you. This is why people cheated on the civil service exams, and this was why there was a whole industry of creating model exam essays so that people could give the impression of understanding without actually having to understand.

That is what AI does of course, and there are various ways to avoid that, just like the civil service exams tried to prevent both simple cheating and tried to come up with sets of questions and tasks that would select for the people they wanted. If you have students do an oral exam, or write the exam in class, then it is harder (but not impossible) to cheat, in the sense of giving the impression that you know something that you in fact do not. There are drawbacks to this, of course, such it being a lot more work for the faculty, and work that neither your bosses nor your students want you to do. Still, there is nothing really revolutionary about it. It is no different from downloading an essay from a cheating site, it’s just that now it is a curated, artisanal cheating essay that fits whatever prompt you were given. You can even get the AI to fix it in response you your commands.

But is answering exam questions really doing history, rather than an easily assessed proxy for it? The answer would seem to be “no”.

First, sources. Historians spend a lot of time thinking about sources. The authors give us a list of the things they have dumped into their LLM. If you look at the list they have a bunch of “ancient” things from the Qing (that may just be a translation error), which seem rather random to me (this is not my field). Then they have authoritative monographs and top-tier academic papers. These also seem pretty random. They are also all Chinese secondary sources, which probably skews them towards answering the very nuts and bolts sort of technical questions they are about to ask. The two English language sources are Elman’s A Cultural History of Civil Examinations in Late Imperial China (2000). Ok… and then Harrison’s The Man Awakened from Dreams (2005) Why ???? How are they looking for sources that will answer their questions, or adjusting their questions to what their sources can show? Got me there. There is really nothing here on the relationship between sources and interpretation, or even any understanding of what those things mean.

The questions themselves are also not great. Defining terms like “Zhujuan” 朱卷 (pg 30) or giving a short answer to “What were the characteristics of the imperial examination system in the Yuan Dynasty?” pg 27. These seem like tasks for an undergraduate exam written by a rather dull teacher, rather than a test for something with a brain the size of a planet. I think you could answer most of these with Willkinson. They are absolutely not something you could publish in a journal or put in your dissertation or your undergraduate honors thesis. I mention those things since those are the sort of goals you have for “doing history” when you get past the taking exams phase and get into actually doing history.

This is not really doing history. This is trying to get AI to answer exam questions. And failing. Why do that? I just don’t see how this is a first step into AI replacing real history.

I also looked at some of the other studies they cite, and they are not much better. Hauser et.al. have created HiST-LLM) Their “benchmark shows that while LLMs possess some expert-level historical knowledge, there is considerable room for improvement.”

Here the only goal is answering multiple choice questions. There is a much larger “data set” but they seem to have very different things dumped into it with no idea what they are or what questions they could be used for. They are aware that

compiling such data poses several challenges. First, given the wide range of theoretical questions and approaches in history and archaeology, deciding which variables to record and whether they are best conceptualized as attributes [17] or events [18] and whether they should be recorded as numerical or categorical already poses a challenge. Second, covering more than one or a few regions across different time periods is challenging since experts in history and archaeology specialize on particular regions and time periods. Therefore a comprehensive regional and temporal coverage requires engaging with multiple experts and academic resources. Third, most history and archaeology research focuses on recording well-established facts, but scarcely records that a vast amount of historical knowledge is inferred.

Again this does not seem connected to what historians actually do. History “focuses on recording well-established facts”? That is not even wrong. The questions are bad, and the answers often wrong, in the sense that the AI sometimes gets it wrong and they sometimes grade it wrong.

Question:
During the time frame from 1568 CE to 1603 CE, was the characteristic ’Mutilation’, associated with the cults and rituals held by people of the ’Japan – Azuchi-Momoyama’ polity, present, inferred present, inferred absent, or absent?
Options:
A: Present, B: Inferred Present, C: Inferred Absent, D: Absent
Reasoning and evidence:
The high casualty rate of the Joseon and Ming forces, and the large number of ears collected during the campaign was enough to build a large mound near Hideyoshi’s Great Buddha, called the Mimizuka (“Mound of Ears”).
Answer:
A

Question:
The characteristic ’Shields’ is categorized under ’Armor’. Was it present, inferred present, inferred absent, or absent for the polity called ’Latium – Bronze Age’, during the time frame from 1800 BCE to 900 BCE?
Options:
A: Present, B: Inferred Present, C: Inferred Absent, D: Absent
Reasoning and evidence:
Weapons, statuettes, and “double shields” found in male burials suspected to infer elite military or religious status.
Answer:
A

I think the first answer is wrong if they mean that “mutilation” was part of religious rituals in Azuchi-Momoyama based on the ear mound. The second one is, I think correct, but why is that a question worth asking and answering? If the goal of AI is to help students cheat on tests written by an AI, this may be of some use, but I can’t see any way that it can lead to doing history. You can’t get the right answers if you ask the wrong questions.

Elman Civil Examinations and Meritocracy p.57 ↩

Leave a ReplyCancel reply