{"id":9169,"date":"2023-03-04T16:01:05","date_gmt":"2023-03-04T16:01:05","guid":{"rendered":"https:\/\/froginawell.net\/frog\/?p=9169"},"modified":"2023-03-04T17:27:55","modified_gmt":"2023-03-04T17:27:55","slug":"cleaning-up-tables-from-primary-sources-in-chatgpt","status":"publish","type":"post","link":"https:\/\/froginawell.net\/frog\/2023\/03\/cleaning-up-tables-from-primary-sources-in-chatgpt\/","title":{"rendered":"Cleaning Up Tables from Primary Sources in ChatGPT"},"content":{"rendered":"<p>I&#8217;ve been following with interest the debates around the rapid emergence of powerful large language models such as OpenAI&#8217;s ChatGPT, its Bing sibling Sydney, Meta&#8217;s Galactica, and Google&#8217;s Bard. One important recent discussion of this can be found <a href=\"https:\/\/nymag.com\/intelligencer\/article\/ai-artificial-intelligence-chatbots-emily-m-bender.html\">here<\/a>. My current status: deep concern mixed with pragmatic curiosity.<\/p>\n<p><a href=\"https:\/\/froginawell.net\/frog\/wp-content\/uploads\/2023\/03\/P1090436-scaled.jpg\"><img loading=\"lazy\" decoding=\"async\" class=\"alignleft wp-image-9170 size-medium\" src=\"https:\/\/froginawell.net\/frog\/wp-content\/uploads\/2023\/03\/P1090436-182x300.jpg\" alt=\"Summation of United States Army Military Government Activities in Korea (March, 1946)\" width=\"182\" height=\"300\" srcset=\"https:\/\/froginawell.net\/frog\/wp-content\/uploads\/2023\/03\/P1090436-182x300.jpg 182w, https:\/\/froginawell.net\/frog\/wp-content\/uploads\/2023\/03\/P1090436-700x1153.jpg 700w, https:\/\/froginawell.net\/frog\/wp-content\/uploads\/2023\/03\/P1090436-768x1264.jpg 768w, https:\/\/froginawell.net\/frog\/wp-content\/uploads\/2023\/03\/P1090436-933x1536.jpg 933w, https:\/\/froginawell.net\/frog\/wp-content\/uploads\/2023\/03\/P1090436-1244x2048.jpg 1244w, https:\/\/froginawell.net\/frog\/wp-content\/uploads\/2023\/03\/P1090436-800x1317.jpg 800w, https:\/\/froginawell.net\/frog\/wp-content\/uploads\/2023\/03\/P1090436-scaled.jpg 1555w\" sizes=\"auto, (max-width: 182px) 100vw, 182px\" \/><\/a><\/p>\n<p>Given the propensity of ChatGPT (mid-February, 2023 version) to happily invent facts, people, nonexistant citations, and quotations, I&#8217;m not yet too worried about how this impacts historical essays produced by students. However, while its shortcomings in this regard may give only temporary relief as these models evolve, it also limits its usefulness for quick information lookups on things you are not already expert enough on to call bullshit on. So are there any current use cases for historians? I stumbled on one potential use through <a href=\"https:\/\/sigmoid.social\/@dhoe\/109964995289479309\">a post<\/a> on Mastodon: apparently, ChatGPT is not bad at cleaning up and formatting tables from raw text.<\/p>\n<p>&nbsp;<\/p>\n<p>To test this, I took some very badly formatted data from a single table randomly chosen from my photo of a March, 1946 issue of a <em>Summation of United States Army Military Government Activities in Korea. <\/em>Here is a view of the original table:<\/p>\n<p><a href=\"https:\/\/froginawell.net\/frog\/wp-content\/uploads\/2023\/03\/P1090443-1.jpg\"><img loading=\"lazy\" decoding=\"async\" class=\"size-large wp-image-9172 aligncenter\" src=\"https:\/\/froginawell.net\/frog\/wp-content\/uploads\/2023\/03\/P1090443-1-700x445.jpg\" alt=\"\" width=\"700\" height=\"445\" srcset=\"https:\/\/froginawell.net\/frog\/wp-content\/uploads\/2023\/03\/P1090443-1-700x445.jpg 700w, https:\/\/froginawell.net\/frog\/wp-content\/uploads\/2023\/03\/P1090443-1-300x191.jpg 300w, https:\/\/froginawell.net\/frog\/wp-content\/uploads\/2023\/03\/P1090443-1-768x488.jpg 768w, https:\/\/froginawell.net\/frog\/wp-content\/uploads\/2023\/03\/P1090443-1-1536x976.jpg 1536w, https:\/\/froginawell.net\/frog\/wp-content\/uploads\/2023\/03\/P1090443-1-800x508.jpg 800w, https:\/\/froginawell.net\/frog\/wp-content\/uploads\/2023\/03\/P1090443-1.jpg 1602w\" sizes=\"auto, (max-width: 700px) 100vw, 700px\" \/><\/a><\/p>\n<p>Here is the poorly formatted text extracted from this:<\/p>\n<pre>From   To    Quantity (in suk)\nCholla Pukto  Seoul  32,000\n             Inchon       6,000\n                           Mukko       8,000\n                       Chechon  2,000\n                       Wonju  2,000\n                       Chunchon    8,000\n                __________\n                    58,000\nCholla Namdo    Mukko              8,000\nChung Chong Pukto       Seoul 4,500\nChung Chong Namdo   Seoul 36,000\n                Inchon          6,000\n                    ______\n                         42,000\nKyong Sang Pukto   Chechon       5,000\n                 Mukko    7,0000\n                    _______________\n                    12,000\n\n<\/pre>\n<p>Now give ChatGPT the instructions to clean up the table:<\/p>\n<p><a href=\"https:\/\/froginawell.net\/frog\/wp-content\/uploads\/2023\/03\/Screenshot-2023-03-04-at-15.55.24.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignleft size-large wp-image-9176\" src=\"https:\/\/froginawell.net\/frog\/wp-content\/uploads\/2023\/03\/Screenshot-2023-03-04-at-15.55.24-700x424.png\" alt=\"\" width=\"700\" height=\"424\" srcset=\"https:\/\/froginawell.net\/frog\/wp-content\/uploads\/2023\/03\/Screenshot-2023-03-04-at-15.55.24-700x424.png 700w, https:\/\/froginawell.net\/frog\/wp-content\/uploads\/2023\/03\/Screenshot-2023-03-04-at-15.55.24-300x182.png 300w, https:\/\/froginawell.net\/frog\/wp-content\/uploads\/2023\/03\/Screenshot-2023-03-04-at-15.55.24-768x466.png 768w, https:\/\/froginawell.net\/frog\/wp-content\/uploads\/2023\/03\/Screenshot-2023-03-04-at-15.55.24-1536x931.png 1536w, https:\/\/froginawell.net\/frog\/wp-content\/uploads\/2023\/03\/Screenshot-2023-03-04-at-15.55.24-800x485.png 800w, https:\/\/froginawell.net\/frog\/wp-content\/uploads\/2023\/03\/Screenshot-2023-03-04-at-15.55.24.png 1874w\" sizes=\"auto, (max-width: 700px) 100vw, 700px\" \/><\/a><\/p>\n<p>This produced the following:<\/p>\n<p><a href=\"https:\/\/froginawell.net\/frog\/wp-content\/uploads\/2023\/03\/Screenshot-2023-03-04-at-15.55.53.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignleft size-large wp-image-9177\" src=\"https:\/\/froginawell.net\/frog\/wp-content\/uploads\/2023\/03\/Screenshot-2023-03-04-at-15.55.53-700x410.png\" alt=\"\" width=\"700\" height=\"410\" srcset=\"https:\/\/froginawell.net\/frog\/wp-content\/uploads\/2023\/03\/Screenshot-2023-03-04-at-15.55.53-700x410.png 700w, https:\/\/froginawell.net\/frog\/wp-content\/uploads\/2023\/03\/Screenshot-2023-03-04-at-15.55.53-300x176.png 300w, https:\/\/froginawell.net\/frog\/wp-content\/uploads\/2023\/03\/Screenshot-2023-03-04-at-15.55.53-768x450.png 768w, https:\/\/froginawell.net\/frog\/wp-content\/uploads\/2023\/03\/Screenshot-2023-03-04-at-15.55.53-1536x900.png 1536w, https:\/\/froginawell.net\/frog\/wp-content\/uploads\/2023\/03\/Screenshot-2023-03-04-at-15.55.53-800x469.png 800w, https:\/\/froginawell.net\/frog\/wp-content\/uploads\/2023\/03\/Screenshot-2023-03-04-at-15.55.53.png 1952w\" sizes=\"auto, (max-width: 700px) 100vw, 700px\" \/><\/a><\/p>\n<p>Selecting the text from the entire table, copying (from Firefox browser, in my case) and pasting it directly into (Mac) TextEdit or Microsoft Word retained the table formatting perfectly. But here it is again below. I asked ChatGPT to reproduce the table as HTML for easy copy\/paste here with prompt: &#8220;reproduce this table as html&#8221;:<\/p>\n<table>\n<tbody>\n<tr>\n<th>From<\/th>\n<th>To<\/th>\n<th>Quantity (in suk)<\/th>\n<\/tr>\n<tr>\n<td>Cholla Pukto<\/td>\n<td>Seoul<\/td>\n<td>32,000<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td>Inchon<\/td>\n<td>6,000<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td>Mukko<\/td>\n<td>8,000<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td>Chechon<\/td>\n<td>2,000<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td>Wonju<\/td>\n<td>2,000<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td>Chunchon<\/td>\n<td>8,000<\/td>\n<\/tr>\n<tr>\n<td colspan=\"2\" align=\"right\"><strong>Total<\/strong><\/td>\n<td><strong>58,000<\/strong><\/td>\n<\/tr>\n<tr>\n<td>Cholla Namdo<\/td>\n<td>Mukko<\/td>\n<td>8,000<\/td>\n<\/tr>\n<tr>\n<td>Chung Chong Pukto<\/td>\n<td>Seoul<\/td>\n<td>4,500<\/td>\n<\/tr>\n<tr>\n<td>Chung Chong Namdo<\/td>\n<td>Seoul<\/td>\n<td>36,000<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td>Inchon<\/td>\n<td>6,000<\/td>\n<\/tr>\n<tr>\n<td colspan=\"2\" align=\"right\"><strong>Total<\/strong><\/td>\n<td><strong>42,000<\/strong><\/td>\n<\/tr>\n<tr>\n<td>Kyong Sang Pukto<\/td>\n<td>Chechon<\/td>\n<td>5,000<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td>Mukko<\/td>\n<td>7,0000<\/td>\n<\/tr>\n<tr>\n<td colspan=\"2\" align=\"right\"><strong>Total<\/strong><\/td>\n<td><strong>12,000<\/strong><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>You might notice that the model added &#8220;total&#8221; where it detected sub-totals for some sections. Also notice that the mistakenly transcribed 7,0000 is converted to 7,000 in the reply, but back to 7,0000 in the HTML table when I requested it in the next prompt (h\/t to <a href=\"https:\/\/norden.social\/@lobidu\">Janis<\/a> for noticing this). Clearly a reminder to check the results as carefully as with OCR outputs.<\/p>\n<p>There are lots of other places online that offer services for cleaning up messy data, but I have had mixed results with them. This worked quite well and can potentially save a lot of time cleaning up tabular data in OCRs of historical documents.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I&#8217;ve been following with interest the debates around the rapid emergence of powerful large language models such as OpenAI&#8217;s ChatGPT, its Bing sibling Sydney, Meta&#8217;s Galactica, and Google&#8217;s Bard. One important recent discussion of this can be found here. My current status: deep concern mixed with pragmatic curiosity. Given the propensity of ChatGPT (mid-February, 2023&#8230;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[280],"tags":[],"class_list":["post-9169","post","type-post","status-publish","format-standard","hentry","category-posts"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p9yoH3-2nT","_links":{"self":[{"href":"https:\/\/froginawell.net\/frog\/wp-json\/wp\/v2\/posts\/9169","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/froginawell.net\/frog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/froginawell.net\/frog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/froginawell.net\/frog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/froginawell.net\/frog\/wp-json\/wp\/v2\/comments?post=9169"}],"version-history":[{"count":8,"href":"https:\/\/froginawell.net\/frog\/wp-json\/wp\/v2\/posts\/9169\/revisions"}],"predecessor-version":[{"id":9184,"href":"https:\/\/froginawell.net\/frog\/wp-json\/wp\/v2\/posts\/9169\/revisions\/9184"}],"wp:attachment":[{"href":"https:\/\/froginawell.net\/frog\/wp-json\/wp\/v2\/media?parent=9169"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/froginawell.net\/frog\/wp-json\/wp\/v2\/categories?post=9169"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/froginawell.net\/frog\/wp-json\/wp\/v2\/tags?post=9169"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}