Launched in 2016, the Transcribathon platform has been further developed by two Generic Services projects: Enrich Europeana (2018-2020) and Enrich Europeana Plus (2021-2023). The platform allows volunteers to transcribe handwritten historical texts in different languages and from different historical periods, using nothing more than their computer. Since the projects started, over 372,000 documents have been transcribed by volunteers and turned into digital text files, helping to expand and enrich Europeana’s vast collections of digital cultural heritage items.
In 2021, the Enrich Europeana Plus project began to update the Transcribathon platform with advanced handwriting recognition technology, which uses artificial intelligence to provide automatic transcriptions that can then be checked by volunteers. One of the biggest providers of such technology is READ-COOP, a European Cooperative Society that manages the popular Transkribus software. Enrich Europeana Plus spent several months working with READ-COOP and incorporating their technology into the Transcribathon platform.
Linking Transcribathon with the ‘metagrapho’ API
Developed as part of an EU-funded project led by the University of Innsbruck, Transkribus software enables historical handwritten documents to be automatically transcribed on a mass scale. The technology uses AI to ‘learn’ how to read specific types of handwriting, and then implements this knowledge to create automatic transcriptions of texts. This dramatically speeds up the transcription process: the transcriber no longer needs to spend hours writing a transcription from scratch, as they can proofread the automatic transcription instead.
Handwriting recognition technology like Transkribus is particularly ideal for citizen science projects. The easier it is to transcribe those documents, the more documents the volunteers can process in a certain timeframe, and the faster the Europeana website can be enriched. The Transcribathon team were therefore keen to implement this technology into the platform.
To do this, they decided to use READ-COOP’s metagrapho API to enable Transcribathon to access the Transkribus technology. An API is a piece of software that acts as a messenger between two different platforms. Someone requests information on one platform, and the platform sends this request to the API of another platform. Once this second platform has a response to the request, the API brings it back to the first platform and the person gets the information they need.
The Transcribathon platform uses the metagrapho API in exactly this way. When a volunteer wants to get an automatic transcription of a text,they request this on the Transcribathon platform. Transcribathon then sends this request to the metagrapho API, which uses handwriting recognition technology to process the image and generate an automatic transcription. Finally, once the processing is complete, the Transcribathon platform can access the transcription and show it to the volunteer, again via the metagrapho API.
The metagrapho API not only provides the transcription but also the coordinates for each line or even word found in the image - something that was not possible in the old version of Transcribathon. This feature makes it possible to then use the transcriptions for further applications, such as highlighting matching keywords in the text during a full-text search.
An enhanced transcription editor
Updating the technology behind Transcribathon meant that the transcription editor - the part a volunteer uses to input their transcriptions - was no longer able to cope with the richer data format that it was receiving back from the metagrapho API. Therefore, READ-COOP built a custom transcription editor for Transcribathon. This allows people to click on a line of the transcription, and see the corresponding line in the image of the text.
To speed up the process, READ-COOP took the existing editor in the Transkribus software, modified it to fit the requirements of Transcribathon, and turned it into a widget. The widget was then simply inserted into the Transcribathon platform, making it possible for users to access and edit the transcriptions generated by the metagrapho API. Using the existing Transkribus editor and simply modifying it also saved precious development time and costs.
The power of collaboration
These technological updates take Transcribathon to the next level. Instead of creating time-consuming transcriptions from scratch, volunteers can now simply correct automatically generated transcriptions in the new transcription editor, helping them to process many more documents during a run.
READ-COOP is currently training the handwritten text recognition AI models on the basis of material already transcribed, or for material soon to be transcribed, in Transcribathon. The better the AI model is adapted to the material in focus, the more accurate the automatic transcriptions will be.
For instance, one upcoming Transcribathon Run will feature scans of ration cards from the State Archives in Zagreb, which were used during WW2 (from 1941 1945.) as a form of rationing food and other resources. The cards contain demographic and socioeconomic indicators for individuals and/or households like titles, jobs, and are therefore a rich source of research material.
As preparation for this run, READ-COOP held a webinar with employees of the archive, to show them how to prepare training data. This training data will then be used to train a handwriting model or ‘teach’ the engine how to read documents of this type, so that it can provide more accurate transcriptions during the run. This, combined with the proofreading skills of the volunteers, should enable the Zagreb archive to digitise a larger number of documents than ever before.
Find out more
You can review the webinar on how to prepare training data in this video. You will find the integration of the editor for automatic Handwritten Text Recognition on the Transcribathon platform and can check out the first results from the Dublin papers.