A Focus on Transcription Tools

DRI’s Software Engineer, Dr Kathryn Cassidy was recently joined by Dr Ciarán Wallace of Beyond 2022 to present on the topic of ‘transcription tools’ at our monthly members’ Virtual Coffee Morning. This virtual get-together was a great success, so we asked them to share a few more thoughts with us.

An interview with DRI Software Engineer, Dr Kathryn Cassidy and Deputy Director of Beyond 2022: Ireland’s Virtual Record Treasury, Dr Ciarán Wallace

DRI: Thank you both for joining us for our Coffee Morning. Your talks on the topic of ‘transcription tools’ attracted an impressive audience. Why do you think that transcription tools are particularly interesting to people?

KC: I think when you transcribe handwritten material you are, in a way, unlocking the information hidden within it. To catalogue these materials and make them findable, you need to add some metadata about the content of the document. What’s it about? Who is mentioned in the text? Does it relate to a particular place? And so on. If you have a series of handwritten letters, for example, you might already know the recipient and an approximate date range when the letters must have been written, but that’s rarely going to be enough for a researcher or other user to find letters that are of interest to them. So, traditionally someone would have had to read those letters, manually transcribe them, and add the keywords from the letters to the metadata. That’s a very labour-intensive process, but now we have a range of new technologies and initiatives that help to reduce the labour involved in transcribing handwritten texts. The same applies to the transcription of audio and video recordings. Having a transcription in an electronic text format allows for the content of these artefacts to be searched directly, or for other computational processes to be run on the texts.

It also makes the objects more accessible for humans. Screen reader applications can automatically read the content of handwritten documents for people who have trouble reading the text, or if someone has trouble hearing or understanding an audio recording, they might find the transcription more accessible. There are really a lot of benefits to providing transcriptions for digital objects including handwritten text, audio, and video.

DRI: Thanks, Kathryn. Ciarán, you work on the Beyond 2022 project. Could you tell us a bit about the project and the importance of transcription in the project?

CW: Yes, the Beyond 2022 project is working to recreate – virtually – the collections belonging to the Public Record Office of Ireland (PROI), and the building that housed them, before the destruction of the Four Courts complex in 1922. The PROI was a modern, purpose-built archive that opened in 1867. By 1922 it had gathered thousands of records from older smaller archives into this one central location. These records dealt with the English state in Ireland right back to the 1300s. Practically everything was destroyed in the opening engagement of the Irish Civil War, but now, 100 years later, we are tracking down any copies or transcripts of originals lost in that fire. There are a surprising number of these, made by officials, court clerks, local and family historians as well as academic researchers, stored in archives and libraries all around the world. When we locate a replacement record we include its metadata and – wherever possible – a digital image, in our database. But just like Kathryn says, it is the content, the words on the page, that are important. Manual transcription is very time-consuming and we would never have the resources to transcribe everything. So we needed tools to tackle the problem.

DRI: What are the tools involved in the Beyond 2022 project?

CW: Machine reading for modern printed text, normal Optical Character Recognition (OCR), is fine for published works. But OCR is much less reliable for reading typewritten text or older printed text. It cannot do anything for handwritten documents, and we have ended up with digital images of ten of millions of handwritten words in our archival searches.

Happily, just as we started work an EU-funded project had come up with a solution. The Read Coop created an artificial intelligence application called Transkribus which you can train to read any given handwriting. It takes a bit of old-fashioned manual transcription work to begin with, but once the AI has a model of the handwriting it can transform the process. In general, to get the best value for your start-up effort you need large amounts of text in the same handwriting. But we have loaded so many early modern English handwriting samples into Transkribus that we can get excellent results even on new samples that it has never seen before. The accuracy can be over 85% at a first attempt. This is not meant to be publication-quality transcription, but it is reliable enough to allow users to search for person and place names and offices of state etc. which is where most research begins.

Our Computer Science colleagues in the ADAPT Centre are developing another tool to handle all this transcribed text. The Knowledge Graph for Irish History allows us to connect ‘entities’ in different documents and archives: an ‘entity’ is a person, place or title. So if you were interested in the Fitzgeralds of Kildare, say the 5th Earl, who was Justiciar of Ireland in the early 1400s, the Knowledge Graph would show you his family connections; who else was Justiciar around that time – were they always Fitzgeralds or was he a rarity? – and it would link you to relevant documents about him, the family and Maynooth Castle in multiple archives. A much deeper search.

[Image: Transcription screen shot, courtesy of Beyond 2022]

DRI: Kathryn, you gave a talk about Transcribathon at the Coffee Morning, what is that and why did you think it would be of interest to DRI’s members?

KC: Transcribathon is a tool that takes a different approach to lessening the workload of creating transcriptions. It leverages crowdsourcing to let members of the public help transcribe handwritten documents. Obviously, this takes some of the burden away from the staff of the institute holding the documents, but it also provides a really exciting way to get the public involved with your collections in a deep and meaningful way. Crowdsourcing campaigns like this can bring a lot of publicity to your institute and the various collections in your custody. Users who participate in crowdsourcing campaigns gain a deeper understanding and appreciation of the materials.

The Transcribathon tool itself works with content in the Europeana platform, which is a central portal bringing together thousands of European archives, libraries and museums to share their cultural heritage materials with the world.

DRI: That’s really interesting. So, when a member of the public transcribes a document, they form a connection with that document, and that encourages them to engage with the institute that holds it – and its other materials – in a deeper and more meaningful way. We like that idea! Leading on from that last question, then, could you tell us a little about EnrichEuropeana+?

KC: This is a really exciting new project that we’re very happy to be involved in. It aims to enhance the Transcribathon Platform as a service for cultural heritage institutions. So it will be integrating AI tools such as Transkribus, automated translation tools and natural language processing to automatically transcribe, translate, and identify metadata enrichments. It will merge these tools with the existing crowdsourcing platform, so users won’t just be transcribing, now they’ll be correcting automated transcriptions and translations. We hope that it will really improve the results that institutes can get out of crowdsourcing campaigns, and it will also give the users a better understanding of these technologies and their strengths and weaknesses!

DRI: So humans and computers are working together to make each more knowledgeable, better, and more efficient transcribers! What should we be looking out for from EnrichEuropeana+ in the near future?

KC: Well the project just started in April this year and it will be running for 18 months. Dublin City Library and Archive, who are also a partner in the project, will be running a crowdsourcing campaign with some of their content, and the Library of Trinity College Dublin is also working hard on getting some of their content ready to send to Europeana with a view to doing the same. I think we’ll see a lot of activity in this space in the coming year! We’re also keen to hear from other institutes that have handwritten content from the 19th century and might be interested in depositing this data and using it in crowdsourcing campaigns. You can find out more about the project on the Europeana website.

DRI: What are the next steps for Beyond 2022? When will we be able to virtually browse the lost archives?

CW: We will launch the Virtual Record Treasury on the centenary of the destruction – 30 June 2022. There will be thousands of documents, maps, and images from over fifty archives and research libraries around the world. The Treasury will be free to use and it will continue to grow – it has to! We know already that we have found more material than we could possibly process and put up online by next summer. But you can get a taster right now by going to our brochure website, www.beyond2022.ie. You can check out some of our collections on the Explore page. Follow us on Twitter @VirtualTreasury to see plans for upcoming releases.

DRI: Thank you for answering those questions, there’s so much to look forward to! We are sure that our readers will be keeping a lookout for the next stages of both projects.

DRI is a participatory member of the Beyond 2022 project – you can read more about the agreement here. Our Virtual Coffee Mornings are a forum where DRI Members can start a conversation about digital preservation topics, challenges, or projects in a relaxed environment. We welcome all members to join us for these meetings, which are advertised in advance by the DRI team. To find out more about DRI’s membership model, visit our Membership page. Watch this space for future blog posts on the topics shared in our Coffee Mornings!

Subscribe to our newsletter and stay updated