Contextualising Collections with 'Datasheets for Digital Cultural Heritage Datasets'

This blog is part of a series of posts related to the Cultural Heritage Image Sharing Recommendations produced by the WorldFAIR Project’s Cultural Heritage Image Sharing Working Group. Learn more about DRI’s role in the WorldFAIR Project.

A conversation with Steven Claeyssens, Curator of Digital Collections, KB National Library of the Netherlands and Chair of the Europeana Research Community Steering Group, and Beth Knazook, DRI’s Research Data Project Manager.

BK: Thank you for joining our blog series Steven! It’s really exciting to be able to talk with you more about datasheets for cultural heritage collections, which became an important part of the discussion on data documentation that you contributed to as part of the WorldFAIR Project Cultural Heritage Image Sharing Working Group. You’ve also been leading a Datasheets for Digital Cultural Heritage Working Group with Europeana that recently published guidance in the Journal of Open Humanities Data on how these datasheets can be responsibly created, ‘Datasheets for Digital Cultural Heritage Datasets’ (DOI: 10.5334/johd.124). Can you tell us a little bit about what inspired the work at Europeana?

SC: My pleasure! Glad to have the opportunity to talk with you.

If I recall correctly, the idea for the Datasheets for Digital Cultural Heritage Working Group was born during a breakout session of the Members Council of the Europeana Network Association a few years ago. Clemens Neudecker (Berlin State Library) and I were asking ourselves how the EuropeanaTech Community and the Europeana Research Community could collaborate more often. Around that same time, I became increasingly convinced that we urgently needed to learn how to think about describing and offering our digital collections as data. Since 2012, the KB has been providing researchers access to our digital collections for TDM. Initially, those researchers were just happy it was possible, but increasingly they were asking legitimate questions. What exactly is in these collections and what is not? How does the proportion of digitised publications relate to the proportion of published works over time, for instance? What is the quality of the OCR (Optical Character Recognition)? Which voices are represented and which are silenced? And who made the decisions in compiling the collection? These are very important questions that we often can only answer in a very limited way. A lot of information is in the heads of the people involved in the digitisation process. Other things we need to extract from the ‘black box’ of our digitisation pipeline and selection criteria from past decades.

This preoccupied me, especially in light of the rapid growth of Machine Learning and the rise of Large Language Models (LLM). These models are trained on large amounts of text. But what texts are these? And is it done in a responsible manner? When I discussed this with Clemens, I had just read the article on Datasheets for Datasets by Timnit Gebru et al. (https://doi.org/10.1145/3458723). Was this not the direction we should take in the heritage world, I wondered? Clemens did not disappoint: he was familiar with both the issue and the article. We knew how we wanted to collaborate.

BK: In our previous blog with Josiline Chigwada, she highlighted the importance of documentation in showing the scope and provenance of memory collections in making these collections more useful to both humans and machines. This is probably an important point to come back to in this conversation, as the cultural heritage field has typically been good at documenting where artefacts or records come from, who owned them, when they were acquired, conserved, exhibited, and even digitised—so it might seem to some readers that datasheets are not really a new or necessary idea. Would I be correct in saying that datasheets take what is typically messy behind-the-scenes or administrative information and organises it in predictable ways?

SC: Especially in the library world, we have a long tradition of meticulously describing each object, providing it with bibliographic metadata, information about its provenance, links to thesauri, etc. However, unlike the archival world to some extent, we have little experience in describing collections of objects. There is a simple logic behind this: people consume text one piece at a time. We read a book, a magazine, or a newspaper, not an entire library. Librarians have therefore become adept at making individual publications findable, first as objects, and increasingly as text through digitisation. Machines, on the other hand, can handle much more. Indeed, their greatest strength compared to humans lies in the volume of data they can process in a short time. We are still figuring out how to aggregate or bundle that information about collections in a comprehensible way and how to add broader context so that anyone working with a heritage collection as data both artefacts and their associated metadata—can do so responsibly, aware of potential gaps, underrepresented or overrepresented voices, historical pitfalls, and biases of the collection creators.

Datasheets consist of little more than a series of questions that can help gather and document that knowledge. It is a template, a guide for a README file, modular and certainly not exhaustive. So far, it mainly emphasises the R in FAIR, Responsible (re-)use. We are now also considering the step towards the F, Findable, by translating the template into a structured, machine-readable form, so that the information can also come to life in data registries and catalogues.

We need to realise that as (national) libraries, we have become (re)publishers of the enormous, historical ‘long tail’ of publications. This brings with it other responsibilities. We have always partly shaped the image of history by what we preserve and make findable, but now also by what we make publicly available and searchable online, for both humans and machines.
Steven Claeyssens, Koninklijke Bibliotheek

BK: I really liked that ‘responsibility’ was highlighted in your recommendations for developing datasheets. Transparency is so important, we even made it a recommendation in the WorldFAIR report all on its own! Although we used the word in Recommendation #2 mainly to encourage transparency around the creation and management of digital files, it really does run through all the recommendations. What are some of the issues that the Europeana group explored in terms of transparency and ethics in creating datasheets?

SC: Yes, for me personally, the most important lesson I took away from the sessions with the working group was the enormous importance of that tricky notion of ‘responsibility.’ Unfortunately, I don’t have hands-on experience with machine learning myself, but in our group, there were people who do have that experience and who can reflect on it well. It goes beyond ‘garbage in, garbage out,’ it’s also ‘bias in, bias out.’ Moreover, the data hunger of these models is so immense that there is hardly any time or space left for thorough selection and curation or even something as basic as asking for permission. Over the years, we in the heritage world have learned to think about licences, public domain, and openness, but these developments are putting everything back on shaky ground. Shouldn’t we think more and better about who exactly is allowed to do what with our heritage? However, the datasheet paper does not go so far as to ask these kinds of fundamental questions. (As an aside, for anyone interested in these questions, I have been recommending for a year now to follow Open Future).

In the group we considered, for example, the specific properties of heritage data compared to other data. Translated to my world: how a collection of digitised books is always only a part of all preserved books. And how that collection of preserved books is always only a part of all published books. And how that national book production is in turn always only a part of all written books. I like to compare this layering to a ziggurat structure: each new layer is by definition smaller than the underlying, half-buried layer, but we often remain in the dark about the precise difference and how that difference came about.

The biggest discussion was about metrics and measures. The desire to statistically make information about possible biases, or the ziggurat problem, insightful was widely supported, but so was the scepticism about whether this is possible in a responsible way.

BK: We’ve asked some of the other Working Group members in this blog series if there was a particular recommendation that made them think about their work differently, or that they think could have a big impact in their region and/or field. Can I ask you the same about the impact the WorldFAIR recommendations might have on your work at Koninklijke Bibliotheek?

SC: Ah, I fear we still have work to do on all fronts: we do not yet have a clear citation model (Recommendation 1), we are—of course—working on datasheets (Data documentation, Recommendation 3), we are still grappling with rights, licences, and labels (Recommendation 4), particularly the Text and Data Mining (TDM) opt-out option which is practically difficult to implement, forcing us to resort to excluding certain crawlers via the robots.txt. (See also our AI Statement, www.kb.nl/en/news/kb-restricts-access-to-collections-for-training-commercial-ai.) But what I took most from the WorldFAIR sessions was the consensus I encountered when we talked about the status of digitised objects. I think I will quote this sentence more often: ‘The Working Group agreed broadly that the GLAM sector has a conceptual problem to overcome in the assumption that digital representations of images are mere surrogates for original objects.’ Exactly! Even at the KB, this has not fully sunk in. We need to realise that as (national) libraries, we have become (re)publishers of the enormous, historical long tail of publications. This brings with it other responsibilities. We have always partly shaped the image of history by what we preserve and make findable, but now also by what we make publicly available and searchable online, for both humans and machines. We have been transformed into very active gatekeepers of the past, with a reach greater than ever before. The machine is effectively turning libraries, archives, and museums into large publishing houses. Let us ensure we become and remain responsible publishers, protecting those who need protection as well as protecting the digital commons. We are, after all, not the owners of our collections.

BK: I think that’s a really powerful statement to conclude this conversation, Steven. Thank you so much for sharing your thoughts on the DRI blog!

DRI supports the addition of ‘Datasheets for Digital Cultural Heritage Datasets’ for collections in the Repository, in alignment with Recommendation 3 in the WorldFAIR Cultural Heritage Image Sharing Recommendations.

Contextualising Collections with ‘Datasheets for Digital Cultural Heritage Datasets’

Subscribe to our newsletter and stay updated.