Web Archiving for DRI Members

In this blog post DRI’s Training and Engagement Manager Lorraine Marrey talks to Senior Software Engineer Kathryn Cassidy and Archivist Kevin Long about the web archiving service currently being offered to DRI members.

LM: Hello both, thank you for taking the time to explain in a bit more detail about the website archiving option open to DRI members. I’m going to ask a few questions that will hopefully clarify what the service is and how it operates.

LM: What is web archiving?

KL: According to the Digital Preservation Coalition preserving the web, or ‘web archiving’, refers to the practice of taking a copy of a website or of particular content published on the web to act as a record. A web record might consist of an entire website or only the text from a few pages. Web records require urgent attention because the web by nature is ephemeral. (Digital Preservation Coalition)

LM: Kevin can you tell us how this works in a DRI context?

KL: Within the DRI, a digital object consists of the file or asset to be preserved and the metadata provided to describe it. In the context of web archiving, it is the captured website itself, rather than the individual files hosted on it, that forms the asset of a DRI digital object. The descriptive metadata will describe the website, and relevant project.

To effectively preserve the material on your website you need to consider whether it is better to archive the website or the files that are hosted on it (or a combination of both).

The format used by the DRI to capture and preserve websites is a WARC file (Web ARChive). This format creates an interactive file that opens in a web browser and allows end users to interact with the captured website as they would with a live site.

LM: If a DRI member is unsure whether their website needs archiving or the resources within the website – are there any resources available/checklist from DRI to help them make this decision?

KL: There are a few things for DRI members to consider:

Reasons for preserving a website would include:

The website contains a large volume of text which does not exist as standalone files.
The content on the website makes most sense to the end user in the context of the larger browsing experience.
Images and other files hosted on the website are not of a preservation quality and do not have metadata to describe them.
The primary content of the website is its indexing features (breadcrumb trails, resource lists).

Reasons to preserve the discrete files hosted on a website would include:

The files hosted on the website represent the important content
These files are significant pieces of standalone content and have different metadata to describe them (e.g. reports, photographs, essays, audio recordings)
You have preservation quality versions of these files stored outside of the website environment (e.g. .tiff and .wav files).

In addition the following factors should be used to decide whether to create an archive of your website:

DRI Policies: All websites preserved and published on the DRI must comply with our Collection Policy. Archived websites must also comply with DRI’s Deposit Terms and Conditions, and depositors must abide by the copyright and personal data policies contained within.

The DRI will generally only archive at-risk, end-of-life websites as opposed to project sites which are still being updated and maintained.

LM: Kathryn, if our member has decided that their website is the resource that needs archiving what do they need to do to prepare it for ingest to the repository?

KC: Basically, what we ingest is a .wacz web archive file. There are various crawler tools available that will allow you to create a .wacz file, for example, we’ve been using one called Browsertrix which works quite well, but there are many others to choose from. Typically, you can just enter the URL for your website into the crawler application, and it will crawl the website for you, downloading each page and packaging it up into a .wacz file.

They might also choose to download certain specific elements from the website to archive as individual objects alongside the .wacz file. These might be images, audio-visual components, documents and so on. Some of these could also be available to browse within the web archive, but in some cases such downloadable files will not be accessible inside the web archive. Adding them as separate objects in their own right ensures that they are discoverable and accessible within the DRI.

To test your .wacz file after creating it, use a tool such as https://replayweb.page/. This will let you preview the web archive and give you a good idea of what it will look like within the DRI web archive viewer.

We’re currently working with our member organisations to help them create these .warc files, so if any of the above sounds daunting please don’t be put off. Get in touch with our Digital Archivist and we can offer technical assistance through the process!

LM: Is the archived website suitable for ingest to Europeana?

KC: Unfortunately at the moment, Europeana doesn’t handle archived websites. If a website includes content that might be suitable for Europeana aggregation, we can certainly discuss ingesting these elements as individual digital objects so that they can be accessed directly in the DRI and aggregated to Europeana. Aside from Europeana, we’re also looking into other aggregation initiatives, such as the European Data Space for Cultural Heritage, or Irish platforms such as HeritageMaps.ie. These sorts of platforms might provide opportunities for aggregating web archives or other novel formats. It’s a developing area.

Subscribe to our newsletter and stay updated.