Data rescue for World Digital Preservation Day 2025

Today, Thursday 6 November 2025 if I actually manage to finish and publish this today, is World Digital Preservation Day so I thought I would try and get a blog post out about some work I’ve been doing to rescue at-risk data. I’ve briefly mentioned this in my post about Library of Congress Subject Headings but not in much detail.

The project is Safeguarding Research & Culture and I got involved back in March or April when Henrik reached out on social media looking for someone with library & metadata experience to contribute. I said that I wasn’t a Real Librarian but I’d love to help if I could, and now here we are.

The concept is simple: download public datasets that are at risk of being lost, and replicate them as widely as possible to make them hard to destroy, though obviously there’s a lot of complexity buried in that statement. When the Trump administration first took power, there were a lot of people around the world worried about this issue and wanting to help, so while there are a number of institutions & better resourced groups doing similar things, we aim to complement them by mobilising grassroots volunteers.

Downloading data isn’t always straightforward. It may be necessary to crawl an entire website, or query a poorly-documented API, or work within the constraints of rate-limiting so as not to overload an under-resourced server. That takes knowledge and skill, so part of the work is guiding and mentoring new contributors and fostering a community that can share what they learn and proactively find and try out new tools.

We also need people to be able to find and access the data, and volunteers to be able to contribute their storage to the network. We distribute data via the venerable BitTorrent protocol, which is very good at defeating censorship and getting data out to as many peers as possible as quickly as possible. To make those torrents discoverable, our dev team led by the incredible Jonny have built a catalogue of dataset torrents, playfully named SciOp. That’s built on well-established linked data standards like DCAT, the Data Catalogue Vocabulary, so the metadata is standardised and interoperable, and there’s a public API and a developing commandline client to make it even easier to process and upload datasets. There are even RSS and RDF feeds of datasets by tag, size, threat status or number of seeds (copies) in the network that you can plug into your favourite BitTorrent client to automatically start downloading newly published datasets. There are even exciting plans in the works to make it federated via ActivityPub, to give us a network of catalogues instead of just a single one.

We’re accidentally finding ourselves needing to push the state of the art in BitTorrent client implementations. If you’re familiar with the history of BitTorrent as a favoured tool for ahem less-than-legal media sharing, it probably won’t surprise you that most current BitTorrent clients are optimised for working with single audio-visual streams of about 1 to 2½ hours in length. Our scientific & cultural data is much more diverse than that, and the most popular clients can struggle for various reasons. In many cases there are BEPs (BitTorrent Enhancement Proposals) to extend the protocol to improve things, but these are optimal features that most clients don’t implement. The collection of BEPs that make up “BitTorrent v2” is a good example: most clients don’t support v2 well, so most people don’t bother making v2-compatible torrents, but that means there’s no demand to implement v2 in the clients. We are planning to make a scientific-grade BitTorrent client as a test-bed for these and other new ideas.

Myself I’m running one of a small number of “super” nodes in the swarm, with much more storage available than the average laptop or desktop, and often much better bandwidth too. That’s good, because some of our datasets run to multiple terabytes, plus to ensure new nodes can get started quickly we need to have some always-on nodes with most of the data available to others. Since BitTorrent is truly peer-to-peer, it doesn’t matter how many people have a copy of a given dataset, if none of them are online no-one else can access it.

This is all very technically interesting, but communications, community, governance, policy, documentation, funding are also vitally important, and for us these are all works in progress. We need volunteers to help with all of this, but especially those less-technical aspects. If you’re interested in helping, please drop us a line at contact@safeguar.de, or join our community forum and introduce yourself and your interests.

If you want to contribute but don’t feel you have the time or skills, well, to start with we’re more than happy to show you the ropes and help you get started, but as an alternative, I’m running one of those “super” nodes and you can contribute to my storage costs via GoFundMe: even a few quid helps. I currently have 3x 6TB hard drives with no space to mount them, so I’m currently in need of a drive cage to hold them and plug them into my server.

Special shout-out also to our sibling project, the Data Rescue Project, who are doing amazing work on this and often send us requests for websites or complex datasets for our community to save.

I’ve barely scratched the surface here, but I really want to actually get this post out for WDPD so I’m going to stop here and hopefully continue soon!

If you enjoy what I create, or find it useful, you can support me at ko-fi.com/cipherrot. Every little helps!

Webmentions

You can respond to this post, "Data rescue for World Digital Preservation Day 2025", by: liking, boosting or replying to a tweet or toot that mentions it; or sending a webmention from your own site to https://erambler.co.uk/blog/wdpd2025-data-rescue/

Comments & reactions haven't loaded yet. You might have JavaScript disabled but that's cool 😎.

Comments

Powered by Cactus Comments 🌵