On fooling around with triples

Sunday 6 April 2025 Jez Cope Metadata RDF SPARQL Data rescue

Content Note

For reasons that will become apparent as you read the introduction, this post has tangential references to a number of oppressed groups and other fascist newspeak as encoded in Library of Congress metadata standards.

If you've been aware at all about the current political situation in the US, you'll be aware that the dude in the Oval Office has been issuing executive orders left, right and centre. Much damage is being done to the country, its people and its infrastructure that will take generations to repair, but as someone watching from an ocean away it would be easy to believe it won't affect me.

It will.

Not only will any success experienced by fascism in the US embolden fascists around the world, but the US is so culturally and economically dominant that anything happening there will inevitably impact the rest of us. "When America sneezes, the rest of the world catches a cold."

If this all seems like a weird way to introduce a post about me figuring out how to query and manipulate RDF metadata, well, it is. I do intend to write more about this soon, but for now I'll just say that the question I'm interested in is this:

Is the linked data published by the Library of Congress being materially changed as a result of the aforementioned flurry of executive orders?

If you think this sounds hypothetical, well, the US National Cancer Institute has already begun removing certain gender- and sexuality-related terms from its linked data thesaurus. I like to believe that, for reasons of both professional integrity and institutional inertia, it will take much longer before similar changes are made to vocabularies like the Library of Congress Subject Headings (LCSH). These are resources used in cultural heritage institutions all over the world, though, so if & when that happens the impact will be widespread.

The setup

I eventually want to be able to do this kind of analysis on "the big one", LCSH. But that really is big and queries against it will take … a while … so I've chosen to start with something smaller, the Library of Congress Demographic Group Terms. I chose this in particular because it's a reasonable size but also contains terms that appear on a list of those government departments are being instructed to remove so seems likely to be a target of that censorship.

This data is all in the form of RDF triples. This post isn't the place for a full explanation of what that means, but in brief, a triple is a machine-readable statement of some property of a subject in the form:

<subject> <predicate> <object>

So I have two versions of the same RDF ontology, and I want to know what statements have been removed or altered from one to the other. I could process the raw files myself as text but: 1) I would eventually end up writing the bits of a triple-parser and store I need myself, from scratch, which would be a waste of time; and 2) I actually would like to learn more about the key technologies involved.

To work with RDF data, I needed a specialised database called a triplestore, which is optimised for running queries against this type of graph data using a query language whimsically named SPARQL. I didn't give this too much thought, and picked Apache Jena Fuseki because it was open source, was available as a package in my Linux distribution and worked first time when I tried starting it. Other triplestores are available.

The basic unit of a triplestore is a "graph", equivalent to a "database" in a database management system. Because these are two versions of the same data I can't just load them both up into the same graph because they have significant overlap and the second would largely replace the first leaving me no way to compare them. What I can do is load them up into separate "named graphs", which I can then refer to explicitly in my queries. I've called the graphs for the two versions urn:demographicTerms/20250314 and urn:demographicTerms/20250321, for the dates I downloaded them.

Since RDF triples tend to use URIs (technically IRIs) all over the place, which tend to be quite verbose, SPARQL allows you to define prefixes to make queries less verbose. Here are a few that might be useful:

  PREFIX owl: <http://www.w3.org/2002/07/owl#>
  PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
  PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
  PREFIX mads: <http://www.loc.gov/mads/rdf/v1#>
  PREFIX ri: <http://id.loc.gov/ontologies/RecordInfo#>

I'll also define my own prefix, dt: to make it easier to refer to the two named graphs:

  PREFIX dt: <urn:demographicTerms/>

You'll see these abbreviated to <<common-prefixes>> in the code below. With that set up, let's dive into the data.

Getting situated

To make sure that I'm successfully querying the right place, and get an idea of what data is in there, let's pull out a list of all the unique predicates: the things that are used to make statements about other things.

  <<common-prefixes>>

  SELECT DISTINCT ?predicate
  FROM dt:20250314
  WHERE {
    [] ?predicate []
  }
  ORDER BY ?predicate

predicate
http://id.loc.gov/ontologies/RecordInfo#languageOfCataloging
http://id.loc.gov/ontologies/RecordInfo#recordChangeDate
http://id.loc.gov/ontologies/RecordInfo#recordContentSource
http://id.loc.gov/ontologies/RecordInfo#recordStatus
http://id.loc.gov/ontologies/bflc/marcKey
http://id.loc.gov/vocabulary/identifiers/lccn
http://www.loc.gov/mads/rdf/v1#adminMetadata
http://www.loc.gov/mads/rdf/v1#authoritativeLabel
http://www.loc.gov/mads/rdf/v1#citationNote
http://www.loc.gov/mads/rdf/v1#citationSource
http://www.loc.gov/mads/rdf/v1#citationStatus
http://www.loc.gov/mads/rdf/v1#deletionNote
http://www.loc.gov/mads/rdf/v1#elementList
http://www.loc.gov/mads/rdf/v1#elementValue
http://www.loc.gov/mads/rdf/v1#exampleNote
http://www.loc.gov/mads/rdf/v1#hasBroaderAuthority
http://www.loc.gov/mads/rdf/v1#hasEarlierEstablishedForm
http://www.loc.gov/mads/rdf/v1#hasMADSCollectionMember
http://www.loc.gov/mads/rdf/v1#hasNarrowerAuthority
http://www.loc.gov/mads/rdf/v1#hasReciprocalAuthority
http://www.loc.gov/mads/rdf/v1#hasSource
http://www.loc.gov/mads/rdf/v1#hasVariant
http://www.loc.gov/mads/rdf/v1#historyNote
http://www.loc.gov/mads/rdf/v1#isMemberOfMADSCollection
http://www.loc.gov/mads/rdf/v1#isMemberOfMADSScheme
http://www.loc.gov/mads/rdf/v1#note
http://www.loc.gov/mads/rdf/v1#variantLabel
http://www.w3.org/1999/02/22-rdf-syntax-ns#first
http://www.w3.org/1999/02/22-rdf-syntax-ns#rest
http://www.w3.org/1999/02/22-rdf-syntax-ns#type
http://www.w3.org/2000/01/rdf-schema#comment
http://www.w3.org/2000/01/rdf-schema#label

This gives me a little confidence that we are at least looking at the right sort of data, and also gives hints about the structure of the dataset. Now we can start digging in a bit more deeply.

Deprecated records

In common with a lot of such vocabularies, those produced by LoC are generally quite scrupulous about retaining deleted entries but clearly marking them as such, rather than simply dropping them from the dataset. This is important: if you make use of a term that is subsequently deprecated, you need to know when that happened and why to understand how to appropriately update your own records, and historical records will likely still reference terms that have since been deleted but still need to be interpreted.

Let's take a look at what's been officially deprecated from this vocabulary:

  <<common-prefixes>>

  SELECT ?label ?delDate ?note FROM dt:20250321
  WHERE {
    OPTIONAL { ?x mads:variantLabel ?label } .
    ?x rdf:type mads:Authority .
    ?x mads:deletionNote ?delNote .
    ?x mads:adminMetadata ?md .
    ?md ri:recordChangeDate ?delDate .
    ?md ri:recordStatus "deprecated" .
    BIND(REPLACE(?delNote, "\n", " ") as ?note)
  }
  ORDER BY ?delDate

label	delDate	note
Prisoners of war	2021-01-22T15:39:00	This authority record was deleted because it was created in error.
Concentration camp inmates	2022-11-09T12:41:19	This authority record has been deleted because the demographic group term is covered by the demographic group terms {Internment camp inmates} (DLC)dg2022060247 and {Nazi concentration camp inmates} (DLC)dg2022060248
Parents of autistics	2024-06-06T12:09:08	This authority record has been deleted because the heading is covered by the headings {Family members of autistics} (DLC)dg2024060011 and {Parents} (DLC)dg2015060230
Parents of mass murderers	2024-06-06T12:09:08	This authority record has been deleted because the heading is covered by the headings {Family members of mass murderers} (DLC)dg2024060075 and {Parents} (DLC)dg2015060230
Politicians' partners	2024-06-06T12:09:08	This authority record has been deleted because the heading is covered by the headings {Family members of politicians} (DLC)dg2024060009 {Spouses} (DLC)dg2024060004 and {Unmarried partners} (DLC)dg2024060005
Cancer patients' partners	2024-06-06T12:09:08	This authority record has been deleted because the heading is covered by the headings {Family members of cancer patients} (DLC)dg2024060007 {Spouses} (DLC)dg2024060004 and {Unmarried partners} (DLC)dg2024060005
Parkinson's disease patients' partners	2024-06-06T12:09:08	This authority record has been deleted because the heading is covered by the headings {Family members of Parkinson's disease patients} (DLC)dg2024060008 {Spouses} (DLC)dg2024060004 and {Unmarried partners} (DLC)dg2024060005
Parents of transgender people	2024-06-06T12:09:08	This authority record has been deleted because the heading is covered by the headings {Family members of transgender people} (DLC)dg2024060076 and {Parents} (DLC)dg2015060230
Parents of dyslexics	2024-06-06T12:09:08	This authority record has been deleted because the heading is covered by the headings {Family members of dyslexics} (DLC)dg2024060073 and {Parents} (DLC)dg2015060230
Partners (Spouses)	2024-06-06T12:09:08	This authority record has been deleted because the heading is covered by the headings {Spouses} (DLC)dg2024060004 and {Unmarried partners} (DLC)dg2024060005
Parents of alcoholics	2024-06-06T12:09:08	This authority record has been deleted because the heading is covered by the headings {Family members of alcoholics} (DLC)dg2024060010 and {Parents} (DLC)dg2015060230
Parents of gays	2024-06-06T12:09:08	This authority record has been deleted because the heading is covered by the headings {Family members of gay people} (DLC)dg2024060074 and {Parents} (DLC)dg2015060230
United States Air Force officers	2024-08-21T13:21:46	This authority record has been deleted because it is not a valid construction.
United States Coast Guard officers	2024-08-21T13:21:46	This authority record has been deleted because it is not a valid construction.
Junior high school students	2025-01-24T13:29:48	This authority record has been deleted because the heading is covered by the heading {Middle school students} (DLC)dg2015060024

There's actually nothing problematic here, as far as I can tell. Yes, some terms have been removed over time but this is because they are covered by some other combination (e.g. Parents of dyslexics), weren't correctly constructed to start with (e.g. United States Coast Guard officers), or were duplicates (e.g. Junior high school students).

Let's move on…

Social terms (includes gender and sexuality)

There are several subsets defined within the full set of terms. One in particular, http://id.loc.gov/authorities/demographicTerms/collection_LCDGT_Social, contains terms referring to social groups, which includes various categories which are actively threatened by the current regime. These seem worth looking over to see if there's anything amiss.

  <<common-prefixes>>

  SELECT * FROM dt:20250321
  WHERE {
    ?term mads:authoritativeLabel ?label .
    ?term mads:isMemberOfMADSCollection <http://id.loc.gov/authorities/demographicTerms/collection_LCDGT_Social>
  }
  LIMIT 20

label	term
Labor union members	http://id.loc.gov/authorities/demographicTerms/dg2023060130
Family members of Alzheimer's patients	http://id.loc.gov/authorities/demographicTerms/dg2015060782
Clavichord students	http://id.loc.gov/authorities/demographicTerms/dg2016060518
Males	http://id.loc.gov/authorities/demographicTerms/dg2015060003
Iraq War veterans	http://id.loc.gov/authorities/demographicTerms/dg2024060120
Chemistry students	http://id.loc.gov/authorities/demographicTerms/dg2017060075
Greek, Students of	http://id.loc.gov/authorities/demographicTerms/dg2023060042
Family members of mass murderers	http://id.loc.gov/authorities/demographicTerms/dg2024060075
Quechua, Students of	http://id.loc.gov/authorities/demographicTerms/dg2023060041
Orphans	http://id.loc.gov/authorities/demographicTerms/dg2022060429
Czech, Students of	http://id.loc.gov/authorities/demographicTerms/dg2022060118
Navajo, Students of	http://id.loc.gov/authorities/demographicTerms/dg2023060063
Children of politicians	http://id.loc.gov/authorities/demographicTerms/dg2017060126
Expatriates	http://id.loc.gov/authorities/demographicTerms/dg2015060864
Holocaust victims	http://id.loc.gov/authorities/demographicTerms/dg2015060152
Two-spirit people	http://id.loc.gov/authorities/demographicTerms/dg2022060056
Booker Prize winners	http://id.loc.gov/authorities/demographicTerms/dg2022060316
Prix Carbet de la Caraïbe et du Tout-Monde winners	http://id.loc.gov/authorities/demographicTerms/dg2024060112
Stepbrothers	http://id.loc.gov/authorities/demographicTerms/dg2015060238
Parents	http://id.loc.gov/authorities/demographicTerms/dg2015060230

I've limited the number of results here to 20, but I've scanned through the full collection of 325 and all the terms I'd expect to see removed (relating to sexuality & gender identity, for example) are still present. For example:

  <<common-prefixes>>

  SELECT * FROM dt:20250321
  WHERE {
    ?term mads:authoritativeLabel ?label .
    FILTER(CONTAINS(LCASE(?label), "gender"))
  }

label	term
Family members of transgender people	http://id.loc.gov/authorities/demographicTerms/dg2024060076
Gender minorities	http://id.loc.gov/authorities/demographicTerms/dg2015060398
Transgender people	http://id.loc.gov/authorities/demographicTerms/dg2015060006
Cisgender people	http://id.loc.gov/authorities/demographicTerms/dg2017060283
Gender studies teachers	http://id.loc.gov/authorities/demographicTerms/dg2017060049
Genderqueer people	http://id.loc.gov/authorities/demographicTerms/dg2022060046

What's changed?

The check I've been building up to, though, is to directly compare snapshots of the vocabulary taken one week apart. This is why I uploaded two versions into named graphs, and requires slightly more verbose queries to compare.

I thought about a few ways of doing this, but I've landed on this simple option which should nonetheless identify the most obvious kinds of vandalism: what (textual) labels are present in the earlier snapshot but absent in the later? This should catch both deleted and modified terms, which we can then potentially inspect further.

In SPARQL we can do this by querying the earlier graph for all its term labels, then filtering out any of these that do not exist in the later graph:

  <<common-prefixes>>

  SELECT ?term ?label
  FROM NAMED dt:20250314
  FROM NAMED dt:20250321
  WHERE {
    GRAPH dt:20250314 {
      ?term mads:authoritativeLabel ?label
    }
    FILTER NOT EXISTS {
      GRAPH dt:20250321 { ?x mads:authoritativeLabel ?label }
    }
  }
  LIMIT 200

In this case, this returns only one result:

term	label
http://id.loc.gov/authorities/demographicTerms/dg2022060384	Porto-Alegrenses

Running the same query reversed (what is present in the later but not the earlier graph?) also shows only one result, and it's for the same ID.

  <<common-prefixes>>

  SELECT ?term ?label
  FROM NAMED dt:20250314
  FROM NAMED dt:20250321
  WHERE {
    GRAPH dt:20250321 {
      ?term mads:authoritativeLabel ?label
    }
    FILTER NOT EXISTS {
      GRAPH dt:20250314 { ?x mads:authoritativeLabel ?label }
    }
  }
  LIMIT 200

term	label
http://id.loc.gov/authorities/demographicTerms/dg2022060384	Porto-alegrenses

This differs only in the case of the single letter "A", so we can reasonably assume that this is just a typo being corrected or something of that nature.

What's next?

So far, it looks like this particular vocabulary has not yet been defaced. That doesn't mean it won't be though, so I'll need to continue taking snapshots at regular intervals and repeat these tests to see if anything else changes. One follow-up task, then, is to better automate this so I can't forget to do it (I've already missed the 2025-03-28 update).

There are other vocabularies that we do know have already been changed. Library of Congress Subject Headings (LCSH) are used in English-language library catalogues around the world, so the recent widely-discussed changes of the terms for "Mexico, Gulf of" and "Mount Denali" to "America, Gulf of" and "Mount McKinley" respectively will have global impact.

I'm aware that these changes do not go unnoticed in the GLAM community, since many catologuing professionals will routinely check updates to LCSH and friends before applying them to their own catalogue. I'm still doing this for my own interest, and also because a growing number of organisations simply don't have the capacity to do such checks.

LCSH is a much bigger dataset, and I plan to dig into that for my next post on this subject.

If you enjoy what I create, or find it useful, you can support me at ko-fi.com/cipherrot. Every little helps!

Webmentions

You can respond to this post, "On fooling around with triples", by: liking, boosting or replying to a tweet or toot that mentions it; or sending a webmention from your own site to https://erambler.co.uk/blog/on-fooling-around-with-triples/

Comments & reactions haven't loaded yet. You might have JavaScript disabled but that's cool 😎.