On fooling around with triples
Content Note
If you've been aware at all about the current political situation in the US, you'll be aware that the dude in the Oval Office has been issuing executive orders left, right and centre. Much damage is being done to the country, its people and its infrastructure that will take generations to repair, but as someone watching from an ocean away it would be easy to believe it won't affect me.
It will.
Not only will any success experienced by fascism in the US embolden fascists around the world, but the US is so culturally and economically dominant that anything happening there will inevitably impact the rest of us. "When America sneezes, the rest of the world catches a cold."
If this all seems like a weird way to introduce a post about me figuring out how to query and manipulate RDF metadata, well, it is. I do intend to write more about this soon, but for now I'll just say that the question I'm interested in is this:
Is the linked data published by the Library of Congress being materially changed as a result of the aforementioned flurry of executive orders?
If you think this sounds hypothetical, well, the US National Cancer Institute has already begun removing certain gender- and sexuality-related terms from its linked data thesaurus. I like to believe that, for reasons of both professional integrity and institutional inertia, it will take much longer before similar changes are made to vocabularies like the Library of Congress Subject Headings (LCSH). These are resources used in cultural heritage institutions all over the world, though, so if & when that happens the impact will be widespread.
The setup
I eventually want to be able to do this kind of analysis on "the big one", LCSH. But that really is big and queries against it will take … a while … so I've chosen to start with something smaller, the Library of Congress Demographic Group Terms. I chose this in particular because it's a reasonable size but also contains terms that appear on a list of those government departments are being instructed to remove so seems likely to be a target of that censorship.
This data is all in the form of RDF triples. This post isn't the place for a full explanation of what that means, but in brief, a triple is a machine-readable statement of some property of a subject in the form:
<subject> <predicate> <object>
So I have two versions of the same RDF ontology, and I want to know what statements have been removed or altered from one to the other. I could process the raw files myself as text but: 1) I would eventually end up writing the bits of a triple-parser and store I need myself, from scratch, which would be a waste of time; and 2) I actually would like to learn more about the key technologies involved.
To work with RDF data, I needed a specialised database called a triplestore, which is optimised for running queries against this type of graph data using a query language whimsically named SPARQL. I didn't give this too much thought, and picked Apache Jena Fuseki because it was open source, was available as a package in my Linux distribution and worked first time when I tried starting it. Other triplestores are available.
The basic unit of a triplestore is a "graph", equivalent to a "database" in a database management system. Because these are two versions of the same data I can't just load them both up into the same graph because they have significant overlap and the second would largely replace the first leaving me no way to compare them. What I can do is load them up into separate "named graphs", which I can then refer to explicitly in my queries. I've called the graphs for the two versions urn:demographicTerms/20250314
and urn:demographicTerms/20250321
, for the dates I downloaded them.
Since RDF triples tend to use URIs (technically IRIs) all over the place, which tend to be quite verbose, SPARQL allows you to define prefixes to make queries less verbose. Here are a few that might be useful:
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX mads: <http://www.loc.gov/mads/rdf/v1#>
PREFIX ri: <http://id.loc.gov/ontologies/RecordInfo#>
I'll also define my own prefix, dt:
to make it easier to refer to the two named graphs:
PREFIX dt: <urn:demographicTerms/>
You'll see these abbreviated to <<common-prefixes>>
in the code below. With that set up, let's dive into the data.
Getting situated
To make sure that I'm successfully querying the right place, and get an idea of what data is in there, let's pull out a list of all the unique predicates: the things that are used to make statements about other things.
<<common-prefixes>>
SELECT DISTINCT ?predicate
FROM dt:20250314
WHERE {
[] ?predicate []
}
ORDER BY ?predicate
predicate http://id.loc.gov/ontologies/RecordInfo#languageOfCataloging http://id.loc.gov/ontologies/RecordInfo#recordChangeDate http://id.loc.gov/ontologies/RecordInfo#recordContentSource http://id.loc.gov/ontologies/RecordInfo#recordStatus http://id.loc.gov/ontologies/bflc/marcKey http://id.loc.gov/vocabulary/identifiers/lccn http://www.loc.gov/mads/rdf/v1#adminMetadata http://www.loc.gov/mads/rdf/v1#authoritativeLabel http://www.loc.gov/mads/rdf/v1#citationNote http://www.loc.gov/mads/rdf/v1#citationSource http://www.loc.gov/mads/rdf/v1#citationStatus http://www.loc.gov/mads/rdf/v1#deletionNote http://www.loc.gov/mads/rdf/v1#elementList http://www.loc.gov/mads/rdf/v1#elementValue http://www.loc.gov/mads/rdf/v1#exampleNote http://www.loc.gov/mads/rdf/v1#hasBroaderAuthority http://www.loc.gov/mads/rdf/v1#hasEarlierEstablishedForm http://www.loc.gov/mads/rdf/v1#hasMADSCollectionMember http://www.loc.gov/mads/rdf/v1#hasNarrowerAuthority http://www.loc.gov/mads/rdf/v1#hasReciprocalAuthority http://www.loc.gov/mads/rdf/v1#hasSource http://www.loc.gov/mads/rdf/v1#hasVariant http://www.loc.gov/mads/rdf/v1#historyNote http://www.loc.gov/mads/rdf/v1#isMemberOfMADSCollection http://www.loc.gov/mads/rdf/v1#isMemberOfMADSScheme http://www.loc.gov/mads/rdf/v1#note http://www.loc.gov/mads/rdf/v1#variantLabel http://www.w3.org/1999/02/22-rdf-syntax-ns#first http://www.w3.org/1999/02/22-rdf-syntax-ns#rest http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://www.w3.org/2000/01/rdf-schema#comment http://www.w3.org/2000/01/rdf-schema#label
This gives me a little confidence that we are at least looking at the right sort of data, and also gives hints about the structure of the dataset. Now we can start digging in a bit more deeply.
Deprecated records
In common with a lot of such vocabularies, those produced by LoC are generally quite scrupulous about retaining deleted entries but clearly marking them as such, rather than simply dropping them from the dataset. This is important: if you make use of a term that is subsequently deprecated, you need to know when that happened and why to understand how to appropriately update your own records, and historical records will likely still reference terms that have since been deleted but still need to be interpreted.
Let's take a look at what's been officially deprecated from this vocabulary:
<<common-prefixes>>
SELECT ?label ?delDate ?note FROM dt:20250321
WHERE {
OPTIONAL { ?x mads:variantLabel ?label } .
?x rdf:type mads:Authority .
?x mads:deletionNote ?delNote .
?x mads:adminMetadata ?md .
?md ri:recordChangeDate ?delDate .
?md ri:recordStatus "deprecated" .
BIND(REPLACE(?delNote, "\n", " ") as ?note)
}
ORDER BY ?delDate
label | delDate | note |
---|---|---|
Prisoners of war | 2021-01-22T15:39:00 | This authority record was deleted because it was created in error. |
Concentration camp inmates | 2022-11-09T12:41:19 | This authority record has been deleted because the demographic group term is covered by the demographic group terms {Internment camp inmates} (DLC)dg2022060247 and {Nazi concentration camp inmates} (DLC)dg2022060248 |
Parents of autistics | 2024-06-06T12:09:08 | This authority record has been deleted because the heading is covered by the headings {Family members of autistics} (DLC)dg2024060011 and {Parents} (DLC)dg2015060230 |
Parents of mass murderers | 2024-06-06T12:09:08 | This authority record has been deleted because the heading is covered by the headings {Family members of mass murderers} (DLC)dg2024060075 and {Parents} (DLC)dg2015060230 |
Politicians' partners | 2024-06-06T12:09:08 | This authority record has been deleted because the heading is covered by the headings {Family members of politicians} (DLC)dg2024060009 {Spouses} (DLC)dg2024060004 and {Unmarried partners} (DLC)dg2024060005 |
Cancer patients' partners | 2024-06-06T12:09:08 | This authority record has been deleted because the heading is covered by the headings {Family members of cancer patients} (DLC)dg2024060007 {Spouses} (DLC)dg2024060004 and {Unmarried partners} (DLC)dg2024060005 |
Parkinson's disease patients' partners | 2024-06-06T12:09:08 | This authority record has been deleted because the heading is covered by the headings {Family members of Parkinson's disease patients} (DLC)dg2024060008 {Spouses} (DLC)dg2024060004 and {Unmarried partners} (DLC)dg2024060005 |
Parents of transgender people | 2024-06-06T12:09:08 | This authority record has been deleted because the heading is covered by the headings {Family members of transgender people} (DLC)dg2024060076 and {Parents} (DLC)dg2015060230 |
Parents of dyslexics | 2024-06-06T12:09:08 | This authority record has been deleted because the heading is covered by the headings {Family members of dyslexics} (DLC)dg2024060073 and {Parents} (DLC)dg2015060230 |
Partners (Spouses) | 2024-06-06T12:09:08 | This authority record has been deleted because the heading is covered by the headings {Spouses} (DLC)dg2024060004 and {Unmarried partners} (DLC)dg2024060005 |
Parents of alcoholics | 2024-06-06T12:09:08 | This authority record has been deleted because the heading is covered by the headings {Family members of alcoholics} (DLC)dg2024060010 and {Parents} (DLC)dg2015060230 |
Parents of gays | 2024-06-06T12:09:08 | This authority record has been deleted because the heading is covered by the headings {Family members of gay people} (DLC)dg2024060074 and {Parents} (DLC)dg2015060230 |
United States Air Force officers | 2024-08-21T13:21:46 | This authority record has been deleted because it is not a valid construction. |
United States Coast Guard officers | 2024-08-21T13:21:46 | This authority record has been deleted because it is not a valid construction. |
Junior high school students | 2025-01-24T13:29:48 | This authority record has been deleted because the heading is covered by the heading {Middle school students} (DLC)dg2015060024 |
There's actually nothing problematic here, as far as I can tell. Yes, some terms have been removed over time but this is because they are covered by some other combination (e.g. Parents of dyslexics), weren't correctly constructed to start with (e.g. United States Coast Guard officers), or were duplicates (e.g. Junior high school students).
Let's move on…
Social terms (includes gender and sexuality)
There are several subsets defined within the full set of terms. One in particular, http://id.loc.gov/authorities/demographicTerms/collection_LCDGT_Social
, contains terms referring to social groups, which includes various categories which are actively threatened by the current regime. These seem worth looking over to see if there's anything amiss.
<<common-prefixes>>
SELECT * FROM dt:20250321
WHERE {
?term mads:authoritativeLabel ?label .
?term mads:isMemberOfMADSCollection <http://id.loc.gov/authorities/demographicTerms/collection_LCDGT_Social>
}
LIMIT 20
I've limited the number of results here to 20, but I've scanned through the full collection of 325 and all the terms I'd expect to see removed (relating to sexuality & gender identity, for example) are still present. For example:
<<common-prefixes>>
SELECT * FROM dt:20250321
WHERE {
?term mads:authoritativeLabel ?label .
FILTER(CONTAINS(LCASE(?label), "gender"))
}
label | term |
---|---|
Family members of transgender people | http://id.loc.gov/authorities/demographicTerms/dg2024060076 |
Gender minorities | http://id.loc.gov/authorities/demographicTerms/dg2015060398 |
Transgender people | http://id.loc.gov/authorities/demographicTerms/dg2015060006 |
Cisgender people | http://id.loc.gov/authorities/demographicTerms/dg2017060283 |
Gender studies teachers | http://id.loc.gov/authorities/demographicTerms/dg2017060049 |
Genderqueer people | http://id.loc.gov/authorities/demographicTerms/dg2022060046 |
What's changed?
The check I've been building up to, though, is to directly compare snapshots of the vocabulary taken one week apart. This is why I uploaded two versions into named graphs, and requires slightly more verbose queries to compare.
I thought about a few ways of doing this, but I've landed on this simple option which should nonetheless identify the most obvious kinds of vandalism: what (textual) labels are present in the earlier snapshot but absent in the later? This should catch both deleted and modified terms, which we can then potentially inspect further.
In SPARQL we can do this by querying the earlier graph for all its term labels, then filtering out any of these that do not exist in the later graph:
<<common-prefixes>>
SELECT ?term ?label
FROM NAMED dt:20250314
FROM NAMED dt:20250321
WHERE {
GRAPH dt:20250314 {
?term mads:authoritativeLabel ?label
}
FILTER NOT EXISTS {
GRAPH dt:20250321 { ?x mads:authoritativeLabel ?label }
}
}
LIMIT 200
In this case, this returns only one result:
term | label |
---|---|
http://id.loc.gov/authorities/demographicTerms/dg2022060384 | Porto-Alegrenses |
Running the same query reversed (what is present in the later but not the earlier graph?) also shows only one result, and it's for the same ID.
<<common-prefixes>>
SELECT ?term ?label
FROM NAMED dt:20250314
FROM NAMED dt:20250321
WHERE {
GRAPH dt:20250321 {
?term mads:authoritativeLabel ?label
}
FILTER NOT EXISTS {
GRAPH dt:20250314 { ?x mads:authoritativeLabel ?label }
}
}
LIMIT 200
term | label |
---|---|
http://id.loc.gov/authorities/demographicTerms/dg2022060384 | Porto-alegrenses |
This differs only in the case of the single letter "A", so we can reasonably assume that this is just a typo being corrected or something of that nature.
What's next?
So far, it looks like this particular vocabulary has not yet been defaced. That doesn't mean it won't be though, so I'll need to continue taking snapshots at regular intervals and repeat these tests to see if anything else changes. One follow-up task, then, is to better automate this so I can't forget to do it (I've already missed the 2025-03-28 update).
There are other vocabularies that we do know have already been changed. Library of Congress Subject Headings (LCSH) are used in English-language library catalogues around the world, so the recent widely-discussed changes of the terms for "Mexico, Gulf of" and "Mount Denali" to "America, Gulf of" and "Mount McKinley" respectively will have global impact.
I'm aware that these changes do not go unnoticed in the GLAM community, since many catologuing professionals will routinely check updates to LCSH and friends before applying them to their own catalogue. I'm still doing this for my own interest, and also because a growing number of organisations simply don't have the capacity to do such checks.
LCSH is a much bigger dataset, and I plan to dig into that for my next post on this subject.
Webmentions
You can respond to this post, "On fooling around with triples", by:
liking, boosting or replying to a tweet or toot that mentions it; or
sending a webmention from your own site to https://erambler.co.uk/blog/on-fooling-around-with-triples/
Comments
Powered by Cactus Comments 🌵