<< Back to stories index | >> To the Github repo of this page |
Olaf Janssen, 8 June 2020
Dutch Wikipedia contains some 20.000 citations to Delpher, a digital historical newspaper archive in the Netherlands. Many of those URLs are sustainable, but until recently Dutch Wikipedia also contained thousands of non-future-proof links to Delpher. In this article I show how I found these links and replaced them with persistent URLs, and what the benefits of this search-and-replace operation are.
To understand the rest of this post, I’ll start with a short introduction of the two key players:
Delpher is a website containing over 100 million full-text digitized pages not only from Dutch historical newspapers, but also from books and periodicals. In this article I’ll focus on the newspapers in this archive.
The Koninklijke Bibliotheek (KB) is the national library of the Netherlands. Delpher is a services operated by the KB.
*****
In my previous article Detecting Wikipedia articles strongly based on single library collections, I explained how you can find Dutch Wikipedia articles that are based entirely or largely on content from Delpher.
I used the Massviews Analysis tool (Dutch: Analyse verzamelde weergaven), which helps you to find articles that contain links to (in this case) newspapers in Delpher. It takes a URL (or rather: a URL pattern, or base URL) as input, and returns a list of articles containing that base URL. In the example below, that base URL is https://www.delpher.nl/nl/kranten/view?query= (in that URL, ‘kranten’ is the Dutch word for newspapers).
This tool has one drawback however: you measure that an article contains links starting with that base URL, but not which links these exactly are, so what the full URLs look like.
Fortunately, there is another Wiki tool for this: External links search (Dutch: Externe koppelingen zoeken). If you enter the above base URL in that tool, you will get a list of all URLs starting with https://www.delpher.nl/nl/kranten/view?query=, and in which Dutch Wikipedia article each link appears.
To give an example: in line 2 about the Afro-Surinamers you see that this article contains the link https://www.delpher.nl/nl/kranten/view?query=%22Afro+Surinamers%22&coll=ddd&identifier=ddd:011187267:mpeg21:a0003&resultsidentifier=ddd:011187267:mpeg21:a0003. That link refers to the newspaper article ‘Black Power’ by Cyriel R. Karg in the Vrije Stem: onafhankelijk weekblad voor Suriname from 27-07-1970.
And, indeed, in the Wikipedia article about the Afro-Surinamers
see you that article and URL listed at the bottom. Check!
What struck me when working with this tool was that there are many Delpher links in Wikipedia that are non-persistent. The above link is an example of such an unsustainable link. Fortunately, you can often find the sustainable (= persistent, permanent, future-proof) link from the Delpher interface. In this example, it is https://resolver.kb.nl/resolve?urn=ddd:011187267:mpeg21:a0003, as shown in the screenshot below.
Not only is this so-called resolver link (from the URL resolver.kb.nl) more future-proof, but it’s also a lot shorter, more readable and ‘cleaner’ than that long, messy, non-permanent URL:
For these four reasons, it was a good idea to replace as many of those unsustainable URLs with persistent resolver links. Since Wikipedia currently contains some 20,000 Delpher links, a completely manual approach was not an option, so some automation was necessary.
Fortunately, years ago in 2015, following a request from the KB, the Wikipedian Merlijn van Deen wrote a blog post on how to semi-automatically replace old, dead links to KB sites with new, working ones in Wikipedia. The approach & scripts he shared back then (after being shelfed for 5 years) suddenly came in very handy to get my job done!
I converted these scripts into two Jupyter notebooks and using of some Excel work and the replace.py routine form the Pywikibot framework I managed to make hundreds of Delpher links more sustainable, as evidenced by the screenshot below.
The effects of the actions can be clearly seen in thes source code of the affected articles, eg. in the one about the Dutch actress Marijke Bakker. That article has now become a lot shorter, more readable and tidier (left column = source code before the action, right column = source code after):
Or in the article about Europees clubhonkbal (European club baseball). That cleans up nicely!
To summarize, replacing non-persistent Delpher URLs with future-proof resolver links has following benefits, both for Wikipedia and the KB:
So far I’ve mainly worked on replacing links to Delpher (and specifically to newspapers). Once I’ve replaced all non-persistent newspaper links by their persistent equivalents, I’ll start working on the books and magazines in Delpher. After that I will continue to replace non-durable links to other KB services with their resolver equivalents.
A nice current example of the necessity of this is the recent change of the base URL of the KB catalogue. In June 2020 it changed from http://opc4.kb.nl to https://opc-kb.oclc.org/. For now, the KB offers a temporary redirect from the old to the new base URL, but once that stops, hundreds of KB catalogue links in Wikipedia (and other websites) are in danger of becoming unreachable.
To make those KB catalogue links really sustainable - after all, who will guarantee that https://opc-kb.oclc.org will still be available in 5 or 10 years’ time? - it is best to replace them as much as possible with their durable resolver counterparts. These always take formats like https://resolver.kb.nl/resolve?urn=PPN:376299290, with the PPN:number uniquely identifying the publication in the catalogue.
This has already been done for about 1200 such links (see screenshot below), but for the other links I will do this in collaboration with the Wikipedia community.
Olaf Janssen is the Wikimedia coordinator of the KB, the national library of the Netherlands. He contributes to
Wikipedia, Wikimedia Commons and Wikidata as User:OlafJanssen
This article is available at Zenodo, Github and Wikimedia Commons. The text is available under the Creative Commons Attribution CC-BY 4.0 License.