KB-Wiki-Stats-Graphs

Detecting Wikipedia articles strongly based on single library collections

247 Dutch Wikipedia articles that wouldn’t be here without Delpher and DBNL, with 33.000 views each month

Olaf Janssen, 21 May 2020

In this post I will illustrate an approach to detect Wikipedia articles whose contents are fully or largely based on content from a single online source, such as a full-text digitized newspaper archive or a digital text library. Using Dutch Wikipedia I’ll track down 247 articles that owe their existence to Delpher and DBNL, two full-text collections operated by the KB, the national library of the Netherlands.

This approach might be relevant for GLAMs that have digital text collections used by the Wikipedia community for writing articles.


Three key players: Delpher, DBNL and KB

To understand the rest of this post, I’ll start with a short introduction of three key players:

Delpher is a website containing over 100 million full-text digitized pages from Dutch historical newspapers, books and periodicals.

DBNL is the Digital Library for Dutch Literature (Dutch: Digitale Bibliotheek voor de Nederlandse Letteren, DBNL), a website about Dutch language and Dutch literature. It contains thousands of literary texts, secondary literature and additional information, like biographies, portrayals etcetera, and hyperlinks.

The Koninklijke Bibliotheek (KB) is the national library of the Netherlands. Both Delpher and DBNL are services operated by the KB.


OK, let’s go: Quiz time!

What is the connection between a garbage man, a garbage bag and a garbage truck?

Or between the Dutch soccer players Cor van der Gijp, Gerrie ter Horst and Joop van Daele?

Or between Hotel Des Indes and the International Press Museum, both located in The Hague, The Netherlands?

Or between a children’s song book and the literary magazine ‘Forum’ (1932-1935)?

The answer:

The Dutch Wikipedia articles about these things probably wouldn’t be there without Delpher or DBNL. In other words: the contents of these articles is fully or largly based on the contents of Delpher and/or DBNL. These articles owe their existence to the KB as the content supplier and the Wikipedia community piecing together all those pieces of Delpher/DBNL content into Wikipedia articles for millions of potential readers.

A more detailed look

Every two years I measure a number of indicators about the reach and reuse of KB collections via the Wikimedia platforms, most recently in February 2020. I would like to share one of the insights I gained from that analysis: Dutch Wikipedia contains dozens of articles that would not have existed today without Delpher and/or DBNL.

To be more specfific, last February I determined

Approach in 4 steps

During this measurement process I started to notice that there are quite a few Hotel Des Indes-like articles: articles containing a striking amount of links to Delpher and/or DBNL. That triggered my curiosity, so I went deeper and more systematic, in 4 steps.

Step 1: article lists

I started out by making an overview of all articles on Dutch Wikipedia containing one or more links to Delpher or DBNL. I did this using the Massviews Analysis tool, which takes a URL (or rather: a URL pattern, or base-URL) as input, and returns a list of articles containing that URL pattern. The screenshot below is based on the URL https://www.delpher.nl (click for live tool, might take some time)

I used this tool for all Delpher URLs (don’t forget the persistent KB-resolver base-URLs such as http://resolver.kb.nl/resolve?urn=ddd, see column 3 of this table for all base-URLs). I merged and de-duplicated the resulting article lists, and converted the outcome to Excel, the final result is a list of approx. 6.800 articles containing one or more Delpher URLs.

I used a similar workflow for DBNL (URL patterns http(s)://*.dbnl.org), resulting in a list of just over 7.600 unique Wikipedia articles.

Once I had those article lists, for each article I determined which (and how many) external links it contains, and which of those links point to Delpher (or DBNL). I did this using the MediaWiki API and Python script (for Delpher and for DBNL). In the screenshot below of the Delpher script you can see that filtering is done on the resolver base-URLs of the Delpher Newspapers subset.

This step eventually yields an Excel that (for Delpher) looks like this:

For example, the first article “…die_Revolutie_niet_begrepen!…” contains 16 external links, 9 of which point to Delpher.

Because we are looking for articles that are entirely or largely based on contents from Delpher (or DBNL), it is useful to look at the so-called link ratio . That is the ratio of the total number of external links, and the number of those that link to Delpher. A link ratio of 1.00 means that all external links in an article are Delpher links. The lower the link ratio, the smaller the relative number of Delpher links in the article.

Step 4: threshold criteria

Next, to determine whether an article owes its existence largely to Delpher (or DBNL), I use two threshold criteria:

  1. The article must contain a minimum number of external links, as its content must be sufficiently based on external sources.
  2. The link ratio must exceed a certain threshold in order to have Delpher (or DBNL) as an external source sufficiently often.

There is some freedom in the choice of both thresholds, I have used the following:

This results in the following table for Delpher

Analysis

The articles found in this way are places where strong aggregation and republication of Delpher content takes place. In other words: These articles bring together information from Delpher related to people, places, events and other topics for a wide audience, as 80% of the Netherlands reads Wikipedia. The same goes for DBNL.

If you look at the lists of the ‘aggregation articles’ obtained in this way, you see

For Delpher

For DBNL

33.000 views every month

All very well these Wikpedia articles heavily based on Delpher and/or DBNL, but are they actually read by the public? I also looked into that.

For each article, the Massviews Analysis tool mentioned above also gives the number of requests (see the Pageviews column) during a certain period, in this case it is (almost) 2 years, from 21 Febr 2018 to 5 Febr 2020.

This allows us to determine the total number of requests for these 193 Delpher and 54 DBNL aggregation articles during those two years.

In total, this amounts to 789.534 page views in 2 years, or an average of 33.000 requests per month.

Raw data

The approach described above is also explained on Dutch Wikipedia. The Excels from which the above screenshots were created are available here on Github:

About the author

Olaf Janssen is the Wikimedia coordinator of the KB, the national library of the Netherlands. He contributes to Wikipedia, Wikimedia Commons and Wikidata as User:OlafJanssen

Reusing this article

This text of this article is available under the Creative Commons Attribution CC-BY 4.0 License.