Technical notes (under construction)
Latest update: xx May 2025
Work on this:
This page gives more info about
- The scripts ‘extract_copyright_templates.py’ and ‘template_usage_summary.py’ and
2 The data files in the data folder
- The datavisualisation from datawrapper, created via the datawrapper API
- …xxx
Key Features of script ‘template_usage_summary.py’:
TO ADD
- Uses the MediaWiki API to search for Commons files in the desired category.
- Fetches the raw wikitext of each file page.
- Isolates wrapper templates like , , , and .
- Extracts relevant templates from top-level usage or embedded fields like:
- Handles multiline and nested template values reliably.
- Extracts a simplified creation date from various formats:
- Supports date formats: YYYY, YYYY-MM, YYYY-MM-DD
- Returns the most recent valid year if multiple are present.
- Excludes known irrelevant templates via a robust filtering system.
- Outputs results to:
- Console (one line per file with all extracted info)
- Excel file (
*_commons_templates_output_<date>.xlsx
) with URLs and linked templates
- Excel file (
*_commons_templates_output_<date>-cleaned.xlsx
) is a munually
cleaned version of the first file, where any non-copyright templates, incorrect dates and other ‘noise’ that we did not manage to get filtered out by the Python script have been manually removed as a post-processing step.
Output:
- File URL
- Number of detected templates
- Simplified creation or publication date
- Template names and links to their Commons documentation pages
Dependencies:
- Python 3.7+
requests
, re
, pandas
, openpyxl
See also
- Same code as notebook on PAWS: https://hub-paws.wmcloud.org/user/OlafJanssen/lab/tree/MediaFromDelpher_ExtractCopyrightTemplates/extract_copyright_templates.ipynb
Author:
- Olaf Janssen, Wikimedia coordinator @KB national library of the Netherlands (via ChatGPT)
- Last updated: 9 April 2025
- User-Agent: OlafJanssenBot/1.0
License:
This script is CC0, so released into the public domain. You may freely use, adapt, and redistribute it.