Hoppa till innehållet

Projekt:Kunskap i krissituationer 2021/SCB

Från Wikimedia

Background

In July 2021, Statistics Sweden (SCB) released their open data under the license CC0. This is the most permissive of the Creative Commons licenses, as it allows anyone to spread and re-use the data without having to attribute the source. Most importantly for us, it makes SCB's data compatible with Wikidata. As Wikidata itself is licensed CC0, only data with a compatible license can be imported into it.

Wikimedia Sverige has worked with the SCB team for Agenda 2030 and SDG data, to see what data on SDG indicators that could be transferred to Wikidata.

In order to investigate the available data and test a workflow to upload it to the Wikimedia platforms, we selected the dataset Population aged 15-74 (LFS) by sex, age and labour status. Year 1970 - 2020. This is one of the indicators under SDG 8, "Promote sustained, inclusive and sustainable economic growth, full and productive employment and decent work for all".

Implementation

The Wikimedia platforms offer two ways of storing data:

We tested both of them in order to investigate and compare the advantages and drawbacks of each approach.

We downloaded Sweden's unemployment rate from 1970 to 2020. The data was available in several formats, of which CSV was used.

The layout of the source data was an issue, as years were used as column headers, which was unintuitive. We transposed the data using Google Sheets, so that the columns switched places with the rows, as demonstrated in this example:

2020 2019 2018…
women xx xx xx
men xx xx xx
total xx xx xx


women men total
2020 xx xx xx
2019 xx xx xx
2018… xx xx xx

Wikidata

Wikidata has a number of properties related to statistics and demography, including P1198 unemployment rate. We used it to add data about the percentage of unemployed people (from 1970 to 2020) to the item of Sweden, Q34. Each claim included the qualifier P585 point in time.

Tabular data on Wikimedia Commons

Even though Wikimedia Commons is used mostly to store media files, such as images and video clips (there are over 70 million files there), it also has a separate namespace for tabular data. Tabular data allows users to create CSV-like tables of data and use them from other wikis to create automatic tables, lists, and graphs.

The tabular data namespace is not heavily utilized. As of now, many of the datasets are related to the Covid pandemic, but it has also been used for population data, as well as miscellaneous data such as results of election opinion polls and numbers of visitors to national parks.

We uploaded the data about unemployment divided by gender, as tabular data on Wikimedia Commons – i.e., a larger dataset than the one we uploaded to Wikidata, where only the general numbers were uploaded, not divided by gender.

In order to upload tabular data on Commons, it has to be converted from a table to a JSON file that includes fields like copyright license, source, and definitions of the columns. The conversion of the actual data from the source CSV file (transposed as described above) to JSON lists was made in a text editor by replacing some characters in every row with JSON syntax characters, as shown in this example:

2020,8.6,8.4,8.5 ["2020",8.4,8.5,8.4],
2019,7.6,7.4,7.5 ["2019",7.6,7.4,7.5]

Result

The two uploads resulted in the following datasets:

  • Statements about the unemployment rate from 1970 to 2020, for the whole population, were added to Sweden's item on Wikidata, Q34. They can be retrieved using Wikidata Query Service.
  • The tabular data file Data:Unemployment in Sweden.tab was created on Wikimedia Commons, containing the numbers for both the entire population and divided by gender.

Discussion

The reason why we worked with both Wikidata and Wikimedia Commons was to investigate which of them is better suited for this type of statistical data with many (yearly) data points.

One of the strengths of Wikidata is that it combines a lot of information on one platform. The point of Linked Open Data is that the whole is greater than the sum of the parts. Using the Wikidata Query Service, the data can be queried using various criteria, for example to plot the unemployment rate against other economic indicators, or to make comparisons between different countries.

On the other hand, one could argue if adding 50 statements to one item – one for every year from 1970 to 2020 – is really the most optimal way to go. Q34, the item of Sweden, is already quite large, as are the items of most countries. If we also added the data divided by gender, that would make it in total 150 statements about unemployment (for the whole population, for women and for men). And once you think of other indicators that are valid for the whole country, like homelessness rate, mortality rate, etc., each with 50 data points, it becomes clear that overloading the country's item is not viable.

Another aspect of Wikidata is that it's not always easy to express what we want with statements. In our case, the property P1198 unemployment rate already existed. But if a property that matches your needs to 100% does not exist, and you think you should be created, there's a process where the community discusses it and together decides whether it provides value to the platform.

On the other hand, Wikimedia Commons is much more flexible in this regard. You don't have to limit your data to whatever is available in a limited list of properties. Furthermore, the tabular data namespace was literally created to store large lists of data like ours. From a user's point of view, a table like COVID-19 Sweden daily cases hospitalisations deaths, is much easier to read and analyze than data on Wikidata.

It should also be noted that both options – Wikidata and tabular data – are compatible with the template Graph:Lines which can be used to display a graph of the data on Wikipedia. See for example this graph of unemployment in Sweden, divided by gender, rendered from the data on Wikimedia Commons.

In conclusion, the tabular data namespace on Wikimedia Commons is a great alternative to Wikidata when it comes to data with many data points, such as tracked across time. It doesn't have the modelling challenges that Wikidata has, with its limited number of properties. At the same time, adding some of the data to Wikidata is also valuable. In any particular case, we should determine which subset of the data will provide most benefits to Wikidata users, such as in this case, only adding the unemployment rate for the whole population, and storing the numbers divided by gender on Wikimedia Commons.