Projekt:Öppen databas för offentlig konst 2013/Wikimania-presentation

Slide 1 – Title

Hi everybody,

My name is André Costa and I'm employed as a GLAM-technician by Wikimedia Sverige since earlier this year. I'm primarily working on our open database of public art which I'm going to tell you about today.

First though some background...

Slide 2 – Background

The idea for creating our own database came from the preparations for another project, the Wiki Loves Public Art (WLPA) competition.

Wiki Loves Public Art is an attempt to port the highly successful Wiki Loves Monuments competition to the context of public art. I won't say any more on this since there is a separate presentation about this tomorrow in the "Wiki loves..." track [1]. I'll also leave the motivation for why public art was chosen to that session.

Slide 3 – WLPA problems

Anyhow during the preparations for Wiki Loves Public Art we discovered that in Sweden there is no central registry of artworks. Instead the artworks are administered by a variety of public and private entities. Primarily this is done by the municipalities (roughly towns in a more densely populated country than Sweden).

This means a lot of different data providers without a standardised format for the information or even any guarantee that they hold any information at all.

Slide 4 – The plan

We therefore though why don't we collect all of this data into one single place where it's stored in the same structure irrespectively of the original source.

We can then make the data accessible for any downstream use (including WLPA). By integrating it with the Wikimedia projects we also provide a natural way of enhancing the data.

At this point we decided that we needed to do things slightly differently from how they are done in Wiki Loves Monuments.

Slides 5-8 – Wiki Loves Monuments

In Wiki Loves Monuments the data is provided by one or a few central registrars...

[next slide]: ...and put into lists on Wikipedia.
[next slide]: These lists are then harvested to a database...
[next slide]: ...which in turn is made available for downstream use via an API.

Slides 9-12 – Our system

By comparison in our system...

[next slide]: The provided data is put into the database.
[next slide]: From there it is pushed out into the lists on Wikipedia. Updates to the lists are imported back into the database.
[next slide]: Again downstream access is provided through an API with the possibility of importing their data back into the database.

Slide 13 – Why?

Why would we want to do things differently?

The difference allows us to store official data differently from the enhanced data we get back from Wikipedia. This in turn allows downstream users to decide if they want to simply use us as an aggregator of official data or if they want the full wiki-grade data.

It also makes it easier for us to include improvements form other sources.

And it allows us to store more information than the wiki might want in their lists, even as hidden parameters.

Slide 14 – Targets

So to summarise the target audience for our open database are:

Researchers: Wanting to access all of the (official) information in a single place in order to be able to draw conclusions from a larger dataset.
Schools: Where the students want to write about the local artwork or about local artists and their artwork wherever it's situated in the country.
Journalists: Wanting to examine what the taxpayers money is being used for.
Tourism: Wanting to inform visitors about the local culture.
...and more

And of course the database will be of use for Wikimedians as well. Both in providing metadata about the artworks and as a source of information for generated lists etc.

Slide 15 – Overview

As may be imagined each part of this process comes with it's own challenges and problems.

This presentation I'll therefore give you a guided tour through the different steps of the process to show you what we've done and to highlight any challenges/problems we've encountered.

It should be noticed that this is still a work in progress and so not all of the pieces that I'll show you today are in place yet.

Slide 16 – Data providers

To get the data we need to contact each provider individually.

Luckily Sweden has something called the Public Sector Information directive (PSI) meaning that, with some caveats, public bodies need to provide us with the data we ask for, assuming that they have it to start with. This is roughly similar to Freedom of Information (FoI) requests in other countries.

Even though we have the PSI to fall back on we worded the letters as a friendly request motivating the benefits of sharing the information. The letters however had a few PSI related keywords sprinkled through them to ensure they weren't disregarded.

We started by contacting the larger municipalities as well as those we knew to have had some previous involvement with open data issues. This was primarily done as we expected that these were the players who best could give us the feedback needed to improve our communications with the later municipalities.

We specifically asked for the data shown on the slide but we also mentioned that we gladly accepted any other data they had and that we were interested in their information even if they didn't have all of the requested data.

We were however made it very clear that we did not want any of their images. This is due to a variety of copyright and related complications.

Slide 17 – Data providers: The Problems

There are however several complications and difficulties with the collected data.

Firstly is the quality of the data. The reason for this is two-fold. Firstly there is no legal responsibility to store the data and thus no standard for what and how to store it. Secondly the data is often only stored digitally as a way of more easily updated paper files, as a result only the output is taken into account not the data itself. Some common issues were:

Lack of unique identifiers for the objects
Complete lack of data or very little data
Non-digitised data
Incoherent data (i.e. where the data is structured differently for each object)
Inappropriately or Non-structured data (e.g. data stored in a way which visually looks structured but lacks a structure on a data level)
Tourism brochures rather than formal data

[click on images for examples]

Problems with the quality of the data is easy to spot. More problematic are systematic problems related to the selection of the data or the definition of Public Art. Questions which are often unclear are:

Did the municipality include all of their data or did they only include:
- artworks which they paid for,
- artworks which they care for,
- artworks which are deemed interesting from a tourism point of view,
- only indoor/outdoor art.

We tried to document their selection process but it is still often very unclear.

Unawareness was another big issue.

Unawareness of open data and why anyone would have an interest in reusing their data.
Unawareness of the own internal structure. Very often the contact person is a curator and not responsible for running the internal database. Often they don't even know that there is a database (when there is one) or how to get data outputted from it in any format other than the standard (pdf ready for archival). Often the database has been built by subcontractors meaning that there is no one available on-site who can answer these questions either.
Unawareness of re-usability. Apart from the previously mentioned problem with data formats there was also a frequent view of if it is available in format A why would you want it in format B. This relates to the previous point about the contact person not being aware of the technical differences underlying different formats. A frequent reply was simply "What do you need from us? The information is already available on our web-page" referring to a web-page where all of the information is embedded into a Flash animation.

Lastly there is nervousness about what the reuse of the data will lead to. Remember that this is data which up to now has primarily been of internal use and which is suddenly used in a new setting. The worries included:

Risk of theft: If the information about the artworks and their placement is publicly accessible then what is there to stop someone from targeting the artwork for e.g. its metal value.
Privacy issues: The public sector has been drilled to be careful whenever disclosing information regarding a private person. In our case this meant uncertainties regarding what the municipalities could tell us about the creators (e.g. name, year of birth/death etc.).
A spotlight on mistakes. Any large dataset will contain mistakes. This is something that both us and the municipalities are aware of. Still it was often felt by the municipalities that the data needed to be improved/cleaned before being released publicly. A very common reply to read was "We are happy to give you our data, please just wait until we've touched it up, this will take ... weeks/months"

Slide 18 – Data providers: Results and conclusion

Our initial goal for this project is to get the data from 25 data providers into the database. As you can see we are doing quite well with respect to this goal. However it is also very clear that this is only a small part of all the main data providers.

The main conclusion is that the data collection takes up an unexpected amount of time. This is something worth stressing and definite something which should be taken into account during the plannign stages.

Secondly the data that you do get will often be in a much worse state then you might have expected. Especially if you are used to the data for Wiki Loves Monuments which is often provided by large governmental bodies and which has a certain structure since it has often served as a basis for deciding on the legal protection of the object in question.

There are however certain positive side-effects as well!

Whilst communicating with the municipalities we got in contact with lots of motivated people who started thinking about both open data and the role of Wikipedia. We know that this resulted in many internal discussions and policy decisions regarding open data. It is also very rewarding to know that by the end of this project every municipality in Sweden will have had to deal request for open data, hopefully making it easier for future requests to be processed.

We have also been lucky in that we've teamed up with two partners who have helped us increase the impact of out work.

The first is a umbrella organisation for the municipalities of Sweden, they provided us with the contact details for the person within each municipality most likely to be responsible for public art. This meant that our letter got to the right person quicker.
The second is a company charged with improving how municipalities deal with open data requests. By sharing our experiences with them the impact of our work is drastically increased. They also use us as an example for why data is requested and how it can be used.

Slides 19-20 – Database<->Data providers

Switching now to how the data is imported into the database
[next slide]: Ideally we would be provided with api access to the databases of each municipality meaning that after some initial mapping the data would be synced automatically... In practice each data set must be "massaged" into a form where it can later be imported into the database. Since the datasets vary drastically each most frequently requires it's own "massaging technique" (i.e. script/workflow).

If we want to act as an aggregator we now need some way for downstream users to decide if they want the raw or "enhanced" information.

If we additionally want to be able to export improved data back to the data providers then we additionally need some means of tracking changes/enhancements.

Slide 21 – The Database

Which brings us to the database itself

Slide 22 – ÖDOK

Like any other project this one has a name which was instantly disregarded in favour of an abbreviation.

In this case that is ÖDOK which stands for Öppen Databas för Offentlig Konst. Unsurprisingly this is Swedish for Open Database of Public Art. The naming does not make it explicit that this is a Sweden specific database. I'll return to how the database could be used for other countries at the end of this talk.

Slide 23 – The views

We decided to differentiate between the official data and enhancements by storing an audit version of the original data when it is changed and setting a "changed" flag accordingly.

This way if a change is reverted (through e.g. patrolling on Wikipedia) then the data now matches the audit version again and timestamps, flags etc. can be reset. Similarly the official data could be updated which in turn would effect the audit version.

A user querying the database can then choose one of three views through which to retrieve the data.

Strict: This only returns the original (audit) data
Enhanced: This returns the original data plus any non-conflicting additions. E.g. an added coordinate would be returned but not a corrected misspelling.
Normal: The topmost/latest data is returned, i.e. the full wiki glory.

Slide 24 – Additional parameters

To the original data we also add a few Wikimedia specific fields such as

A sample image on Commons
A subject category on Commons
A Wikipedia article about the object in question (via Wikidata)
A Wikipedia article about the artist(s) in question (via Wikidata)

Additionally we also add:

A unique identifier if one was not provided by the data provider.
A same_as parameter which allows us to identify objects that exist at more than one data provider (whilst maintaining persistent identifiers).

Also by keeping a separate table for the artists we can connect artworks by artist and do some basic copyright calculations for the individual artworks.

Slides 25-26 – Wikipedia

Switching to how the information is stored on Wikipedia
[next slide]: This is largely done in the same way as for Wiki Loves Monuments, i.e. by template based lists. This gives an easy overview of the lists and encourages to upload of free images (which is the tie-in to Wiki Loves Public Art).

The thinking is that Wikipedia is already a known platform with built in abuse/conflict resolution

On the Swedish language Wikipedia there is already a precedence of these types of lists with artworks per municipality although not all such lists exist yet.

Many of these parameters (but possibly not all) could be moved to Wikidata once Phase 3 is completed.

By contrast to Wiki Loves Monuments there is no problem with having the same object appear in different lists. As a result we can also use the database to generate other types of lists e.g. artworks by a specific artist to add to the artists biography.

Slides 27-28 – Database<->Wikipedia

Moving on to how the Wikipedia lists communicate with the database
[next slide]: Again this is largely similar to how Wiki Loves Monuments works with bots syncing the different sources.

Where it differs to Wiki Loves Monuments is in that updates to the database (from municipalities or other sources) can be pushed to the lists. And that new lists can be generated directly form the database.

Here to it is worth noting that if a list is already in existence then changing it to be compatible with the database can be time consuming as it requires matching the existing entries to the database entries. This however only need to be done once for each list.

Slides 29-30 – Database<->Downstream

And finally looking at how the database communicates with downstream users.
[next slide]: This is done using an open API. This allows queries through either of the three views and filtered on several of the parameters.

For the data providers there is a function which exports all crowd-sourced additions/corrections to their dataset.

The data can be exported in various formats and the intention is to expand it with greater support for queries based on the table of artists.

The API may also be adapted to allow writing to the database from trusted applications.

Slide 31 – Showcase and links

To illustrate how the database can be used we put together a map showing all of the artworks in Stockholm for which coordinates have been added by Wikipedia users. Clicking on any of these brings up a window showing (some of) the held information on the artwork as well as a thumbnail of any sample image on commons and the introduction to a Wikipedia article about the artwork.

If you want to find out more then visit one of these links:

se.wikimedia.org/wiki/Projekt:ODOK - project page
github.com/lokal-profil/ODOK - code repository
offentligkonst.se - the showcase above

Any questions?

The missing slide – Reuse for other countries

Some of the design choices for the database are slightly Sweden-centric. This is largely due to the fact that administrative subdivision are set up in a certain way etc.

If you are intending to reuse this or something similar in your country then there are two paths to follow.

Copy the code. Do some modifications to make it fit your country and then run your own database.
Develop the underlying code so that it deals with different countries. Then we can build one database containing the artworks in both countries .

Slide 32 – Used images

list of images/licenses/creators for non-own artwork used