Projekt:CommonsDB registry/Implementation
On this page we collect ideas, thoughts and considerations on how the potential could be realized and what requirements that will have on the Wikimedia infrastructure. Some of the ideas are visualized on a high level in the video to the right. For now, all info goes on this page, but eventually they may need separate subpages.
Checking if a new upload already is in CommonsDB
Note that the ways this could be implemented may vary depending on with which method an image is being uploaded. The UploadWizard may be the main way to implement it, as most of the other tools have either very experienced users (like OpenRefine users) with better than average knowledge about copyright, or (like the Wikimedia Commons app) mainly feature uploading original content. However, considering that there are many tools that can be used to upload to Wikimedia Commons, and that some tool developers wants to benefit from more features and make their uploads have as high quality as possible too, it might be worth considering the design of the feature to be generic and reusable through an API or library that can enable tools to use this functionality as a service.
UploadWizard
Check the CommonsDB register
- After the image has been uploaded in the browser, an ISCC code can be generated.
- Use the ISCC to query the CommonsDB register to see if it is known.
- If assets with exact match or very high similarity is found, suggest a license template/statement based on the result (see below), possibly with a link so that the user can explore/verify.
- If assets with high similarity is found, provide a link so that the user can explore/verify and if it was a match suggest a license template/statement based on the result.
- If it is not, do nothing.
Selecting license templates
When checking for similar matches as described above, if a match is found the rights statement of that asset can be used to select the right license on Wikimedia Commons. One way could be to match the string to an item on Wikidata with that value in official website (P856) and then from that item find topic has template (P1424) and get the Commons sitelink from that value, which should be the correct template. Of course, this need to be verified for all allowed rights statements in the registry, taking particular care of public domain which may be modeled differently on Commons. Possibly for performance reasons, we could make this a cached table and not query for this each time as the values are unlikely to change very often, if at all.
Store ISCC
The ISCC code should be stored as Structured Data: there is already a property for this, ISCC code (P13150). This should be possibly qualified with Wikimedia page-version URL (P7569) so that it is clear which uploaded file the ISCC is for.
Prototyping options
An alternate workflow where a user uploads to a tool on Toolforge could be created, where that tool generates the ISCC, does the check against the CommonsDB API and recommends a licensing template based on the result. Then it either makes the upload completely separately or forwards the information to the UploadWizard.
A bit more integrated workflow, would be to create a user script. This one could then "embed" itself in the regular workflow, and don't feel that different for the user. Possibly, parts of the work behind the scenes could still benefit of a tool on Toolforge to generate the ISCC.
The first option is likely easier to build, whereas the second one would be a more convincing demo for the end user. It might also make it more clear for WMF developers on exactly where in the upload process the tool could be integrated.
Declaring new uploads to the registry
We probably want some kind of a "grace period" before we declare a new upload to the registry. This period will allow for poor uploads to be speedily deleted and for some of the most common bots to add metadata in structured form to the file (examples: SchlurcherBot, BotMultichillT) The length of this period should be checked with the community and be long enough so that the bots would be likely to have added the metadata to it and therefore minimize the need for updating declarations.
After the period has passed an upload can be considered as stable enough and a declaration should be made.
Store declaration status
That a declaration has been made should be stored somewhere, possibly as Structured Data. Storing the declaration ID as a main statement will make it easy to query for, and also makes sense as it is a kind of external identifier. Optionally, there could also be a qualifier on either the ISCC statement or this, connecting the two to each other (so that it is clear that the ISCC gave cause to the declaration ID).
Prototyping options
This could be built as a prototype on Toolforge. It could be done in a couple of steps. The first step is to identify suitable images to declare, generate the ISCC and put them in a queue. The second step, which we can enable when we feel like the first step is stable enough, is to do the declaration. The third step is to store the declaration status and ISCC on toolforge in a way that can be seen and show a suggested edit to Wikimedia Commons. When the needed properties have been created, those edits could be made to the file.
Updating metadata in the registry
Sometimes when the information on the file page or its structured data is updated, this is information that could be updated to the original declaration. In particular, changes of the license should be of important to update.
It seems nontrivial to detect changes like this and some service that looks for that sort of changes may be needed.
Prototyping options
Here we could experiment with a toolforge tool scanning the recent changes on Commons, trying to detect relevant changes on already declared images. In a first step, just finding the relevant changes is the goal and make it possible/easy for us to see how well the detection works. Making a new declaration can be made either after human review, or when we are confident that the detection is good enough.
Removing(?) declarations of deleted files from the registry
If a file gets removed from Wikimedia Commons, the registry should be updated or possibly have the entry removed.
Find duplicates among existing images
This would be a one-time job of checking generated ISCCs against each other to see if there are duplicates. It is likely that we do want some images that are very similar to each other, whereas others may have been mistakes of different kinds. For the mistakes, the community can decide if they want to delete the "duplicate" images (perhaps including redirects). For the ones that should be kept, a way to mark this as intentional would be good. Possibly this could be done with the other versions field in the Information template.
Suggesting category improvements
This is a similar one-time job for all of the images that are kept. It is likely that similar images should have at least one category in common, so a tool to easily make these edits could be made. This is likely a good "game" where users just approve or reject suggestions. If the user decides there should be no category in common, this may be odd to store on the files, which also suggests that an external tool can be useful.
Find media that has entered the public domain since the upload
This is not really related to the ISCC as such, but as we have declared and signed the metadata, we should make efforts to have that data be accurate. In the best of worlds, the community is already on top of this and we would just follow the process for updating the metadata in the registry as described above. However, no tools that assist such workflows are known and may be helpful for the community.
Resolve externally found "conflicts"
Here there are two scenarios, an initial check and then a continuous queue.
Initial check
This would be a one-time job of checking all images on Commons against other files in the registry. Files with exact match or very high similarity and a mismatch with the rights statement and the license on the file should be put in a queue for human review. Likely some manual investigation is needed to understand why this conflict occurred. Then, either the license is changed, or it should be marked as having the right one, possibly triggering a notification to the other data provider.
Continuous queue
Whenever new files are declared in the registry by other data providers that have exact or high similarity with files on Commons but where the rights statement are not the same, we should have a way to receive that notification and investigate the file. This could be a VRT queue if it needs to be email, but it would be better with a public list onwiki that anyone could help check.