Projekt:Identifiera kvalitetsbrister/delprojekt 1/proposal v3
3rd revised version following Skype discussion with Holger Motzkau and email feedback from Lev Muchnik
- Main questions
- To what extent and why do people from non-English language communities use the English Wikipedia instead of the one in their local language? A particular focus will be put on the Swedish Wikipedia.
Using Wikipedia can refer to accessing (reading) articles, editing articles, and contributing to discussions, which corresponds to editing discussion pages. These activities can be done by users inside the particular country under consideration, but also from people living outside the country and somehow associated with it (by nationality, interest, etc.). Here we want to study the relations between English Wikipedia and Swedish Wikipedia. To get a basis of reference and identify special properties of the Swedish Wikipedia and its users, we will also choose three to five additional local Wikipedias comparable with the Swedish one with respect to (i) access volume, (ii) edit volume, (iii) number of speakers. A particular focus should be placed on choosing language communities in different time zones. Natural candidates are such Wikipedias from the Dutch, the Hebrew, and the Korean language communities, possibly also Arabic, Turkish, and/or Japanese.
Part I: Literature review (including online material); questions for a survey
- Retrieve studies on similar or related research questions regarding the usage of Wikipedias or other online media published in research papers, conference proceedings or online forums. Summarize the content with focus on the main research question of this project.
- Find statistical tools, data, and/or analysis pages available on the web.
- Collect relevant questions for online surveys addressing users (readers) and/or editors from the Swedish language community. The (future) online survey shall mainly start from banner calls that appear to users from Sweden who look at or edit English Wikipedia. Although running the survey is not part of this project, the collection and selection of relevant and important questions will continue throughout the project.
Part II: Selecting data to be studied; descriptive and comparative statistics
We will study and compare topical articles and discussion pages for (i) articles present in all considered Wikipedia versions (general topics), (ii) articles regarding Sweden present in the English Wikipedia and the Swedish Wikipedia (e.g., Swedish cities, Swedish artists), and (iii) topics regarding Sweden present only in the Swedish Wikipedia. The following data bases are available to the group in Halle: (a) hourly access rates for all Wikipedia articles from 1.1.2009 to 21.10.2009, (b) all article edit events up to 21.10.2009, and (c) a data base containing full information on the categories of all articles and all links (and re-directs) between them. Additional data required or desired: access rates to all or selected English articles with resolved geo-location of the reader, number of reference links in all selected articles as a proxy quality index, differentiation of person and “bot” edit events.
- Choose the additional language communities. Identify the lists of normal Wikipedia articles (no files, redirection lists, etc.) and discussion pages that will be studied in detail. A focus shall be placed on articles with large total access (high interest) and/or edit volume (possibly controversial). The lists should contain approximately 500 articles for the articles in all Wikipedias under study (item (i)), 50-100 articles regarding Sweden in the English Wikipedia (item (ii)) and 50-100 only in the Swedish Wikipedia (item (iii)).
- Retrieve article hourly access rates and individual article edit event time series for all articles, discussion pages and languages according to the lists from item 1.
- Calculate quantitative comparative statistics of total access and edit volumes for all considered articles, compare order-rank statistics between language communities. E.g., check the hypothesis that articles of high interest in the English Wikipedia are also prominent in the considered local Wikipedias. How similar are they for different language communities? Check possible relations of access and edit activity with quality index proxies (number of links, in particular number of reference links). In particular, check the hypothesis that the quality of articles is correlated with greater interest (at least in some range) and compare language communities.
- Try to identify articles where either the Swedish or the English version was created during the considered time interval, while the other language version existed already before. What effect was there on the access volume of the pre-existent article?
- As working time and computational power permit, the initial subset of articles can be increased and/or split into articles regarding specific topic groups (e.g., physics, movies, large cities, large companies, etc.), which would allow a more detailed (topical) comparison.
Part III: Detailed analysis of article access-rate data
Article access activity is recorded as a rate per hour and article. The location or identity of the readers (users) cannot be identified, unless additional data become available. However, the temporal distribution of hourly access rates and hourly edit rates during the day and during the week (daily and weekly patterns) can be traced back to time zones and can thus allow an approximate attribution to the location of the users. Although such an attribution is not very accurate, it can yield information independent of geo-localization (that turned out to be not very reliable in several cases). Our goal it to obtain information on the location of users that may support data from geo-localization or quantify possible problems in geo-localizations.
- Calculate typical daily access rate and edit rate cycles for the articles from each of the language communities and article groups as retrieved in part II. Identify and discuss differences, possibly also regarding topical groups.
- Try to identify by principal component analysis or a similar technique: (i) a local component (according to the main time zone of the language community) and (ii) one or several world-wide components. Possibly the daily cycle in the English Wikipedia can serve as an appropriate model for a world-wide component (strong assumption). The statistics over the 500+ articles can be used for significance testing.
- Compare daily access rate cycles and edit rate cycles to see if readers and editors come typically from the same places or from different places. A restriction to weekdays and a comparison of weekly cycles and trends may be used if there are significant differences between weekdays and weekends.
- As working time permits, compare the structures and statistics of the direct link networks (and possibly also access link networks) within each language community to identify differences in the embedding of the considered articles.
Part IV: Detailed analysis of article edit event data
Editorial contributions are recorded with exact time stamp and editor code. They can thus be analyzed for co-incidence (i.e., the same editor changes both English and local Wikipedia practically simultaneously) and traced back to the identity (possibly also the nationality) of the editor. In this analysis we will exclude edit events by “bots” and focus only on human editors, preferably those that edit both language versions. We will not publish any information that might reveal the behaviour of individual editors (according to the Wikipedia guidelines).
- Identify events of co-incidence in editing the English Wikipedia and one local Wikipedia by the same editor. This requires empirically determining a maximum time-interval for co-incidence. Study frequency and distribution of co-incident edits among all edits regarding the different language communities.
- Identify events of co-incidence in editing the English Wikipedia and one local Wikipedia by different editors. Compare with editing by the same editor.
- Statistics of editors: study and compare how many editors are involved in each of the Wikipedias, how frequently they work, etc. Relate the activity of editors with their nationality if possible.