Preregistration: Comparing the use of Wikidata and Wikipedia by open-source software programmers on GitHub repositories

Abstract

Wikipedia and Wikidata are socio-technical systems driven by collaborative communities, open content, and open-source infrastructure. Some of the open-source software development around them involves code-sharing sites like GitHub. Analyzing GitHub for repositories related to Wikipedia and Wikidata can thus provide insights into multiple dimensions of the development of Wikipedia and Wikidata tools. We plan to do such an analysis, and in order to test our workflows for doing that, we ran a preliminary study based on a sample of 1000 GitHub repositories each for Wikidata and Wikipedia. We are preregistering our workflows here as a transparent basis for documenting and reporting on the full analysis later. The kinds of insights we expect based on the preliminary data about open-source GitHub repositories related to Wikidata and Wikipedia are as follows: (i) statistical information about these repositories; (ii) computational information, e.g. in terms of the programming languages used; (iii) demographic information about the contributors to such open-source projects; (iv) legal information about the choice of licenses; (v) linguistic information about the natural language used in the context of these repositories; (vi) trends over time. In the process of applying these preliminary workflows to studying the full dataset of GitHub repositories related to Wikipedia and Wikidata, we hope to gain some additional insights into the community dynamics at play in volunteer software development around Wikimedia projects, as well as into the process and merits of preregistrations for studies of this kind. We welcome community feedback on this approach as well as suggestions on additional aspects to include into the full study, and collaborations on the actual implementation.

Houcemeddine Turki
Houcemeddine Turki
Medical student

My research interests include the development of a large-scale framework for using open resources and semantic technologies for driving biomedical informatics and research evaluation at a low cost.

Mohamed Ali Hadj Taieb
Mohamed Ali Hadj Taieb
Assistant professor

My research interests include semantic similarity, semantic relatedness, knowledge representation, Big Data, social media, data management systems and graph embedding.

Mohamed Ben Aouicha
Mohamed Ben Aouicha
Associate professor

My research interests concern information retrieval, semantic technologies, social media analytics, knowledge representation, Big Data and graph embedding.