Background edit

This project is focussed on uploading metadata for New Zealand academic theses to Wikidata, in order for them to be more openly citable and accessible. We believe this is the first attempt to upload a national dataset of theses.

The project came about while Giantflightlessbirds was a Wikipedian in Residence at Lincoln University. During that short residency, librarian Zeborah raised the possibility of adding Lincoln University's theses to Wikidata. She had an opportunity to present on to her academic librarian colleagues at the online conference Aotearoa Institutional Repositories Community Days 30 September – 1 October 2021 on adding thesis metadata into Wikidata. In preparation for this presentation she reached out to Giantflightlessbirds who in turn invited Ambrosia10 and DrThneed to join in the discussion. This group met several times to discuss the proposal of uploading all New Zealand academic theses into Wikidata and to prepare for the presentation at the conference.

Discussion documents, slides and other project documentation is being collated in Google Docs folders as some of the participating academic librarians are not Wikimedians. Some of this documentation is linked to in the documentation section of this page.

Scope edit

The intention is to collect metadata for theses from New Zealand universities and polytechnics, and upload a core set of statements for each thesis in the first instance. After this core set of statements has been uploaded, there is potential for further work to increase the findability and linkage of the theses, for example the data includes keywords, often in controlled vocabularies such as ANZSRC, which could be mapped to main subject statements. We also have data connecting theses to degree programmes and advisors.

A dataset of approximately 66,500 theses has been compiled, from 13 New Zealand institutions. The theses range from diploma and bachelor's theses through to Doctor of Science, and span the time period 1907 to 2022. Whilst many of the theses are digitised and available through an institutional repository, others are represented only by their metadata. Because of variability in the data both within and between institutions, there is a lot of clean up and standardising of data required. Deborah Fitchett has done significant work aggregating and collating the data in Excel, and DrThneed will clean it up in OpenRefine and upload to Wikidata. There will likely be some problems to resolve with institutions where, for example, a thesis is held in more than one library and has been modelled differently by each.

Funding edit

The thesis dataset is a large and complex dataset, with 66.5k items and several languages, including some apparent duplicate items within and between institutions that need to be clarified with the academic librarians involved, and some incomplete data that may need follow up. The inconsistencies in data format between institutions will require a lot of time to standardise and clean up. For instance, we have counted more than 50 ways of indicating in a title that a work is a thesis but we need to remove these additions to ensure the title of each thesis is as the author intended and in order to make a good citation.

We estimate the data cleaning, checking and upload to Wikidata to take approximately 200 hours of work by an experienced data wrangler. At an hourly rate of $NZ25 this amounts to $NZ5000.

We are approaching Wikimedia Aotearoa New Zealand to support obtaining a contractor to complete this work.

Progress edit

Ambrosia10 and DrThneed used a small sample dataset to work on mapping the thesis data to Wikidata properties, and Ambrosia10 developed a Wikidata Cradle schema for an academic thesis in consultation with the other members of the group as well as the academic librarians contributing the data. This ontology will likely need to be modified during the project.

Zeborah undertook significant work collating and aggregating the data and was able to pass the dataset onto DrThneed in the beginning of March. DrThneed then spent time exploring the dataset and began a small trial upload of 116 theses into Wikidata both to test the proposed workflow and the schema that had been previously created.

Feedback is in the process of being gathered from the participating institutions and as at April 2022 DrThneed is continuing to work on the dataset preparing it for upload to Wikidata. It is anticipated that the upload of a core set of statements for the full theses dataset will be complete in May/June 2022.

A small team met before Christmas to work on ANZSRC vocabularies in Wikidata, which would be a useful prelude to uploading keywords to the theses items. Progress on the ANZSRC Mix'n'Matches has been slow but we intend to return to this work after upload of the core statements for the main dataset.

DrThneed has created a dashboard that measures edits to Wikidata items with the statement "on focus list of NZThesisProject".

Events edit

  • First meeting with librarians 2021
  • Second update with project members & librarians 25 March 2022
     
    Slides from 25 March
    : DrThneed presented her findings to the Project participants and contributors and requested feedback from the contributing libraries on issues this trial upload raised.
  • Third update 28 July 2022 showing how theses are connected in Wikidata and cited in Wikipedia, and some of the data visualisations now possible, as well as tools to improve the data.
     
    Slides from 28 July
  • Presentation at Wikimania in Singapore on 20 August 2023 (see Documentation for recording)
  • Presentation to LIANZA conference in Christchurch 31 October 2023
  • Presentation to Christchurch librarians 22 April 2024

Documentation edit

Tools edit

DrThneed has made some Wikidata property dashboards to see progress on the project. They are both linked from the Wikidata project page. One table shows properties for theses, and the other properties for people (thesis authors). A third table shows some properties we don't expect to find, like volume number and published in - this helps check that our thesis items haven't been inappropriately merged with other types of publications.

The Wikidata project page also contains a link to some Histropedia timelines, and some Sparql queries to visualise the data e.g. a map of where authors have been educated or employed, bubble charts of main subjects or author occupations, links between advisors and students.

Tasks edit

If you would like to help, some easy tasks are making sure the theses are cited on relevant author Wikipedia pages, or matching authors to author name strings in the Mix'n'match tool.

Citing theses on Wikipedia edit

This Googlesheet shows theses by people who have Wikipedia pages (updated 23 March 2023). Unfortunately we have discovered that CiteQ is not helpful for citing theses currently, as the citations are not tracked by Altmetric. That means the impact of all the work is harder to see. We are currently replacing CiteQ citations with the "cite thesis" template instead. To make this easier the Google sheet now contains the citation with ref tags ready to paste into the Wikipedia page - without any need for source editing. A 4 minute "how to" video has been uploaded to YouTube showing how to create a new citation or replace an existing one.

For reference purposes, here is the old Googlesheet

Do you like working in other language Wikipedias? edit

This Googlesheet has a short list of thesis authors who do not have an English Wikipedia page, but do have one in another language (languages show in last column). It would be great to cite the theses on those pages, so that non-English speakers can see the work exists. The first sheet in the file contains some instructions for how to go about this if you don't speak the language concerned, obviously if you are fluent you will find it much faster!

Mix'n'match edit

The Mix'n'match tool is a way to match the author name strings from the thesis project to authors on Wikidata. If you search Wikidata and do not find the author, try removing middle names, initials etc. If you are sure the person is not in Wikidata, click the 'new' button to create an item for them. You may be able to find other identifiers to add to the new record e.g. Orcid or ResearchGate. Or if they have a university profile page you can add the university as an 'employer' statement, and then use their profile URL as the reference URL for the statement. You do NOT need to link the author and the thesis item. DrThneed will periodically download matches from the Mix'n'match catalogue and match the authors and theses, and also add other information such as advisors.

If you are not familiar with the Mix'n'match tool, this screencapture shows how to match items, using the Alexander Turnbull library catalogue as an example.

Participants edit

Outcomes and impact edit

  • July 2022 After Dr Thneed presented to the librarian community who provided the thesis data in July, Ambrosia10 did a twitter thread explaining to the wider Wikidata community and others on twitter about the project and the progress being made. Dr. Amanda Whitmire, librarian at Stanford Hopkins Marine Station, responded by expressing a desire for the theses from that station be added to Wikidata. This led to an exchange where Dr Thneed and Ambrosia10 expressed encouragement and support in the preparation of theses data by Dr Whitmire being uploaded into Wikidata. As at 9 August 2022 Dr Whitmire has made over 1000 edits to Wikidata including adding 353 Stanford theses from folks who worked at Hopkins Marine Station. She has also created numerous items for the authors of those theses, and has learned how to cite them on Wikipedia.
  • August 2022 As a result of Dr Thneed creating a youtube video about the project and her workflow using OpenRefine she has been contacted by a PhD student in Leipzig who is doing a PhD on dissertations.
  • September 2022 As a result of Dr Thneed's twitter and Wikidata outreach awareness was raised of the NZ Thesis project and the London School of Economics Wikidata Thesis project were able to adapt queries and visualisations used in the NZThesisProject for their own (and vice versa).
  • September 2022 User:Schwede66 wanted to work on New Zealand Rhodes scholars. Dr Thneed scraped and imported a list to OpenRefine, and matched to existing Wikidata items, and then created a Mixnmatch catalogue for the remaining scholars to be matched or created. As most of the scholars have completed a degree at a university in New Zealand and many return to teach in New Zealand institutions, there is a large overlap between Rhodes scholars and the thesis project. Additionally we have been able to match some scholars to their Oxford thesis.
  • October 2022 DrThneed presented on the project to the Australia Wikimedia Community Meeting. DrThneed encouraged anyone who knows an institution keen to put thesis data into Wikidata to contact her.
 
October 2022 slides

.