Wikipedia:WikiProject Newspapers/Knight proposal round 2

In late 2018, members of the Newspapers on Wikipedia campaign (NOW) were invited to submit a second-round application for a project to be funded by the Knight Foundation's Artificial Intelligences Ethics Challenge, a program that will distribute $750,000 to several grantees. Lane Rasberry is the primary author of the grant proposal, and submitted it under the auspices of his employer, the University of Virginia Center for Data Ethics and Justice. Below are the responses to the Round Two questions, as prepared by Lane Rasberry, Pete Forsyth, and Daniel Mietchen. As you can see, the project's goals are closely aligned with those of NOW, and we expect that if awarded, the grant will likely closely involve members of the campaign. We release the text of the proposal under the Creative Commons Attribution Share-Alike v 4.0 license, with an eye toward the principles espoused by proponents of Open Philanthropy.

Who is your intended audience for this project?

Our audience is people and machines who seek basic information about news sources for further analysis and research. This includes academic researchers, journalists, media critics, funders, and consumers evaluating news sources. Basic factual information about news media - such as ownership, region of coverage, founding dates, and sister publications - is universal and perpetually relevant, but currently absent from our public data commons.

We will publish structured data about media and its owners on Wikipedia as prose and Wikidata as structured data. Our audience will readily find our open linked data in wiki sites, as Wikipedia has been a top-10 global website since 2008 and Wikidata has emerged as a leading source of open data in its quality and scope of coverage. Tech services, including Google, Amazon, Apple, Facebook, and anything smaller, increasingly report using Wikipedia as source data and for quality control. Through our original curation and off-wiki consequent ubiquitous reuse, media and the public will become better informed as they evaluate or contextualize news sources.

Any human or machine which currently attempts to do news media research faces the impediment of the fundamental data in this space being incomplete, inaccessible, or inadequately curated. We target that audience.

How will your target audience be impacted by your proposal? What does the world look like if you are successful, and what do you do next?

When our audience searches anywhere for general reference information on news publishers and publications, then they will readily find the information they seek to evaluate the news source.

A core advantage of our proposal is its readiness to build our data into Wikidata, an open linked data platform. Queries, whether from layman, expert, or machine, draw from multiple existing datasets and include data from diverse fields. Identifying news media is the first step toward evaluating news media, and after we identify this sector in Wikidata, then we and everyone else can proceed with evaluating it in the public commons.

By merging and reconciling multiple datasets of news media, we will assemble a set that is more complete and accurate than its constituent parts. All parts of the collection benefit from the routine improvement that all Wikipedia and Wikidata content enjoys. Tools we build to leverage Wikidata's visualization capabilities could then match that data with any other open linked data in Wikimedia projects.

This project is finite and has an endpoint. There is a certain amount of general reference information required to support a host of worthwhile queries about news media. Once this data is in the public commons, maintenance is relatively trivial, and efforts can turn to formulating interesting research questions which Wikidata can readily answer.

What are the tangible deliverables of this project?

The deliverables will be five working prototypes each including a data corpus, queries and visualizations, documentation, and an impact report. In more detail here is what the team will present for five use cases:

A complete corpus of news sources published as an open linked dataset in Wikidata, for example modeling a particular community demographic, region, topic, library collection, time, or language translation
Sample queries and Wikidata visualizations which illustrate that dataset to make it relevant to researchers, students, and news consumers
Documentation characterizing the social context of the corpus including the community it serves and the extent of their engagement with curating the collection
A measurement of the impact of presenting the prototype, including a characterization of the audience and extent of their use of the prototype

The intent of each prototype is to model the scaling up of curation and presentation of open linked data into the public commons with more inclusive engagement in its management.

How diverse is your team?

The base of the core team is the Center for Data Ethics and Justice in the Data Science Institute at the University of Virginia. The role of this organization includes guiding the progress of this project through best practices in diversity and inclusion and in keeping focus on other ethical objectives, such as the particular values of the Wikimedia community and the Open Movement.

Team lead Lane Rasberry is a cofounder of Wikimedia LGBT+ and routinely engages in strategic planning for diversity. In this project as well as in general in Wikimedia programs, the participation of particular organizations and individuals is less important than the continual practice and public discussion of diversity and inclusion at decision and reporting points.

This primary resource allocation decision to make in this project will be in selecting the institutional partners with whom we can pilot and showcase news media metadata to serve their community and particular need. The inclusion requests which we have collected include representation in this project for a non-English language community, women, the African diaspora, a smaller university, a demographic in the Global South, youth, and for people who need accessibility in interface design. The goal of including all of these will figure into our outcome report.

We are ready to pass on collaboration opportunities with the better resourced conventional partners to establish Wikimedia and data science relationships with underrepresented communities.

Who are your competitors or others doing similar work?

This is an open linked data project with free and open licenses. We seek to integrate news source metadata into Wikidata. There is no competing alternative at peership with the Wikimedia ecosystem. There are other organizations developing source metadata in Wikidata and we already collaborate with many of them through the WikiCite project.

Right now, the curation of news source metadata is a tense competitive space. We will attempt to relieve this pressure by transforming it into a pre-competitive space where competitors become collaborators in sharing the same entry-level linked open data. The change we want is to elevate competition into specialization beyond the reach of crowdsourcing and the needs of general consumer interest.

Libraries as competitors have open data which is inaccessible for not being mapped and linked into the open data commons. Big tech companies are competitors in sharing Wikimedia information for the extent to which they rely on Wikipedia and Wikidata for quality control. One example of this is YouTube's statement at the 2018 SXSW that Wikipedia counters fake news in its video offerings. Although YouTube and similar have committed to use Wikipedia's news metadata, they have not yet committed to share back with the Wikimedia public data commons. https://www.wired.co.uk/article/wikipedia-google-youtube-facebook-support

Our project seeks to develop the norms of both contributing to Wikimedia data and reusing it.

Has your project or your thinking about this project changed since you submitted it? How so?

Since submitting the project we have surveyed potential collaborators and considered pilot subprojects to showcase.

We intend to establish a community-based news media program for integrating data, tools, and user communities into the Wikidata and Wikimedia sphere of influence. Now we have identified some classes of collaborators with whom we might partner and the ways in which they could be useful.

The Wikimedia Foundation's ORES is its highest profile AI service and has features for anyone to use. Our data science institute will guide the application of its quality control and community management service to the intersection of journalism and Wikimedia projects.

Libraries are obvious partners for having catalogues of media. The national libraries of Sweden and Wales are potential pilot cases outside English language for having complete data corpora of source metadata in Swedish and Welsh and for already mirroring this information to Wikidata. This project could develop the news media segment of that collection. Other library projects like "Chronicling America" in the United States may be useful to pilot links between publisher profiles in Wikidata and the text of news articles in library systems.

Wikipedia community groups, including WikiProject Newspapers, have already organized university departments and classes to develop Wikimedia content for their geographical regions. https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Newspapers Tools like PaceTrack to measure the progress of such campaigns. https://github.com/hatnote/pacetrack/ Our project would encourage such community participation through university partnerships.

Please submit any bio link. If you have any specific experience developing, applying, or covering AI, please describe it here.

Lane Rasberry, Wikimedian in Residence at the Data Science Institute at the University of Virginia, is the project lead. https://en.wikipedia.org/wiki/User:Bluerasberry

Lane curates open linked data for Wikidata, contributes to the WikiCite and Scholia projects to profile source metadata in academic publications, advises graduate student research in machine learning on Wikipedia as a data corpus, promotes institutional partnerships between universities and Wikimedia projects, and has piloted a university project to surface journalism from segregation-era African American newspapers into Wikidata.

Beyond Lane or any project lead, Wikimedia projects are a social machine where multiple community representatives and hundreds of participants each merit credit for their contributions toward the success of projects. Whatever the crowd might contribute, Lane will keep the focus of the project on applications of AI.

WikiProject Newspapers is the most obvious community and project which already has convened subject matter expertise and interest in overseeing prose and structured data describing news sources in Wikimedia projects. WikiCite has already piloted analogous metadata curation practices for academic publications. Scholia is the WikiCite visualization suite which models consumer-targeted presentations of academic publication data, which this proposal seeks to adapt for news publication data. ORES is the highest profile Wikimedia-based AI service. Lane has experience developing, applying, and covering all these projects.