Wikipedia:Wikipedia Signpost/2023-10-03/Recent research

Recent research

Readers prefer ChatGPT over Wikipedia; concerns about limiting "anyone can edit" principle "may be overstated"

A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, also published as the Wikimedia Research Newsletter.

In blind test, readers prefer ChatGPT output over Wikipedia articles in terms of clarity, and see both as equally credible

A preprint titled "Do You Trust ChatGPT? -- Perceived Credibility of Human and AI-Generated Content"^[1] presents what the authors (four researchers from Mainz, Germany) call surprising and troubling findings:

"We conduct an extensive online survey with overall 606 English speaking participants and ask for their perceived credibility of text excerpts in different UI [user interface] settings (ChatGPT UI, Raw Text UI, Wikipedia UI) while also manipulating the origin of the text: either human-generated or generated by [a large language model] ("LLM-generated"). Surprisingly, our results demonstrate that regardless of the UI presentation, participants tend to attribute similar levels of credibility to the content. Furthermore, our study reveals an unsettling finding: participants perceive LLM-generated content as clearer and more engaging while on the other hand they are not identifying any differences with regards to message’s competence and trustworthiness."

The human-generated texts were taken from the lead section of four English Wikipedia articles (Academy Awards, Canada, malware and United States Senate). The LLM-generated versions were obtained from ChatGPT using the prompt Write a dictionary article on the topic "[TITLE]". The article should have about [WORDS] words.

The researchers report that

"[...] even if the participants know that the texts are from ChatGPT, they consider them to be as credible as human-generated and curated texts [from Wikipedia]. Furthermore, we found that the texts generated by ChatGPT are perceived as more clear and captivating by the participants than the human-generated texts. This perception was further supported by the finding that participants spent less time reading LLM-generated content while achieving comparable comprehension levels."

One caveat about these results (which is only indirectly acknowledged in the paper's "Limitations" section) is that the study focused on four quite popular (i.e. non-obscure) topics – Academy Awards, Canada, malware and US Senate. Also, it sought to present only the most important information about each of these, in the form of a dictionary entry (as per the ChatGPT prompt) or the lead section of a Wikipedia article. It is well known that the output of LLMs tends to have fewer errors when it draws from information that is amply present in their training data (see e.g. our previous coverage of a paper that, for this reason, called for assessing the factual accuracy of LLM output on a benchmark that specifically includes lesser-known "tail topics"). Indeed, the authors of the present paper "manually checked the LLM-generated texts for factual errors and did not find any major mistakes," something that is well reported to not be the case for ChatGPT output in general. That said, it has similarly been claimed that Wikipedia, too, is less reliable on obscure topics. Also, the paper used the freely available version of ChatGPT (in its 23 March 2023 revision) which is based on the GPT 3.5 model, rather than the premium "ChatGPT Plus" version which, since March 2023, has been using the more powerful GPT-4 model (as does Microsoft's free Bing chatbot). GPT-4 has been found to have a significantly lower hallucination rate than GPT 3.5.

FlaggedRevs study finds that concerns about limiting Wikipedia's "anyone can edit" principle "may be overstated"

A paper titled "The Risks, Benefits, and Consequences of Prepublication Moderation: Evidence from 17 Wikipedia Language Editions",^[2] from last year's CSCW conference, addresses a longstanding open question in Wikipedia research, with important implications for some current issues.

Wikipedia famously allows anyone to edit, which generally means that even unregistered editors can make changes to content that go live immediately – only subject to "postpublication moderation" by other editors afterwards. Less well known is that on many Wikipedia language versions, this principle has long been limited by a software feature called Flagged Revisions (FlaggedRevs), which was developed and designed at the request of the German Wikipedia community and deployed there first in 2008, and has since been adopted by various other Wikimedia projects. (These do not include the English Wikipedia, which after much discussion implemented a system called "Pending Changes" that is very similar, but is only applied on a case-by-case basis to a small percentage of pages.) As summarized by the authors:

FlaggedRevs is a prepublication content moderation system in that it will display the most recent “flagged” revision of any page for which FlaggedRevs is enabled instead of the most recent revision in general. FlaggedRevs is designed to “give additional information regarding quality,” by ensuring that revisions from less-trusted users are vetted for vandalism or substandard content (e.g., obvious mistakes because of sloppy editing) before being flagged and made public. The FlaggedRevs system also displays the moderation status of the contribution to readers. [...] Although there are many details that can vary based on the way that the system is configured, FlaggedRevs has typically been deployed in the following way on Wikipedia language editions. First, users are divided into groups of trusted and untrusted users. Untrusted users typically include all users without accounts as well as users who have created accounts recently and/or contributed very little. Although editors without accounts remain untrusted indefinitely, editors with accounts are automatically promoted to trusted status when they clear certain thresholds determined by each language community. For example, German Wikipedia automatically promotes editors with accounts who have contributed at least 300 revisions accompanied by at least 30 comments.

The paper studies the impact of the introduction of FlaggedRevs "on 17 Wikipedia language communities: Albanian, Arabic, Belarusian, Bengali, Bosnian, Esperanto, Persian, Finnish, Georgian, German, Hungarian, Indonesian, Interlingua, Macedonian, Polish, Russian, and Turkish" (leaving out a few non-Wikipedia sister projects that also use the system). The overall findings are that

"the system is very effective at blocking low-quality contributions from ever being visible. In analyzing its side effects, we found, contrary to expectations and most of our hypotheses, little evidence that the system [...] raises transaction costs sufficiently to inhibit participation by the community as a whole, nor [that it] measurably improves the quality of contributions."

In the "Discussion" section, the authors write

Our results suggest that prepublication moderation systems like FlaggedRevs may have a substantial upside with relatively little downside. If this is true, why are a tiny proportion of Wikipedia language editions using it? Were they just waiting for an analysis like ours? In designing this study, we carefully read the Talk page of FlaggedRevs.^{[supp 1]} Community members commenting in the discussion agreed that prepublication review significantly reduces the chance of letting harmful content slip through and being displayed to the public. Certainly, many agreed that the implementation of prepublication review was a success story in general—especially on German Wikipedia. [...]
However, the same discussion also reveals that the success of German Wikipedia is not enough to convince more wikis to follow in their footsteps. From a technical perspective, FlaggedRevs’ source code appears poorly maintained.^{[supp 2]} [...] FlaggedRevs itself suffers from a range of specific limitations. For example, the FlaggedRevs system does not notify editors that their contribution has been rejected or approved. [...] Since April 2017, requests for deployment of the system by other wikis have been paused by the Wikimedia Foundation indefinitely.^{[supp 3]} Despite these problems, our findings suggest that the system kept low-quality contributions out of the public eye and did not deter contributions from the majority of new and existing users. Our work suggests that systems like FlaggedRevs deserve more attention.

(This reviewer agrees in particular regarding the lack of notifications for new and unregistered editors that their edit has been approved – having filed, in vain, a proposal to implement this uncontroversially beneficial and already designed software feature to the annual "Community Wishlist", in 2023, 2022, and 2019.)

Interestingly, while the FlaggedRevs feature was (as summarized by the authors) developed by the Wikimedia Foundation and the German Wikimedia chapter (Wikimedia Deutschland), community complaints about a lack of support from the Foundation for the system were present even then, e.g. in a talk at Wikimania 2008 (notes, video recording) by User:P. Birken, a main driving force behind the project. Perhaps relatedly, the authors of the present study highlight a lack of researcher attention:

"Despite its importance and deployment in a number of large Wikipedia communities, very little is known regarding the effectiveness of the system and its impact. A report made by the members of the Wikimedia Foundation in 2008 gave a brief overview of the extension, its capabilities and deployment status at the time, but acknowledged that “it is not yet fully understood what the impact of the implementation of FlaggedRevs has been on the number of contributions by new users.”^{[supp 4]} Our work seeks to address this empirical gap."

Still, it may be worth mentioning that there have been at least two preceding attempts to study this question (neither of these has been published in peer-reviewed form, thus their omission from the present study is understandable). They likewise don't seem to have identified major concerns that FlaggedRevs might contribute to community decline:

A talk at Wikimania 2010 presented preliminary results from a study commissioned by Wikimedia Germany, e.g. that on German Wikipedia, "In general, flagged revisions did not [affect] anonymous editing" and that "most revisions got approved very rapidly" (the latter result surely doesn't hold everywhere; e.g. on Russian Wikipedia, the median time for an unregistered editor's edit to get reviewed is over 13 days at the time of writing). It also found, unsurprisingly, a "reduced impact of vandalism", consistent with the present study.
An informal investigation of an experiment conducted by the Hungarian Wikipedia in 2018/19 similarly found that FlaggedRevs had "little impact on the growth of the editor community" overall. The experiment consisted of deactivating the feature of FlaggedRevs that hides unreviewed revisions from readers. As a second question, the Hungarian Wikipedians asked "How much extra load does [deactivating FlaggedRevs] put on patrollers?" They found that "[t]he ratio of bad faith or damaging edits grew minimally (2-3 percentage points); presumably it is a positive feedback for vandals that they see their edits show up publicly. The absolute number of such edits grew significantly more than that, since the number of anonymous edits grew [...]."

In any case, the CSCW paper reviewed here presents a much more comprehensive and methodical approach, not just because it examined the impact of FlaggedRevs across multiple wikis, but also regarding the formalizing of various research hypotheses and concerning the use of more reliable statistical techniques.

The findings in detail

In more detail, the researchers formalized four groups of research hypotheses about the impact of FlaggedRevs [our bolding]:

First, the study assessed whether the "system is indeed functioning as intended", by hypothesizing that it reduces the "number of visible rejected contributions" (i.e. edits that were reverted after being approved, i.e. becoming visible to the general reader), both from users affected by the restriction (H1a) and from all editors (H1c), but not from those editors not affected (H1b). All three of these sub-hypotheses were confirmed in an interrupted time series (ITS) analysis of the monthly counts of such reverts (aggregated for all users in each group, over the entire wiki), covering the timespan from 12 months before to 12 months after the date (or month) on which FlaggedRevs was activated on a particular Wikipedia version. The researchers conclude that "In general, the results we see are in line with our expectations and provide strong evidence that FlaggedRevs achieves its primary goal of hiding low-quality contributions from the general public."
Secondly, "Our H2 hypotheses suggest that prepublication review will affect the quality of contributions overall. We operationalize quality in two ways. First, we use the number of rejected contributions that we operationalize as the number of reverts. [...] We also test our second hypotheses using average quality that we operationalize as revert rate [...] measured as the proportion of contributions that are eventually." Like the first hypothesis, this is separately assessed for affected users, non-affected users and all users. The authors anticipated a rise in the quality of contributions overall (H2c) and of contributions from affected users such as IP editors (H2a), reasoning that "proactive measures of content moderation and production control can play an important role in encouraging prosocial behavior." For unaffected users, they again hypothesized a null effect (H2b). This set of hypotheses was again examined in an ITS analysis of the time series of monthly aggregate counts of such edits. Here, the authors "find little evidence of the prepublication moderation system having a major impact on the quality of contributions. Thus, we cannot conclude that FlaggedRevs alters the quantity or quality of newcomers’ contributions."
The third group of hypotheses was motivated by existing "research that has shown that additional quality control policies may negatively affect the growth of a peer production community" (citing several papers which have been covered here before, see e.g. "'The rise and decline' of the English Wikipedia"). Again, this is split into three sub-hypotheses for affected users (H3a), unaffected users (H3b) and the community overall (H3c). The authors chose the aggregate number of mainspace (article) edits as their measure of productivity, and hypothesized that FlaggedRevs would decrease it in all three cases – for affected (non-trusted) editors because of a "reduced sense of self-efficacy" (i.e. the lack of satisfaction that comes with seeing one's change immediately being shown to the public), but also for unaffected (trusted) editors, because "prepublication [review] systems require effort from experienced contributors and may result in a net increase in the demands on these users’ time". These hypotheses are again tested using an ITS analysis aggregate (per-wiki) monthly numbers. Regarding H3a, this confirms a significant decrease for IP editors as one group of affected users in H3a, but not for newly registered editors as the other group of affected users. (Unfortunately, the analysis appears to treat these as static groups, without examining the possibility that FlaggedRevs may have motivated at least some people who had habitually contributed without logging in to do so under an account instead, with the anticipation of becoming a trusted/unaffected user after passing the applicable threshold.) The study finds that "the deployment of the prepublication [review] discouraged the participation of the group of editors with the lowest commitment and most targeted by the additional safeguard [i.e. IP editors], but not the other groups." In particular, FlaggedRevs did not cause a significant decline in article edits overall, contradicting the expectations formed based on the aforementioned previous research.
The fourth hypothesis was similarly motivated by previous research that had found that the "barrier to entry posed by prepublication review, combined with the delayed intrinsic reward, might be disheartening enough to drive newcomers away" (in case of the creation of new articles on English Wikipedia, see our previous coverage: "Newcomer productivity and pre-publication review"). Here, the authors "hypothesize that the deployment of [FlaggedRevs] will negatively affect the return rate of newcomers (H4)." Differently from the previous three hypothesis, this effect on retention rate is tested using a per-user (instead of aggregate) dataset. The study finds "that although FlaggedRevs did negatively affect the return rate of newcomers in a way that was statistically significant, the size of this effect is extremely small." Again though, the analysis is limited by treating this group as static, without being able to consider the possibility that FlaggedRevs may motivate more people to create an account instead of contributing under an IP. What's more, the authors caution their analysis had been limited by the fact that "we do not have access to wiki-level configuration data on FlaggedRev" (referring to settings such as the edit number threshold where an editor will be automatically promoted to trusted status). However, the Wikimedia Foundation does in fact publish this information, so there might be opportunities for future research to examine this research question more thoroughly. Relatedly, while that paper promises that "[a] replication dataset including data, code, and other supplementary material has been placed in the Harvard Dataverse archive and is available at: https://doi.org/10.7910/DVN/G1YFLE ", that URL does not (yet) contain such material for most of the paper's results. (In March 2023, the authors acknowledged this issue and planned to remedy it, but at the time of writing the data repository appears unchanged.)

(Disclosure: This reviewer provided some input to the authors at the beginning of their research project, as acknowledged in the paper, but was not involved in it otherwise.)

See also related earlier coverage: "Sociological analysis of debates about flagged revisions in the English, German and French Wikipedias" (2012)

Briefly

Wikimania, the annual global conference of the Wikimedia movement, took place in Singapore in August (as an in-person event again for the first time since 2019). Its research track included the by now traditional "State of Wikimedia Research" presentation highlighting research trends from the past year (with involvement by members of this research newsletter), see our blog post with videos and slides. Videos and slides from other presentations are being uploaded, too.
See the page of the monthly Wikimedia Research Showcase for videos and slides of past presentations.

Other recent publications

Other recent publications that could not be covered in time for this issue include the items listed below. Contributions, whether reviewing or summarizing newly published research, are always welcome.

"Wikidata as Semantic Infrastructure: Knowledge Representation, Data Labor, and Truth in a More-Than-Technical Project"

From the abstract:^[3]

"Various Wikipedia researchers have commended Wikidata for its collaborative nature and liberatory potential, yet less attention has been paid to the social and political implications of Wikidata. This article aims to advance work in this context by introducing the concept of semantic infrastructure and outlining how Wikidata’s role as semantic infrastructure is the primary vehicle by which Wikipedia has become infrastructural for digital platforms. We develop two key themes that build on questions of power that arise in infrastructure studies and apply to Wikidata: knowledge representation and data labor."

"Naked data: curating Wikidata as an artistic medium to interpret prehistoric figurines"

From the abstract:^[4]

"In 2019, Digital Curation Lab Director Toni Sant and the artist Enrique Tabone started collaborating on a research project exploring the visualization of specific data sets through Wikidata for artistic practice. An art installation called Naked Data was developed from this collaboration and exhibited at the Stanley Picker Gallery in Kingson, London, during the DRHA 2022 conference. [...] This article outlines the key elements involved in this practice-based research work and shares the artistic process involving the visualizing of the scientific data with special attention to the aesthetic qualities afforded by this technological engagement."

"The Wikipedia Republic of Literary Characters"

From the abstract:^[5]

"We [...] explore a user-oriented notion of World Literature according to the collaborative encyclopedia Wikipedia. Based on its language-independent taxonomy Wikidata, we collect data from 321 Wikipedia editions on more than 7000 characters presented on more than 19000 independent character pages across the various language editions. We use this data to build a network that represents affiliations of characters to Wikipedia languages, which leads us to question some of the established presumptions towards key-concepts in World Literature studies such as the notion of major and minor, the center-periphery opposition or the canon."

"What makes Individual I's a Collective We; Coordination mechanisms & costs"

From the abstract:^[6]

"Diving into the Wikipedia ecosystem [...] we identified and quantified three fundamental coordination mechanisms and found they scale with an influx of contributors in a remarkably systemic way over three order of magnitudes. Firstly, we have found a super-linear growth in mutual adjustments (scaling exponent: 1.3), manifested through extensive discussions and activity reversals. Secondly, the increase in direct supervision (scaling exponent: 0.9), as represented by the administrators’ activities, is disproportionately limited. Finally, the rate of rule enforcement exhibits the slowest escalation (scaling exponent 0.7), reflected by automated bots. The observed scaling exponents are notably robust across topical categories with minor variations attributed to the topic complication. Our findings suggest that as more people contribute to a project, a self-regulating ecosystem incurs faster mutual adjustments than direct supervision and rule enforcement."

"Wikidata Research Articles Dataset"

From the abstract:^[7]

"The "Wikidata Research Articles Dataset" comprises peer-reviewed full research papers about Wikidata from its first decade of existence (2012-2022). This dataset was curated to provide insights into the research focus of Wikidata, identify any gaps, and highlight the institutions actively involved in researching Wikidata."

"Speech Wikimedia: A 77 Language Multilingual Speech Dataset"

From the abstract:^[8]

"The Speech Wikimedia Dataset is a publicly available compilation of audio with transcriptions extracted from Wikimedia Commons. It includes 1780 hours (195 GB) of CC-BY-SA licensed transcribed speech from a diverse set of scenarios and speakers, in 77 different languages. Each audio file has one or more transcriptions in different languages, making this dataset suitable for training speech recognition, speech translation, and machine translation models."

15 years later, repetition of philosophy vandalism experiment yields "surprisingly similar results"

From the paper^[9]

"Fifteen years ago, I conducted a small study testing the error-correction tendency of Wikipedia. [...] I repeated the earlier study and found surprisingly similar results. [...] Between July and November 2022, I made 33 changes to Wikipedia: one at a time, anonymously, and from various IP addresses. [...] Each change consisted of a one or two sentence fib inserted into the Wikipedia entry on a notable, deceased philosopher. The fibs were about biographical or factual matters, rather than philosophical content or interpretive questions. Although some of the fibs mention “sources”, no citations were provided. If the fibs were not corrected within 48 hours, they were removed by the experimenter. The fibs were all, verbatim, ones that I used in Magnus (2008). [...] Thirty-six percent (12/33) of changes were corrected within 48 hours. Rounded to the nearest percentage point, this is the same as the adjusted result in Magnus (2008)."

References

^ Huschens, Martin; Briesch, Martin; Sobania, Dominik; Rothlauf, Franz (2023-09-05). "Do You Trust ChatGPT? -- Perceived Credibility of Human and AI-Generated Content". arXiv:2309.02524 [cs.HC].
^ Tran, Chau; Champion, Kaylea; Hill, Benjamin Mako; Greenstadt, Rachel (2022-11-11). "The Risks, Benefits, and Consequences of Prepublication Moderation: Evidence from 17 Wikipedia Language Editions". Proceedings of the ACM on Human-Computer Interaction. 6 (CSCW2): 333–1–333:25. arXiv:2202.05548. doi:10.1145/3555225. S2CID 246823933. / Tran, Chau; Champion, Kaylea; Hill, Benjamin Mako; Greenstadt, Rachel (2022-11-07). "The Risks, Benefits, and Consequences of Prepublication Moderation: Evidence from 17 Wikipedia Language Editions". Proceedings of the ACM on Human-Computer Interaction. 6 (CSCW2): 1–25. arXiv:2202.05548. doi:10.1145/3555225. ISSN 2573-0142.
^ Ford, Heather; Iliadis, Andrew (2023-07-01). "Wikidata as Semantic Infrastructure: Knowledge Representation, Data Labor, and Truth in a More-Than-Technical Project". Social Media + Society. 9 (3): 20563051231195552. doi:10.1177/20563051231195552. ISSN 2056-3051.
^ Sant, Toni; Tabone, Enrique (2023). "Naked data: curating Wikidata as an artistic medium to interpret prehistoric figurines". International Journal of Performance Arts and Digital Media: 1–18. doi:10.1080/14794713.2023.2253335. ISSN 1479-4713.
^ Wojcik, Paula; Bunzeck, Bastian; Zarrieß, Sina (2023-05-11). "The Wikipedia Republic of Literary Characters". Journal of Cultural Analytics. 8 (2). doi:10.22148/001c.70251.
^ Yoon, Jisung; Kempes, Chris; Yang, Vicky Chuqiao; West, Geoffrey; Youn, Hyejin (2023-06-03). "What makes Individual I's a Collective We; Coordination mechanisms & costs". arXiv:2306.02113 [physics.soc-ph].
^ Farda-Sarbas, Mariam (2023), Wikidata Research Articles Dataset, Freie Universität Berlin, doi:10.17169/refubium-40231
^ Gómez, Rafael Mosquera; Eusse, Julian; Ciro, Juan; Galvez, Daniel; Hileman, Ryan; Bollacker, Kurt; Kanter, David (2023). "Speech Wikimedia: A 77 Language Multilingual Speech Dataset". arXiv:2308.15710 [cs.AI].
^ Magnus, P. D. (2023-09-12). "Early response to false claims in Wikipedia, 15 years later". First Monday. doi:10.5210/fm.v28i9.12912. ISSN 1396-0466.

Supplementary references and notes:

← Previous "Recent research"

Next "Recent research" →

In this issue

3 October 2023

News and notes

In the media

Recent research

Discuss this story

These comments are automatically transcluded from this article's talk page. To follow comments, add the page to your watchlist. If your comment has not appeared here, you can try purging the cache.

FlaggedRevs

There seem to be some errors in The Risks, Benefits, and Consequences of Prepublication Moderation: Evidence from 17 Wikipedia Language Editions (https://arxiv.org/abs/2202.05548) papers assumptions on how FlaggedRevs works. For example:

... FlaggedRevs is a prepublication content moderation system in that it will display the most recent “flagged” revision of any page for which FlaggedRevs is enabled instead of the most recent revision in general... - This is depending on wiki where user is. For example German Wikipedia uses pre-moderation, and as different setup example is Finnish Wikipedia which uses post-moderation.
... Although editors without accounts remain untrusted indefinitely, editors with accounts are automatically promoted to trusted status when they clear certain thresholds determined by each language community.... - This also depends on wiki. For example dewiki promotes automatically, but huwiki has rather small number manually selected trusted reviewers. Fiwiki has large number manually selected trusted reviewers.

--Zache (talk) 09:11, 4 October 2023 (UTC)[reply]

In Russian Wikipedia, as well as in Russian Wikinews, FlaggedRevs is a disaster. You say Germans are guilty in that? --ssr (talk) 06:09, 27 October 2023 (UTC)[reply]

Wikidata

Wikidata as Semantic Infrastructure: Knowledge Representation, Data Labor, and Truth in a More-Than-Technical Project.: The article has a lot of critiques of Wikidata, some interesting, some less relevant in my opinion. Some of them were a big part of the discussions around adopting Wikidata on frwiki, interestingly, so might be a little bit out of date. The copyright issues, for example, was and is still I guess a show stopper for some. Interestingly I think this point is significantly weakened with the emergence of the LLMs like ChatGPT as a major products of the Gafams : they don’t really care about the copyright issues or the attribution of datas they learn from, and they really can’t track back the generated factoids origin or reliably source anything … Without any restraint to date from copyright laws. Copyright does not seem to slow automated fact extraction from any public source in any way … (for the rest of the arguments covered for or against Wikidata on wikipedias, this has already been extensively discussed and argued on the projects, I won’t return on this because I don’t really feel like it); I don’t quite understand the upper ontology part, as I think Wikidata as is is difficult to really use as an upper ontology because of the lack of formalisation. There is no reasoning rules in Wikidata to capture some of the meaning, and this is a crucial part of how an upper ontology is supposed to be used. The upper tree is indeed, or as long been as there is ongoing effort to clear this up, globally a hard to manage mess. But I doubt anyone actually used in a meaningful way the classes they show up like bad, or actually used Wikidata as an upper ontology. I’d like this claim of the article to be given more substance or proof.; It’s a critic I would actually make for the whole article, this is a discussion of the critiques but there is not really a lot of datas to give substance, this is mostly qualitative, and as such arguable. TomT0m (talk) 11:38, 4 October 2023 (UTC)[reply]

ChatGPT v. Wikipedia

The study authors comment on prose quality. I happened to ask ChatGPT yesterday to explain what government shutdowns in the U.S. are and what effects they have. I got the following answer:

Government shutdowns in the United States occur when Congress fails to pass a budget or a continuing resolution to fund the government. This can happen due to disagreements between political parties or issues related to specific policies. The shutdown results in the temporary suspension of non-essential government services and the furlough of federal employees.

During a government shutdown, essential services such as national security, law enforcement, and emergency activities continue to operate. However, many government functions, agencies, and national parks may be affected. Federal employees who are deemed non-essential may be furloughed, meaning they are placed on temporary unpaid leave.

The economic impact can be significant, affecting various sectors, including tourism, government contractors, and the overall economy. Additionally, government shutdowns can lead to a loss of public confidence in the government's ability to function effectively.

It's worth noting that my information is up to September 2021, and I recommend checking the latest sources for updates on this topic.

I then compared that to the lead of Government shutdowns in the United States:

In the United States, government shutdowns occur when funding legislation required to finance the federal government is not enacted before the next fiscal year begins. In a shutdown, the federal government curtails agency activities and services, ceases non-essential operations, furloughs non-essential workers, and retains only essential employees in departments that protect human life or property. Shutdowns can also disrupt state, territorial, and local levels of government.

Funding gaps began to lead to shutdowns in 1980, when Attorney General Benjamin Civiletti issued a legal opinion requiring it. This opinion was not consistently adhered to through the 1980s, but since 1990 all funding gaps lasting longer than a few hours have led to a shutdown. As of September 2023, ten funding shutdowns have led to federal employees being furloughed.

The most significant shutdowns have included the 21-day shutdown of 1995–1996, during the Bill Clinton administration, over opposition to major spending cuts; the 16-day shutdown in 2013, during the Barack Obama administration, caused by a dispute over implementation of the Affordable Care Act (ACA); and the longest, the 35-day shutdown of 2018–2019, during the Donald Trump administration, caused by a dispute over funding an expansion of barriers on the U.S.–Mexico border.

Shutdowns disrupt government services and programs; they close national parks and institutions. They reduce government revenue because fees are lost while at least some furloughed employees receive back pay. They reduce economic growth. During the 2013 shutdown, Standard & Poor's, the financial ratings agency, said on October 16 that the shutdown had "to date taken $24 billion out of the economy", and "shaved at least 0.6 percent off annualized fourth-quarter 2013 GDP growth".

Personally I found ChatGPT's output a lot more readable than the Wikipedia lead – it is just better written. The English Wikipedia text often required me to go back and read the sentence again.

Take the first sentence: In the United States, government shutdowns occur when funding legislation ... At first I parsed "when funding legislation" as an indication of when shutdowns occur (i.e. "when you are funding legislation"). I needed to read on to realise that this wasn't where the sentence was going.

Next, Wikipedia uses the rather technical expression "when funding legislation ... is not enacted" (which is also passive voice) where ChatGPT uses the much easier-to-understand "when Congress fails to pass a budget" (active voice).

Where ChatGPT speaks of a "temporary suspension of non-essential government services", Wikipedia says the federal government "curtails agency activities and services, ceases non-essential operations", etc. I find the ChatGPT phrase easier to understand and faster to read while providing much the same information as the quoted Wikipedia passage (a point the study authors commented on specifically).

The Wikipedia sentence Funding gaps began to lead to shutdowns in 1980, when Attorney General Benjamin Civiletti issued a legal opinion requiring it. leaves me wondering even now what the word "it" at the end of the sentence is meant to refer to.

I suspect our sentence construction and word use are not helping us win friends. It's one thing when we are the only service available; it's another when there is a new kid on the block. Andreas JN 466 13:56, 4 October 2023 (UTC)[reply]

I agree that ChatGPT produces better prose than most humans do (myself included), and as far as reliability is concerned, I'd honestly trust it to give me a general overview on any non-controversial topic, no matter how obscure (I tested it earlier on "the Black sermonic tradition of whooping" and it passed with flying colours). I think where it fails and starts to hallucinate is when you ask it to drill down into the detail. This is Wikipedia's strength – we may not be good at summarizing complex topics, but we're very good at detail.

So I don't think it has to be "machines vs. Wikipedians". I used ChatGPT this morning to write a lead for the article Khlysts. As leads go, it's maybe not perfect, but it's decent. But when I ask the bot for more in-depth information, like "Who founded the Khlysts?", it starts to get vague, and if I press it for specific answers, that's when it starts lying rather than admit it doesn't know. Chatbots have got a long way to go before they can write entire articles (at least without a lot of coaching and cross-checking, at which point you're not really saving any time), and I personally don't believe that AI will ever reach that level. But I can easily foresee a not-too-distant-future in which Wikipedia has an integrated LLM interface that allows editors to auto-generate summary-style content, or perhaps provides it to readers on the fly via a "Summarize this article/section" button.

In my view, the alarm of the study authors is misplaced. They seem to agree that the AI-generated content they used in the study is, in fact, trustworthy and competent, so there's no reason to be "unsettled" that people rate it as such. I suspect the participants would give a different rating to the vague, generic answers that ChatGPT gives you when you try to go beyond a summary. I don't think there's any competition here – when people want pedantic, granular detail, they'll always know where to find it. Sojourner in the earth (talk) 21:52, 4 October 2023 (UTC)[reply]

The fact that ChatGPT doesn't know when it doesn't know something means it's "intelligent" only in a restricted sense. To go on and on about a topic, all the while presenting itself as an "intelligence", also does not seem ethical. CurryCity (talk) 08:29, 5 October 2023 (UTC)[reply]

Quoted from above, "I'd honestly trust it to give me a general overview on any non-controversial topic, no matter how obscure" — that's extremely problematic. Recently I read about someone who is a critic with a dim view of a certain thing, a fact that is well known in his field (albeit not by the general public), that when he asked ChatGPT who he is, it smoothly and confidently claimed that he is a proponent of it. The devil is in details of the black box of "where it fails and starts to hallucinate is when you ask it to drill down into the detail" — but clearly the upshot is that you absolutely cannot trust GPT not to mix any bits of plausible-sounding-but-definitely-incorrect hallucination into any kind of "explanation" that it gives you about what something or someone is or isn't, when that something or someone is any more obscure than extremely widely known and understood. When I read in this Signpost article the sentence, " […] even if the participants know that the texts are from ChatGPT, they consider them to be as credible as human-generated and curated texts [from Wikipedia]," the first thought that popped into my head is, "Well, that's it, we're boned." Duck and cover and kiss our present standard of living goodbye, because some bad shit will be happening if that shit continues very long in its current configuration. You know how McNamara said that "the indefinite combination of human fallibility and nuclear weapons will destroy nations"? Like it doesn't even matter if the annualized risk is somewhat low, if you run the runtime for enough years? That's most likely what's going to happen with the combination of human [epistemologic] fallibility [incompetence] plus LLMs, unless the "indefinite" part can be massively shortened in duration by chaining today's type of LLMs in series with some kind of downstream rapid-acting bullshit-detector software that can act like a filter. If that shit doesn't get invented within the next N years, where time N is pretty fucking short, then there is going to come a time in the future where some Homer Simpson who works at a nuclear plant is going to ask an LLM how to run the reactor properly and the LLM is going to reply with some super-smoothly-worded and speciously/superficially-plausible-but-yet-idiotically-wrong answer, and there's going to be another Chernobyl-type steam explosion of a reactor, because even though the plant has a strict rule against asking any LLMs for any cheatsheet help, Homer is going to carry his smartphone into the bathroom and ask it while inside the toilet stall because he thinks he's getting away with a clever clandestine lifehack and sticking it to The Man. Or some librarian is going to ask an LLM which library books to ban for age-inappropriate content and it's totally going to confabulate a bunch of imaginary but plausible-sounding horseshit and get books banned for containing things that they don't even contain at all (that one already happened, by the way). Or a lawyer is going to ask an LLM to write a legal argument and cite a bunch of 'definitely real and please not fake' references that support it, and he's going to get fired because those smoothly cited plausible-sounding references are all made-up imaginary horseshit anyway (that one already happened, by the way). "Oh, no, don't be silly, Quercus, that reactor-explosion thing's not gonna happen." Yeah. We all sure better just hope not, right? Running on hope and faith but no guardrails at all, what could go wrong, right? This is why the likes of Sam Altman are begging governments to regulate their industry (albeit begging with crocodile tears in their eyes while they shovel wads of money into their bank accounts). Also, just to clear up a misconception implied on this page, among some of these other comments — regarding the notion that, "Well, how can you blame people, because a supposed-to-be-nonfiction answer that's factually wrong but super-smoothly worded and easy to read is preferable to one that's factually correct but somewhat clunkily worded." Jesus, Lord help us with the stupidity. If you have to ask what's wrong with that notion, then goto Homer's steam explosion above. Quercus solaris (talk) 23:37, 6 October 2023 (UTC)[reply]

Thing is, there are plenty of Wikipedia articles that also contain "plausible-sounding-but-definitely-incorrect" information, so for LLMs and Wikipedia to be rated as equally credible sounds about right. Of course I wouldn't use either as a source if I were writing my own article, but for satisfaction of personal curiosity, sure. As for the rest of your comment, I'm not convinced that "human fallibility plus LLMs" is a bigger danger to society than, say, human fallibility plus the invention of the printing press, or human fallibility plus firelighters. Pretty much anything that benefits humanity can be misused, and every Next Big Thing gets people prophesying the imminent collapse of civilization, but somehow we keep trucking along. Sojourner in the earth (talk) 05:40, 7 October 2023 (UTC)[reply]

In this case success and failure are asymmetric. It takes only one bad enough catastrophe once to wipe out civilisation, if not humanity. Even ones that are "too small" to destroy the rest of the biosphere or planet with us. CurryCity (talk) 06:33, 10 October 2023 (UTC)[reply]

Even if ChatGPT or its successor becomes the predominant internet search tool, that doesn't mean Wikipedia will be obsolete. It likely means that Wikipedia will go back to its theoretical origin as a reference work rather than the internet search tool many readers use it as. Thebiguglyalien (talk) 16:11, 4 October 2023 (UTC)[reply]

Ah, the rise of AI. I've used it to get ideas for small projects in the past, but people prefer LLMs over Wikipedia? That's, just... sad. The Master of Hedgehogs is back again! 22:09, 4 October 2023 (UTC)[reply]

It's quite understandable because articles often have a ponderous, pedantic style. As a fresh example, consider the current FA, ministerial by-election. Its first sentence is
From 1708 to 1926, members of parliament (MPs) of the House of Commons of Great Britain (and later the United Kingdom) automatically vacated their seats when made ministers in government and had to successfully contest a by-election in order to rejoin the House; such ministerial by-elections were imported into the constitutions of several colonies of the British Empire, where they were likewise all abolished by the mid-20th century.
When this was a candidate, a reviewer observed that
... you have a tendency to write very long sentences, exemplified by the lead sentence, which is 67 words long. A quick google search tells me 25 words is the optimal sentence length (and 30 the maximum), but here you can routinely see double that.
By my count, this sentence now has 68 words so it's only getting longer!

Andrew🐉(talk) 11:47, 5 October 2023 (UTC)[reply]

I really hope we can get a guideline on readability. My young sister-in-law complains that we adults "Explain things like Wikipedia", rather than in a way that a teenager can understand. I started to draft readability guideline once, but never found time to get it past the first couple of sentences. —Femke 🐦 (talk) 19:59, 6 October 2023 (UTC)[reply]

There are plenty of guidelines such as WP:NOTJARGON, WP:NOTJOURNAL and MOS:LEAD. The challenge is getting them followed. There are also plenty of metrics for this such as the automated readability index. We just need a bot or tool to assess all our articles and so highlight cases in need of attention. Andrew🐉(talk) 22:46, 6 October 2023 (UTC)[reply]

What is missing from these guidelines is a) the need to make non-technical text easy to understand b) more practical advice or what sentence and paragraph lengths make text difficult. I've been playing with the idea of creating a contest, similar to WP:the core contest, to tackle lead difficulty. Not sure yet what the best set-up is, as misguided attempts at simplifying can be quite bad. —Femke 🐦 (talk) 06:50, 7 October 2023 (UTC)[reply]

You have your work cut out for you as TFA is now featuring another good example: Affine symmetric group. ChatGPT struggles with the topic too. I reckon that it's possible to do better but it would be a thankless task currently. Andrew🐉(talk) 08:55, 8 October 2023 (UTC)[reply]

@Femke and Andrew Davidson: For what it's worth, you can ask ChatGPT or DeepL (which has a beta feature for this) to rewrite texts to make them easier to understand. Fed the Wikipedia lead section of Government shutdowns in the United States mentioned above and asked to improve its readability, ChatGPT produced the following:

In the U.S., government shutdowns happen when the funding needed to run the federal government isn't approved before the new fiscal year begins. During a shutdown, the government cuts back on agency activities and services, stops non-essential operations, puts non-essential workers on leave, and keeps only essential employees in departments that protect people or property. Shutdowns can also cause disruptions at the state, territorial, and local government levels.

Funding gaps started causing shutdowns in 1980, following a legal opinion from Attorney General Benjamin Civiletti. While this opinion wasn't consistently followed in the 1980s, since 1990, any funding gap lasting more than a few hours has led to a shutdown. As of September 2023, ten funding shutdowns have resulted in federal employees being placed on leave.

Major shutdowns include the 21-day shutdown in 1995–1996, during the Bill Clinton administration, due to disagreements over significant spending cuts; the 16-day shutdown in 2013, during the Barack Obama administration, caused by a dispute over implementing the Affordable Care Act (ACA); and the longest one, the 35-day shutdown in 2018–2019, during the Donald Trump administration, triggered by a disagreement over funding for expanding barriers on the U.S.–Mexico border.

Shutdowns disrupt government services and programs, leading to the closure of national parks and institutions. They also result in reduced government revenue as fees are lost, even though some furloughed employees eventually receive back pay. Furthermore, they negatively impact economic growth. For instance, during the 2013 shutdown, financial ratings agency Standard & Poor's reported on October 16 that the shutdown had already taken "$24 billion out of the economy" and "reduced annualized fourth-quarter 2013 GDP growth by at least 0.6 percent."

I wouldn't endorse every change, but would argue that the overall result is indeed improved readability. Andreas JN 466 11:51, 8 October 2023 (UTC)[reply]

Keep up with The Signpost on Twitter, Facebook or Mastodon.

Home

About