Template:Did you know nominations/GPT-2

The following is an archived discussion of the DYK nomination of the article below. Please do not modify this page. Subsequent comments should be made on the appropriate discussion page (such as this nomination's talk page, the article's talk page or Wikipedia talk:Did you know), unless there is consensus to re-open the discussion at this page. No further edits should be made to this page.

The result was: promoted by Amakuru (talk) 10:09, 26 February 2021 (UTC)

GPT-2

(

)

... that the artificial intelligence program GPT-2 can summarize, respond to, generate, and even translate human-level writing, despite being trained to do nothing more than predict the next word in a sequence? Source: OpenAI paper, ref in article
- ALT1: ... that the GPT-2 artificial intelligence can summarize, respond to, generate, and translate text on a human level, despite being trained to do nothing more than predict the next word in a sequence? Source: OpenAI paper, ref in article
  - ALT2: ... that the GPT-2 artificial intelligence can summarize, respond to, generate, and translate text, despite being trained to do nothing more than predict the next word in a sequence? Source: OpenAI paper, ref in article

Reviewed: Södermanland runic inscription 140

5x expanded by JPxG (talk). Self-nominated at 17:25, 26 December 2020 (UTC).

Going to be a harsh review because I think it's important we get this topic right.

General: Article is new enough and long enough
New enough: Long enough:

Policy compliance:

Adequate sourcing: - Under "Architecture", the last paragraph is lacking citations, as is the last sentence of the first paragraph.
Neutral: - I have some concerns. Claims that GPT-2 "often" passes the Turing test (implied with an Easter egg link) is not implausible, but it's such a high-impact claim that I think it needs secondary sources to show that this is accepted within the field. Throughout the article I do have concerns that the prose is parrotting primary claims a little bit without the appropriate prose attribution of viewpoint, or sounding a bit too much like a pitch to investors ("While its objective is simple", "GPT-2 became capable of performing well"). A copyediting run with this in mind should solve it—most claims could be toned down or attributed, the alternative being more academic secondary sources.
Free of copyright violations, plagiarism, and close paraphrasing:
Other problems: - "Scale-up" and "tokenization" are dab links. When it comes to the lead image, can you explain to me why it's freely licensed? Screenshots are not in general, of course, and while lots of OpenAI content might be open-source, all I see on this specific website is a "© 2020 InferKit".

Hook: Hook has been verified by provided inline citation
Cited: Interesting:

QPQ: Done.

Overall: Hook is interesting and its claims are uncontroversial enough for the primary source to be fine. It would be good if the article could mention the stages of source code release—am I right that all the code is now released? Or just some? But of course, the researchers initially had concerns. Possibly the topic is not quite D7 "complete" without some description of the source code releases. — Bilorv (talk) 23:02, 26 December 2020 (UTC)

I appreciate the brutality. Truth be told, I was planning to write an article at least twice this size (if not more); the technical background took me longer than I anticipated, and I ended up getting buttonholed by IRL goings-on halfway through. There is definitely a lot of stuff that ought to be in there, and isn't; I was contemplating just doing it when I got back home, but I was running out of time to submit a DYK! OpenAI's claims are, indeed, wild and outrageous, but there are a lot of secondary sources to back them up (and, for a while at least, it was possible to go try it out yourself on a few websites and get your mind blown in real time). I don't know when I will have time to go through and put those sources in, but I can try to carve out some time in the next couple days. As for the image, well, TalkToTransformer is currently part of some gee-whiz startup, but prior to that I believe it had different licensing information (will try to find it for you). Would be an interesting quandary to figure out whether GPT-2 holds copyright to its own works, huh? Anyway, I have to do some stuff tonight but I will try to get started on all this crap tomorrow. And thanks for the review! jp×g 00:37, 27 December 2020 (UTC)

No worries, often getting a foothold can be the hardest part and I'll give you a few days, know it's a busy time of year for many. — Bilorv (talk) 14:47, 27 December 2020 (UTC)

@JPxG: at the one week mark, I see you've not really been active since but since the problems could take a while to fix, I think I'll have to fail this in a few days unless you can find a time in your schedule to commit to resolving the above. Either way I hope the comments are useful for the article's future progress. — Bilorv (talk) 01:34, 3 January 2021 (UTC)

Have got some free time today, will finish it off. jp×g 15:28, 5 January 2021 (UTC)

@JPxG: I'll give you 24 hours but after that I think I'll have to fail this, sorry. — Bilorv (talk) 23:47, 6 January 2021 (UTC)

Okay, adding the relevant sections now. jp×g 23:48, 7 January 2021 (UTC)

Unfortunately, I'm going to have to fail this. It's been an additional 24 hours and improvements have been made, but I believe there are still neutrality issues that are a barrier to showing this on the main page. I hope the feedback is useful and would look forward to future development of the topic. — Bilorv (talk) 23:37, 8 January 2021 (UTC)

@Bilorv: I've added some more citations to the claims you mentioned above (like its output being plausibly interpreted as human, which most of the sources support, and which I've clarified in the lede suffers on longer passages); I'm not sure what action can be taken to give it more neutrality. In your initial review you mentioned phrases like "its objective is simple" sounding like an investor pitch. The reason for this specific phrasing is because, well, its objective was simple: unlike previous ML models measured on the same benchmarks (which often involved extensive task-specific fine-tuning), GPT-2 was not reinforced on its performance on any task other than text prediction. That is to say, during its training, it was not assessed on any metrics for machine translation or summarization; similarly, "perform well" is based on things like its performance on the WMT-14 French-English test set on which it achieved 11.5 BLEU (comparable or superior to other unsupervised translation models, but unlike them, it contained only 10MB of untranslated French in its training corpus). jp×g 02:05, 9 January 2021 (UTC)

We've still got the Easter egg link asserting that the model "sometimes" passes the Turing test, which would need explanation in prose in the body with attribution of this view. I didn't hear back on that licensing point. At the time I wrote the above, there were still uncited parts and none of this WMT-14 evidence in prose (which is absolutely a great improvement). I'm not happy to extend this review indefinitely, after setting a hard time limit that was not met after quite some leeway. I will reluctantly call for a new review if you insist on this. — Bilorv (talk) 16:41, 9 January 2021 (UTC)

Okay, back again today. I've never had a DYK fail before so I am going to do my best on this one. jp×g 00:13, 12 January 2021 (UTC)

So I went ahead and added a sentence with a ref to a study in the last section which I believe qualifies to support the "sometime passes Turing" statement in the lede (I accidentally marked it as a minor edit; it was not). @JPxG, I strongly suggest you simply remove the image for now unless/until you can get confirmation on its free usage. It is a nice illustration, but not worth failing the DYK over. --LordPeterII (talk) 15:21, 25 January 2021 (UTC)

Yeah, I've commented out the image for now (I am going to be adding some more stuff to the article later, and verifying the status of the image is up there in my list). This DYK has been hanging for quite some time, so I think it is ready to go (once I'm finished with my expansion I am thinking of going for GA or FA because why not). jp×g 23:26, 26 January 2021 (UTC)

I'm not sure if I've been clear enough but I'm not going to approve this hook as issues were not resolved by the deadline I set. The article is now much lengthier and more detailed than what I reviewed (a very good and welcome improvement, don't get me wrong) so this needs a thorough fresh review by an uninvolved reviewer if it is to approved. I'll ask for a new reviewer and wish you luck with the article in the future. — Bilorv (talk) 23:48, 26 January 2021 (UTC)

Hi there! First of all, thanks for your patience. I know this has been sitting in the DYK queue for a while. I'm a data scientist by profession, and I'm thrilled to see content like this on Wikipedia. I may be inspired to edit in this topic area shortly. :-)

The article is new enough, certainly long enough. Earwig flagged a few areas, but they were basic phrases or quotations. I'm going to take a closer look at the article's neutrality and sourcing, based on the above feedback from Bilorv. However, I'm a bit concerned about WP:citation overkill primarily in the lead, but also in other parts of the article. Also, I see terms like "state of the art", which to me sounds like MOS:WEASEL.

Hook is verifiable, however it is 205 characters long. The maximum is 200, so please shorten it. Edge3 (talk) 04:50, 15 February 2021 (UTC)

@JPxG: You also have a large blockquote in the Architecture section. Do you need to share all of that info in this way? It would also be helpful to link the highly technical terms. Further, within that blockquote, you seem to have retained the reference "[53]", which is irrelevant to Wikipedia. Edge3 (talk) 01:45, 16 February 2021 (UTC)

@Edge3: It lives!!!

So, about these things, I'll be honest: that enormous blockquote is mostly an item on my to-do list. My approach to summarizing papers and results so far has been to explain them in a way that provides appropriate background (i.e. in other parts I have explained what the WMT'14 benchmarks are and what BLEU means); with that quote, the things that need explanation or context is barely within my own understanding (wew lad!), and they're packed pretty densely. For now, I would be willing to comment it out or heavily abridge it until I can get home and do this (I have not had a lot of time for editing lately, and am temporarily relegated to a computer that can't view PDFs properly).

As regards "state of the art", I will grant that it's the most bullshitty sounding phrase in the world, but at least in the NLP papers I've been reading, it's used solely to convey a specific objective definition, i.e. whatever model or approach, at the time of any given publication, had been documented as achieving the highest performance on whatever metric. For example, in this sense, we might say that Charles Lindbergh's solo flight in 1927 outperformed the previous state of the art on transatlantic flight pilot-count minimization, previously 2. If this sounds too much like a buzzword, I would be willing to give an explicit definition of the term somewhere explaining that it's being used in a specific objective sense that isn't "really cool Failing that, I could attempt to write it out and replace every instance with some synonym, or "the then-current best result achieved by previous applications of NLP models to the benchmark", or whatever.

For the hook, I offer one that's 195 characters, let me know if this is crap or what.

Overall, I am willing to stand by it as it is, or make the modifications mentioned above. That said, I haven't finished doing everything I want to do in this article; I have some more to say about reception and critical response, the drama of OpenAI's break from tradition followed by reversal of policy and incremental releasing of larger models, and subsequent research that's been done using GPT-2 (although this is probably more of a GAN issue than a DYK issue). jp×g 03:07, 16 February 2021 (UTC)

Thanks for the prompt reply! Here are my comments:

A concern I have about the blockquote is that it's very long, and could potentially be a WP:copyvio issue. You'd have to demonstrate that the quote itself is worthy of direct copying and meets the WP:fair use guidelines. You might be fine with commenting it out or heavily summarizing it for now, as you suggest. Let me know if you need help summarizing it. I took an NLP course in grad school (though I'm certainly not an expert!).
As for "state of the art", I do think it sounds MOS:WEASEL-y as it's currently used. And I agree with your assessment. (Believe me... I roll my eyes every time I see something like that in the literature. It gets worse in the corporate world.) I just realized that we have a wiki article for State of the art. Maybe use that link? However, as the article itself notes, a layperson would view it as puffery rather than a technical term, so I'd still recommend avoiding it. As you suggest, you could either explain the term in a footnote, or you could rephrase it as "benchmark" or "best performing model at the time".
Hook length is fine, as long as it's below 200 characters. What does it mean for text to be "on a human level"? Would it be better, and more accurate, to say "natural language"?

I know you're hoping to do more with the article, potentially taking it to GAN. But in my opinion, you're not that far from DYK. The remainder of your changes can be done at GAN, unless you really want to incorporate it into this review process. Edge3 (talk) 05:56, 16 February 2021 (UTC)

I've read this and I'll get more into it tomorrow. jp×g 10:33, 16 February 2021 (UTC)

@Edge3: "Tomorrow" is a loose phrase, but I have done the modifications. Thoughts? jp×g 14:37, 24 February 2021 (UTC)

Looks good to me! ALT2 approved. Thanks! Edge3 (talk) 03:02, 25 February 2021 (UTC)