Wikipedia talk:Wikipedia Signpost/2008-08-11/Growth study

Discussion moved from tip line

edit

In a study published in the August issue of Communications of the ACM entitled "The Collaborative Organization of Knowledge," computer scientists Diomidis Spinellis and Panagiotis Louridas analyze the relationship between references to non-existent articles (redlinks) and the creation of new articles. The study, based on the February 2006 dump of English Wikipedia because more recent dumps were unavailable (for shame!), finds that the ratio of complete (i.e., non-stub) articles to incomplete (non-existent or stub) articles remained nearly constant between 2003 and 2006 (about 1.8 incomplete articles per complete article). A trend in either direction, according to the authors, would indicate an unsustainable growth pattern. If the average number of redlinks per article is increasing, it means that Wikipedia is becoming diffuse and will become less useful as more and more the terms in the average article are not covered. If the average number is decreasing, it suggests that Wikipedia's growth will slow or stop as the number of links to uncreated articles approaches zero. The stable redlink ratio suggests that Wikipedia is a scale-free network, in principle capable of unlimited growth.

The study also notes that most new articles were created within the first month that they were referenced in another article. Furthermore, only 3% of new articles were created by the same user who created the first link to that article (whether as a redlink or a bluelink). This implies that the connection between redlinks and new articles is a collaborative one, and that adding redlinks actually spurs others to create new articles.--ragesoss (talk) 00:20, 1 August 2008 (UTC)Reply

This is a fascinating study but the conclusion completely wrong. In one direction (diffusion), it omits the possibility that editors will simply remove redlinks if they feel an article has too many of them. This could be primarily an aesthetic decision, rather than an aggregate indicator of potential articles. In the other direction, it omits the possibility that growth could slow from one equilibrium to another equilibrium (lower than current level of growth, still positive, flat with second derivative=0). --JayHenry (talk) 00:38, 1 August 2008 (UTC)Reply
PS: My comments are based off a fast read of the working draft, which is freely available, and I assume the same in substance as the published version? --JayHenry (talk) 00:43, 1 August 2008 (UTC)Reply
Yes, at first glance it looks very close to the published version. In any case, since they find a stable redlink ratio, the conclusion of a scale-free network pattern of growth seems to stand. To me, though, the most interesting aspect of it was that most articles (right around 50%, it looks like to me from the graph) are created within a month after the first time they are redlinked, and 97% of the time by a different user than the one who linked it. It seems like a strong argument against removing redlinks for aesthetic reasons (e.g., in FA and GA candidates). It also makes me think adding a redlinks section (e.g., a line from Wikipedia:Recent changes article requests) to the Main Page might be a good idea.--ragesoss (talk) 00:59, 1 August 2008 (UTC)Reply
(I know Ral hates it when we do this on the suggestion line but I think it's interesting!) I don't think the conclusion does stand, because the correlation between redlinks and actual "missing articles/undefined concepts" may be weaker than they considered. The ratio of missing articles could be constantly falling, and the aesthetic preference for articles to contain 1.8 redlinks and stubs is only sustainable and constant because the true ratio of missing articles was higher than 1.8. If it fell below 1.8 that artifice would crumble. --JayHenry (talk) 01:15, 1 August 2008 (UTC)Reply
You point out one of many ways that Wikipedia is a really complex place. They acknowledge as much in the conclusion, noting several factors unrelated to network structure that could affect the growth potential of Wikipedia. Still, the conclusion that Wikipedia will not be limited by network structure passes the smell test, in my view. Of course, being based only on February 2006 data means that it missed the growth-mode transition that happened around September 2006, when the exponential phase of growth ended.--ragesoss (talk) 01:52, 1 August 2008 (UTC)Reply
You are talking about how the current thought is that it's growth is logarythmic, correct? Wikipedia:Modelling_Wikipedia's_growth -Ravedave (talk) 04:11, 1 August 2008 (UTC)Reply
Speaking of newspapers (this being the Signpost page) and logs... At Purdue University, a school with huge engineering and math programs, the student paper is called The Exponent. They briefly had a rival paper which was, yep, The Log. Sadly a single clever pun (actually several puns-in-one, as it was also a log of university life, as well as printed on paper, which is derived from logs) was not enough to sustain an otherwise poorly-produced publication. --JayHenry (talk) 04:42, 1 August 2008 (UTC)Reply
On the topic of logarithms, the above comment made me think that the Logarhythms would be a good band name at a school like Purdue. As with most great ideas, somebody beat me to it. Okay, sorry Ral, enough with logs... if I keep it up my block log is where it will end :) --JayHenry (talk) 05:18, 1 August 2008 (UTC)Reply
I don't mind interesting discussion on an issue, it's mainly when people argue on a controversial subject that I get annoyed. Ral315 (talk) 20:56, 2 August 2008 (UTC)Reply
It's tough to say. Growth was roughly exponential until September 2006. Since that time, it's been roughly linear. The total size graph can be fit to a logistic curve pretty well, suggesting that the growth rate will gradually decline. However, it seems probable that changes in inclusion standards and deletion practices (and the effect those changes had on the size and structure of the community) were more of a factor than lack of potential topics (as the logistic model implies). Growth rate may be declining, though; since January, Wikipedia has added about 1560 new articles per day, down from 1625 in 2007 and 1822 in 2006. It's unclear whether this is a trend, or just fluctuation, as month-to-month growth rates vary considerably.--ragesoss (talk) 04:45, 1 August 2008 (UTC)Reply
I rerun the study on a more recent data set, which included the years 2006 and 2007. The new results I obtained from this 2001-2007 data set don't appear to differ from the ones based on the study's 2001-2005 data set. --Diomidis Spinellis (talk) 20:47, 8 August 2008 (UTC)Reply
Thanks! Very interesting study, and the update is much appreciated. It looks like there is a monotonic decline in the incomplete/complete ration since mid-2006, and it's now less than 1.4. That does seem to suggest a different conclusion. It might be useful to put total articles on the X-axis instead of time (which would make the recent downward trend much more apparent.--ragesoss (talk) 21:20, 8 August 2008 (UTC)Reply
I think it's too early to draw a conclusion from this downward trend. We've seen a more dramatic downward change from 2001 to mid 2002, and then an upward swing to 2004. The recent trend is more significant, because larger numbers of articles are involved; it might therefore be more difficult to reverse. As long as the ratio is above 1.0, growth as we know it should continue. A fall below 1.0 (which if the current trend continues could happen around 2011-2012) would indeed be worrying. --Diomidis Spinellis (talk) 14:47, 9 August 2008 (UTC)Reply

Yes, I think we are experiencing the slow-down in the article growth. I can tell this from my experience, and I'm sure many other experienced users would concur. What is interesting is that, according to History of Wikipedia, the article creation by anonymous users was disabled in 2005, but it, apparently?, didn't affect the growth at the point. Also, since the growth rate is the number of articles created minus that of ones deleted, I suspect the notability policy is quite possibly the culprit of the declining growth, rather than the dearth of uncovered topics. -- Taku (talk) 05:35, 1 August 2008 (UTC)Reply

A while ago (probably more than a year), I looked at this and found there was one article deletion for every three articles created (i.e. a net creation of two). Dragons flight (talk) 19:28, 1 August 2008 (UTC)Reply
I wonder what the number is now.--ragesoss (talk) 19:31, 1 August 2008 (UTC)Reply
IMHO, notability & the presence of links are indirectly related. That is, the greater the possibility that an article is notable, the greater the likelihood that it will have one or more links to it. (And I hesitate a little in stating this, since some Wikipedians will misuse this observation to promote a mechanistic definition of notability.) -- llywrch (talk) 15:01, 4 August 2008 (UTC)Reply
Somewhat related discussion at http://lists.wikimedia.org/pipermail/wikien-l/2008-August/095097.html 86.44.19.24 (talk) 04:02, 19 August 2008 (UTC)Reply