Wikipedia talk:Modelling Wikipedia's growth/Archive 4

Archive 1 Archive 2 Archive 3 Archive 4

I'm extremely sceptical about the claim of logistic growth. Just because the growth is now sub-exponential does not mean it's logistic. If the project was in an exponential regime, and is now in a more or less linear regime, that does not mean its growth is going to fall off to zero. Perhaps the exponential regime was as Wikipedia was being discovered, and now we're in a linear regime, where everyone knows about it and pretty much anyone who wants to edit already knows about it and is editing? In any case, the logistic growth claims are an overzealous projection from limited data. -Oreo Priest talk 03:07, 7 July 2009 (UTC)

'extended growth model'

Exactly what is it is supposed to predict? So far as I can tell it:

a) doesn't predict the shape of the initial growth b) doesn't predict the shape or height of the peak c) has some half-assed prediction that the growth will decay exponentially

Exactly why are we supposed to take this seriously? Why is it even in the introduction, for that matter, why is it in the article at all?

Meanwhile the logistic curve fits the whole curve to very good accuracy, based on just 3 numbers and has done for about a year now.

Given this, I'm removing references to the extended growth 'model' from the introduction. It does not seem notable enough to be there; and to be honest, it doesn't really deserve to be in the article; it makes completely unjustified prediction based on very little data, and doesn't explain most of the data at all.- (User) Wolfkeeper (Talk) 00:04, 2 November 2009 (UTC)

I've put it back. There is no reason that growth of a user-generated encyclopedia needs to be modelled from start to finish by a single mathematical function; different functions in different regimes will most likely be the most appropriate to model it. The growth curve in the early sections reflects more and more people discovering WP; now everyone knows about it and we've entered a qualitatively different regime.
I for one am extremely sceptical of the logistic growth model. There's little to suggest, short of your hunch that there's not much more to write about, that Wikipedia's growth will stop anywhere in the next million. Growth is only slightly sub-linear, and seems well-poised to keep on going. Essentially, any modelling should be phenomenological, not a question of using the fewest parameters possible, especially if that gives strange predictions.
Most relevant here, however, is that some people do support the logistic growth model, while others don't. Rather than claiming that there's only one prediction, both should be presented in the introduction as alternatives.Oreo Priest talk 06:30, 2 November 2009 (UTC)
Oh, and you were right to remove what you did from the criticisms of the logistic model section. Oreo Priest talk 06:31, 2 November 2009 (UTC)
 
Growth is down to 1000 articles per day, and falling
Screw that. I don't care what you think is right or not; a mathematical model has to fit the data. This model doesn't. The current article growth rate hit 30,000 per month, exactly what was the prediction of the extended growth model? Over 40,000 wasn't it? If a model fails to match the data; it's wrong. I don't care how nice you think it is, the only thing we want from out models is prediction. If it doesn't predict... and this one doesn't... it's binned.- Wolfkeeper 15:10, 17 November 2009 (UTC)
First of all, you can't throw out the entire model just because one set of parameters gave an instantaneous prediction that was off by 20%. Take a look at the graph you just posted! In the past year alone, all 3 curves for the logistic growth model have been off by more than 10,000 articles per month. If you extend that to the past two years, the top and bottom curves have both been off by a whopping 20,000 articles per month! And you now say we should throw out the extended growth model completely just because a crude graph with one set of parameters was off by 10,000 at one point in time? That doesn't make sense.
If you could convince the creator of that graph to try fitting Wikipedia's growth curve to something like this [1], giving a derivative something like this [2], then it would be a lot easier to compare apples and apples. I suspect you will find that with appropriate parameters, it fits the data just fine. Oreo Priest talk 03:36, 18 November 2009 (UTC)
 
The averaged data shows fewer fluctuations, and puts us close to a growth rate of 40,000 articles per month.
Also, if you look at the much less noisy 6 month averaged data, we are indeed quite close to 40,000 articles per month. Oreo Priest talk 03:41, 18 November 2009 (UTC)
Why 10%? Why not 20%? Why not 5%? The logistics curve model reached the 3.5 million by plotting 3 and 4 million curves and finding they were too high and too low respectively, and the binary chopping down to 3.5 million and discovering that it was close enough. Where did you get 10% from? Does 10% perform better than 5%?
It seems to me this is not science at all; your model was just plucked out of the air. Where's the working?- Wolfkeeper 18:47, 19 November 2009 (UTC)
It's a phenomenological model designed to fit the data. It notes that the growth rate is shrinking, and extrapolates from that. Does that clear things up? Oreo Priest talk 20:39, 19 November 2009 (UTC)
No, show me how 10% wasn't pulled out of something you sit on. 10% is a weirdly round number, where did it come from?- Wolfkeeper 21:04, 19 November 2009 (UTC)

(Undent) 10% is an arbitrary example. Phenomenological means we pick whatever best fits the data. Just like the logistic curves plotted were those that best fit the data. Oreo Priest talk 05:32, 20 November 2009 (UTC)

Currently there are 3 models

As creator ot the logistic model graphs I can see good reasons for each of them.

  • Logistic: explained in this article
  • Extended growth: The decline in measured growth is less than predicted in the logistic model.
  • The PARC monotonic growth model has a sound theoretical bakcground.

What in my opinion is lacking on the last 2 models is a realistic prediction and/or a comparison with the measuresed growth. For instance the PARC model is at the moment at 6M articles (also at the moment of publication) while currently we only have about 3M articles. HenkvD (talk) 11:00, 21 November 2009 (UTC)

Exactly, so the claim in the lead that the wikipedia may make 10 million is just pulled out of thin air. Even assuming the model is at all accurate the growth rate could turn out to be 5%, in which case the wikipedia would be half as big, or 20% (unlikely from the data) in which case it would be twice as big. In the absence of any attempt to compare the predictions with what actually happened historically; this is not currently a valid model or prediction; sorry.- Wolfkeeper 16:31, 21 November 2009 (UTC)
Hello HenkvD, good to have you participate. Could you please try to use the data and graphing software to try fitting Wikipedia's growth curve to something like this [3], giving a derivative something like this [4]? It would be great if you could make 3 slightly spaced curves just as with the logisitic model, and you could update it with the others. Thanks! Oreo Priest talk 16:02, 22 November 2009 (UTC)

Two phase exponential model

What do you think of the new two phase exponential model recently added to the page? I think it looks fantastic, and the graphs fit the data exceptionally well, without making any sort of unreasonable predictions. (The rationale behind the prose added with it is a bit more dubious). I'm sure you can agree that this model is one of the best yet, if not the best.

Just for the record, the model (sinusoidal variations excluded) is in fact the same as the "extended growth" model, just much, much better done, and with a much clearer name. This is the kind of thing I've been defending all along, not those sketchy bar graphs, which give good reason to be skeptical. Oreo Priest talk 02:59, 1 December 2009 (UTC)

I think it's dubious. Just fitting a curve to data isn't enough; you need to make usable predictions. And the curve has several parameters that are largely arbitrary; fitting a curve with arbitrary parameters isn't difficult- getting it to predict usefully is.- Wolfkeeper 14:12, 2 April 2010 (UTC)
And all of the graphs that have been produced stop with historical data; you're supposed to predict forwards for at least a year or two. There's no point otherwise.- Wolfkeeper 14:12, 2 April 2010 (UTC)
Once again, because this is phenomenological, fitting a curve with arbitrary parameters is the only thing that makes sense. The logistic growth model also uses arbitrary parameters. As for future predictions, it does make them, they're just not plotted yet. I'll ask the creator of those graphs to project it into the future a bit. Oreo Priest talk 16:36, 3 April 2010 (UTC)
There's a problem with more parameters, because the more parameters you have the more chance that you're "over-fitting" the curve; with enough parameters you can fit any curve exactly... but it will have poor predictive power, it will look great but just won't track properly as time goes on. Simpler curves with less parameters are nearly always much better.- Wolfkeeper 16:44, 3 April 2010 (UTC)
Yeah, but there's pretty much only two parameters: amplitude and the factor in the exponential (these are different for each regime). The rest are just to make it smooth at the transition. Oreo Priest talk 21:42, 4 April 2010 (UTC)
'just to make it smooth at the transition'. That's just fooling yourself. Count the parameters properly.- Wolfkeeper 15:29, 5 June 2010 (UTC)
Fooling myself? I'm not sure you understand the model... Oreo Priest talk 18:18, 5 June 2010 (UTC)
The logistics model has 3 free parameters, as does the Gompertz model. How many parameters?- Wolfkeeper 18:42, 5 June 2010 (UTC)

[unindent] All three models are phenomenological, in that they fit a mathematical function to the data. The theories come afterwards, as attempts to justify the function.
The logistics curve (LC) model assumes that wikipedia evolution has had same dynamics since 2001, and implies a gradual slowdown is a consequence of gradual 'saturation' of some 'potential article space'. The Gompertz curve (GC) model too assumes a single process but with gradually changing parameters. In contrast, the two-phase (2P) model assumes that the dynamics changed suddenly and drastically sometime 2005-2006, from a regime of nearly exponential growth to one of apparent exponential decline. As Oreo says, this is similar to the extended-growth (XG) model, only more emphatic about the break in 2005-2006.
The first question is which model provides a better fit to the data. Indeed one can fit any set of data if given enough parameters, and the more parameters the better the fit. A model is "remarkably well-fitting" only if the fit is better than what one would expect if the data were random. If we ignore the transition period, the 2P model has basically five parameters (the date of the break, and amplitude and rate of the two exponentials). The LC model has three (the date of the inflection, the derivative at the infletion, and the limiting value), and so has the GC model (a, b, c). So, indeed, it is not surprising that 2P provides a better fit. Yet, I may be partial, but I believe that the 2P fit is still more "remarkable" than the other two, because it fits very closely the data in each of the two periods -- more closely than one would expect from a five-parameter model; and because it provides a sharp kink where the data seems to have a sharp kink too.
On the other hand, if we look only at the period after 2006, the data is basically a straight descending line with a slight upwards curvature, modulated by an erratic seasonal variation. In that time frame, all four models have three fitting parameters, and they all can be tuned to fit that data quite accurately, within the range of measurement error. That is not surprising either, since the seasonal and random variation essentially forces us to look at yearly averages, so there are only three good data points in that period (2007,2008,2009). Indeed, if we look only at those recent years, the four models are just complicated ways of writing a parabolic curve (PC), A t2 + B t + C. Moreover, all five models, including PC, make basically the same prediction for the next year or two. They differ substantially in the in their long-term predictions (consider the PC prediction, for instance!); but the data by itself cannot tell us which model makes the right prediction.
To make longer-term predictions we should "open the black box" and get data on the internal processes that determine the growth rate: namely, the evolution of the number, profile, and behavior of editors over all these years, and the reasons why they evolve as they do. Unfortunately that data is harder to get. Eric Zachte has statistics on the number of editors, but those numbers include a lot of "chaff" --- such as robot-wielding admins and editors who do lots of cosmetic editing, single-article editors, vandals, etc.. Basically, the data that would be needed to decide which model is correct is not available.
The only thing that seems clear from Eric's statistics is that the number of active editors (under most reasonable definitions) has generally followed the same trend as the growth rate N'(t): steady and clearly exponential growth from 2001 through 2005, and slow but steady decline since 2007. Now, 50% of the articles are still stubs, and perhaps 98% of the articles have serious problems tht have not been fixed for many years. Therefore, the implicit assumption/deduction of the LC model --- that N' is falling because we have run out of suitable topics --- is hard to sustain. Instead, it would seem that the growth rate is falling because the number of productive editors is falling; and this "thinnig of the ranks" is not due to lack of work to do.
I have my own theories about the cause of this phenomenon. I have tried to explain them in my user page, but that material is still in a terribly messed-up state, a big pile of drivel --- and I may never get around to cleaning it up.
Indeed, you may count me as another casualty of those same circustances. Last February, after some exasperating discussions with deletionists and arrogant admins, I took a few months off Wikipedia editing to see whether the climate would improve and the former joy of editing would return. Well, I logged in again yesterday and imemdiately ran into the same problems, or worse. (Now the deletionists have started deleting even unlinked draft articles that I kept as subpages of my user page, only because, in their view, "they have been sitting there for too long"!)
That is perhaps the biggest fault of the LC and GC models: they imply that all is well and there is nothing to worry about. Wikipedia was wonderful in 2002, the models "prove" that nothing has changed since then, ergo Wikipedia is wonderful today. We might call them Panglossian models. The 2P model, in contrast, implies that some terrible decision or event in 2006 threw a monkey wrench into the works, so that Wikipedia is now a plane up in the sky with all its engines dead --- It can fly for quite a while still, but it will hit the ground well before reaching its intended destination.
Wikipedia can have editors who enjoy writing detailed articles on topics that are dear to them, spending a hundred hours polishing a single article, fixing factual errors, providing references, or keeping watch on the factual accuracy of thousands of articles. Or it can have editors who get their kicks by adding disparaging tags to articles, managing Wikiprojects and portals, writing templates and rulebooks, giving orders, setting deadlines, deleting articles, awarding medals to each other, and generally playing boss over other editors. But it cannot have both. One visible change since 2006 is that the bossy editors --- who have been aptly called "wikibullies" in the media --- have taken the upper ground, thanks to robots and the apparent favor of the Foundation; and since then they have been keeping new good editors from joining the team, and slowly driving the veteran ones away. Sic transit Wikipedia...
All the best (if that is still possible), --Jorge Stolfi (talk) 04:57, 26 June 2010 (UTC)

The Wikipedia has inevitably transitioned from a period of explosive growth to consolidation. I mean, it's far more than ten times bigger than the Encyclopedia Britannica, and that was regarded as a pretty comprehensive work. Just think how hard it is now to find a new article to start, although you may be able to think of one or two. Most of the important articles now have fairly sensible content, so it's going to be a question of bringing the quality up. That fairly inevitably involves having standards and trying to apply them to articles; tagging etc. this will cause friction, and make no mistake, people will not be as interested by and large in it.- Wolfkeeper 13:17, 26 June 2010 (UTC)
Well, it is not true "most of the important articles now have fairly sensible content". Most articles are stubs, and most of the remainder, ncluding many important ones, are little more than rough drafts. The "consolidation" is not "bringing the quality up": except for cosmetic robot edits, most stubs and drafts have been siting there unedited for years. The amount of productive improvement editing is decreasing, perhaps even faster than article creation, and is minuscule compared to the amount of work that still needs to be done to bring the existing articles to near Britannica quality. Fewer editors means fewer people available to fix vandalism and factual errors, remove POV and advert material, and even to correct bad English.
Standards, tagging, and robotized format edits contribute absolutely nothing to Wikipedia's value, and are in fact killing it. As noted in an article on The Economist, the overzealous managers who have taken over Wikipedia have only made it thoroughly unmanageable for content editors.
Indeed, Wikipedia could be immensely and easily improved, in a single night, by just deleting the entire Wikipedia:* namespace and all article-side tags, navboxes, category tags, and Wikiproject tags. You can be sure that no reader will miss them, and that the overall productivity of the corps of editors will only increase --- even (or especially) if that simple remedy causes all the standards-writers and article-taggers to leave en masse.
You do not have to take my word for it. The data is there, and anyone who looks closely at it can see that wikipedia is not "consolidating" but falling apart. But, as another standards-breaking and much-bullied editor once observed, there is no worse blind man than the one who refuses to see.
All the best, --Jorge Stolfi (talk) 14:43, 26 June 2010 (UTC)
The data is there, and anyone who looks closely at it can see that wikipedia is not "consolidating" but falling apart. And this data is where then?- Wolfkeeper 15:10, 26 June 2010 (UTC)
Eric Zachte has statistics on new articles, number of edits, new users, users by number of edits, etc. The links are in the main page, and someone posted a recent summary below. --Jorge Stolfi (talk) 16:19, 26 June 2010 (UTC)
PS. Just to cite a few examples of important articles that are still half-baked drafts: Thebes, Egypt, Subroutine, Volvox. I recently created Lacrymaria olor and Dileptus anser, which are school-grade biological celebrities with substantial fan clubs in Googleland (I am a computer scientist, as you well know. Where are the biology experts who should be creating and expanding those articles?). Most articles on genera are still full of red links. (Indeed, one of the things that "consolidators" have been busily doing in recent years is to unlink red links, and delete red links from disamb pages. That activity makes Wikipedia look better, but of course does not contribute anything to its function and closes one of the main sources of new articles. --Jorge Stolfi (talk) 16:19, 26 June 2010 (UTC)
PPS. I strongly recommend the WikiSym 2009 paper cited above. The authors analyzed data up to the end of 2008 and found the same abrupt change of regime around jan/2007 in the number of editors, across all categories (from those who make 1 edit per month (epm) to those who make over 1000 epm). In all categories the number of editors have been steadily decreasing since 2007. The total work (number of edits) done per month too has been decreasing for all categories of editors except for the 1000+ category, whose total work per month kept growing from 2007 to the end of 2008, but at a much reduced rate. By the end of 2008 that class accounted for 25% of the total edits, up from 20% in 2005. The same ~2007 shift to a downward trend is seen in all kinds of edit activities --- except robot edits, which steadily grew from near zero in 2005 to 800,000 epm by the end of 2008. The paper does not give a breakdown of robot edits per user class, but it is hard to imagine them coming from other than the 1000+ class. And, by the way, the paper also gives evidence of increased unfriendliness to newbies since 2007. All the best, --Jorge Stolfi (talk) 23:24, 26 June 2010 (UTC)

Deleted v nondeleted articles

These charts are great but they only show the net change. I would be very interested to see how article deletions and articles creations have changed over time. If we had charts of deletions and creations as well it would give us some indication as to whether growth is slowing because we are getting fewer articles or because we are deleting more of what comes in. ϢereSpielChequers 23:16, 11 December 2009 (UTC)

You can see some graphs (of 2007) on User:Dragons flight/Log analysis and especially a comment on Article creation vs Deletions. HenkvD (talk) 12:17, 12 December 2009 (UTC)
Thanks, I was kind of hoping for something more recent - particularly covering the last few months. ϢereSpielChequers 15:53, 2 April 2010 (UTC)

News versus legacy topics

I had a look at the new article feed. It was somewhat interesting. I was trying to work out what proportion of current articles are 'news'. Straight away I hit a problem trying to define what News means in this context; an article could be written months or years after an event that made it notable, so I tentatively defined it as anything that happened since 2001- the year that the wiki was established; anything since then is presumably 'news' as far as the wiki is concerned.

I did a quick straw poll through the hits looking for new stuff. It looked like (extraordinarily roughly) 40% of the articles likely to survive prods are currently news in that sense. We're looking at about 1000 articles a day overall growth right now, so 400 might be about the underlying growth rate, once the wiki has more or less caught up with 2001.

Then again my sample size was tiny, so I'd need to do this again with a much bigger one.- Wolfkeeper 15:32, 25 March 2010 (UTC)

I think that would vary sharply from day to day - sometimes you get editors creating articles on large numbers of say Roman Senators, Mosses or 1960s footballers. But if you ignore the prepatrolled articles you probably aren't far out. We've covered a very high proportion of the "obvious" topics in the English speaking world, so new articles tend to either be from outside that or new subjects. ϢereSpielChequers 16:07, 2 April 2010 (UTC)
I think this is starting to cause an upwards kink in our percentage growth curve; as the number of remaining articles prior to 2001 reduce, we're left with just the new articles, and they have different statistics- they're more like a constant rate growth, rather than percentage growth. So the percentage growth will start to level out over the next few years to about half what we have at the moment, and then go down much more slowly from there, it will be much flatter than right now. That's probably why we're slightly above the 3.5M logistics curve at the moment- we're moving onto a different straight line on the log graph.- Wolfkeeper 15:26, 5 June 2010 (UTC)
I just did another quick sample, I found 13 articles on topics that would be unlikely to be written prior to 2001, in a sample of 27 from the oldest parts of the new page log. If I can remember my statistics well enough that's about 48% +- 9% of the articles are new topics.- Wolfkeeper 16:07, 5 June 2010 (UTC)
This is an interesting point. With some work we could get a more precise value for the number of 'news' articles that are created per day or per month. But it will be very difficult to have similar values for the past to predic the future. I can't model it either as there are too many variables in the equation. HenkvD (talk) 18:35, 7 June 2010 (UTC)
Actually, I don't see any particular problems with doing this. Worse case we could randomly sample the database of articles using the random key, and find out what percentage are about topics newer than 2001, and draw graphs wrt time of when they were written. It's a bit laborious, but not stupidly so. There may be other ways as well, I couldn't off-hand find a way to get lists of articles created between arbitrary times, say sometime back in 2006, but it's almost certainly possible to get an SQL interrogation to give us that.- Wolfkeeper 18:55, 7 June 2010 (UTC)
 
Growth at German wikipedia
I made a similar investigation on German wikipedia: of 40 new articles 14 were 'news' articles. This gives a percentage of 35 +- 9%, so a bit less as the english wikipedia. I think the German wikipedia is particular interesting as the growth rate is flat for a few years already, with around 400 new articles a day, as shown on the graph. This does not fit the logistic growth, nor the Gompertz growth. It has a flat growth rate, but contrairy to the expectation the percentage of 'news' is small. HenkvD (talk) 18:06, 8 June 2010 (UTC)
I would guess that the german wiki is probably volunteer limited. They're possibly suffering from the existence of the English Wikipedia; presumably a lot of german-speaking people can use and contribute to the English wikipedia instead; but I don't really know.- Wolfkeeper 18:27, 8 June 2010 (UTC)
FWIW I don't think there's sufficient evidence from such a limited trial to say that there's less news on the German Wikipedia, the numbers are only a bit more than one SD apart, so from eyeballing them, they won't reach statistical significance; but there may be differences none the less.- Wolfkeeper 18:27, 8 June 2010 (UTC)
By the way: Last year at WikiSym 2009 this paper was published. See especially Figure 7, and the text "... we expect that the capacity depends also on external factors such as the amount of public knowledge (available and relevant) that editors can easily forage and report on (e.g., content that are searchable on the web) ...". HenkvD (talk) 18:51, 9 June 2010 (UTC)
Actually yes, that's where I got the idea from, I suspect the Lotka-Volterra population growth model is going to be a better model for the growth than our current one. I also agree that public knowledge is probably limiting the Wikipedia right now.- Wolfkeeper 15:11, 13 June 2010 (UTC)
I had a quick go with it, but not terribly successfully so far, but I'm not sure whether I've got the right equation; it may be that the growth rate for new articles is different to that of established ones or something, the logistic curve seemed to fit better if anything.- Wolfkeeper 15:24, 26 June 2010 (UTC)
We are still getting articles on "small villages" in India with a population of 3,000. I think in some subject areas we have barely begun, whilst others that are of great interest to the technosenti have long been completed initiated. ϢereSpielChequers 21:19, 26 June 2010 (UTC)
Yes, we're still getting small villages, and there's at least 1/7 of the articles that we'll end up with are missing entirely. Unfinished Wikipedia is unfinished. Early on a clever vandalism was still an improvement ("Did you know that there's no article on gullibility in thw Wikipedia?" Many lols were had). Now, not so much. As we push up the quality level, fewer and fewer people are going to be capable or willing to add more quality and follow the processes. But the Wikipedia has to improve quality. Sometimes people add stuff, and it would take more effort than it took to write to square it away. What do you do? Revert, or go forwards? If you go forwards, it costs your time, if you revert, the users feelings are hurt. But if you leave it, quality goes down. What do you do? In most cases people revert. Quality is painful.- Wolfkeeper 02:56, 27 June 2010 (UTC)
[articles]that are of great interest to the technosenti have long been completed: Most definitely not. There is no article that could not be substantially improved by an expert on the topic. Almost every article I check suffers from obviously incomplete or unbalanced coverage, bad organization, poor writing, repetititions, etc.. The drop in article creation is just the visible tip of the iceberg: the real problem is that quality (=contents) editing has all but stopped --- and that is most definitely not because there is no more work to do. Quality was improving much more prior to 2006, when most edits were about contents. Today the "technosenti" have been pushed away by the "bureaucretins", and all I see in my watchlist are robot-assisted format edits that, I repeat, contribute absolutely nothing (not "little", not "very little", but really nothing) to Wikipedia's quality. Many of our most prolific editors are not helping Wikipedia at all, they are only massively wasting other editors' time. In particular, reverting or deleting a contents contribution just because "it would take more effort than it took to write to square it away" could be a good defintition of "wikibullying" --- the sort of arrogant destructive behavior by the self-proclaimed Guardians of Wikipedia that is preventing experts from contributing. Editors who did that used to be considered vandals, and treated as such. Now they call themselves heroes and award barnstars to each other. Sigh... All the best, --Jorge Stolfi (talk) 13:28, 27 June 2010 (UTC)
Apologies, we were talking about numbers of articles and I should have said started not completed. ϢereSpielChequers 17:49, 27 June 2010 (UTC)
It is undeniable that new topics will occur that can be the subjects of new articles, but I am not sure that we can yet know how they will affect the total number of articles. A good estimate for the rate of creation may be 400 a day, but what about the longer-term processes of merging, deletion, unmerging, and undeletion? Also the rates might not be constant, as it is possible for notability criteria to change. For example, at present a football player is deemed to be notable if they play just one match for a major team, but I have seen recent discussion that this is wrong. If a stricter test is applied in future, the rate of such new articles will drop.
There has been a recent rise in the graph of article growth per month, but to me it looks no different to earlier variations. I will not be convinced it is levelling out until I can see a flatter trend that lasts for 12 months. JonH (talk) 21:59, 29 June 2010 (UTC)

Seem to be under 1000/day at the moment

Here's the live averages since various dates:

696 /day since 15:00, 31 March 2010 (UTC)

693 /day since 15:17, 26 April 2010 (UTC)

690 /day since 19:05, 31 May 2010 (UTC)

690 /day since 05:14, 8 June 2010 (UTC)

690 /day since 17:51, 10 June 2010 (UTC)

690 /day since Fri, 11 Jun 2010 18:09:55 (UTC)

We're currently in a growth slump. Presumably everyone's gone on holiday ;-) It may pick up at the end of the month.- Wolfkeeper 03:13, 13 June 2010 (UTC)

I understand that Germany requires new articles to start with a source, and for the last few months we have required new BLPs to have a source; I expect that this damps down the growth in articles - so it may not pickup. Also are these figures net growth or new articles that have survived for x minutes? If its net growth then what we may be seeing is an increase in deletions rather than a reduction in new articles. We've had a lot of work going on recently to tackle the backlog of unreferenced BLPs, and that has included a certain amount of prodding of old unnotable articles. ϢereSpielChequers 21:33, 26 June 2010 (UTC)
I beleive that for several months the underlying rate has been 900 a day, and that if on one day the increase was much greater it was because a user (such as Ganeshbot or Starzynka) was using a list of topics to create a lot of stubs. By the way, it was a good move by Wolfkeeper to fix the year-to-date calculation at WP:size. JonH (talk) 22:01, 29 June 2010 (UTC)

Gompertz model

First, let me say that I am a fan of User:HenkvD, both for his dedication to updating the size graphs every month, and for the predictive power of his models (e.g. the 2 June 2006 image in the file history of File:Enwikipediagrowth6.PNG). also, I noticed that his growth rate graph was selected by Dan Frommer as the "Chart of the Day" on 6 Aug 2009.

The new Gompertz model looks very good. My only criticism concerns the smoothing of the data at Oct 2002. I understand why this is done, to spread out the Rambot contributions, but similar bulk changes continue to happen. For example about 20000 stubs on tiny Polish settlements were created in 2009, followed by about 3000 stubs on German politicians which were created and then deleted ([5]). These are the main cause of the zigzag appearence of the growth rate charts, and it would be very hard to remove all these effects. I think it is best to see them all as part of Wikipedia development, and to just model the general trend. Selectively excluding some changes could be seen as manipulation of the statistics.

As the Gompertz model seems to be the best we have, I intend to change WP:size to use the new graphs (without any emphasis on the predictions) unless there is some objection. JonH (talk) 22:01, 29 June 2010 (UTC)

I have no objection to changing WP:size.
I introducted smoothining of the data of Oct 2002 to be able to show the trend better. By now the trend is more clear, so I think I will remove this smoothing from now on. HenkvD (talk) 10:31, 30 June 2010 (UTC)
Let me join in the applause of HenkvD, bot for his work and for provideing the data which I used in my analysis.
However... I must say that I am not impressed with the Gompertz model. Looking at this plot, all three curves provide a poor fit to the period 2001-2005; and this poor fit seems necessary in order to get the 2006 rate in the right ballpark. If we look at the period 2006-2010, Gompertz does look much better than the other two --- but they could fit just as well if their parameters were adjusted for that purpose.
As I have pointed out above, given the large seasonal fluctuations after 2006, any smooth model has only three data points to fit --- the yearly averages for 2007, 2008, and 2009. Therefore any three-parameter model --- even a simple parabola --- can be adjusted to fit that data as well as Gompertz does; and its near-term forecasts will be just as accurate as Gompertz's.
However, my biggest complaint is that single-formula models like Gompertz and logistic implicitly assume that Wikipedia was a single process troughout the 2001-2010 period. But it is very clear now from the raw data plots that there was a drastic change around 2006. "Wikipedia before 2006" and "Wikiedia after 2006" are clearly two different projects; even though the article base and editor community were largely the same before and after the transition, the editor in/out rate changed drastically at that time.
Thus, trying to find a single simple formula that fits the article data over the whole 2001-2010 period makes as much sense as trying to find such a formula formula for, say, New Orleans's resident population over that same period. In this example, it should be obvious that the population growth data before Katrina is completely irrelevant for modeling or forecasting the growth after that event. Why is it so hard to admit that Wikipedia too had a Katrina in 2006?
Note also that one can always get a good fit to any dataset with a three-, or even zero-parameter formula, if the formula does not have to be justified by a realistic model of editor behavior. The two-phase exponential formula assumes a simple and realistic model of editor recruitment and retirement, that agrees with other observed data (such as editor count). The logistic model has been justified in terms of a supposed saturation of "article topic space"; but that saturation does not seem to exist, and it does not explain the editor count data. As for the Gompertz model, I cannot think of a reasoable explanation for the double exponential. All the best, --Jorge Stolfi (talk) 04:26, 2 July 2010 (UTC)
So you're saying that you don't think that the Wikipedia has any emergent properties, and that the only reason the growth levelled of in 2006 is because something changed?- Wolfkeeper 14:33, 2 July 2010 (UTC)
Like what?- Wolfkeeper 14:33, 2 July 2010 (UTC)
Jorge Stofli is absolutely correct. Oreo Priest talk 15:32, 4 July 2010 (UTC)
I must admit the poor fit to the period 2001-2005, but I did NOT fit it predominantly on 2006-2009, all date points were weighted equaly. I contradict that any 3 parameter model can fit the data, as not only the values for 2007, 2008 and 2009 need to be correct, but also 2001 (zero articles) and all years in between. Furthermore Jorgi's double exponential model has has 2 times 2 parameters, and a smoothing factor as well. There was a transition around 2006, but it is much smoother, it is not a single incident like "Katrina".
I still do assume that Wikipedia is a single process troughout the 2001-2010 period (amd beyound), and that decline of growth is inevitable. Wether or not it will decline to zero is open, but I don't see proof of that yet.
The logistic is not really based on saturation of "article topic space", but on what the editors together will be able to contribute (combined expertise of the possible participants). Two people know more than one, but not double as much. The one millionth contributor first need to look very hard and very long to find where to contribute, and even then he/she can only add e few articles.
@jorgi personally: could you give us your prediction for the years 2010 - 2015? HenkvD (talk) 19:01, 6 July 2010 (UTC)
Archive 1 Archive 2 Archive 3 Archive 4