Talk:Maximum likelihood estimation

(Redirected from Talk:Maximum likelihood)
Latest comment: 11 months ago by Chjacamp in topic Maximum a posteriori estimation


Wiki Education Foundation-supported course assignment edit

  This article was the subject of a Wiki Education Foundation-supported course assignment, between 27 August 2021 and 19 December 2021. Further details are available on the course page. Student editor(s): Jiang1725. Peer reviewers: EyeOfTheUniverse, 5740 Grant L, Jimyzhu.

Above undated message substituted from Template:Dashboard.wikiedu.org assignment by PrimeBOT (talk) 03:39, 17 January 2022 (UTC)Reply

removal edit

I removed this from the article, until it can be made more NPOV and more encyclopedic. Currently reads more like a list of observations than true encyclopedic content and needs more explanation. --Lexor|Talk 07:25, 5 Aug 2004 (UTC)

Maximum likelihood is one of the main methods used by frequentist (i.e. non-Bayesian) statisticians. Bayesian arguments against the ML and other point estimation methods are that
  • all the information contained in the data is in the likelihood function, so why use just the maximum?Bayesian methods use ALL of the likelihood function and this is why they are optimal.
  • ML methods have good asymptotic properties (consistency and attainment of the Cramer-Rao lower bound) but there is nothing to recommend them for analysis of small samples
  • the method doesn't work so well with distributions that have many modes or unusual shapes. Apart from the practical difficulties of getting stuck in local modes, there is the difficulty of interpreting the output, which consists of a point estimate plus standard error. Suppose you have a distribution for a quantity that can only take positive values, and your ML estimate for the mean comes out at 1.0 with a standard error of 3? Bayesian methods gives you the entire posterior distribution as output, so you can make sense of it and then decide what summaries are appropriate.

Style edit

This whole article reads like an 80's highschool textbook. As a matter of fact, a lot wikipedia's articles on difficult to understand subjects read like they're out of an 80's highschool textbook, making them only useful to people who have already know the subject back to front, making the entire wikipedia project a failure. —Preceding unsigned comment added by 4.232.174.170 (talkcontribs)

Clarity issues in a statistics article (a subject that is less than clear) should not be used to make the inference that "the entire wikipedia project a failure" Rschulz 23:56, 1 Mar 2005 (UTC)
This article was never intended to be accessible to secondary school students generally (although certainly there are those among them who would understand it). I would consider this article successful if mathematicians who know probability theory but do not know statistics can understand it. And I think by that standard it is fairly successful, although more examples and more theory could certainly be added. If someone can make this particular topic comprehensible to most people who know high-school mathematics, through first-year calculus or perhaps through the prerequisites to first-year calculus, I would consider that a substantially greater achievement. But that would tak more work. Michael Hardy 00:40, 10 Jan 2005 (UTC)
I would consider this article a failure if only "mathematicians who know probability theory" can understand it. I got to the article via a link from Constellation diagram, and learned nothing from it that I didn't already know from the other article—not even what it is used for. 121a0012 05:41, 14 June 2006 (UTC)Reply
I think there is kind of a style struggle in mathematics whereas some people prefer math text to say what it means and discuss itself introspectively, and others prefer a terse no nonsense style. I find the discussion based approach to be healthier. One of the problems though is it isnt neccessarily encyclopedic to use that tone. I mean most important math can be stated in one or two sentence. Doesnt mean that the sentence will be approachable. But it will be factually complete saying all there is to say about the subject. So then an encyclopedia editor struggles with the fact that you have to say more to make it approachable, but at the same time you can say less and say all there is to say. It is a problem unique to mathematics. Jeremiahrounds 13:01, 20 June 2007 (UTC)Reply
But most mathematicians who know probability theory do not know this material, so it should not be considered a failure if they understand it. That is not to say it should not be made accessible to a broader audience. But that will take more work, so be patient. Michael Hardy 17:42, 14 June 2006 (UTC)Reply
I totally agree with the original complaint that this along with many other wikipedia math articles are too heavy going. For that reason I replaced the first paragraph with something more digestible. I see no reason to dive into using math symbols right in the first paragraph. --Julian Brown 02:50, 30 August 2007 (UTC)Reply
As a university student learning statistics, I think this article needs improvement. It would be good if a graph of the likelihood of different parameter values for p was added (with the maximum pointed out) to the example. This addition would require adding some specific data to the example. Also, the example should be separated from the discussion about MLE, to make sure people understand that the binominal distribution is only used for this case. The reasons why it is good to take the log of likelihood are not discussed. Further the discussion about what makes a good estimator (and how MLE is related to other estimators) could be expanded. Rschulz 23:56, 1 Mar 2005 (UTC)
The "left as an exercise to the reader" part is definitely gratuitous and needs to go. I came to this page to learn about the subject, not for homework problems.
a user in CA Good article, don't be so hard on the author(s), of course could be better, but most of us have day jobs, but I would change the notation, as this was confusing " The value (lower-case) x/n observed in a particular case is an estimate; the random variable (Capital) X/n is an estimator." seems to conflict with the excellent example at the end for finding maximum likelihood x/n in a bionomial distribution of x voters in a sample of n (without replacement). Now, next question, can anybody explain the Viterbi algorithm to a high-schooler? 01 March 2004
I don't see the conflict. The lower-case x in the example at the end is not a random variable, but an observed realization. Michael Hardy 22:50, 2 Mar 2005 (UTC)
To chime in on this discussion: reading this article felt too much like a high school text book, and too little like an encyclopedia. Though I agree that this article should be (in part) accessible to laymen, elaborating to this amount of detail makes it hard for non-laymen to get to what they want to know. I believe that laymen are more than adequately catered to in the premier section of this article. The rest of the article should highlight important properties and results. Editors, remember, this is an encyclopedia, a reference book, not a text book, and not Wikiversity. The article can be shortened significantly by: (1) Trimming the huge (seemingly repetitive) list of "It has widespread applications in various fields, including:..." list. (2) Assuming the reader already has a basic knowledge of what statistics is or at least that he has read/understood the statistics article of Wikipedia. (3) Cutting the hand-holding mathematics in the "Examples" section, which should be moved to Wikiversity. Solace1 —Preceding unsigned comment added by 207.237.81.59 (talk) 13:34, 11 November 2008 (UTC)Reply

Difference between likelihood function and probability density function edit

I guess the line "we may compute the probability associated with our observed data" is not correct. Because the probability of a continuous variable for any given point is always zero. The correct statement would be "we may compute the likelihood associated with our observed data". For more argument please see [1]

Your assertion is correct, but your header is not. That is NOT the difference between the likelihood function and the density function. One is a function of the parameter, with the data fixed; the other is a function of the data with the parameter fixed. Michael Hardy 02:16, 13 May 2006 (UTC)Reply

where is the spanish version?


Is it possible to get the text shown as it should read by people who don't know latex code instead of symbols such as x_1 for subindexes or x^1 for superindexes, etc. I understand those but not other symbols which are used in this article, and in any case they hamper reading. Thanks! Xinelo 14:47, 21 September 2006 (UTC)Reply

Maybe it was a temporary problem; they look OK to me now. Michael Hardy 15:06, 21 September 2006 (UTC)Reply

Untitled edit

Great article! Very, very clear. 18.243.6.178 04:08, 13 November 2006 (UTC)Reply

Thank You!! edit

This article is fantastic. It is more understandable than my class notes and has helped me greatly for my class. I also really appreciated the use of examples. Many many thanks to the people who wrote it!. Poyan 8:09, 6 December 2006 —The preceding unsigned comment was added by 128.100.36.147 (talk) 13:09, 6 December 2006

Sloppiness edit

You really sould say something about the second derivative test. You nonchalantly claimed you reached a maximum. You could very well have found a minimum. Don't reinforce bad habits. This is especially true for the Normal case. Just because the gradient is zero does not mean you have a local maximum...and it's not especially trivial to just brush off.(ZioX 18:06, 19 March 2007 (UTC))Reply

I don't know which "you" is addressed, but I agree with the spirit of the comment. But the details of the comment fall short. The second-derivative test may prove a local maximum, but here we need a global maximum. There are various ways to prove there is a global maximum, not all of them involving second derivatives. For example, suppose you show that L(θ) increases as θ goes from 0 to 3, and decreases as θ goes from 3 to ∞, and the parameter space is the interval from 0 to ∞. Then you've got a global maximum at 3, without benefit of second derivatives. Or suppose you've shown that L(θ) is differentiable everywhere, and because the parameter space is compact, there must be a global maximum somewhere, and furthermore L(θ) is 0 on the boundary and positive in the interior. Then the global maximum must be reached at a critical point in the interior. If next it turns out that there is only one critical point in the interior, then you've got it again, and again without second derivatives. This is not at all an unusual situation in elementary MLE problems. Michael Hardy 18:30, 19 March 2007 (UTC)Reply

Is this true edit

From the bias section: "we can only be certain that it is greater than or equal to the drawn ticket number." Is this true? Wouldn't it be the less than or equal to?--Vince |Talk| 05:00, 15 May 2007 (UTC)Reply

disregard: It was the wording that confused me. The article Bias of an estimator phrases the problem in the manner I was thinking. I may try to make it clearer. --Vince |Talk| 06:37, 15 May 2007 (UTC)Reply
It was true but not relevant, so I dropped it when I expanded this section. RVS (talk) 00:17, 25 December 2008 (UTC)Reply

Some more applications please edit

Hi all, The article is very well written from a statistics POV.

MLE is used ubiquitously in phylogenetic analysis and cladistics in genetics and evolutionary biology. It would be great if someone could include a section on how MLE can actually be applied to such studies, with some examples.

Also, I, as an amateur biologist, know where MLE is used, but do not have an intuitive understanding of the technique. A section that would provide the layman with such a perspective (of whats actually happening) would be great...

Indiaman1 19:30, 30 June 2007 (UTC)indiaman1Reply

I think the article is not "well written" "MLE is a popular statistical method" - I think such comments should be left to established textbooks. In Germany it could be Fahrmeir (et al.), may be others as well. These professors with the 3rd + x editions of basic statistics. There should be mentioning of competitive measurements (confidence intervals, potential misuse of the p-value (deciding after the fact which of the confidence levels to use etc.)) For a non-english (non-american) it would be interesting who are considered the leading statistics profs in the UK etc. [Gaschroeder] -- Gaschroeder (talk) 14:47, 3 October 2010 (UTC) -- Gaschroeder (talk) 14:51, 3 October 2010 (UTC)Reply

Non-independent variables edit

I have added a section on non-independent variables. I hope this proves useful to someone :) Velocidex (talk) 04:44, 19 March 2008 (UTC)Reply

I added a tie back to article topic by mentioning the likelihood function, which should possibly be the main thing being specifically discussed rather than the density function. Melcombe (talk) 10:45, 19 March 2008 (UTC)Reply

For generality this section really needs something said about mixed discrete-continuous distributions. Melcombe (talk) 10:45, 19 March 2008 (UTC)Reply

Mathematical precision edit

Someone wrote above that this article reads (read) like a 1980's textbook. In my opinion that was the case. For instance, the discussion of the asymptotic properties of maximum likelihood estimation could have been taken straight out of many standard textbooks, but an intelligent person can realise that the authors of those books either don't know what they are talking about or are hiding things from the reader. I added a sentence referring to modern mathematical results on the maximum likelihood estimator (modern: these results have been known since the 60's but still did not permeate into standard textbooks). I hope the result still makes sense to the non-expert. Gill110951 (talk) 07:12, 5 May 2008 (UTC)Reply

Typography edit

I see mixed use of   and   for the likelihood function. The likelihood function page itself only uses  . Which is it? (Or do they mean slightly different things?) —Ben FrantzDale (talk) 12:56, 14 August 2008 (UTC)Reply

link has pw edit

there is a pw requirement on the link to the tutorial —Preceding unsigned comment added by 193.157.202.64 (talk) 13:04, 20 August 2008 (UTC)Reply

applications list edit

The current article lists a mish-mash of applications. I think a distinction needs to be made between application areas that are various statistical methodology methods (univariate models, structural equation models, etc.) vs. application areas that are different fields (agriculture, communications, business research, etc.). Either type is perhaps too broad to list in this article, but i don't think the types should be mixed.

The article currently mentions "...applications in various fields, including:

Hope this comment is helpful. doncram (talk) 21:28, 30 December 2008 (UTC)Reply

about which distribution edit

(discussion section separated out under this title later)

I think the second sentence in the "Principles" section is ambiguous. The sentence currently reads, in part, "We draw a sample x_1,x_2,\dots,x_n of n values from this distribution [...]". I think it would be much clearer to specify exactly which "distribution" is being referred to by "this distribution". The current wording makes it unclear whether we're looking at one member of the family of distributions, or instead the PDF itself. —Preceding unsigned comment added by Diracula (talkcontribs) 21:08, 27 January 2009 (UTC)Reply

Your point escapes me. It is of course referring to one member of the family, and that one member does have a pdf. And that one could be ANY of the members of the family. "Amibiguous" normally means there are two ways to construe the sentence. Just what those two ways are is unclear. Your way of phrasing it seems to suggest that the two ways are (1) We're looking at one member of the family of distributions; and (2) We're looking at the pdf. I don't see how the sentence can be read as giving us that choice. Before that, you say the ambiguity is about the question of WHICH DISTRIBUTION it is. That's quite a different thing from choosing between the statements I labeled (1) and (2) above, since the choice between (1) and (2) is not a choice of which distribution it is. Nor can I find any ambiguity about which distribution it is. So I find your comments completely cryptic. Michael Hardy (talk) 22:31, 27 January 2009 (UTC)Reply

Consistency edit

Why is it listed that "The MLE is asymptotically unbiased, i.e., its bias tends to zero as the sample size increases to infinity" under "Under certain (fairly weak) regularity conditions" when one of the conditions listed is "The maximum likelihood estimator is consistent." Isn't being consistent the same as being asymptotically unbiased? cancan101 (talk) 00:11, 3 March 2009 (UTC)Reply

Technically they are not the same thing; consistent includes the spread of the distriburion reducing, while the othef only needs the mean to approach the right value. Your point however is broadly correct, as consistency almost implies asymptotically unbiased (one exclusion being that the mean stll need not exist for consistency). I think what is meant is really that unbiasedness usually holds under similar conditions to those needed for consistency. The level of technical detail in the article does not naturally incline to including more detail. I think it is meant to be a general warning not to expect that maximum likelihood will always "work".

Article title edit

Why was this article renamed "Maximum likelihood" instead of "Maximum likelihood estimation"? The latter seems more appropriate and I suggest renaming it back. -Roger (talk) 15:52, 25 March 2009 (UTC)Reply

I agree - why the change? The article even starts "Maximum likelihood estimation (MLE) ..." 128.61.125.131 (talk) 14:55, 28 January 2010 (UTC)Reply
Notionally, maximum likelihood is more general, with some things worth saying that are not about "maximum likelihood estimation", but there probably not be very much. However, none of this appears (yet). There is also a question of whether it should be "estimate" or "estimation" .... this would preferably aggree with other article titles, such as "minimum distance". Melcombe (talk) 10:51, 29 January 2010 (UTC)Reply
We have M-estimator, Extremum estimator, Bayes estimator, but Maximum spacing estimation, Maximum a posteriori estimation, Minimum distance estimation, and then there is also Generalized method of moments. I’d vote for “Maximum likelihood estimation”, because that’s how it is usually defined in econometric textbooks (MLE). Although the abbreviation “ML” can also be seen occasionally.  … stpasha »  18:42, 29 January 2010 (UTC)Reply

Error in pdf formula? edit

Shouldn't this:

The joint probability density function of these   random variables is then given by:
 

be this?:

The joint probability density function of these   random variables is then given by:
 

as per Multivariate_normal_distribution#General_case? Aaron McDaid (talk - contribs) 22:10, 2 April 2009 (UTC)Reply

I've changed it to this:
 
(Alternatively, one could put (2π)n under the radical.) Michael Hardy (talk) 01:42, 3 April 2009 (UTC)Reply

See also edit

This is one of the rare articles that gives synopses of the related articles: Isn't it better just to give the links?

I humbly suggest these alternatives, which are usually shorter and which in some cases (e.g. abuction) avoid objectionable statements.

I put ?? after synopses that should be removed imho. Kiefer.Wolfowitz (talk) 18:59, 24 May 2009 (UTC)Reply

"Isn't it better just to give the links?" No. It is at least helpful to have some indication of relevance. For example, why is there is link to censoring? But it might be better to reduce the number of "see alsos" by creating brief subsection with headings like "alternative estimation methods", so as to better group the topics. Melcombe (talk) 09:42, 27 May 2009 (UTC)Reply

Weak consistency or strong consistency? edit

As i understand, consistency is acually weak consistency, i would like to know if there is proof about strong consistency of the maximum likelihood? —Preceding unsigned comment added by 132.72.51.67 (talk) 16:38, 25 August 2009 (UTC)Reply

Insufficient content edit

The article provides a great layman description of the concept of maximum likelihood, however the rigorous mathematical treatment of the same topic is also necessary. What use is the claim “under certain fairly weak conditions MLE is consistent” if those conditions are never spelled out. What use is the “asymptotic normality” property if the article doesn’t even state the asymptotic variance of the estimator.

The “Maximum spacing estimation” article lists several examples when the MLE fails: certain heavy-tailed distributions, certain mixture distributions, certain 3-parameter models. It would be nice to know why those examples don’t work (at least they are not the case of boundary parameters). ... stpasha » talk » 10:31, 2 September 2009 (UTC)Reply

(Dis?)motivational example edit

Motivational example

As a motivational example, let X be a binomial random variable with n = 10 and an unknown parameter p. Further suppose that x = 3, a realization of the random variable X, is observed. As in all estimation problems, the goal is to estimate the unknown parameter p. The fundamental idea in maximum likelihood estimation is to calculate the probability of generating the observed value x = 3, that is, P(X = 3), under the different possible values that p can take. By construction of the binomial distribution, p is a probability and therefore p ∈ [0, 1]. In order to calculate these probabilities, we view the binomial distribution as a function of p, holding x fixed at 3:

 

In this example, L is the likelihood function. Here is a plot of L:

 

As can be seen on the graph, it turns out that the probability of generating x = 3 is the largest when p = 0.3. Hence   will be the maximum likelihood estimate of p. Of course, in general, inference will not be based on a single observed value x but on a random sample   yielding observations  . Supposing that the random variables   are independent and identically distributed, the likelihood function is a product of binomial densities, regarded as a function of p.

So there used to be a section called “Motivational example”. I've looked at it for a long time, and finally concluded that the best way to improve it would be to remove it altogether. This example simply breaks the flow. It starts using the MLE method without explaining how and why it is used so. It uses an example with 1 observation, which is although legal, but quite counterintuitive since no properties of such estimator can be stated. I think we have enough examples in the “Examples” section, so that we don't need this one too.  … stpasha »  07:50, 24 January 2010 (UTC)Reply

“Convergence of maximum likelihood estimator” gone edit

This section ([2]) has been an ugly duckling from the beginning, but despite the selfless attempts of Michael Hardy, it can never grow in a swan… There is just so much wrong with that piece, for example:

  1. It silently assumes that θ is a scalar, although the article states otherwise;
  2. It concludes the “proof” with Pr[Sn(θ)>0] → 0, while in fact this probability is equal to 1 for all n.
  3. The general idea to prove a theorem without stating it, and without specifying the assumptions needed for it, is faulty.

Sentence: removed.  … stpasha »  08:11, 24 January 2010 (UTC)Reply

Uniqueness? edit

This article made a lot of statements about "the" mle, when this seem contrary to the literature. Old examples of Cramér show the importance of including local minima (and the problems of restricing attention to only maxima).

Second, this article stated a lot of nice properties of the mle, and included interpretations that the (?) MLE method was the best. But many other estimators have those properties, which may be proved under wider conditions---see the maximum probability estimator of Wolfowitz (no relation!!), and Le Cam on the superior properties of Bayes methods and a lot of examples where the MLE has trouble. I tried to soften uniqueness claims ("the"), where this was possible. No doubt others can improve my edits. Thanks! Kiefer.Wolfowitz (talk) 00:36, 25 January 2010 (UTC)Reply

I'm not sure what it is that you're trying to say. The MLE is one of the generic estimation techniques, applicable to a wide range of problems. There are typically other methods which can be applied to same problems, and some of those methods are as good as MLE. But they can't be better than MLE (at least up to the second-order), because MLE is already efficient — at least as long as we're talking about the MSE loss function.
The maximum probability estimator — what is it? As for the “Bayes estimator” — there is no such thing. There is a bayesian school of thought for estimation, and within that school many estimators can be constructed. Neither of those estimator is “superior” to MLE, because they work with different model: they require that additional prior knowledge about the value of the parameter was available.
I'm not saying that MLE is like “best from the best” or anything like that, but the article should be kept encyclopedic, and any comparison to other estimation methods be done in a separate section (or maybe even a separate article).  … stpasha »  08:18, 25 January 2010 (UTC)Reply
On the first point, the article does have "For certain problems the maximum likelihood estimates may not be unique, or even may not exist" fairly early on. There does seem scope to add some new sections to cover the non-uniqueness case from both a theoretical-asymptotic viewpoint and a practical one. Another thing possibly not mentioned fully enough is the case where estimates are on the boundary of the parameter space (asymptotic properties). Melcombe (talk) 10:48, 25 January 2010 (UTC)Reply

Proposed restructure edit

Earlier versions of this article were fairly readable (for example this version of early January), but the recent addition of so-called "proofs" has ruined this. Can we improve things by just moving most of this over-technical stuff towards the end of the article (after the existing "see also" section, which could be renamed) under a heading like "Theoretical details". This would be in line with the guidance for maths articles, that it is OK to start simple and become more thechnical later, even at the risk of some repetition. Melcombe (talk) 12:39, 26 January 2010 (UTC)Reply

This is a good idea. Another idea would be to have a self-standing article entitled "Proofs of properties of MLE". Some articles on technical subjects have informal and formal versions; for example, see the gentle Introduction to M-theory and the high fallutin' M-theory. Kiefer.Wolfowitz (talk) 15:42, 26 January 2010 (UTC)Reply
I agree that the subsections “Consistency” and “Asymptotic normality” still require lots of work. And that’s why I put the tags {{cleanup}} there. In particular, the material from the “Asymptotics” subsection should be distributed between the first two. But I’m not sure if moving the material all the way out of the article is a good idea… Perhaps that guidance for mathematical articles can be interpreted section-wise, like start the section simple, and then proceed to more advanced theory.  … stpasha »  18:52, 26 January 2010 (UTC)Reply
The previous strtucture of the model was that there was a readable outline of all the properties of MLE, which must be retained but which is being destroyed by these supposed "proof" intrusions. My own view is that "proofs" should typically not appear on wikipedia, particularly when they are text book stuff, but that the theory necessary for a proper description of what is being discussed should appear. Thus I am against "Proofs of ... articles" when they have no encylopedic content (ie no proofs for the sake of having proofs), as they would here. Proofs would be much better dwealt with by just giving appropriate citations. Melcombe (talk) 10:23, 27 January 2010 (UTC)Reply

Discrete case edit

So, does anybody know if there is a justification for the use of MLE in case when either the parameters or the data belong to a discrete set?  … stpasha »  19:48, 26 January 2010 (UTC)Reply

MLE can be proved to be have good properties for one-parameter exponential families, e.g. binomial, negative binomial, Poisson, etc. A good reference is the work of Erling Andersen, who especially emphasized the importance of conditional maximum likelihood estimation, which avoids the inconsistency of the MLE in many cases. (For a counter-example to conditional mle, see Deb. Basu's selected writings (Ghosh, ed.).) MLE has trouble with multiple parameters, and the bestiary of likelihoods (partial, profile, etc.) have worse properties than MLE with one parameter. Kiefer.Wolfowitz (talk) 22:49, 26 January 2010 (UTC)Reply
Cox's Principles of Statistical Inference ISBN 0-521-68576-2 Parameter error in {{ISBN}}: checksum has (page 143) discussion of a case where the mean of a normal distribution is known to be an integer... but this more in an "inference" context than "estimation". Page 144 in the same book has the example of a mixture of two normals being a case where standard asymptotic theory doesn't apply, which relates to another earlier question. Melcombe (talk) 10:32, 27 January 2010 (UTC)Reply
Perhaps the most important condition for the success of MLE in irregular cases is the following:
* MLE was used on this irregular problem by David Cox!
Cox's recent Principles mentions a practical example where the mle has substantial bias, making it dangerous in application, for optimal design of experiments for logistic regression, citing Rolf Sundberg (EM method) and Vågero, I believe. (Cox and Sunberg are usually fond of MLE.) Kiefer.Wolfowitz (talk) 14:17, 27 January 2010 (UTC)Reply
On discrete parameters, the case where the mean of a normal distribution is known to be an integer, is also used for two Examples by Kendall&Stuart (Advanced Th. of Statistics, Vol 2, 3rd Edition, Examples 18.21-2). They indicate that the second provides an example where the MLE is not consistent. Melcombe (talk) 14:37, 1 February 2010 (UTC)Reply

I object edit

I object to the blatantly sexist and ageist example being used as an example of a normal distribution in nature, namely the height of adult female giraffes. I would like to suggest a less controversial example: the mass of mouse turds divided by the mass of the mouse producing them. This will be approximately normally distributed, and has the added advantage of being independent of age and (very probably) sex, thus avoiding an unhealthy obsession with age and sex. It does however suffer from a particularly mouse-o-centric view of the universe, and is therefore not perfect. I am wracked with guilt over this flaw, but lacking a better example, I offer it as perhaps a temporary improvement. PAR (talk) 22:50, 24 March 2010 (UTC)Reply

Who is Pfanzagl? edit

One of the references is to Pfanzagl. Is this a made-up name or something? It shows up once in the entire page, so it's not a valid reference/citation. —Preceding unsigned comment added by 130.58.92.245 (talk) 03:04, 29 June 2010 (UTC)Reply

The inadequate reference has been improved, now reading "Pages 207-208 in Pfanzagl, Johann; with the assistance of R. Hamböker (1994). Parametric Statistical Theory. Berlin: Walter de Gruyter. ISBN 3-11-01-3863-8, 3-11-014030-6. MR 1291393. {{cite book}}: Check |isbn= value: invalid character (help); Cite has empty unknown parameter: |1= (help)". Kiefer.Wolfowitz (talk) 11:37, 29 June 2010 (UTC)Reply
Who is Pfanzagl? Professor Dr. Johann Pfanzagl may have been the leading mathematical statistician in Austria and perhaps all of the German speaking countries from 1970-1995, say. In his early work, Pfanzagl solved the outstanding problem of von Neumann and Morgnenstern of showing that expected utility theory could be axiomatized with subjective probability (essentially compatible with von Neumann's approach), and wrote a related monograph on measurement theory. He then worked on statistical inference, with deep results on median-unbiased estimation and exponential families. Following Le Cam and Hajek, Pfanzagl has been one of the architects of asymptotic theory, including both parametric and semiparametric approaches; Le Cam credits Pfanzagl with introducing tangent cones and spaces, and these objects are now standard in advanced graduate books. Although Pfanzagl clearly states that MLE has no good finite-sample properties, he has contributed one of the best convergence analyses of maximizing the conditional likelihood; Pfanzagl also includes simulation studies for assessing the performance of MLE and other methods, which often show that the MLE behaves quite well in moderately large samples. Kiefer.Wolfowitz (talk) 11:37, 29 June 2010 (UTC)Reply

Vandalism: Administrators, consider raising the protection level edit

The vandalism intensity has increased lately. Could the level of protection be increased to prevent damage by IP editors? Thanks!  Kiefer.Wolfowitz  (Discussion) 13:11, 13 March 2011 (UTC)Reply

Revert edit

I'm reverting the last edit [3] by Kiefer.Wolfowitz due to several reasons: (1) For many purposes, the maximum-likelihood estimator has poor theoretical properties --- not explained in the following text (what poor properties, which purposes?), and this sentence didn't fit into the paragraph anyway, (2) ... mathematical statisticians consider a related estimator that finds critical points of the log-likelihood function. -- not true, since right after you find the MLE estimator you compute the information matrix (−Hessian of the objective function), and verify that this matrix is positive definite, which means that you have the maximum; (3) When evaluated on a sequence of observations from the true distribution, a subsequence of the estimator-evaluations ... --- saying the same as before, only in far more complicated language; (4) ... , for some (unspecified) sample size n --- stated as such the property becomes false: for a given n (even unspecified) you cannot estimate with arbitrary precision, however for arbitrary precision you can find the right sample size n, as was stated initially (and why n suddenly bold?); (5) if the limiting likelihood function (θ|·) has any global maximum at θ0 then it is unique. --- again not true. Identification condition is necessary and sufficient for to have the global maximum (assuming the function is well-defined), for example see Lemma 2.2 in Handbook of Econometrics 36. // stpasha » 18:43, 16 March 2011 (UTC)Reply

I am on the road and so I cannot reference everything. I shall reply more fully later. Considering zero-score estimators is standard: See Ferguson's article,
  • Ferguson, Thomas S. (1982). "An inconsistent maximum likelihood estimate". Journal of the American Statistical Association. 77 (380): 831–834. JSTOR 2287314.
which I believe is cited in the enthusiastic book of Pawitan.  Kiefer.Wolfowitz  (Discussion) 08:17, 17 March 2011 (UTC)Reply
BTW, it is possible to link to Handbook of Economics chapters:

Are there other ways of estimating? edit

It basically starts something like "MLE is a method for making estimates of ...". Well, are there other ways of doing so? Other generic ways? Do these methods all give the same estimates, or is making estimates not an exact science? 80.162.194.33 (talk) 09:22, 4 August 2011 (UTC)Reply

Please ask such questions at the help desk or at an internet chat site. Talk pages are for discussing improvements to the article.  Kiefer.Wolfowitz 10:55, 4 August 2011 (UTC)Reply
The anon user might have meant that we should address these concerns in the article itself. If so, that's a legitimate use of a talk page. - dcljr (talk) 00:53, 3 April 2013 (UTC)Reply

Kolmogorov in the introduction edit

The introduction section provides a very useful starting point for non-experts reading the page, with the exception of the following sentence: "In the Kolmogorov structure function one deals with individual strings." I personally find it a bit strange that this appears here. If I understood what it is dealing with, I'd relegate it to further down in the article, but I don't. Would anyone care to move this sentence or improve it to make it more congruent with the rest of the introduction?Jimjamjak (talk) 15:15, 25 January 2012 (UTC)Reply

Good point. I've just removed that sentence. It was added last November by Vitanyi (talk · contribs), who heavily edited 'Kolmogorov structure function'. It's relationship to maximum likelihood isn't clear, and it's certainly not suitable for the lead. If someone else understands it better and thinks it deserves a mention further down the article, that's fine with me. Qwfp (talk) 15:39, 25 January 2012 (UTC)Reply

Subsequence of the sequence converges in probability edit

In the article it says: 'Consistency: a subsequence of the sequence of MLEs converges in probability to the value being estimated.' I don't agree with it. The sequence converges in probability to the value. Why do you have to take the subsequence? 95.113.187.130 (talk) 21:05, 29 October 2012 (UTC)Reply

Agreed. I have changed the article. As discussed in the Consistent estimator article, it is the sequence of estimators that converges in probability to the parameter. If only "a subsequence" did so, that would be a hideously weak result. Now, if it said "all subsequences", that would be different: that would imply that the sequence itself converges... but there's no reason to say it that way, AFAIK. - dcljr (talk) 00:49, 3 April 2013 (UTC)Reply

Error in the article: Consistency edit

Hey!

There is an error in the conditions that are demanded for the MLE to be consistent. The second condition states that the parameter space must be compact, which is not true. The parameter space must be an open subset of  , which implies completeness, but not compactness.

For example the parameter space of   in the normal distribution is  . That subset is open, and thus complete but not compact.

Repairing this would call for replacing the whole part about consistency. Suggested sources are Lehmann (1999) Elements of Large Sample Theory, or Knight (2000) Mathematical Statistics. — Preceding unsigned comment added by Emogstad (talkcontribs) 10:09, 25 May 2013 (UTC)Reply

Problem in "Higher-order properties" section edit

I believe that the bias calculation has a typo. The tensor   should actually be  . The cited source is equation (20) from Cox and Snell (1968) (http://www.jstor.org/stable/2984505). This problem is there as well.

I checked two other sources, which comply with the correction I propose:

  • Equation (16) from Shenton, L. R., & Wallington, P. A. (1962). The Bias of Moment Estimators with an Application to the Negative Binomial Distribution. Biometrika, 49(1/2), 193. http://doi.org/10.2307/2333481
  • 7th equation in page 29 from Firth, D. (1993). Bias reduction of maximum likelihood estimates. Biometrika, 80(1), 27–38. http://doi.org/10.1093/biomet/80.1.27

I'm not an expert on this topic, but rather an advanced user. If somebody confirms this, I would make the changes. — Preceding unsigned comment added by Fbalzarotti (talkcontribs) 21:55, 28 February 2016 (UTC)Reply

Wikipedia:Be bold (but not reckless): You've done some good homework. Make the change. I don't have time to check the math nor the sources you cite.
I suggest include a single footnote citing all three sources with a brief comment listing first Shelton and Wallington (1962) and Firth (1993) then noting that Cox and Snell (1968) write   where the other sources write  , and you believe the other sources (because ...???)
Do you have an example where using   would give a different answer from using  ? If you do, then you might do a simulation to confirm -- and it would be worth mentioning that example in the text and your simulation in a footnote.
For the normal distribution,   = 0 =  , for all  ,  , and  . For Poisson regression and logistic regression,   = 0; all the bias comes from  . These three examples don't expose the difference.
(I'm not an expert in this either.) DavidMCEddy (talk)
Thanks for the advice, I'll be bold and go ahead. I believe the others authors and not Cox because I went through the calculations myself, I'm quite confident given that I found the other articles confirming them.
As for adding an example, I'm not sure. The situation with which I'm working is a bit odd.
I'm using a set   of random variables following a multinomial distribution with parameters   where   and  . When working with maximum likelihood estimators for the parameters   the necessary tensors for the bias estimation are   and  , where the indices   run from 1 to m-1. They're both symmetric in   and the problem doesn't show. So far, so good.
I'm actually working in a situation in which the parameters   depend on a set of variables  . I'm using a maximum likelihood estimator for these variables   and I need to calculate an approximation of their bias. In this context the result for   is
 
which is not symmetric. By interchanging   by   the result completely changes. In general, the functional dependence   makes this tensor non-zero and non-symmetric, which are the cases with which I'm working.
I have calculated and simulated all this, but I'm afraid that posting this whole thing as an example may not be exactly informative, but quite disruptive. Fbalzarotti (talk) 15:42, 29 February 2016 (UTC)Reply
I'm impressed.
After you get your work published -- at least as a tech report on the web -- I might like to see a line added saying something like what I wrote above: "For the multivariate normal distribution,   = 0 =  , for all  ,  , and  . For Poisson regression and logistic regression,   = 0; all the bias comes from  . For an example in which   is not zero, see" your tech report.
Wikipedia:Conflict of interest policy "strongly discourages" "contributing to Wikipedia about yourself, family, friends, clients, employers, or your financial or other relationships." For examples like this, they suggest you use the {{request edit}} template on a talk page like this: Provide the suggested change with a link to your paper, noting that it's your work. Then hope that someone else will make the actual change.
And good luck with your research. DavidMCEddy (talk) 17:40, 29 February 2016 (UTC)Reply

Move to hyphenated title edit

@Fgnievinski: All the literature calls it "maximum likelihood" without a hyphen. Why have you moved it? It is not a compound adjective requiring hyphenation (we don't say this estimation is maximum-likelihood). It is a compound noun and therefore should not be hyphenated. Tayste (edits) 02:30, 20 June 2016 (UTC)Reply

@Tayste: I moved maximum likelihood to maximum-likelihood estimation, because estimation was part of the acronym (MLE) mentioned in the lead. The hyphen was not intentional. I've now requested a technical move to maximum likelihood estimation. fgnievinski (talk) 02:50, 20 June 2016 (UTC)Reply
Oh I see, thank you. Tayste (edits) 20:09, 20 June 2016 (UTC)Reply

Maximum a posteriori estimation edit

A recent edit: [4] suggest that ML is a "special case" of maximum a posteriori estimation (MAP) for a uniform prior. Is this true? If so, it would be good to have a citation to a source. Thanks. Isambard Kingdom (talk) 13:24, 19 September 2016 (UTC)Reply

this is not true. MLEs enjoy several properties such as invariance under measurable transformations (i.e g(MLE) is the mle for g(theta)). This property does not hold for all measurable transformations g for the corresponding bayesian MAP estimator. Perhaps we could say the value of MLE will match the value of the map estimator after evaluation, assuming one does not change the underlying measure. Chjacamp (talk) 20:29, 6 May 2023 (UTC)Reply

According to the french page, this is completely wrong (and also present in the introduction, and terribly misleading since it's a direct contradiction

https://fr.wikipedia.org/wiki/Maximum_de_vraisemblance#Histoire

It roughly says that it was a misconception, and fisher disproved it in 1922.

En 1912, un malentendu a laissé croire que le critère absolu pouvait être interprété comme un estimateur bayésien avec une loi a priori uniforme^2. Fisher réfute cette interprétation en 1921^2 — Preceding unsigned comment added by 143.121.239.90 (talk) 13:14, 5 March 2018 (UTC)Reply

Section "Principles" edit

This section began with main|Likelihood principle, even though MLE does not depend upon the likelihood principle. The section then made the following claim.

Suppose there is a sample x1, x2, …, xn of n independent and identically distributed observations, coming from a distribution with an unknown probability density function f0(·). It is however surmised that the function f0 belongs to a certain family of distributions { f(·| θ), θ ∈ Θ } (where θ is a vector of parameters for this family), called the parametric model, so that f0 = f(·| θ0). The value θ0 is unknown and is referred to as the true value of the parameter vector.

The claim about a "surmise" and a true value is false: all models are wrong and so the "true model" does not exist. Additionally, θ could be a scalar, matrix, or something else.

The section then talked a lot about the special case of when the observations are iid, even though it is supposed to be a general exposition of principles. The section then made the following claim.

In the exposition above, it is assumed that the data are independent and identically distributed. The method can be applied however to a broader setting, as long as it is possible to write the joint density function f(x1, …, xn | θ), and its parameter θ has a finite dimension which does not depend on the sample size n.

That is erroneous: consider the likelihood function for an arbitrary autoregressive model and for a proportional hazards model.

There were other technical problems too. As well, much of the rest of the section seemed like blather.

I have removed most of the section, and revised much of the remainder. It could benefit from further improvement though. For the Bayesian aspects, I moved the discussion to a new subsection.  BetterMath (talk) 19:00, 6 January 2018 (UTC)Reply

Subsection "Asymptotic normality" edit

What is the purpose of the subsection "Asymptotic normality"? The subsection defines asymptotic normality as follows.

maximum likelihood parameter estimates exhibit asymptotic normality – that is, they are equal to the true parameters plus a random error that is approximately normal (given sufficient data)

The "true parameter", however, does not exist with real-world data: because all models are wrong. Thus, this subsection seems to describe something that is relevant only for simulated data. Such a long a complicated subsection that is relevant only for simulated data would not appear to benefit the article.  BetterMath (talk) 00:56, 13 January 2018 (UTC)Reply

Furthermore, the subsection begins with this claim: "In a wide range of situations, maximum likelihood parameter estimates exhibit asymptotic normality". The claim is false: the only situations are those in which the data is simulated. Thus it appears that the subsection was written under a misconception.

For the above reasons, I have removed almost all of the subsection. BetterMath (talk) 20:30, 29 January 2018 (UTC)Reply

The level of pedantry leveled here is absurd. By this reasoning, asymptotic consistency should be removed as well, along with second-order efficiency, ... and pretty much everything on this page. Assuming there exists a true model is completely standard within the context of parameter estimation. If you want to argue about model selection or model uncertainty, then go somewhere else for that. (204.111.241.221 (talk) 21:37, 28 July 2018 (UTC))Reply

revising the lede 2019-04-10 edit

@Bender235: On 2019-04-10 the lead paragraph of this article began as follows:

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of a statistical model such that having obtained a given set of observed data was most probable. Specifically, this is done by expressing the joint probability density function of the observed data   as a likelihood function   of the parameter vector   and finding its maximum value over a parameter space  .

I have problems with this:

  • The first "sentence" is difficult to parse.
  • The likelihood is not always the probability density, e.g., with Censoring (statistics).

I therefore propose the following:

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of a statistical model so the observed data is most probable. Specifically, this is done by finding the value of the parameter (or parameter vector)   that maximizes the likelihood function  , which is the joint probability (or probability density) of the observed data  , over a parameter space  .

I've made this change, hoping that it's clearer, more concise and more compelling, while still being accurate. DavidMCEddy (talk) 01:49, 11 April 2019 (UTC)Reply

I like your changes, except for "...parameter (or parameter vector)...". I would prefer if we could avoid using parentheses in the lede. --bender235 (talk) 01:55, 11 April 2019 (UTC)Reply
OK: Then how about "parameter" without the "(or parameter vector)"?
Most people who have studied this would understand that a "single value" may not be a single number but a point in a vector space.
Change this if you wish. I think it's clearer for people who have not studied this extensively with the paranthetical "(or parameter vector)" than without, but I can live with this deletion.
If you really insist on "parameter vector", I could live with that also, recognizing that the real line is a vector space of one dimension. DavidMCEddy (talk) 02:51, 11 April 2019 (UTC)Reply

injective is not enough edit

@Bender235: For an MLE to be "the same regardless of whether we maximize the likelihood or the log-likelihood," we need the log to be "a strictly increasing function." Injective is not enough. Monotonic is not enough. If I replace the log(likelihood) with [-log(likelihood)], then I'd have to minimize [-log(likelihood)] to maximize the likelihood.

Injective is even worse. Suppose I have an observation from a Poisson distribution with a parameter space   restricted to {1, 2, 3}. This is a contrived example, but with more work I believe I could find one more realistic. Besides, this is mathematics, and we want principles that will work based only on the assumptions.

Suppose further my observation is 2. Then the likelihood   =  . For   in   = {1, 2, 3}, we get   =  ,  , and  , respectively.

On this restricted parameter space,   is injective and is maximized when   = 1, not 2.

Conclusion: I don't see a way to allow the use of only injective transformations and still get a transformation that is necessarily maximized in the same place. Monotonically increasing seems to be necessary and sufficient.

Accordingly, I've reverted "injective" to "monotonically increasing". DavidMCEddy (talk) 02:55, 11 April 2019 (UTC)Reply

Correct. No objection from my side. --bender235 (talk) 03:54, 11 April 2019 (UTC)Reply

Possibly unnecessary algebraic complexity edit

The section on Continuous distribution, continuous parameter space [5] contains the following pair of formulae:

 

or more conveniently,

 

As far as I can see, in working through the derivations that follow, the second formula (described as "convenient") doesn't actually add anything. Am I missing the illumination that this second formula is supposed to provide? Attic Salt (talk) 15:11, 2 July 2019 (UTC)Reply

It's "convenient", because it isolates the MLEs for the mean,  , and standard deviation,  .
If you want to add this, I would support that. However, I'm not going to take the time to do it myself. DavidMCEddy (talk) 12:49, 3 July 2019 (UTC)Reply
Yes, I see that, but this doesn't affect the derivations of the MLE that follows. The MLE is algebraically easier to obtain without that expansion. Attic Salt (talk) 12:55, 3 July 2019 (UTC)Reply
I went ahead an simplified the derivation. Attic Salt (talk) 17:10, 11 July 2019 (UTC)Reply

Unclear conclusion on continuous parameter space edit

Hi, the section https://en.wikipedia.org/wiki/Maximum_likelihood_estimation#Discrete_distribution,_continuous_parameter_space states that the derivative of the distribution when set to 0, "has solutions p = 0, p = 1, and p = ​49 / 80". It's not clear to me where "p = 49/80" comes from. Perhaps I'm missing something obvious, but if not, maybe a clarifying comment would be a helpful addition to the page. Thanks. — Preceding unsigned comment added by 146.66.57.203 (talk) 05:17, 10 October 2019 (UTC)Reply

I just changed the verbiage to make that clearer. Thanks, DavidMCEddy (talk) 07:32, 10 October 2019 (UTC)Reply
Got it now, many thanks. — Preceding unsigned comment added by 146.66.57.203 (talk) 15:21, 10 October 2019 (UTC)Reply


typo in asymptotic behavior? edit

In the section "consistency", look at the expression after the sentence "it can also be shown that the maximum likelihood estimator converges in distribution to a normal distribution". I think that the   should not be there, or at least it depends on the definition of the loglikelihood that enters the information matrix  . If   is based on the log-likelihood of one datum, then we need the  . If   is based on the log-likelihood of   data, then the   should not be explicit in the formula. Indeed, the german page does not report the  , and I think that it is reported here in the english one because whoever wrote it took the formula from the cited source without realizing that in the source the log-likelihood was defined for one datum only — Preceding unsigned comment added by 176.242.186.224 (talk) 28 May 2020 (UTC)

Efficiency subsections X_t subscript edit

I'm simply wondering the meaning of the t subscript on the random variable in these sections. If it isn't meaningful, it should be removed for consistency, otherwise it should be explained briefly. Cheers. Moo (talk) 19:45, 29 May 2022 (UTC)Reply

Adding Expectation-Maximization to Methods or Link edit

Several methods of calculating the MLE are given (Newton-Raphson, Fisher Scoring, etc.) but no mention of the EM algorithm. It is used so often in actually calculating the MLE that it should either be part of the Iterative Procedures list or part of the See Also. 108.18.248.126 (talk) 13:32, 5 January 2023 (UTC)Reply