Talk:Softmax function

Neuroscience Low‑importance

	This article is within the scope of WikiProject Neuroscience, a collaborative effort to improve the coverage of Neuroscience on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.NeuroscienceWikipedia:WikiProject NeuroscienceTemplate:WikiProject Neuroscienceneuroscience articles
Low	This article has been rated as Low-importance on the project's importance scale.

Mathematics Mid‑priority

	Mathematics portal This article is within the scope of WikiProject Mathematics, a collaborative effort to improve the coverage of mathematics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.MathematicsWikipedia:WikiProject MathematicsTemplate:WikiProject Mathematicsmathematics articles
Mid	This article has been rated as Mid-priority on the project's priority scale.

Computer science High‑importance

This article is within the scope of WikiProject Computer science, a collaborative effort to improve the coverage of Computer science related articles on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.Computer scienceWikipedia:WikiProject Computer scienceTemplate:WikiProject Computer scienceComputer science articles

High

This article has been rated as High-importance on the project's importance scale.

Things you can help WikiProject Computer science with:

Here are some tasks awaiting attention:

Article requests :
- Requested articles/Applied arts and sciences/Computer science, computing, and Internet
Cleanup :
- Computer science articles needing attention
- Computer science articles needing expert attention
Copyedit :
- Computing
Expand :
- Computer science
Infobox :
- Computer science articles without infoboxes
Maintain :
- Timeline of computing 2020–present
Photo :
- Find pictures for the biographies of computer scientists (see List of computer scientists)
- Computing articles needing images
Stubs :
- Computer science stubs
Unreferenced :
- WikiProject Computer science/Unreferenced BLPs
Project-related :
- Tag all relevant articles in Category:Computer science and sub-categories with {{WikiProject Computer science}}

It is requested that a mathematical diagram or diagrams be included in this article to improve its quality. Specific illustrations, plots or diagrams can be requested at the Graphic Lab.
For more information, refer to discussion on this page and/or the listing at Wikipedia:Requested images.

Daily pageviews of this article

A graph should have been displayed here but graphs are temporarily disabled. Until they are enabled again, visit the interactive graph at pageviews.wmcloud.org

Probability distribution edit

Latest comment: 3 years ago1 comment1 person in discussion

"to normalize the output of a network to a probability distribution over predicted output classes. " Maybe I am misunderstanding something but you cannot just normalize something to a probability distribution. Not everything that takes a set and assigns values in [0,1] which sum to 1 is a probability distribution. It is not even clear what the probability space would be. — Preceding unsigned comment added by SmnFx (talk • contribs) 07:56, 22 October 2020 (UTC)Reply

Origin edit

Latest comment: 7 years ago2 comments2 people in discussion

To my knowledge, the softmax function was first proposed in

J. S. Bridle, “Probabilistic Interpretation of Feedforward Classification Network Outputs, with Relationships to Statistical Pattern Recognition,” in Neurocomputing, F. F. Soulié and J. Hérault, Eds. Springer Berlin Heidelberg, 1990, pp. 227–236.

and

J. S. Bridle, “Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters,” in Advances in Neural Information Processing Systems 2, D. S. Touretzky, Ed. Morgan-Kaufmann, 1990, pp. 211–217.

Would it be appropriate to mention these publications, or at least their author, in the article? --131.152.137.39 (talk) 13:40, 13 November 2015 (UTC)Reply

Yes, appropriate to mention the publications, certainly. (No point mentioning the author without the publications.) --mcld (talk) 15:12, 8 March 2017 (UTC)Reply

Family of functions edit

Latest comment: 7 years ago2 comments2 people in discussion

From reading Serre 2005, it sounds that there are multiple definitions of softmax. Is this the case? What other representations are there? It looks like Riesenhuber and Poggio, 1999b and Yu et al., 2002 as referenced in Serre 2005 might give clues. JonathanWilliford (talk) 23:53, 11 August 2009 (UTC)Reply

There may have been multiple definitions, but this is the only one I encounter in computer science literature, so perhaps it has achieved consensus name-wise. --mcld (talk) 15:13, 8 March 2017 (UTC)Reply

Content edit

Latest comment: 8 years ago3 comments3 people in discussion

Is this acceptable content? 95% of the article is a direct copy of the content from [1]. Jludwig (talk) 06:23, 10 May 2008 (UTC)Reply

I agree with this concern. I am deleting most of the content of the article, per the concern of copyright violation. 128.197.81.32 (talk) 22:00, 30 July 2008 (UTC)Reply

Furthermore, most of the terms are not defined. You can't just copy a bunch of equations into a document and not define the terms, it's terrible. People w/a background in stats can guess what most of the variables mean, but regardless, i hope whoever wrote this will return and define every variable that appears in an expression. Chafe66 (talk) 22:40, 29 October 2015 (UTC)Reply

Derivation edit

Latest comment: 15 years ago1 comment1 person in discussion

Whats about the derivation of the softmax function? —Preceding unsigned comment added by 78.34.250.44 (talk) 21:28, 13 March 2009 (UTC)Reply

John D Cook's definition is different from all of these edit

Latest comment: 7 years ago5 comments3 people in discussion

I followed the external link to the description of softmax as a substitute for maximum by John D. Cook. There, the softmax is described as

\log(\sum _{j=1}^{n}\exp(q_{j}))

not

{\frac {\exp(q_{i})}{\sum _{j=1}^{n}\exp(q_{j})}}

as in wkp. His version makes more sense to me. Can anyone corroborate me on this? I think the article needs fixing. But since there seem to be multiple definitions, it's hard to be clear.--mcld (talk) 15:31, 2 January 2014 (UTC)Reply

I saw this on Cook's blog and I was highly surprised. Apparently this is a completely different function that is also called the softmax; I've never seen it in use. QVVERTYVS (hm?) 15:28, 5 February 2014 (UTC)Reply

I checked Cook's blog post again, and found that he doesn't even call his function softmax; he calls it "soft maximum". Removed the link as it's quite unrelated. QVVERTYVS (hm?) 17:36, 5 February 2014 (UTC)Reply

Cook's post is very informative on smooth maximum. There seems to be no natural setting for putting smooth maximum sub heading in soft max. Moving to a new page. — Preceding unsigned comment added by Yodamaster1 (talk • contribs) 16:50, 20 February 2015 (UTC)Reply

Thanks for moving that content to Smooth maximum. I also discovered the page LogSumExp, and I think those two should be merged.--mcld (talk) 15:27, 8 March 2017 (UTC)Reply

Should someone mention that the gradient of the LogSumExp is the softmax?

Possible? edit

Latest comment: 5 years ago2 comments2 people in discussion

Could it ever be possible that the explanation of how the function works be any more incomprehensible?

Apparently, there's an extremely well developed culture in wikipedia of: everyone is expected to know a bunch of inscrutable variable-name conventions. Either that or writers really are convinced those are solidly established conventions of the likes of "+", "-", "x", etc. . I mean, not even "n" (usually used to mean "number of elements") is conventional enough in many cases (especially considering how often it is used to other meanings too).

I'm really fed up with this, and this article is among the most poor examples of this that I've found so far. — Preceding unsigned comment added by 151.227.23.87 (talk) 17:27, 1 August 2015 (UTC)Reply

The definition of the softmax as provided on Wikipedia simply does not make sense. The output of the softmax cannot possibly be the cube (0,1)^k, as (0.8, 0.8, 0.8, 0.8, 0.8, ..., 0.8) is in the cube but is not the output of the softmax. Someone fix this.

The definition has been changed to an even more wrong version. How can the \sigma be a function as well as a vector in R^N at the same time? This is the worst article on Wikipedia by far. — Preceding unsigned comment added by 111.224.214.171 (talk) 03:28, 28 May 2018 (UTC)Reply

The hyperbolic tangent function is almost linear near the mean, but has a slope of half that of the sigmoid function. edit

Latest comment: 6 years ago1 comment1 person in discussion

The subject sentence does not appear to be correct. near x=0 tanh(x) has a derivative of 1.0 and the sigmoid 1/(1+exp(-x)) has a derivative of approx 0.25 — Preceding unsigned comment added by 129.34.20.23 (talk) 19:43, 14 June 2017 (UTC)Reply

Possible Errors in the Second Equation on the Page edit

Latest comment: 3 years ago2 comments2 people in discussion

As of 2017-01-20, the page has:

\sigma (\mathbf {z} )_{j}={\frac {e^{z_{j}}}{\sum _{k=1}^{K}e^{z_{k}}}}

for j = 1, …, K.

On the left side, why is j outside the parenthesis?

On the right side, underneath, why is the counter k instead of j?

Maybe the equation should read:

\sigma (\mathbf {z_{j}} )={\frac {e^{z_{j}}}{\sum _{j=1}^{K}e^{z_{j}}}}

for j = 1, …, K. — Preceding unsigned comment added by 216.10.188.57 (talk) 07:48, 20 January 2018 (UTC)Reply

The original is correct. The left side means that the softmax function takes a vector as input and returns a vector, the jth component of this output being .... And the right side means to take e to the power of the jth component of the input and divide it by the sum of in turn e to the power of each input component (including the jth component and all others). Hozelda (talk) 09:50, 8 September 2020 (UTC)Reply

Flagged as too technical and lacking context edit

Latest comment: 5 years ago1 comment1 person in discussion

I agree with other readers who have noted that this article is close to incomprehensible. The references

https://medium.com/data-science-bootcamp/understand-the-softmax-function-in-minutes-f3a59641e86d

and

https://developers.google.com/machine-learning/crash-course/multi-class-neural-networks/softmax

are far more comprehensible. Until I paraphrase and integrate the content there with that on this page (with appropriate citations), I have flagged the problems with this article as a warning to those who actually hope to learn something from it. - Prakash Nadkarni (talk) 06:00, 10 December 2018 (UTC)Reply

A function-weighted average, using exp edit

Latest comment: 2 years ago1 comment1 person in discussion

So, why does YOUR field need a whole journal? — Preceding unsigned comment added by 129.93.68.165 (talk) 18:42, 30 November 2021 (UTC)Reply

Hierarchical softmax? edit

Latest comment: 2 years ago1 comment1 person in discussion

I realise that "hierarchical softmax" isn't actually softmax, but it is used to replace softmax for efficiency in machine learning contexts. Maybe there should be some clarification. I know I'm confused! akay (talk) 16:31, 23 December 2021 (UTC)Reply

Add topic