Mathematics desk
< July 6	<< Jun \| July \| Aug >>	July 8 >

Welcome to the Wikipedia Mathematics Reference Desk Archives
The page you are currently viewing is an archive page. While you can leave answers for any questions shown below, please ask new questions on one of the current reference desk pages.

July 7 edit

Take pity on the statistics illiterate edit

I have performed a linear regression on the dataset I collected experimentally (see below) in excel which spat out the following statistics:
slope (m_n) = -1.720582243
y intercept (b) = 9.918110731
standard error for for slope (SE_n) = 0.284159964
standard error for y intercept (SE_b) = 0.988586199
coefficient of determination (r²) = 0.148034996
standard error for y estimate (SE_y) = 6.635638125
F statistic / F observed value (F) = 36.66275495
degrees of freedom (d_f) = 211
regression sum of squares (ss_reg) = 1614.323183
residual sum of squares (ss_resid) = 9290.687293

I have 2 hypotheses (at least I think I do)

The independent variable and the dependent variable have some positive or negative correlation (slope !=0)
(null) The independent variable and the dependent variable are not correlated (slope = 0)

For α = .05 I would like to know if I can reject my null hypothesis. I also want to know the likelihood of making a type II error given the below dataset, (b=.05? what do you think is appropriate?), but I am not sure if I am asking the question correctly. I have a table here which might be helpful if I knew what to do with it. While it's one thing to get an answer, what I am really interested in is understanding the theory behind it and when given the appropriate statistics for any sort of similar fit to a line case to know confidently whether I can accept or reject the null hypothesis. (I'm not so interested in how to calculate the numbers as much as how to apply them.) If anyone can direct me to resources that will help me in this endeavor I would appreciate it. I tried reading the WP articles, but I couldn't figure out if and how they are applied to my case and simple wikipedia didn't have all the articles. Also there might be things I don't know about yet that if existing I likely failed to take into account like distribution type(?).
One thing about this data set I gave below is that there is a hidden variable (right term?) which is independent of the current independent variable. The dependent variable may also have a correlation with the hidden variable. The problem is that the hidden variable has a non-constant signal to noise ratio which is proportional to another factor (an experimental error which unfortunately cannot be eliminated). What this means is that for any given point, the signal to noise ratio varies but the points are roughly evenly distributed in the below data set. Will this affect anything? As a side question, when I have a data set where the dependent variable has a signal to noise ratio which is proportional to the independent variable what should I do? Is there anything I can do? I appreciate any help. 152.16.15.144 (talk) 02:28, 7 July 2009 (UTC)[reply]

Lengthy table of data ...

Independent variable	Dependent variable
1	25.29725806
1	25.19327419
1	24.89158065
1	38.37648387
1	43.24045161
1	2.672193548
1	1.080225806
1	6.004677419
1	0.991322581
1	22.55169355
1	24.70856452
1	50.8436129
1	3.249870968
1	15.8688172
1	16.6713871
1	17.48212903
1	13.4363871
1	10.32890323
1	7.838086022
1	6.430483871
1	2.504774194
1	11.89580645
1	14.13787097
1	9.653354839
1	5.079769585
1	7.854914611
1	4.517450572
1	8.758021505
1	5.535249267
1	5.52836129
1	3.04296371
1	5.572699491
1	5.208481183
1	4.733164363
1	3.121344086
1	3.796580645
1	4.292361751
1	2.484580645
1	3.350993548
1	3.446548387
1	5.803467742
1	6.386572581
1	3.54125
2	1.534629032
2	0.114096774
2	12.69280645
2	24.59641935
2	19.31422581
2	6.393193548
2	-0.471645161
2	1.459354839
2	-0.400741935
2	1.429693548
2	7.022032258
2	2.890951613
2	10.23980645
2	13.70490323
2	3.858548387
2	4.442193548
2	10.66690323
2	7.737096774
2	1.286
2	1.301612903
2	2.211794355
2	1.884
2	1.136105991
2	3.449322581
2	1.903182796
2	1.196236559
2	3.046072106
2	4.296149844
2	5.391397849
2	1.94026393
2	4.501032258
2	1.413891129
2	7.305297114
2	2.544771505
2	3.618387097
2	1.834032258
2	3.124204301
2	2.959930876
2	0.796467742
2	0.959883871
2	1.34433871
2	5.019489247
2	3.893682796
2	2.597610887
3	16.34480645
3	0.629580645
3	1.067532258
3	28.70645161
3	0.398258065
3	0.113
3	0.454064516
3	1.075096774
3	0.860919355
3	5.275290323
3	5.302580645
3	10.04935484
3	2.613419355
3	1.395483871
3	2.874096774
3	2.744516129
3	1.996516129
3	1.063516129
3	0.785913978
3	0.612479839
3	0.501387097
3	0.494193548
3	1.653935484
3	1.316580645
3	1.122396313
3	1.871954459
3	2.647991675
3	2.802688172
3	1.325630499
3	3.578619355
3	2.20890121
3	2.052003396
3	3.66438172
3	2.255084485
3	1.416462366
3	2.288365591
3	3.04906682
3	0.893790323
3	0.556735484
3	0.863870968
3	2.55391129
3	1.700349462
3	1.069405242
4	14.23806452
4	0.657870968
4	-0.110032258
4	-0.245322581
4	0.582370968
4	9.728467742
4	3.702290323
4	0.858516129
4	0.522688172
4	1.476096774
4	1.709806452
4	0.40883871
4	0.558032258
4	0.931935484
4	0.47640553
4	1.337709677
4	0.699612903
4	1.503563748
4	1.905180266
4	2.2289282
4	2.988903226
4	1.814428152
4	2.907045161
4	1.17172379
4	1.383106961
4	1.688239247
4	2.451136713
4	1.517903226
4	1.823483871
4	2.026693548
4	0.405435484
4	0.760954839
4	1.009725806
4	0.976827957
4	1.04469086
4	1.079899194
5	11.59672581
5	-0.581967742
5	0.082258065
5	1.232306452
5	5.375903226
5	4.49766129
5	5.74083871
5	0.425096774
5	0.667634409
5	0.71083871
5	0.37163871
5	2.051096774
5	0.717806452
5	0.574602151
5	0.964979839
5	3.095552995
5	1.451516129
5	1.398924731
5	2.345578748
5	2.306580645
5	1.026510264
5	1.158669355
5	2.976706989
5	2.725729647
5	0.975075269
5	1.706139785
5	1.394781106
5	0.439758065
5	0.622812903
5	0.860209677
5	1.160564516
5	1.003333333
5	0.938316532
6	11.6323871
6	0.313129032
6	0.734387097
6	4.214887097
6	3.015870968
6	3.516967742
6	0.530752688
6	1.127258065
6	1.198878648
6	1.397887097
7	0.7575
7	4.220516129
7	1.431096774
8	4.253354839

The low value of your coefficient of determination, 0.148034996, tells you that the dependent variable and the independent variable have very little correlation i.e. the data is a very poor fit to its regression line. You would need a coefficient of determination much closer to 1 before you could conclude that there was any significant correlation.

NONSENSE!!! See my remarks below. Michael Hardy (talk) 22:31, 8 July 2009 (UTC)[reply]

Actually, if you plot the data you can tell by eye that there is almost no correlation - the statistics just give a quantitative confirmation of this. Without a coefficient of determination close to 1, the slope and y intercept values are meaningless - a linear regression line can be calculated for any set of data, but that does not mean that the data points will lie close to the line.

It appears that the "noise" from experimental error is swamping any effect that you are trying to detect. You could perhaps try to filter out this noise by looking at the mean of the dependent variable values for each value of the independent variable, 1 to 8, and see if there is any correlation in this set of mean values. You probably need to get some more data points for values 7 and 8. Gandalf61 (talk) 09:10, 7 July 2009 (UTC)[reply]

The F-statistic of about 36 settles the matter; the correlation is significant with even a far lower value of alpha than the one you mention. How much correlation is significant depends on the sample size; even a very small correlation would be highly significant if the sample were large enough. Michael Hardy (talk) 22:31, 8 July 2009 (UTC)[reply]

Michael - I am confused by your interpretation of the F-statistic value. Lack-of-fit sum of squares says "One uses this F-statistic to test the null hypothesis that the straight-line model is right ... one rejects the null hypothesis if the F-statistic is too big ...". Seems to me that the high value of the F-statistic supports my assertion that the data is a very poor fit to its regression line and there is very little evidence of a linear correlation here (I accept that my original statement that there is no correlation at all is too general - I should have been more specific). Am I wrong ? Gandalf61 (talk) 08:19, 9 July 2009 (UTC)[reply]

This F-statistic has nothing at all to do with the other F-statistic treated in lack-of-fit sum of squares. In particular, this F-statistic has nothing to say about whether the model fits well or not. You said the correlation would have to be much higher to be statistically significant. That was completely wrong, as shown by this F-statistic. On a different issue, you said the line doesn't fit well at all. That is correct for reasons having nothing to do with those you cited. The summary statistics mentioned above do not tell you whether fitting a line makes sense or not; they are completely silent on that issue, and you are wrong to try to use them for that purpose. But instead of those summary statistics, if you look at the actual scatterplot, you will see instantly that fitting a line is the wrong thing to do in this case. Michael Hardy (talk) 18:55, 9 July 2009 (UTC)[reply]

I'm not sure a correlation of -0.385 is that low; it depends entirely on context. Spearman's rank correlation is -0.515. The p-value for testing either against the null hypothesis of zero correlation is <0.0001. But i'd agree it's useful (essential, even) to plot the data, and plotting the mean, or the median, of the y values for each value of x is a good idea. Both the mean and the median clearly decrease over the first 3 or 4 values of x, but for larger x it's pretty flat (though it's hard to be for x>5 sure as the number of observations decreases rapidly). That shows that the straight-line fit from linear regression is unlikely to be appropriate. Also the variance decreases strongly with increasing x, which again means linear regression is unlikely to be appropriate (at least not without some transformation). What is appropriate depends a lot on your scientific objectives, which you don't tell us.

Wikipedia isn't really designed for learning basic statistics, and there's not much on Wikibooks or Wikiversity about regression as yet. There are a few external lihks at the end of the statistics article that might be useful. Neither would I recommend doing statistical analysis in Excel; there are plenty of statistical packages available, quite a few at no cost. Qwfp (talk) 17:23, 7 July 2009 (UTC)[reply]

That this simple least-squares fit is not appropriate can be seen by looking at the scatterplot. Just eyeballing it suggests maybe taking logarithms of the y-values first, and regressing that on x. But five of the 213 y-values are negative, so that has its drawbacks. Michael Hardy (talk) 22:45, 8 July 2009 (UTC)[reply]

Gandalf wrote:

You could perhaps try to filter out this noise by looking at the mean of the dependent variable values for each value of the independent variable, 1 to 8, and see if there is any correlation in this set of mean values.

That is an extraordinarily bad idea. See ecological fallacy.Michael Hardy (talk) 22:53, 8 July 2009 (UTC)[reply]

OK, looking at this again after a couple of days: I've pushed this through a standard statistical software package and I've done the lack-of-fit F-test mentioned above. The numbers given by the original poster agree with what I'm getting. I've also looked at the scatterplot of the original data and of the data with logarithms of the y-values in place of the y-values given. I am a bit suspicious of the five small negative y-values (and those are of course excluded from the data set with logarithms of y-values).

Some conclusions:

Just glancing at the scatterplot, it is obvious that the correlation is overwhelmingly statistically significant.
Just glancing at the scatterplot, it is obvious that fitting a straight line is not a good way to model this.
The lack-of-fit F-test gives an F value of about 4.19, with 6 numerator degrees of freedom and 205 denominator degrees of freedom. The p-value is about 5 × 10⁻⁴, so this is highly significant—no surprises there. In other words, this agrees with the naked eye: the straight-line model doesn't fit.
With logarithms of the y-values, fitting a straight line, the residuals look more-or-less normally distributed (I did a rankit plot). That doesn't mean that model is really good, but it's far better than with the raw y-values.

Michael Hardy (talk) 01:52, 11 July 2009 (UTC)[reply]

OK, one more point: Use logarithms of the y-values and fit a parabola. Then the lack-of-fit test doesn't reject the hypothesis that the model fits. You might want to think about outliers—there are four or five exceptionally small values.

So if the original poster is still here, I can elaborate further if you have questions. I'm being terse because I'm not sure anyone's paying much attention. Michael Hardy (talk) 01:59, 11 July 2009 (UTC)[reply]

Probability in card games edit

What is the probability that one will win "money" in Microsoft's basic Solitaire program, a version of the Klondike card game? In this game, one must pay $52 at the beginning of a game, and one earns $5 per card stacked in the four piles in the rows on the uppermost level, so to win "money", one must put up at least eleven cards. I observe that the odds of winning a game that I describe are altogether unknown, so I'm not asking for that. Nyttend (talk) 04:18, 7 July 2009 (UTC)[reply]

I have no idea how to find that (without prohibitive computational cost), but just to clarify - are you assuming perfect play or something else? Is the strategy optimized for winning the game or for having a positive balance? -- Meni Rosenfeld (talk) 09:35, 7 July 2009 (UTC)[reply]

Sorry, but I'm not familiar with "perfect play" or the strategy details that you ask. I mean playing it strictly as the computer game is generally set up: one card is drawn at a time, and the player knows only the cards that are face-up in this picture. Unlike in this picture, which depicts something similar to Microsoft's "Standard" mode that counts points and permits reshuffling, I'm asking about Microsoft's "Vegas" mode, which does not permit reshuffling. Nyttend (talk) 13:13, 7 July 2009 (UTC)[reply]

The game involves a player who needs to decide which moves to play. The outcome of the game depends on the choices the player takes. You can't ask "what is the probability of such-and-such outcome" without any reference to how the player chooses his moves.

"Perfect play" means just that - the player chooses, at every step, the best move; the one that maximizes whatever it is the player wishes to maximize. The player can choose to shoot for having the highest probability possible of winning the game fully. The best move would then be the one that maximizes the probability of winning. Alternatively, the player can choose to maximize the probability of finishing the game with a positive amount of cash.

Of course, unless the game is trivial, humans can't play perfectly, and neither can contemporary computers. So you'll have to be clear about how exactly the player should act. A systematic way to choose a move in any given situation is called a strategy.

You can also ask what is the probability that for a random game, there will be some sequence of moves which results in positive cash. Note that this sequence of moves absolutely cannot be found in actual play, because the player knows the location of only some of the cards.

For more background information about this you should read up on game theory, noting that you have here a single-player, non-deterministic game with imperfect information, in extensive form.

If you have an algorithm which passes for a strategy, the best way to estimate the probability you want is by running a simulation. -- Meni Rosenfeld (talk) 16:32, 7 July 2009 (UTC)[reply]

From my OR with that game, I would say that you can not give a probability that you will win $ without giving a number of games you will be playing. You odds of a positive balance after 1 game will be radically different then the odds of a positive balance after 100 games. I strongly suspect that this games has a losing expectation, and your best bet is to not play. 65.121.141.34 (talk) 13:57, 7 July 2009 (UTC)[reply]

Nyttend probably means that one game is played. -- Meni Rosenfeld (talk) 16:32, 7 July 2009 (UTC)[reply]

It is going to depend on your skill level, though. I just played 3 games and made a profit in one of them (and a net loss overall). That suggestions (rather imprecisely) that the probability of me making a profit is 33%. I am quite certain I am not the best solitaire player in the world, so I expect better players have a significantly higher probability, and very likely a positive expectation (I made a net loss of about $40, so just three more cards per game would give me a net profit). --Tango (talk) 18:33, 7 July 2009 (UTC)[reply]

Thanks for the input! I failed to say 'assuming that I make what is always the best decision, given what I know', and simply trying to get as much "money" as possible. Of course, decisions could ultimately be bad despite looking good (getting $5 for a ♠4 could be bad because it doesn't give me anywhere to put a later-appearing ♥3), so I just meant 'as good as I could know'. Can't find a source for this right now, but I believe that it's named Klondike for an enterprising storekeeper during the Klondike Gold Rush, who discovered that the average card game loses money. Thus I play on the computer, not with real money :-) Nyttend (talk) 20:14, 7 July 2009 (UTC)[reply]

By the way, I should note — I left this comment here because I remember reading in a junior high school textbook that probability theory was first developed because of card games. How is it possible to have probability for some card games but not for this? I can't imagine a card game where human decisionmaking isn't a significant part of the game. Nyttend (talk) 20:16, 7 July 2009 (UTC)[reply]

Ok, so we're talking about perfect play, but "trying to get as much 'money' as possible" is still not well-defined enough. Are you trying to maximize the expectation of the money, the probability of having positive money after one game, or something else? Note that the expected gain being positive has little to do with the probability of positive balance being greater than half (mean vs. median).

I find it unlikely that the enterprising storekeeper would have found a perfect strategy and calculated its expectation, so I'm guessing he measured the average of a typical player. Of course, "a typical player" is far from being rigorously defined, so not much can be said about it math-wise.

I can't answer the last question fully, but I think it mostly has to do with the number of decisions involved and the choice of which probability to try computing. -- Meni Rosenfeld (talk) 20:56, 7 July 2009 (UTC)[reply]

Of course on the storekeeper; I'm sure that it was just quickly obvious that most games got less than eleven cards put back up. As far as my goal — I want to get at least eleven cards up, so that I "make money": of course, more is better, but this question is meant to ask "what percentage [or proportion] of games could be expected to result in any gain?" Nyttend (talk) 22:42, 7 July 2009 (UTC)[reply]

You're still not very clear. The goal 'maximize the chance of making money in one game' may not be consistent with 'more is better'. Which do you actually mean? Algebraist 23:12, 7 July 2009 (UTC)[reply]

I'm saying that my ultimate goal is to make any "money": therefore, if I can put up 11 cards, I'll not run any risks afterward. If I wanted to maximise my "money", I might put down the previously-mentioned ♠4 so as to make a place for the later-appearing ♥3, in hopes that I might use that card later; since I don't want to run any risks, I'll not consider doing that. However, I'm not simply going to stop the game once I get my eleventh card: if a card appears that I could put up as the twelfth, I'll not leave it down simply because I have eleven and don't care anymore. In short — although I'll keep trying to make money after I get into the black, I'll not do anything that puts me back into the red. I brought this up to make it clear that my behaviour after getting the eleventh card is (as far as I can see) irrelevant to the original question. Nyttend (talk) 00:18, 8 July 2009 (UTC)[reply]

Ok, that makes more sense. You are interested in a strategy that maximizes the probability of having positive money, and among all strategies that maximize that, one that maximizes the expected gain. As you have correctly noted, your "secondary objective" and your actions after 11 cards were placed (as long as you're not undoing your work) have no effect on the probability of successfully placing 11 cards.

Now that we know what the question is, we can go back to the fact that I have no idea how to find the answer.

Going back to the storekeeper, I'll say again that "the average card game loses money" is very different from "most games got less than eleven cards put back up". -- Meni Rosenfeld (talk) 10:51, 8 July 2009 (UTC)[reply]

Card games such as this are hard to calculate probabilities for analytically, but you can take a different approach. If you know how to program computers, you can write a program to generate a random deck, deal itself a game, and then play it, according to whatever strategy you decide to code in. Run that program 1,000,000 times, and keep track of the results. The probabilities you're looking for should appear empirically in the data, by the law of large numbers. I hope that helps... -GTBacchus^(talk) 17:33, 10 July 2009 (UTC)[reply]

Inverse of a Matrix Mod 2 edit

I have a square matrix with binary entries and I want to find its multiplicative inverse using mod 2. I think (and correct me if I am wrong) this means that find the inverse using operations mod 2. So every time I add or multiply, I need to do those operations using mod 2 arithmetic. The problem is that my matrices are large such as 9x9 or 16x16 so doing them by hand is not even an option. My questions is that is there someway to do them in MATLAB or Mathematica? Are there any built-in functions, packages, or programs/m-files that I can use to do this quickly for either program? What about PARI? I know very little about PARI so if someone knows how to do it in PARI, any helps with the commands will be appreciated. Thanks! -Looking for Wisdom and Insight! (talk) 08:37, 7 July 2009 (UTC)[reply]

Why not just take the inverse and reduce mod 2? 98.210.252.135 (talk) 08:46, 7 July 2009 (UTC)[reply]

Because the inverse under the usual addition and multiplication may not even have ANY integer values so taking mod 2 doesn't even make sense.-Looking for Wisdom and Insight! (talk) 08:58, 7 July 2009 (UTC)[reply]

But if the matrix is invertible mod 2, the denominators will all be odd (since they must divide the determinant), so those just go away when you reduce mod 2. Alternatively, you can multiply through by the determinant to get the adjugate and clear the fractions first. But the suggestion below sounds better anyway. 98.210.252.135 (talk) 17:13, 7 July 2009 (UTC)[reply]

In Mathematica, where A is your matrix:

Inverse[A, Modulus -> 2]

Obviously there are also other ways to do this (e.g., using the adjugate of A taken mod 2, which is not efficient). -- Meni Rosenfeld (talk) 09:28, 7 July 2009 (UTC)[reply]

Remember, if that matrix is invertible, which can be determined by calculating the determinant of the matrix; if is 1 mod 2 then it is invertible, but if it's zero mod 2, then it is not. MuZemike 23:50, 13 July 2009 (UTC)[reply]

You can always use straightforward Gaussian elimination in the finite field GF(2). 208.70.31.206 (talk) 02:46, 8 July 2009 (UTC)[reply]

Understanding the basic quadratic formula proof edit

I was trying to follow the proof for the quadratic formula by the completing of the square method, but from $x^{2}+{\frac {b}{a}}x+k=x^{2}+2xh+h^{2}$ I can see how ${\frac {b}{a}}=2h$ but don't see how $h={\frac {b}{2a}}$ , My reasoning would be ${\frac {\frac {b}{a}}{2a}}$ as you would divide both sides by 2a, and when you divide by a fraction you invert the denominator so ${\frac {b}{a}}\times {\frac {1}{2a}}$ to get ${\frac {b}{2a^{2}}}$ . Where am I going wrong if at all? --Dbjohn (talk) 19:12, 7 July 2009 (UTC)[reply]

Where did

{\frac {\frac {b}{a}}{2a}}

come from? If

{\frac {b}{a}}=2h

, then

h={\frac {\frac {b}{a}}{2}}

. Algebraist 19:21, 7 July 2009 (UTC)[reply]

Okay I see where I went wrong my mind was getting mixed up with someone else's explanation that used different letters. Anyway the mistake I was making is that you need to get "h" alone and find out what h equals and $h={\frac {b}{2a}}$ . Okay I can follow the rest of it now.--Dbjohn (talk) 20:49, 7 July 2009 (UTC)[reply]

Solving the differential equation edit

What is the geometrical shape of a curve(in a plane) which would gather the parallel beams of light (for example), in a single point? Trying to answer this question, I get into this differential equation:(assumptions are, the light beams are parallel to the x-axis and are coming from , + inf , and would be focused in A(p,0) )

$yy\prime ^{2}+2(x-p)y\prime -y=0$ , with initial condition: $y(0)=0$

Although I know the parabola of the form: $y^{2}=4px$ ,is an answer, but I don't know how to solve this equation and whether there is another solution to it or not. Re444 (talk) 22:38, 7 July 2009 (UTC)[reply]