Wikipedia:Reference desk/Archives/Mathematics/2015 May 31

Mathematics desk
< May 30 << Apr | May | Jun >> June 1 >
Welcome to the Wikipedia Mathematics Reference Desk Archives
The page you are currently viewing is a transcluded archive page. While you can leave answers for any questions shown below, please ask new questions on one of the current reference desk pages.


May 31 edit

Question in Statistics edit

I am trying to solve a Question in Statistics, for which we are using R and SAS, and it is about a Survey of a number of women, giving facts about themselves to determine whether or not they have Diabetes. We were given a Training Set of 200 people, then a test set of a further 332, and my understanding in Classification, is the training set is used to get a Model or equation to determine membership of either the group that has diabetes, or the one that does not. We assigned zero for no Diabetes, and 1 if the Lady did have Diabetes. We ran code given to us, and had to answer a number of questions which I did until the last, and this was to be given details of one extra woman, and to work out whether or not she either had diabetes or could be predicted to have it, and I am not sure how to do it. We carried out a Linear Discriminant Analysis, a Logistic Regression and a Quadratic Discriminant Analysis, and the summaries of the LDA and Logistic, which we are to use, are as follows, where below I have decided to show the whole Code : Assignment 4

  1. first, set the working directory to the data file location (this can be easily done in RStudio Menu/Session/Set working directory or by using setwd("~/path to working directory/")) import the ' ' separated .txt files

> setwd("P:/STAT315") > pima<- read.table("pima.txt",header=TRUE) >pima$type <- factor(pima$type) > pima_test <- read.table("pima_test.txt", header=TRUE) > pima_test$type <- factor(pima_test$type) > # Linear Discriminant Analysis > library(MASS) > (pima_lda <- lda(type ~ npreg + glu + bp + skin + bmi + ped + age, data=pima, prior=c(0.66, 0.34))) Call : lda(type ~ npreg + glu + bp + skin + bmi + ped + age, data = pima, prior = c(0.66, 0.34)) Prior probabilities of groups: 0 1

   0.66 0.34 

Group means: npreg glu bp skin bmi ped age

 0   2.916667  113.1061   69.54545  27.20455  31.07424  0.4154848   29.23485
 1   4.838235  145.0588   74.58824  33.11765  34.70882  0.5486618   37.69118

Coefficients of linear discriminants: LD1 npreg 0.0794995781 glu 0.0240316424 bp -0.0018125857 skin -0.0008317413 bmi 0.0494891916 ped 1.2530603130 age 0.0314375125

# Variable tab the Name given to the two by two Table from the Pima Type Training Set as shown here

> tab <- table(pima$type, predict(pima_lda)$class)

# From the two by two in the Training Set Table of those with Diabetes and those Without, add              
# Row One Column Two to Row Two Column one, then divide by Total Number of Women, to get
#  the Training Error for the Linear Discriminant Analysis Model

> (tab[1,2] + tab[2,1])/sum(tab) [1] 0.23

  1. Variable tabtest the Name given to the two by two Table from the Pima Type Test Set as shown here

>tabtest<- table(pima_test$type, predict(pima_lda, newdata=pima_test)$class)

  1. From the two by two in the Test Set Table of those with Diabetes and those Without, add
# Row One Column Two to Row Two Column one, then divide by Total Number of Women, in                
#  order to obtain  the Error on a Test Set for the Linear Discriminant Analysis Model

> (tabtest[1,2] + tabtest[2,1])/sum(tabtest) [1] 0.2018072

  1. Cross Validation using the Package ipred

> library(ipred) > mypredict.lda <- function(object, newdata) predict(object, newdata = newdata)$class > errorest(type ~ npreg + glu + bp + skin + bmi + ped + age, data=pima, model=lda, estimator="cv", predict=mypredict.lda, est.para=control.errorest(k=199)) Call: errorest.data.frame(formula = type ~ npreg + glu + bp + skin +

   bmi + ped + age, data = pima, model = lda, predict = mypredict.lda, 
   estimator = "cv", est.para = control.errorest(k = 199))
   199-fold cross-validation estimator of misclassification error 

Misclassification error: 0.245 > # Logistic Regression > lmod <- glm(type ~ npreg + glu + bp + skin + bmi + ped + age, data=pima, family=binomial()) > summary(lmod) Call: glm(formula = type ~ npreg + glu + bp + skin + bmi + ped + age, family = binomial(), data = pima) Deviance Residuals: Min 1Q Median 3Q Max -1.9830 -0.6773 -0.3681 0.6439 2.3154 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -9.773062 1.770386 -5.520 3.38e-08 *** npreg 0.103183 0.064694 1.595 0.11073 glu 0.032117 0.006787 4.732 2.22e-06 *** bp -0.004768 0.018541 -0.257 0.79707 skin -0.001917 0.022500 -0.085 0.93211 bmi 0.083624 0.042827 1.953 0.05087 . ped 1.820410 0.665514 2.735 0.00623 ** age 0.041184 0.022091 1.864 0.06228 . Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1)

   Null deviance: 256.41  on 199  degrees of freedom

Residual deviance: 178.39 on 192 degrees of freedom AIC: 194.39 Number of Fisher Scoring iterations: 5 > pclass <- predict(lmod, newdata=pima_test, type="response") > 0.5 > pclass <- predict(lmod, newdata=pima_test, type="response") > tabtestlogistic <- table(pima_test$type, pclass) > (tabtestlogistic[1,2] + tabtestlogistic[2,1])/sum(tabtestlogistic) [1] 0.003012048

  1. Or :> pclass<-predict(lmod,newdata=pima_test,type="response")>.5

> tabtestlogistic <- table(pima_test$type, pclass) > (tabtestlogistic[1,2] + tabtestlogistic[2,1])/sum(tabtestlogistic) [1] 0.1987952

  1. This Time using the available Data to fit a Quadratic Discriminant Analysis Model

> setwd("P:/STAT315") > pima <- read.table("pima.txt", header=TRUE) > pima$type <- factor(pima$type) > pima_test <- read.table("pima_test.txt", header=TRUE) > pima_test$type <- factor(pima_test$type) > library(MASS)

  1. Setting up the Quadratic Discriminant Analysis Model

> (pima_qda <- qda(type ~ npreg + glu + bp + skin + bmi + ped + age, data=pima, prior=c(0.66, 0.34)))

  1. Summary of the Model

Call: qda(type ~ npreg + glu + bp + skin + bmi + ped + age, data = pima, prior = c(0.66, 0.34)) Prior probabilities of groups : 0 1

          0.66 0.34 

Group means : npreg glu bp skin bmi ped age

          0   2.916667 113.1061 69.54545  27.20455 31.07424  0.4154848  29.23485
          1   4.838235 145.0588 74.58824  33.11765 34.70882  0.5486618  37.69118
  1. Variable tabq the Name given to the two by two Table from the Pima Type Training Set as shown here, # but this time for the qda

> tabq <- table(pima$type, predict(pima_qda)$class)

  1. From the two by two in the Training Set Table of those with Diabetes and those Without, add # Row One Column Two to Row Two Column one, then divide by Total Number of Women, # this occasion to get the Training Error for the Quadratic Discriminant Analysis Model

> (tabq[1,2] + tabq[2,1])/sum(tabq) [1] 0.23

  1. Variable tabqtest is the Name given to the two by two Table from the Pima Type Test Set as shown here, # but now it is for the qda

> tabqtest <- table(pima_test$type, predict(pima_qda, newdata=pima_test)$class)

  1. From the two by two in the Test Set Table of those with Diabetes and those Without, add # Row One Column Two to Row Two Column one, then divide by Total Number of Women, # on this occasion to get the Error on a Test Set for the Quadratic Discriminant Analysis Model

> (tabqtest[1,2] + tabqtest[2,1])/sum(tabqtest) [1] 0.2289157

  1. Cross Validation using the Package ipred for Quadratic Discriminant Analysis

> library(ipred) > mypredict.qda <- function(object, newdata) predict(object, newdata = newdata)$class > errorest(type ~ npreg + glu + bp + skin + bmi + ped + age, data=pima, model=qda, estimator="cv", predict=mypredict.qda, est.para=control.errorest(k=199)) Call : errorest.data.frame(formula = type ~ npreg + glu + bp + skin +

   bmi + ped + age, data = pima, model = qda, predict = mypredict.qda, 
   estimator = "cv", est.para = control.errorest(k = 199))

199-fold cross-validation estimator of misclassification error

Misclassification error: 0.275

Now my understanding is that for the Logistic, I take the coefficients in the estimates column, and multiply each by the actual data values for this one particular woman, but I am not sure if I use the intercept all seven times, or once or not at all, then the number I find I raise to the power of e, and divide this by this same number to the power of e plus 1, to undo the logit ( expit ). The data for the woman in question is : npreg glu bp skin bmi ped age

5 111 81 33 25.1 0.36 58

which are the seven explanatory variables, and type, either 0 for no Diabetes and 1 for Diabetes, is the Response. In LDA we are told to take the coefficients and multiply each by the values for the woman above and see if it is greater than zero, which here it is, but I do not know what that signifies. I also did work in SAS, which gives two sets of coefficients, 0 for no Diabetes and 1 for Diabetes, and we multiply each of the woman's values by each of the coefficients, and here the value relevant to 0 was greater than the one I worked out for 1, so this suggests to me this Lady will not get Diabetes, or at least not be said to have it. This SAS Data is as follows : Calculations for 0 with respect to no Diabetes : -35.51043-0.17897×5+111×0.09573 +81×0.44203-0.26259×33+25.1× 0.96574 +.36× 4.78151 +0.14675×58 = 35.83263 While the Calculations to do with 1 for there being Diabetes present are as follows : -46.10679-0.05820×5+111×0.13224+81×0.43928-33×0.26386+1.04092×25.1+.36×6.68513+0.19451×58 = 34.97047 Also, I did not understand why in the training set there were some women that were misclassified, as well as in the Test Set, when I thought the Training set was meant to be good enough to predict the test set. Sorry for the longness of my Question How do I sort this out ? Thanks Chris the Russian Christopher Lilly 11:13, 31 May 2015 (UTC)[reply]