[精彩WinBUGS答问]Assessing Classification Precision using WinBugs

1629

收藏 2014-06-16

I'm trying to fit a logistic regression to the spam data (that can be found on the webpage of the elements of statistical learning) using R2WinBUGS package. My approach was to first divide the data into 80% training and 20% testing sets. I can fit the model using the 80% set but I dont know how to write WinBugs code to predict on new observations (say the 20% test set) and I wonder if this approach to study model/classification precisicion make sense in a Bayesian Approach?

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

全部回复

ReneeBK

2014-6-16 02:21:53

Predicting with bayesian models and especially BUGS is very easy. Just set the response in testing sets to NA. Then you also need to specifiy initial values for the response; set those to NA for the training set and to a reasonable value for the test data.

BUGS will then sample from the posterior predictive distribution for the response values you set to NA. Note that these distributions contain the uncertainty about the regression coefficients. You can take the median of these samples if you want point estimates, but the sd of the estimates will also be quite informative.

Here is a rather minimal example:

model
{
for (i in 1:N)
      {
         y[i] ~ dnorm(mu,1)
      }
      mu ~ dunif(-1000,1000)
}

#data
list(N=10, y = c(-1,0,1,-1,0.5,-0.5,2,-1.5, NA, NA))
#inits
list(mu = 0, y = c(NA,NA,NA,NA,NA,NA,NA,NA,0,0))
You can then get posterior predictive distributions for y9 and y10. This example does not contain predictors, but it also works with them. Note that you would not set them to NA, they would instead remain unchanged.

@Edit after Comment:

You can also do this differently and seperate test and training data in the model above. This would look like this:

model
{
for (i in 1:N.train)
      {
         y.train[i] ~ dnorm(mu,1)
      }
for (i in 1:N.test)
      {
         y.test[i] ~ dnorm(mu,1)
      }
      mu ~ dunif(-1000,1000)
}

#data
list(N.train=8, N.test = 2, y.train = c(-1,0,1,-1,0.5,-0.5,2,-1.5))
#inits
list(mu = 0, y.test = c(0,0))
This might look somewhat easier, but note that you will also need to split any predictor in the models (my example has none). You might have vectors like sex.train and sex.test then. Personally I prefer the first way, because it is more terse.

And yes, I think this is a reasonable starting point. While some sorts of overfitting will be indicated in a bayesian model by very high sds for the coefficients, you still impose a model structure which might not fit the data well. This can also lead to your predictions being poor. You should also consider (for example) a full cross validation, where you will repeat that step with different splits of the original data.

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

ReneeBK

2014-6-16 02:23:11

Simple data splitting is inefficient if you want to estimate the performance of your model. CV and bootstrapping usually yield better estimates. This is a general fact and not specific to bayesian modelling and/or WinBugs. When cross validating you would still do what you propose but you would do it with different splits of the data. I will add a point to adress your second question in my answer. – Erik Feb 11 '13 at 16:21

"Inefficient" is in the statistical sense (i.e., low precision/high standard error) not in the computational sense. 100 repeats of 10-fold cross-validation, like the bootstrap, can be quite efficient.

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

扫码加我 拉你入群

扫码加我 拉你入群

扫码加我 拉你入群

分享

扫码加好友，拉您进群

扫码加我拉你入群

扫码加我拉你入群

扫码加我拉你入群