[Case Study]Multilevel Analysis using PyMC

1741

收藏 2014-07-01

In any case, I'm writing to ask for help in specifying a multilevel model (mixed model) for what I call partially clustered designs. An example of a partially clustered design is a 2-condition randomized psychotherapy study where subjects are randomly assigned to conditions. In condition 1, subjects are treated in small groups of say 8 people a piece. In the condition 2, usually a control, subjects are only assessed on the outcome (they don't receive an intervention). Thus you have clustering in condition 1 but not condition 2.The model for a 2-condition study looks like (just a single time point to keep things simpler): y_ij = b_0 + b_1*Tx + u_j*Tx + e_ijwhere y_ij is the outcome for the ith person in cluster j (in most multilevel modeling software and in PROC MCMC in SAS, subjects in the unclustered condition are all in clusters of just 1 verson), b_0 isthe overall intercept, b_1 is the treatment effect, Tx is a dummy coded variable coded as 1 for the clustered condition and 0 for the unclustered condition, u_j is the random effect for cluster j, and e_ij is the residual error. The variance among clusters is \sigma^2_u and the residual term is \sigma^2_e (ideally you would estimate a unique residual by condition).Because u_j interacts with Tx, the random effect is only contributes to the clustered condition.In my PyMC model, I expressed the model in matrix form - I find it easier to deal with especially for dealing with the cluster effects.Namely: Y = XB + ZU + Rhere X is an n x 2 design matrix for the overall intercept and intervention effect, B is a 1 x 2 matrix of regression coefficients, Z is an n x c design matrix for the cluster effects (where c is equal to the number of clusters in the clustered condition), and U is a c x 1 matrix of cluster effects. The way I've written the model below, I have R as an n x n diagonal matrix with \sigma^2_e on the diagonal and 0's on the off-diagonal.
All priors below are based on a model fit in PROC MCMC in SAS. I'm trying to replicate the analyses in PyMC so I don't want to change the priors.
The data for my code can be downloaded here:http://dl.dropbox.com/u/613463/run1.csv.The data consist of 200 total subjects. 100 in the clustered condition and 100 in the unclustered. In the clustered condition there are 10 clusters of 10 people each. There is a single outcome variable.he PyMC set-up file can be downloaded here (also pasted below): http://dl.dropbox.com/u/613463/analysis1.py The file for running the PyMC set-up file can be downloaded here: http://dl.dropbox.com/u/613463/run_analysis1.pyI have 3 specific questions about the model:

Given the description of the model, have I successfully specified the model? The results are quite similar to PROC MCMC, though the cluster variance (\sigma^2_u) differs more than I would expect due to Monte Carlo error. The differences make me wonder if I haven't got itquite right in PyMC.
Is there a better (or more efficient) way to set up the model? The model runs quickly but I am trying to learn Python and to get better at optimizing how to set up models (especially multilevel models).
How can change my specification so that I can estimate unique residual variances for clustered and unclustered conditions? Right now I've estimated just a single residual variance. But I typically want separate estimates for the residual variances per intervention condition.

Here are my notes on the matter:

This code worked for me without any modification. :) Although when I tried to run it twice in the same ipython session, it gave me strange complaints. (for pymc version 2.1alpha, wall time 78s).For the newest version in the git repo (pymc version 2.2grad, commit a77b7aa28c75f6d0e8172dd1f1c3f2cf7358135, wall time 75s) it didn't complain.
I find the data wrangling section of the model quite opaque. If there is a difference between the PROC MCMC and PyMC results, this is the first place I would look. I've been searching for the most transparent ways to deal with data in Python, so I can share some of my findings as applied to this block of code.
The model could probably be faster. This is the time for me to recall the two cardinal rules of program optimization: 1) Don't Optimize, and 2) (for experts only) Don't Optimize Yet.That said, the biggest change to the time PyMC takes to run is in the step methods. But adjusting this is a delicate operation. More on this to follow.
Changing the specification is the true power of the PyMC approach,and why this is worth the extra effort, since a random effects model like yours is one line of STATA. So I'd like to write out in detail how to change it. More on this to follow.
Indentation should be 4 spaces. Diverging from this inane detail will make python people itchy.

I think that speeding up the MCMC is more fun than cleaning up the data wrangling, so I'll assume that the data is being set up as desired for now and mess around with the step methods.  Commit linked herehttps://github.com/aflaxman/pymc-multilevel-example/commit/03d0e64653a99085ac5f391acd0ad5839ddfaae4is a small example of this, which doesn't speed things up, but could mean less samples are required of the chain.To actually speed things up (and also potentially reduce the number of samples required), I'm going to combine the beta0 and beta1 stochs. This reduces the wall time to 58s, but the diagnostic plots make me think the it needs a longer burn-in period.  I like to throw out the first half of my samples, so I'll make that change, too.  Pro tip: I added a <code>reload(analysis1)</code> to the run_analysis1.py script, so that my changes to the model automatically make it into ipython when I say <code>time run run_analysis1</code>. Commit linked herehttps://github.com/aflaxman/pymc-multilevel-example/commit/6bd55cef9c5d854f7755924c705ab8e111e20b3cI'm not keeping careful track of if the results change as I modify thestep methods, which is probably not a recommended practice.  But it isfun to see the methods get faster and faster as I group more and morein the AM step method.  (43s wall time for M.B and M.U in one AM,commit here; 32s wall time for all stochs together, but the resultsare not pretty)https://github.com/aflaxman/pymc-multilevel-example/commit/49dea954c0c681ad74a2b427a2e6ca594e501c4fSometimes I've had good luck with choosing initial values for the MCMCby maximizing the posterior or something related to it.  Here is codeto do that, which doesn't take very long (36s wall time, andrespectible looking results are back)https://github.com/aflaxman/pymc-multilevel-example/commit/e8c376b8685c64c7183a8aeb14ec05b095042652So that little burst of messing around has halved the run time.  Toobad I didn't work on the things I was supposed to be doing for thelast hour.Another hour of messing around has given me some confidence that 200Ksamples is certainly enough, and 20K samples probably is enough toguide model development.  To assess this, I relied on theautocorrelation plot of the stoch traces, which I added to theMatplot.plot routine, since I think everyone should use itapproach. Commit linked herehttps://github.com/aflaxman/pymc-multilevel-example/commit/e8545ca5bec6e3a1e5dbb7ba20bb957bc67c0d95Now I'm ready to try simplifying the data wrangling section, and theneverything will be in order to extend the model to have differentvariance for the clustered and unclustered observations.I like to use pylab's csv2rec function for loading data files.  Itdeals nicely with the column headings, and converts data to thecorrect type pretty routinely.  I'm going to do two commits in theprocess of changing this code, one which checks carefully that I'm notbreaking anything, and then another to yield a nice short block ofcode. Commits linked here https://github.com/aflaxman/pymc-multilevel-example/commit/8c06499b287e1bbd9dd7a4146bc627d6e9187ad8 and here https://github.com/aflaxman/pymc-multilevel-example/commit/11ff14dab2377491e8d4d155acfd74b7ca6c93d0 Now, on to changing the model!  I added a separate variance parameterfor the treatment and control groups in a slightly sneaky way.  Itonly took 10 keystrokes (roughly), but maybe it would be better to usemore code and have a clearer approach.  Commit linked herehttps://github.com/aflaxman/pymc-multilevel-example/commit/a46531085c673e62458b8b28153f50aebae036deBefore:
      B: [-0.07  0.3 ]
      u: [-0.14  0.02 0.3 0.5 0.36 -1.02 0.33  0.23 -0.28 -0.35]
var_e1: 0.94
var_u: 0.4

After:
      B: [-0.06  0.31]
      u: [-0.13  0. 0.28  0.53  0.39 -1.08 0.34  0.22 -0.31 -0.35]
var_e1: [ 1.12  0.79]
var_u: 0.41

Since I sneakily used the indicator run1.treat to index the var_e1array, var_e1[0] is the control variance and var_e1[1] is thetreatment variance.  So the change in the model doesn't change any ofthe parameter means, but it does show that the variance of thecontrol group is higher than for the treatment group.

Too bad I didn't generate before-and-after numbers on the uncertaintyintervals, that is something that might have changed in an interestingway with this more complicated model.

In conclusion, here is why I think that this approach is worth the extra effort, compared to usingxtmixed in STATA.  Although the STATA model is almost done when figure outthe single line <code>xtmixed y treat ||j:</code>, that is notquite what you want.  The control groups are not supposed to haverandom effects, so you have to find a
hack for that, maybe<code>xtmixed y treat ||j: treat,nointercept</code>.  And then you want to add different variances for treatment and control, which isdoable, but also hacky, maybe with <code>xtmixed y treat ||j:treat, nointercept ||treat:</code>. And now you've got a very short line, butone that is going to be very hard to understand in thefuture.  And what if you want to fiddle with the priors?  Or change the noise model?  Much better to have 30 clear
lines of PyMC, in my opinion.

p.s. Django and Rails have gotten alot of mileage out of emphasizing _convention_ in frequently performedtasks, and I think that PyMC models could also benefit from thisapproach.  I'm sure I can't develop our conventions myself, but Ihave changed all the file names to move towards what I think we mightwant them to look like. Commit linked here
https://github.com/aflaxman/pymc-multilevel-example/commit/603112c1de3d0fd8fa8fdc76a21ec3f290d30a06
My analyses often have these basicparts: data, model, fitting code, graphics code.  Maybe your do to.

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

扫码加我 拉你入群

分享

扫码加好友，拉您进群

扫码加我拉你入群