全部版块 我的主页
论坛 数据科学与人工智能 数据分析与数据科学 SPSS论坛
3451 7
2014-04-12
I'm working on a discrete time survival analysis (and following along with Singer and Willett's chapters) and I've run into a problem that may occur fairly often in discrete time. A bit of background: 42 out of 261 persons had the event over a span of about 1200 days, which is my time unit. I elected to work in discrete time rather than continuous time because I thought, maybe incorrectly, that discrete time would be easier to work with and present to an inexperienced audience. I'm aggregating time to eliminate time chunks with no events. Using a three month aggregation, I have one time chunk with no events. The estimation gives a large B coefficient for that chunk, which is no surprise. My question is whether an acceptable remedy is to add a tiny value to that cell, changing the 0 to, say, .001, if you think in crosstab terms. Then, if it an acceptable remedy, is the doing of it just a matter of changing the event status indicator from 0 to .001 for a randomly chosen case at that time point?



二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

全部回复
2014-4-12 03:52:01
I do not think it is wise to correct reality to make it fit the model, but (if anything) the other way around. On the other hand, survival analysis is about probabilities, and it is quite likely that at any particular period the number of events is larger or smaller than predicted, including zero events as a limiting case. When the total number of events is 42 over 1200 periods, zero-event periods are a necessity.
Time chunks might be useful for some purposes, but (1) by lumping time periods together you are deliberately forfeiting information; and (2) the actual length of the chunks is arbitrary: your conclusions may vary if you use 2 months, 3 months, or 17 or 43 days as your time-chunk convention. At any rate one should try different ways of carving the chunks, aiming at
using the shortest chunk that is possible to avoid unstable results (e.g. a coefficient estimate that is extremely sensible to the random occurrence of one more or one less event in a given chunk).
You do not clarify what kind of survival analysis you are doing. If it is Cox regression, recall that call regression uses only  information on the succession or time-order of events, and not on the exact length of time elapsed.

Also, Cox regression generates a function for the hazard rate (or conversely the survival probability) including EXP(BX), with B coefficients for each predictor, and these B coefficients are one single set of coefficients for the entire analysis, not one coefficient per chunk of time. Thus I do not understand what you mean by "a large B coefficient for that chunk".

Hope this helps.



Hector Maletta





二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

2014-4-12 03:53:01
I do not know the details of your research program; I thought you were dealing with a SURVIVAL problem, i.e. trying to ascertain the probabilities of surviving for a given span of time (after a certain "start" state) before an event strikes. This is not the same as the probabilities of the event at various periods in comparison with the probability of the event at some reference period. An  example of the latter would be to ascertain the probability of tornados in May, June, July, August, September etc relative to their probability in the "reference" month, say January. An example of the former is the probability that a given location would "survive" without any tornado, from January until a variable date (January to May, January to June, January to September etc). It is a completely different problem. In the former you may use logistic regression (probability of an event for
cases described by category 1, 2, 3, 4, ...., k, relative to the reference probability of the event for cases described by the reference category); for instance, suppose you investigate the probability of tornados with one single predictor, "Month of year"; this predictor has 12 categories, one per month, and you use one of the months (say, May) as your reference category
(you may have data from multiple years for the same location, or from multiple locations in the same year; multiple years for given location may seem more logical as an example, given the geographical variability of tornados, and their relatively stable recurrence over time). You may use also some other concurrent predictor, say accumulated annual rainfall over the 12 months to the start of the tornado season, for all years (or locations) considered. From these data, you obtain the odds of tornados in
September relative to January (probability of tornados in September, divided by probability of tornados in January), and also the odds of tornados for all the other months. In total, you obtain eleven odds (one per month, all relative to January). Your logistic function for the probability of tornadoes in a given month is p(k)=EXP(BX)/[1+EXP(BX)] where BX=b0+b1X1+b2X2. The logarithm of the odds is BX, and the odds are BX. The odds that a tornado happens in a given month k, say July, relative to the
base period (May) in years with rainfall x(t), equal the probability of a tornado happening in month k, relative to the probability of a tornado happening in May, for years with cumulative 12-month rainfall=x(t). (I use May in this example as the probability of tornados in January is likely to be zero).

In the survival case the situation is different. Suppose you use survival analysis to predict the chances of surviving different lengths of time without experiencing a tornado (say, at a given area, like Kansas), using a "time variable" (month of year) and perhaps some predictor (say, rainfall again). In this case, time is NOT a predictor: it is the one-directional dimension along which events can occur. Your hazard function will have only one predictor (rainfall). The proportional cumulative hazard rate h(t) for a
year with rainfall X=x(t) will mean: "number of events expected to have occurred from starting time, say January, to month t, in years with rainfall x(t)". (I use January in this example as the reference time in survival analysis should be at the start of the relevant period, not in the middle of it; one may use also March or April, if the start of the tornado season is always after March). The associated survival probability up to month k, for a year with rainfall x(t), i.e. p(tk), gives you the probability of not having a tornado until month k, for year t with rainfall x(t).

Logistic regression estimates the probability of a tornado in each different month. Survival analysis estimates the probability of not having the tornado up to each different month.

Cox regression is a particular kind of "proportional hazard" survival analysis, where (ordinarily) the hazard rates stand to the reference or base hazard rate in a constant proportion over time. If your chances of survival are twice as large as mine, guys like you would be twice more likely to survive than guys like me up to every month, no matter how distant the month considered. No chance that your survival chances approach mine over time: they remain twice as large. For example, if 800 mm/yr of accumulated rainfall up to start of season afford a reduction of 20% in the incidence of tornadoes (relative to the reference case which has, say, 500 mm/yr rainfall), this reduction of 20% in the odds of tornadoes will hold for all time intervals, i.e. for the relative chances of tornadoes up to all months. Cox regression  may, however, accommodate non proportional hazards by
introducing time-dependent covariates (such as accumulated rainfall UP TO EACH MONTH). Other models of survival analysis may account for more sophisticated relationships between time, covariates and events.



Hector Maletta



二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

2014-4-12 03:53:58
In chapter 11 of that book, S&W show how one can fit a "discrete-time hazard model" using a logistic regression procedure.  To quote:

"Our goal is to show that although the [discrete-time hazard] model may appear complex and unfamiliar, it can be fit using software that is familiar by applying standard logistic regression analysis in the /person-period data set/."

A "person-period data set" has multiple rows per person, one row per period of observation.  There is also an Event indicator variable (call it EVT). For persons who experience the event, EVT=0 on all rows but the last.  For those who do not experience the event, EVT=0 on all rows.  For more details, see Singer & Willett's book (particularly chapters 11 & 12).

Bruce Weaver Professor at Lakehead University Canada
二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

2014-4-12 03:55:44
Bruce,
I understand that, but to me it is not the same thing. I have not read Singer & Willet's book, but I tend to think a person-period dataset does not imply that survival probabilities decrease monotonically over time: if you start being alive today, it may well happen (with such dataset) that your predicted probability of surviving up to next November is lower than your probability of still being alive next December, which would be a bit unreal.
Even a multilevel model with persons and periods cannot account to the ordered nature of periods: they would be simply several "periods" equivalent to several "observations", with no particular order. Perhaps an ordering of periods could be achieved by using "time elapsed" as one of the predictors, but it all looks to me as a convoluted way of arriving at the same point.
Perhaps some people can more easily manage logistic regression software than survival analysis software, but to me they look equally easy or equally difficult.

And errors of interpretation can arise in either, no matter how familiar the software (recall the frequent mistake of using the "classification table" in logistic regression as a criterion of goodness of fit, which is wrong because "fit" in a probabilistic prediction is a fit of the probability, measured as a proportion computed over a group of cases, not the actual occurrence of the event to every individual having p>0.50 and the non-occurrence where p<0.50).

Hector

二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

2014-4-12 04:08:36
Hector, I have this book in front of me.  It seems to me that the censoring variables and event previous variables would ensure that the hazard ratios were accurate, and from what I gather, the claim Singer and Willet are making is they are simply using logistic regression to equal standard hazard modeling, without needing to familiarize yourself with new software.  My understanding from the claim is that they become the same thing, and from what I'm seeing, it appears to be the case.

In the same way, while specialized software exists to perform propensity score matching, all of the methods used in those software's can be equated manually in SPSS or SAS using separate steps.  The separate software makes it easier to perform these functions, but they are introducing new math, so to speak.

The Hazard function is the probability that you experience an event at a given time, given that you didn't experience that event in the previous time.  Doesn't the censoring allow the model to calculate this such that it's not creating a probability for someone already dead?  I assume that was the point you were trying to make?

If what you are saying is that someone's probability of being alive can't increase with time, that isn't true either.  Cancer patients have an increased probability of staying in remission from cancer the longer they go after going into remission with the cancer.  Alcoholics have a higher probability of staying sober the longer they stay sober after treatment (i.e. they are more likely to no relapse once they have put together a great deal of recovery).

The order aspect of time is up to you.  When it comes to any statistical analysis of time, the interpretation of time in a specific order is still researcher interpretation specific, the model can still do "wonky" time comparisons that make no sense.  Statistics doesn't understand that time flows from past to present, only we do.  It is up to you to utilize the variable indicating time and censoring to ensure that the analysis makes sense.

Am I misunderstanding your concern?

Matthew J Poes
Research Data Specialist
Center for Prevention Research and Development
University of Illinois
二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

点击查看更多内容…
相关推荐
栏目导航
热门文章
推荐文章

说点什么

分享

扫码加好友,拉您进群
各岗位、行业、专业交流群