Experimental Design and Data Analysis
General information including Course purpose, Lectures - where and when, Tutorials/Labs, assessment and textbooks etc can also be downloaded here as one document - the first handout in the first lecture.
[此贴子已经被作者于2006-4-27 14:01:51编辑过]
这是什麽乱七八糟的咚咚?! 这又与计量经济学的联系在那边呢?!
当初只是弄块神主牌,让人既缅怀又警惕,可没打算让人一直拜下去。
Instructor: Scott Long
Teaching Assistant Spring 2005/2006: Jason Cummings
Enrolling and Time Conflicts News Download Links Computing Getting-Ready Books
S650 is the second course in sociology’s graduate sequence in applied statistics. The first course, S554, deals with models in which the dependent variable is continuous. These include the linear regression model, seemingly unrelated regressions, and systems of simultaneous equations. S650 deals with regression models in which the dependent variable is limited or categorical. Such models include probit, logit, ordered logit, and Poisson regression, among others. The prerequisite for this class is a prior course in regression. To see the syllabus, click here.
Most materials other than the course notes (available at TIS or the Campus Bookstore) can be downloaded here. Files will be added throughout the semester.
If you want to install the ado files needed for this class, follow this link. You will also find sample programs and data sets at that location. While you may freely use my ado files, you must purchase Stata. For details, you can contact either the Stata Corporation or buy the program from the IU Stat/Math Center.
Enrollment: Unfortunately, there are more students who want to take S650 than there are seats in the class. First priority is given to graduate students in sociology since this is a required course for them. Otherwise, authorizations for the class are given on a first-come-first-serve basis. If you are interested in taking the class, contact the graduate secretary in sociology to get on the list. The graduate secretary (socgrad@indiana.edu) will contact you regarding authorization for the class. If you are given an authorization, you need to sign up for the class during the normal enrollment period; if you do not, your authorization will be given to the next student on the wait list.
Time conflicts: If you have another class that overlaps with the lecture time for S650, you will need to take one of the classes in another semester. If you have a time conflict with all of the lab times, you should take 650 some other semester. If you can attend some of the labs each week and you are already familiar with Stata (or can learn it on your own), you will probably do fine but might have to work harder than students who can attend lab. While most of the lab time is used for students doing independent work, the teaching assistant will give some short lectures related to the assignments. For example, he/she might provide additional information about keeping a research log or how to format tables using Word.
[此贴子已经被作者于2006-4-27 13:22:33编辑过]
Lecture Notes
Practice Problems with Solutions
[此贴子已经被作者于2006-4-27 13:41:02编辑过]
Statistical Inference, Spring 2002
[此贴子已经被作者于2006-4-27 13:19:41编辑过]
Instructor: | Peng Zeng |
Office: | 230C Parker Hall |
Email: | zengpen AT auburn DOT edu |
Phone: | (334) 844 - 3680 |
Office hour: | 3:30--4:30pm, Tuesday/Thursday or by appointment |
[此贴子已经被作者于2006-4-27 13:36:20编辑过]
George W. Collins, II
[此贴子已经被作者于2006-4-27 13:40:10编辑过]
Conventional methods for missing data, like listwise deletion or regression imputation, are prone to three serious problems:
These new methods for handling missing data have been around for at least a decade, but have only become practical in the last few years with the introduction of widely available and user friendly software. Maximum likelihood and multiple imputation have very similar statistical properties. If the assumptions are met, they are approximately unbiased and efficient--that is, they have minimum sampling variance. What's remarkable is that these newer methods depend on less demanding assumptions than those required for conventional methods for handling missing data. At present, maximum likelihood is best suited for linear models or log-linear models for contingency tables. Multiple imputation, on the other hand, can be used for virtually any statistical problem.
This course will cover the theory and practice of both maximum likelihood and multiple imputation. Maximum likelihood for linear models will be demonstrated with Amos 4, a software package designed for estimating structural equation models with latent variables. Multiple imputation will be demonstrated with two new SAS procedures, PROC MI and PROC MIANALYZE.
In addition to Professor Allison's text Missing Data, participants receive a bound manual containing detailed lecture notes (with equations and graphics), examples of computer printout, and many other useful features. This book frees participants from the distracting task of note taking.
1. Assumptions for missing data methods
2. Problems with conventional methods
3. Maximum likelihood (ML)
4. ML with EM algorithm
5. Direct ML with Amos
6. ML for contingency tables
7. Multiple Imputation (MI)
8. MI under multivariate normal model
9. MI with SAS
10. MI with categorical and nonnormal data
11. Interactions and nonlinearities
12. Using auxiliary variables
13. Other parametric approaches to MI
14. Linear hypotheses and likelihood ratio tests
15. Nonparametric and partially parametric methods
16. Sequential generalized regression models
17. MI and ML for nonignorable missing data
Participants in the April 2005 seminar were asked to rate the course on a scale of 1 (worst) to 10 (best). The average score for 27 respondents was 9.2. They were also asked if they wished to make an attributed statement regarding the course. Here are all the comments that were received:
"This has been a great learning experience for me. Intensive, yet reasonably paced, it offered a balanced combination of theories of missing data adjustment and practical applications. For someone like me who has had little previous experience with missing data analysis, this is a good way to get started."
Anca Romantan, Annenberg School for Communication, University of Pennsylvania
"Wonderful course! Makes you realize what your data/analysis is 'missing'."
Faika Zanjani, University of Pennsylvania
"Dr. Allison explains things thoroughly and with enough datail that the student is able to use the material after the course. A large amount of material is carefully condensed and presented in such a way as to still be easily comprehended. The course has an amazing balance between theory and practice. The presentations are engaging."
Jim Godbold, Mount Sinai School of Medicine
"This is a great class. I would recomend it for anyone doing applied or simulation research with missing data."
Carolyn Furlow, Georgia State University
"Even for a novice researcher with no SAS experience, this course has been an invaluable review of conceptual and practical issues related to missing data. Clear, cogent and thorough."
Angela Duckworth, Positive Psychology Center, University of Pennsylvania
"This course is very helpful and Dr. Allison explains complicated contents very easily."
Sunhee Park, University of Pennsylvania School of Nursing
"Theoretically informed, but a very practical 'how-to-do' approach to very common problems. Readily applicable to 'real-world' situations."
Daniel K. Cooper, Harris Interactive
"Missing data is becoming a big issue in all industries, from telecommunications to bank/financial services. Professor Allison taught us how to tackle this problem with the most up-to-date methodologies (both theoretical and practical approaches)."
Shakuntala Choudhury, Senior Marketing Statistician
[此贴子已经被作者于2006-4-27 14:30:12编辑过]
Categorical Data Analysis
http://www.stat.ufl.edu/~presnell/Courses/sta4504-2000sp/
Course Information
Instructor
This instructor for this section is Brett Presnell. His office hours and other contact information are given on Presnell's home page.
Syllabus
Here is the syllabus for the course (in PDF format).
Handouts
Lecture Notes: copies of the transparencies used in class (chapters 1, 2, and 4 were done on the blackboard). Provided in three formats, 1, 2, and 4 slides to a page, for those who wish to conserve paper (pdf files).
Chapter 3 slides. (2 to a page version) (4 to a page version).
Chapter 5 slides (2 to a page version) (4 to a page version).
Chapter 6 slides (2 to a page version) (4 to a page version).
Chapter 8 slides (2 to a page version) (4 to a page version).
Downloading and using data from the General Social Survey.
SAS
Most of the computations for this class will be demonstrated using SAS. SAS is available on the PCs in the CIRCA labs (such as CSE 211). The CIRCA "SAS for Windows" handout will get you started (hard copies are also available from CIRCA). You can also get SAS for your home PC through the new Student Home-Use Program (current price is $35 for one academic year).
SAS code for examples done in class (and for some of the exercises)
SAS Manuals This is a link to nearly a full set of SAS manuals. You might specifically be interested in the entries for PROC FREQ , PROC GENMOD, PROC CATMOD, and PROC LOGISTIC. Simple "PROCS", like MEANS, SORT, and UNIVARIATE can be found in the SAS Procedures Guide, while more involved procedures are in the SAS/STAT User's Guide.
R and Rweb
Many (all?) of the computations for this class can be done using "R", which is a free/open implementation of the "S" statistical programming language. You can install R on your PC or use the web-based version Rweb. Whenever time permits I will make available R programs (scripts) for the various examples done in class and in the text.
The R page for this course: everything you need to know about R (yeah, right).
Other Things
Some data sources:
General Social Survey (15 March 1999 release).
SDA: Survey Documentation and Analysis: Click on SDA Archive to see some of the available survey data sources. The General Social Survey is also available here. The Multi-Investigator Survey might yield some interesting information (how do things like order or wording of questions effect response?).
An Example of Misinterpreted Odds Ratios
The Effect of Race and Sex on Physicians' Recommendations for Cardiac Catheterization
Misunderstandings about the Effects of Race and Sex on Physicians' Referrals for Cardiac Catheterization
Race, Sex, and Physicians' Referrals for Cardiac Catheterization
[此贴子已经被作者于2006-4-27 13:54:27编辑过]
g(x) = piN(x,theta_1) + (1 - pi)N(x,theta_2)The model has five parameters, pi and the two means and variances.
Suppose we want to do MLE. The log likelihood is
l(theta|X) = SUM log [ piN(x,theta_1) + (1 - pi)N(x,theta_2) ]Maximizing this is difficult because of the + inside the logarithm.
Also, the actual maximum gives parameter estimates that are unwanted: set one mean equal to one observation, with zero variance. Then the likelihood of this point is infinite, so the total likelihood is infinite also.
EM is a general iterative procedure for finding a local optimum for this type of hard MLE problem.
Example: waiting for the AP&M elevator, where the waiting time has an exponential distribution, which has the Markov property of history-independence: Pr(T > t+r | T>t) = Pr(T > r)
What is the expected time to wait, called mu?
Data: wait 7 min, success
wait 12 min, give up (censored data)
wait 8 min, success
wait 5 min, give up
Guess mu = 8. Fill in missing data, giving 7, 20, 8, 13. Now the new estimate of mu is the mean which is 48/4 = 12. Repeat.
E step: compute expected value, or a probability distribution for the missing info.
M step: compute MLE of model parameters given expected values for the missing data.
Theorem: Under certain conditions, this process converges to a model with locally maximal likelihood L(data | theta).
The other situation is where maximizing the original incomplete likelihood L(theta | X) is too difficult. Consider the general mixture modeling scenario where we have components numbered i = 1 to i = M.
Suppose we have observed data X generated by a pdf with parameter theta. To estimate theta, we want to maximize the log-likelihood l(theta; X) = log p(X; theta).
Suppose there is additional data called Z also generated with parameter theta. Names for the Z data include latent, missing, and hidden.
Let the "complete" data be T = (X,Z) with log-likelihood l_0(theta; X, Z).
In the Gaussian mixture case, z_i can be a 0/1 variable that reveals whether x_i was generated by theta_1 or theta_2.
So p(X | theta') = P(Z, X | theta') / p(Z | X, theta').
Changing to log-likelihoods, l(theta'; X) = l_0(theta'; Z, X) - l_1(theta'; Z | fixed X)
Now we take the expectation of each side over Z, where the distribution of Z is a function of a given X and theta (different from theta'). On the left we have
E[ l(theta'; X) ] where Z ~ f(X,theta) = E [ log p(X; theta') ] = log p(X; theta')On the right we have
E[ l_0(theta'; Z, X) ] - E[ l_1(theta'; Z | fixed X) ] where Z ~ f(X,theta)Call this expression Q(theta', theta) - R(theta', theta).
We have a lemma that says we can maximize this expression just by mazimizing Q(theta', theta).
Lemma: If Q(theta', theta) > Q(theta,theta) then Q(theta', theta) - R(theta', theta) > Q(theta, theta) - R(theta, theta).
In words, this lemma says that if the expected complete log-likelihood is increased, then the incomplete log-likelihood is also increased.
Based on the E step, the M step is to find the value for theta' that maximizes Q(theta', thetaj).
What is Q(theta', thetaj)? It is E [ l_0(theta'; T) ] where the expectation averages over alternative values for the missing data Z, whose distribution depends on the known data X and the parameter thetaj.
Often, a major simplification is possible in the E step. The simplification is to bring the expectation inside the l_0 function, so it applies just to Z:
E [ l_0(theta'; X, Z) ] = l_0(theta'; X, E[Z])where the distribution of Z is a function of thetaj and also of X.
Often we can simplify this by finding a single special z value such that the integral over Z is the same as evaluating the integrand for this special z value.
The obvious choice for the special z value is the expectation of Z. We want to use z = E[ Z | X, thetaj ] as an imputed value for Z, instead of averaging over all possible values of Z. In this case Q(theta, thetaj) = E[ log p(X, E[ Z | X, thetaj ] | theta)]
In general, the expectation of a function is the function of the expectation if and only if the function is linear. Fortunately, many log-likelihood functions are linear.
A further simplification: If a missing variable Yi is binary-valued then then p(Yi = 1) = E[ Yi ].
In M step we have a function of theta to maximize. Usually we do this by computing the derivative and solving for when the derivative equals zero. Sometimes the derivative expression is a sum of terms where each term involves only a part of the theta parameters. Then we can solve separately for when each term equals zero.
Now we want to take the average value of this, for Z following a distribution that depends on X and thetaj. Note that f(xi,theta'1) and f(xi, theta'2) do not depend on Z. So we take them outside the expectation and get
SUM_i f(xi, theta'1) (1 - E[Zi]) + f(xi, theta'2) E[Zi]For fixed X and thetaj, we can compute E[Zi] using Bayes' rule:
E[Zi] = P(Zi = 1|X, thetaj) = P(Zi = 1|Xi, thetaj)
= P(Zi = 1 and X | thetaj) / ( P(Zi = 1 and X | thetaj) + P(Zi = 0 and X | thetaj)
[此贴子已经被作者于2006-4-27 14:06:50编辑过]
Copyright © 1998-2005 Mike Brookes, Imperial College, London, UK Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the section entitled "GNU Free Documentation License".
To cite this manual use: Brookes, M., "The Matrix Reference Manual", [online] http://www.ee.ic.ac.uk/hp/staff/dmb/matrix/intro.html, 2005
The Matrix Reference Manual» Quantitative Research Methods: Multivariate, Spring 2004
[此贴子已经被作者于2006-4-27 14:28:26编辑过]
楼主,太贵了,买不起,不能降点价吗?
如果每个人都买不起,您还不是得不到钱,得不到钱也就没有钱去买更多的坛子上的好东东啊!
楼主,太贵了,买不起,不能降点价吗?
如果每个人都买不起,您还不是得不到钱,得不到钱也就没有钱去买更多的坛子上的好东东啊!
[此贴子已经被作者于2006-4-27 14:10:57编辑过]
[此贴子已经被作者于2006-4-27 13:29:38编辑过]
Alternatively, you can download the *.raw file and the *-readin.do file and run the *-readin.do file in your version of stata. This is useful if you are running an earlier version of stata.
Please let me know if you have trouble reading the files or if you would prefer to have the data in another format.
To use the *.ado files, put them in your current directory, in your stata "ado" directory, or in a directory where Stata will know where to look for them.
These are *not* thoroughly tested functions! Please let me know of any bugs that you find in these functions.
You might also examine the functions longplot and linkplot from that same website