Dear Experts:
I hope it was a nice weekend to all of you. What could we say if the Sig of Kolmogorov-Smirnov 0.200 and for Shapiro-Wilk 0.016, is the var follow the normal distribution? The reason from the question is because the first value > 0.05 and the second < 0.05.
Thanks.
Omar.
Hi Omar,
Both tests foccus on different aspects of non-normality. Shapiro-Wilk
test is considered the best (at least there's a paper, by Shapiro &
Wilk, where they show mathematically that their test is the best to
detect non-normality).
There are, nonetheless, other details to be considered:
- Sample size: SW test tends to be oversensitive with big sample sizes
(let's say... over 100), you don't mention it.
- Skewness & kurtosis: check both coefficients to investigate the
causes of non-normality. Skewness is, in general, more problematic
than kurtosis. Its effects are more important (at least in Student's
t tests) than the effects of kurtosis, although high kurtosis (usually
a sign of the presence of outliers) can reduce dramatically the
efficiency of parametric methods like ANOVA.
- Take also a look at the box-plot.
Dear Ms. Marta:
Do you mean that SW test is more efficient with big sample sizes from KS test?
Many thanks.
Omar.
Hello, if the Mauchly's Test of Sphericity is significant (I am running
GLM to test MANCOVA), is it absolutely necessary that I use the
Greenhouse-Geisser correction of the degrees of freedom, or can I still
use the df's from the "sphericity assumed' calculation? In other words,
how bad is the violation of sphericity assumption for the validity of
the significant results?
Thanks a lot.
Bozena
Bozena Zdaniuk, Ph.D.
University of Pittsburgh
UCSUR, 6th Fl.
121 University Place
Pittsburgh, PA 15260
Ph.: 412-624-5736
Fax: 412-624-4810
Email: bozena@pitt.edu
Perhaps I didn't explain myself correctly. Mauchly's test indicates if
the correlations among the repeated measures are similar (the so
called sphericity assumption). If Mauchly's test is significant
because the correlations are different, then you must adjust the
degrees of freedom using G-G epsilon. But, sometimes (in heavy tailed
distributions) Mauchly's test can be significant even with similar
correlations. Heavy tails are NOT the cause of a failure in
sphericity, but the cause of a FALSE POSITIVE Mauchly's test. In that
case (Mauchly's test significant & very heavy tailed distributions
-outliers present-) you might consider that the sphericity condition
is OK and avoid the use of epsilon correction for the DF.
I'm not really fond of mathematical transformations (you loose contact
with your data in the same degree you gain normality). I remember that
high kurtosis could be prevented by taking two measures, instead of
one, and averaging them. This must be done during data recollection
(has to be foresighted in the designing steps), it can't be done right
now with your data.
Square root or logarithms might eliminate part of the kurtosis, but
the are more indicated for positively skewed data. If your data are
symmetric, then these transformations could add negative skewness to
your problems, but, as the saying goes "the taste of the pudding is in
the eating". Try them and see what happens with your data.
HTH
Marta
Hello group,
I need a quick help on normality test.
I have a 3x2x2 factorial design with two factors being scale(one with 3
levels and one with 2 levels), one factor being nominal (2 levels). I would
like to test for normality (an assumption for factorial ANOVA) and would
like to know how to do that in SPSS? Can somebody also provide some generial
syntax for SPSS?
What are the implications if normality would not hold?
Kind Regards,
Karl
The best approach is to test normality of the residuals of the model. You can output the residuals and use a descriptive procedure to test for non-normality. The normality assumption refers to the residuals anyway. You could also plot the residuals and do an occular test (eyeball it) although some journals are a bit rigid. Whether you have significant deviations from normality will depend on two things: is the data normal and how large is the sample size. With a very large sample size, almost any data will show significant deviations from linearity. With very small samples, almost no data will deviate from linearity. It's a persistent problem with tests of assumptions.
Paul R. Swank, Ph.D.
Professor, Developmental Pediatrics
Director of Research, Center for Improving the Readiness of Children for Learning and Education (C.I.R.C.L.E.)
Medical School
UT Health Science Center at Houston
Dear Experts.
When I need to check the normality I create the Kolmogoro-Smirnov test and if Sig <0.05, then the distribution is non-normal otherwise the distribution is normal. Please let me know if it is the right way or there is another way.
Thanks.
[此贴子已经被作者于2006-5-1 22:17:36编辑过]
Hello:
I need to check for bivariate normality but am unclear about how to perform
the procedure in SPSS. I have 33 variables in a data set in which I want to
run factor analysis, but I know there is positive skewness from the
univariate analysis. Here are the specific questions for which I need
advice.
1. Can anyone tell me how to perform the bivariate analysis? If I run a
regression on each pair of variables and request plots, which plots should I
be concerned with? How do I save the residuals? And then, what should I do
with the residuals?
2.If my sample size is approximately 210 (presumably large), should I
not concern myself with multivariate normality?
I have checked the SPSS archives and was unable to find more specific
information on bivariate normality. Thus, I would greatly appreciate your
insights on the topic.
Thanks,
Sealvie
I would suggest that you test multivariate normality using Mardia's PK statistic, which is available in the PRELIS package. There is a fairly large literature on the use and interpretation of this statistic.
HTH,
KS
For personalized and professional consultation in statistics and research
design, visit www.statisticsdoc.com
[此贴子已经被作者于2006-5-1 22:24:13编辑过]
Sealvie,
1. For a SPSS macro to examine bivariate/multivariate normality have a
look at:
http://www.columbia.edu/~ld208/
2. Also look at:
http://www.stat.umn.edu/~drak0020/classes/5021/labs/bivnorm/
The author of this site (Douglas Drake) provided me with the following
citations:
@article{best:rayn:1988,
Author = {Best, D. J. and Rayner, J. C. W.},
Title = {A Test for Bivariate Normality},
Year = 1988,
Journal = {Statistics \& Probability Letters},
Volume = 6,
Pages = {407--412},
Keywords = {Goodness-of-fit; Skewness; Kurtosis}
}
@article{maso:youn:1985,
Author = {Mason, Robert L. and Young, John C.},
Title = {Re-examining Two Tests for Bivariate Normality},
Year = 1985,
Journal = {Communications in Statistics, Part A -- Theory and Methods},
Volume = 14,
Pages = {1531--1546},
Keywords = {Goodness-of-fit; Ring test; Line test}
}
@article{pett:1979,
Author = {Pettitt, A. N.},
Title = {Testing for Bivariate Normality Using the Empirical
Distribution Function},
Year = 1979,
Journal = {Communications in Statistics, Part A -- Theory and Methods},
Volume = 8,
Pages = {699--712},
Keywords = {Goodness of fit; Cramer-von Mises}
}
@article{vita:1978,
Author = {Vitale, Richard A.},
Title = {Joint Vs. Individual Normality},
Year = 1978,
Journal = {Mathematics Magazine},
Volume = 51,
Pages = {123--123},
Keywords = {Bivariate normal distribution}
}
@article{mard:1975,
Author = {Mardia, K. V.},
Title = {Assessment of Multinormality and the Robustness of
{H}otelling's $T^2$ Test},
Year = 1975,
Journal = {Applied Statistics},
Volume = 24,
Pages = {163--171},
Keywords = {Bivariate distributions; Mahalanobis distance; Multivariate
kurtosis; Multivariate skewness; Non-normality; Permutation
test}
}
@article{kowa:1970,
Author = {Kowalski, Charles J.},
Title = {The Performance of Some Rough Tests for Bivariate Normality
Before and After Coordinate Transformations to Normality},
Year = 1970,
Journal = {Technometrics},
Volume = 12,
Pages = {517--544},
Keywords = {Goodness of fit}
3. http://www.stat.nus.edu.sg/~biman/
4. John Marden wrote a paper on the use of various plots. It was to be
published in Statistical Science some time in 2005.
Bob Green
Happy Holidays to All:
I have a data set where none of the variables that I wish to use in my regression analysis follows the normal distribution. Further some of these variables have extrme outliers (which may account for the violations of normality). What is the best way to deal with these outliers short of excluding them from the analysis given that they account for approx. 8% of the data and can I still run parametric tests even though the assumptions of normaality have been violated.
Any help would be appreciated.
Hi
The variables you are you talking about, are they dependent or independent?. For regression models, you don't need normally distributed independent (predictor) variables. Moreover, you don't even need that the dependent variable is normally distributed, what you need is that the residuals are normally distributed. Build your model, save the residuals and check their normality (either visually, with a histogram, or mathematically, with Shapiro-Wilk test)
HTH
Marta
[此贴子已经被作者于2006-5-1 22:35:05编辑过]
Hello all,
regression analysis has the four assumptions which are: 1) the assumption of
linearity 2) the assumptions of independence 3) the assumption of constant
variance and 4) Normality
From a practical viewpoint how do you test this assumptions? Are there
methods in SPSS that I can use for that. From the experimental procedure I
have ensured that each measurement is independent by randomization. However,
is ther a statistical method that can test if or even how well this
assumptions exists in the data? What about the other three assumptions?
Kind Regards,
Karl
There was a pretty good thread on this general topic, recently, with
subject "Data Screening". You'd do well to look that thread up. I'll
give citations, which are all from that thread; the comments are too
long to post here.
I'll take the assumptions in a convenient order, rather than the order
you've given them.
>4) Normality
The assumption is normal distribution of the residuals, not of the DV
or any IVs. Hector Maletta posted an extensive discussion; date-time is
Wed, 28 Sep 2005 22:08:00 -0300.
>1) the assumption of linearity
Theory may indicate a non-linear relationship, in which case it's
proper to transform the variables so that the theoretically expected
relationship is linear.
If theory is lacking, you usually have to make the assumption of
linearity and live with it, because the data will not show any
deviation from it. That means, however, that within the accuracy of
your measurements, any deviations don't matter.
You can test linearity directly by including higher-order terms,
typically quadratic to start with, in your model, and testing for their
significance as a group. However,
- The quadratic terms can be so highly correlated with the linear terms
that the resulting model can't be estimated. There are formal
procedures to handle this, but if you're using only quadratic terms,
it's usually enough to change the measurement origin of the IVs so
their means are near 0, certainly less than 1 SD.
- Adding quadratic terms, with product terms, adds a lot of degrees of
freedom to the model: n(n+1)/2, if you have n IVs in the model. Very
often you won't have enough data for a model that size.
- Unless you have a pretty high R-square in the linear model (sorry, I
can't give you numbers), you have little hope of 'seeing' non-linear
effects.
See also my posting Wed, 28 Sep 2005 20:17:13 -0400, in the cited
thread. (The discussion of non-linearity starts a ways down the post.)
>2) the assumptions of independence
You said you "have ensured that each measurement is independent by
randomization". Can you say more what your study is, and how you drew
the samples? And, given that you drew randomly from your available
data, could you instead have used all your available data?
It's also assumed that the residuals are statistically uncorrelated
with the IVs (not necessary independent). Nothing much you can do about
that, except adopt the convention that the portion that's correlated
with (explainable by) the IVs is part of the DV, not the residual.
If time is an IV, successive residuals can fail to be independent
because the process hasn't had enough time to change between
measurements. Statistics such as the Durbin-Watson are used for this.
Analogous problems can theoretically arise with very closely-spaced
measurements of other IVs, but I don't think this is considered a
common problem in practice.
>3) the assumption of constant variance [of the residuals]
I.e., homoscedasticity. Sometimes theory will indicate higher
measurement errors in different parts of the DV range, which can
sometimes be addressed by transformations. Hector discussed this in his
post I cited above, dated Wed, 28 Sep 2005 22:08:00 -0300.
If you have an ANOVA problem, i.e. multiple measures for the same
values of the DVs, you can check. See CELLPLOTS command in MANOVA
(Advanced Models module). If you have a variable that you suspect is a
measure of the residual variance, you can use WLS (Regression Models
module).
See also Hector's discussion of residuals, twice cited above.
1) Check visually with plots. Also, if the values of the IV are
repeated several times (f.i. x=2 happens 3 times, x=3 several times,
and so on), you can use a linearity test using MEANS (I'll give you
more details if you want).
3) You can try White or Breusch-Pagan/Koenker tests
http://www.spsstools.net/Syntax/RegressionRepeatedMeasure/Breusch-PaganAndKoenkerTest.txt
http://www.spsstools.net/Syntax/RegressionRepeatedMeasure/WhiteTestStatisticsAndSignificance.txt
I think both syntaxes are unduly complex (I wasn't a good programmer
when I wrote them, perhaps they can be simplified), but they work.
Technical details for both methods are described here:
http://pages.infinit.net/rlevesqu/spss.htm#Heteroscedasticity
4) Save the residuals:
- If sample size is big (let's say n>100) get a histogram with a
normal curve and check visually for any departures from normality
- If sample size is smaller, then run EXAMINE and ask for normality
tests (Shapiro-Wilk & Kolmogorov-Smirnov(Lilliefors). Also, you can
take a look at skewness/kurtosis coefficients.
HTH,
Marta
Hi to everybody
I got a private request of the syntax to run a linearity test in
regression when you have repeated X values. As I though more people could
be interested, I'm posting it:
(Q) Can you outline how it works? (Unfortunately,
repeating values of the IV aren't very common.)
(A) They aren't, unless you plan them at the design step.
* The following example gives the reaction times to a visual stimulus
in 15 subjects that have taken a certain dose of alcohol (0/40/80 g).
DATA LIST LIST/ id alcohol rtime (3 F4.0).
BEGIN DATA
1 0 3
2 0 1
3 0 2
4 0 4
5 0 2
6 40 5
7 40 3
8 40 4
9 40 6
10 40 6
11 80 7
12 80 5
13 80 6
14 80 8
15 80 7
END DATA.
VAR LABEL rtime 'Reaction time (ms)'
/alcohol 'Alcohol dose (g)'.
* You can do a standard regression analysis, and also, this one *.
MEANS TABLES=rtime BY alcohol
/CELLS MEAN COUNT STDDEV
/STATISTICS LINEARITY .
As you'll see when you run the syntax, you get an ANOVA table where
the between-groups variation is further split into linearity (1 df)
and deviation from linearity (k-2 df). It's non-significant for these
data, showing that the relationship between alcohol dose and reation
time doesn't deviate from linearity.
I have a non-parametric version of this method, based on Kruskal-Wallis
and Cuzick test for monotonic trend, just in case you are interested.
Regards
Hello
I seem to recall a posting indicating that ANOVA with percentages
required special consideration. I could not locate the thread in the
archives, hence my question. For example, in an educational study
examining differences in the the way material is presented is there a
reason for using either the number correct (or raw score) on an exam or
the percentage correct (entered as a decimal) in the ANOVA analysis? In
the case of the former the data would range from 0 points correct to a
maximum amount if all answers are correct. In the case of percentages
the data would range from 0.0 to 1.0.
Thanks for your help.
Randy Richter
Randy,
I do not see problems with that use of percentages in the DV of an ANOVA.
The thread you recall was, I believe, about using a binary or dummy variable
as the DV, for which the average is equivalent to the proportion choosing
the "1" alternative instead of the zero.
Hector
There are three things to watch out for that are associated with
assumptions of ANOVA. Normality, of course, but the test is often robust
to this violation. However, independence and homogeneity of variance are
also potential problems. These often go hand in hand with non-normality.
For instance, with percentage data where the mean percent is less than
.25 or > .75, the mean and variance may not be independent. Likewise, if
one group has a mean percentage around .5 and the other around .2, there
may be a difference in the variances as well.
Paul R. Swank, Ph.D.
Professor, Developmental Pediatrics
Director of Research, Center for Improving the Readiness of Children for
Learning and Education (C.I.R.C.L.E.)
Medical School
UT Health Science Center at Houston
Hi,
I've made a multiple linear regression using SPSS by one dependent variable and two indepent variables and all assumptions were satisfied but R squre is very low about 0.3,so I think that is because my variable are not normally distributed that's why I was thinking about transforming my data uasing logarithmic transformation to normal distributio and repeat the regression,but I don't know how to transform them? and do I have to test any other assumptions after applying the transformation?
Thanks
Razan,
1. Your variables do not need to be normally distributed in order to use
regression, and even less so in order to get high correlation coefficient.
You are confused by the fact that linear regression requires that residuals,
i.e. random errors of prediction (difference between predicted and observed
values) have a normal distribution both sides of the regression line.
2. A low or near zero linear [multiple] correlation coefficient may be due
to (a) the absence of any systematic relationship between your IV and DV, or (b) the existence of a relationship which is non linear. As an example of
(b), if your scatterplot shows a cloud of points with the shape of a U,
there would be possibly a quadratic relationship but the linear coefficient
may be zero.
3. The method of least squares to estimate regression functions is based on
the assumption of a linear relationship between the variables involved. When the relationship is not linear there are two ways to go: (i) identify the
non-linear function linking the variables, and transform it in some way that
yields a linear function, then apply least squares linear regression; or (b)
approximate a non linear function by means of non-linear regression or
curve-fitting, which do not use the least squares algorithm. Some non linear
functions are amenable to linearization, some are not. For instance, a
quadratic equation like y=a+bX+cX^2 can be linearized if you define a new
variable Z=X^2, and use the linear equation y=a+bX+cZ; likewise the equation y=aX^b can be linearized by taking logarithms as log y=log a + b(log X).
4. The fact that a certain mathematical function fits your data is no great
deal. You can always find some function that does that. The trick is finding
a function for which you have a theoretical explanation. So it is not
advisable to go around blindly trying different mathematical functions until
any of them "fits". In fact, you may find several, perhaps an infinite
number of functions that reasonably fit the data, and that is arguably worse
than not having any.
5. If no reasonable function fits the shape of the data, perhaps your data
just show little relationship at all between the variables...
Hector
Hi Mr.Hector,
First of all thank you very much for your quick response.
Razan
[此贴子已经被作者于2006-5-1 23:05:03编辑过]
Depending on the number of cases you have and the subject matter area, a
multiple correlation of .55 (r**2= .3) could be suspiciously high. What are your variables? how are they measured? How many cases do you have? How were they selected?
Art
Art@DrKendall.org
Social Research Consultants
University Park, MD USA Inside the Washington, DC beltway.
(301) 864-5570
[此贴子已经被作者于2006-5-1 22:59:01编辑过]
扫码加好友,拉您进群



收藏
