Sample Size for Factor Analysis?

3030

收藏 2014-03-29

Hello, I was asked to do a factor analysis of 40 variables but I only have 70 cases. Needless to say, I had to increase iterations to 100 to get the program to converge and I still believe that it makes no sense to do a factor analysis with less than 2 cases per variable. I was then asked to provide a citation for that. Could someone point me to a source discussing the minimum case per variable requirement for factor analysis that I can cite? Thanks a lot.

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

全部回复

ReneeBK

2014-3-29 10:10:25

Comrey & Lee (1992, A first course in factor analysis) give as a guide sample sizes of:

50 as very poor
100 as poor
200 as fair
300 as good
500 as very good
1000 as excellent for factor analysis.

Tabachnick & Fidell (Using multivariate statistics, 4th ed) recommend at least 300 cases.

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

ReneeBK

2014-3-29 10:10:39

Perhaps not the most authoritative citation, but the APA publication Edited by Grimm and Yarnold, Reading and Understanding Multivariate Statistics, 8th Ed. 2003. Washington DC. Page 100. Referred to as the subjects to variables ratio (STV), "the minimum number of observations in ones sample should be at least five times the number of variables."

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

ReneeBK

2014-3-29 10:11:10

I would look at the article by McCallum et al in Psycholgical Methods as well as some in MBR that show problems with rules of thumb for EFA......one needs to take into account scaling issues, over/under determination, communalities/saturation, etc..........

Robert Marshall <marshall_pmp@comcast.net> wrote: Perhaps not the most authoritative citation, but the APA publication Edited by Grimm and Yarnold, Reading and Understanding Multivariate Statistics, 8th Ed. 2003. Washington DC. Page 100. Referred to as the subjects to variables ratio (STV), "the minimum number of observations in ones sample should be at least five times the number of variables

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

ReneeBK

2014-3-29 10:12:28

I do not remember a specific citation, but the general idea is that factor analysis is a derivation of regression, and regression rests on the normal distribution of estimation errors. This normal distribution of estimation errors is known as "the law of large numbers" and is a tendency shown by errors as N gets larger and larger. More exactly, as the "degrees of freedom" get larger. The degrees of freedom equal number of cases minus number of variables, N-k-1, which in your case is quite small. As the number of cases are few, the margin of error of your estimates will be very wide, and you could not be sure of their probable true value in the universe or population, especially for minor factors after the first or second one, where the coefficients or loadings will be close to zero (and there may therefore be difficult to tell whether they are not zero in the population).

An old rule of thumb says you need at the very least 10 cases per variable, but this is "the very least". With less than 30-50 cases experimental error distributions hardly (or very infrequently) resemble a normal curve. So my advise is you try a model with fewer variables, possibly one underlying factor if your 40 variables are mostly explained by one overarching factor, or abandon factor analysis altogether and try some more modest approaches like a simple summatory scale, simple regression, 2 or 3 way cross tabulations, and the like. Next time, go bigger in your sample design. And then again, do you really have a theory that is so complex that no less than 40 independent factors are required by it? Isaac Newton explained the universe with only two or three variables, and did very well indeed, thank you.

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

ReneeBK

2014-3-29 10:13:39

I have been following this discussion with much interest, as I have a similar problem at hand. For years, we have been conducting a consumer satisfaction survey that consists of one page, about 10 questions, plus a single open-ended question. Although the questions were intended to probe consumer satisfaction in a number of different areas, basically the level of correlation is so high that it seems that we're really only tracking one factor: overall satisfaction.

So we conducted literature reviews, and went back to the drawing boards, formulating more than 100 questions in 6 broad areas of consumer satisfaction. Our intention was to pilot test these questions with participants, examine the results, throw out the redundant questions (discerned through factor analysis), and emerge with, say, 20 questions known to reflect different dimensions of consumer satisfaction. However, our sample size thus far is in the pitiful range: perhaps 35 respondents.Needless to say, we have a long way to go. With our response rates, and consumer base, we would be lucky to get more than 100 respondents in a year.

In order to improve the subjects to variables ratio (STV), we need either to greatly increase the sample size (which is difficult for us to do), or reduce the number of variables, or both. Our questions are short simple statements requesting responses on a 5-point likert scale. Some of the questions are worded in almost identical language, and some of these are almost certainly redundant. Given our relatively small sample size thus far, what is the best way to proceed to remove redundant questions while
retaining maximum diversity of responses?
From one perspective, it would appear that rank correlations might be the preferred measure of association, but I wonder if Likert scales are, analytically speaking, equivalent to rank order variables? What other measures would be most appropriate? I hesitate to downgrade the measure of association to categorical, because that throws out the information on directionality and degree. Likewise, I hesitate to overgrade the measure of association to ratio, because clearly the intervals are arbitrary and not
additive.

Intuitively, I am seeking to extract, out of these 100 questions, 4-5 groups of 2-3 questions each, such that within-group correlations are high, but correlations with the other groups are low. The within-group redundancy
reinforces degree of satisfaction with that particular factor, and the low
between-group correlation assures that different aspects of satisfaction
are represented.

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

点击查看更多内容…

ReneeBK

2014-3-29 10:13:40

I have been following this discussion with much interest, as I have a
similar problem at hand.
For years, we have been conducting a consumer satisfaction survey that
consists of one page, about 10 questions, plus a single open-ended
question. Although the questions were intended to probe consumer
satisfaction in a number of different areas, basically the level of
correlation is so high that it seems that we're really only tracking one
factor: overall satisfaction.

So we conducted literature reviews, and went back to the drawing boards,
formulating more than 100 questions in 6 broad areas of consumer
satisfaction. Our intention was to pilot test these questions with
participants, examine the results, throw out the redundant questions
(discerned through factor analysis), and emerge with, say, 20 questions
known to reflect different dimensions of consumer satisfaction. However,
our sample size thus far is in the pitiful range: perhaps 35 respondents.
Needless to say, we have a long way to go. With our response rates, and
consumer base, we would be lucky to get more than 100 respondents in a year.

In order to improve the subjects to variables ratio (STV), we need either
to greatly increase the sample size (which is difficult for us to do), or
reduce the number of variables, or both. Our questions are short simple
statements requesting responses on a 5-point likert scale. Some of the
questions are worded in almost identical language, and some of these are
almost certainly redundant. Given our relatively small sample size thus
far, what is the best way to proceed to remove redundant questions while
retaining maximum diversity of responses?

From one perspective, it would appear that rank correlations might be the
preferred measure of association, but I wonder if Likert scales are,
analytically speaking, equivalent to rank order variables? What other
measures would be most appropriate? I hesitate to downgrade the measure of
association to categorical, because that throws out the information on
directionality and degree. Likewise, I hesitate to overgrade the measure of
association to ratio, because clearly the intervals are arbitrary and not
additive.

Intuitively, I am seeking to extract, out of these 100 questions, 4-5
groups of 2-3 questions each, such that within-group correlations are high,
but correlations with the other groups are low. The within-group redundancy
reinforces degree of satisfaction with that particular factor, and the low
between-group correlation assures that different aspects of satisfaction
are represented.

Suggestions, please?

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

ReneeBK

2014-3-29 10:14:03

In such kind of case my main suggestion is forget about factor analysis, and simply try to add up the number of "correct" answers. If all questions are highly correlated and clearly measure various aspects of overall satisfaction, subtle differences in weighting (provided by factor analysis) would not matter much, and would probably vary from one sample to the next. So go ahead with a no-weight (i.e. equal weight) scale and relax. You can check whether this simple additive score still correlates well with individual questions, and with other (external) indicators associated with satisfaction (such as returning for more), but assuming all goes well the simple scale is easier to compute, easier to explain, and lacks the many statistical pitfalls of factor analysis and regression. It only lacks the false pretenses of scientificity coming from mere difficulty or sophistication, and some people live off being difficult, and get famous just because of that, e.g. some postmodern "philosophes", but you better don't care much about that.

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

ReneeBK

2014-3-29 10:23:46

You can try to use Dwyer's extension analysis. You start by creating a set of homogenous item packages or parcels - combine sets of 2-4 items into new scales by reviewing the item correlations (combine those items with the highest inter-item correlations). Then, factor analyze the item parcels (you will have reduced the number of variables in the factor analysis to about 10-15 (instead of 40). Convergence and iterations should behave better. Rotate and then use the Dwyer extension procedure described in Gorsuch
(1983) Factor Analysis (2nd Ed.) on pages 236-238. Essentially, the factor solution of the parcels is projected onto the original set of items. You'll get your factor structure and pattern matrix (if you rotate obliquely) of the 40 items.

If you need some background on item parceling, you can find out more about it by searching "item parcels." I know their use is controversial. You can also check up on Andrew Comrey's work in developing his personality inventory and Ray Cattell's work.

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

ReneeBK

2014-3-29 10:26:17

n addition to the recommended ratios of 10 to 20 people per variable, the following has also been suggested:

Some Monte Carlo simulation research (Guadagnoli & Velincer, 1998) suggest ... replicable factors tend to be estimated if:
1. factors are each defined by four or more measured variables with structure coefficients each great than .6 [in absolute value], regardless or sample size; or
2. factors are each defined with 10 or more structure coefficients each around .4[in absolute value], if sample size is greater than 150; or
3. sample size is at least 300." (Thompson, 2004, p. 24)

Linda

Thompson, B. (2004). Exploratory and confirmatory factor analysis: Understanding concepts and applications. Washington, DC: American Psychological Association.

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

ReneeBK

2014-3-29 10:28:02

I have over 100 variables (potential questions for a survey), and so far only about 30 pilot test responses.

One thought that occurs to me is that our 100 variables actually fall into half a dozen groups. Each group of questions was designed to elicit a particular dimension of consumer satisfaction. Rather than attempting to run a factor analysis on all 100+ variables at once, with so few cases, would it make more sense to

run the factor analysis on one group of questions at a time
reduce the group to one or two questions with the highest loadings on the principal component
repeat the above procedure for each group of questions
Finally, conduct a factor analysis on the reduced set of variables to test the hypothesis that consumer satisfaction as reflected in this set of questions really is multidimensional.

The guiding theory here is that consumer satisfaction has multiple components. Each group of questions is designed to elicit degree of satisfaction with a particular dimension of consumer experience suggested in the literature. There is a great deal of overlap in the language of the questions, as we seek to identify the language that has resonance with our consumers. Our goal is to develop a consumer satisfaction instrument for our agency that is genuinely multidimensional, allowing the agency to get a better idea of where improvements are most needed. Our current instrument is short, and seems to address different issues, but the answers we get are so highly correlated that we really only seem to be measuring global satisfaction, which is really not a very useful result.

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

ReneeBK

2014-3-29 10:29:00

I would like to come back to my question: how to reduce the complexity of a large set of variables if you have few cases ?

This happens in comparative political science all the time when you have countries as cases and a large set of variables that describe them.

I now have a set of some 20 countries in Europe. If you study the EU member states at a aggregate level today you have 27 countries. There are no more member states. I have even fewer cases due to unequal covering of the countries in my sources (the OECD data do not survey the same countries as the EU, the European Social Survey, etc.). At the same time I have a large set of variables describing the economic, social, and cultural structure of the same 20 countries. So how to find a a pattern in the variables if the condition of 1O cases per variable for a sound factor analysis are not met ?

A second question: Factor analysis does not print the KMO or AIC info, even if I demand all stats in the print command. Is this due to the low no. of cases ? How can I force SPSS to print the KMO or the AIC info ?

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

ReneeBK

2014-3-29 10:29:32

Some kludges.

Create meaningful subsets of the variables.

Sidestep the question about whether the obtained matrices are reasonable representations of the population matrix.
IFF you want to consider the 27 countries the total population about which you wish to make statements, then take a large dose of salt, hold your nose, and pretend that the obtained correlation matrix IS the population matrix.  Write out the matrix products (means, SDs, Rs) and read them back in faking the number of cases. Use unit weights to create summative scores of standardized item variables.

create a few nominal level variables that relate to clusters of countries based on clusters of countries on the subsets mentioned above. Add an additional cluster identifier for cases that do not have the variables to create the cluster.  Each membership value in the clustering would stand for a meaningful profile

Relate the cluster memberships to each other with CROSSTABS, CATPCA and TWOSTEP treating the membership  variables as nominal level.

Create choropleth (patch) maps of the memberships.  Try different coordinate systems including weighting visual area by population.

Relate the cluster memberships to variables that were not  used to create that clustering.  E.g., relate industrial clusters to housing variables, etc.

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

ReneeBK

2014-3-29 10:29:52

Another approach you might consider is Partial Least Squares. This is useful for both categorical and continuous (scale) dependent variables. This is available in SPSS Statistics v16 or 17 as an add-in via programmability that can be downloaded from Developer Central (www.spss.com/devcentral). Of course, you don't get all the inferential apparatus of traditional regression methods, but it has the advantage of finding best combinations of predictors for particular dependent variables.

HTH,
Jon Peck

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

ReneeBK

2014-3-29 10:30:56

Resolution number: 20414  Created on: Aug 21 2001  Last Reviewed on: Feb 28 2009

Problem Subject:  FACTOR does not print KMO or Bartlett test for Nonpositive Definite Matrices

Problem Description:  I have run the SPSS FACTOR procedure with principal components analysis (PCA) as the extraction method. I requested the Kaiser-Mayer-Olkin (KMO) measure of sample adequacy and the Bartlett test of sphericity but neither of these measures was printed. The "Communalities", "Total Variance Explained" and "Component Matrix" tables were printed. Why was my request for KMO and Bartlett's sphericity test ignored?

Resolution Subject: KMO, Bartlett's sphericity, and anti-image correlation not printed for nonpositive definite matrices

Resolution Description:
It is likely the case that your correlation matrix is nonpositive definite (NPD), i.e., that some of the eigenvalues of your correlation matrix are not positive numbers. If this is the case, there will be a footnote to the correlation matrix that states "This matrix is not positive definite." Even if you did not request the correlation matrix as part of the FACTOR output, requesting the KMO or Bartlett test will cause the title "Correlation Matrix" to be printed. The footnote will be printed under this title if the correlation matrix was not requested. An NPD matrix will also result in suppression of other output from the 'Descriptives' dialog of the Factor dialog, namely the inverse of the correlation matrix, the anti-image correlation matrix, and the significance values for the correlations. If you had requested a factor extraction method other than PCA or unweighted least squares (ULS), an NPD matrix would have caused the procedure to stop without further analysis.

Matrices can be NPD as a result of various other properties. A correlation matrix will be NPD if there are linear dependencies among the variables, as reflected by one or more eigenvalues of 0. For example, if variable X12 can be reproduced by a weighted sum of variables X5, X7, and X10, then there is a linear dependency among those variables and the correlation matrix that includes them will be NPD. If there are more variables in the analysis than there are cases, then the correlation matrix will have linear dependencies and be NPD. Remember that FACTOR uses listwise deletion of cases with missing data by default. If you had more cases in the file than variables in the analysis but also had many missing values, listwise deletion could leave you with more variables than retained cases. Pairwise deletion of missing data can also lead to NPD matrices. Negative eigenvalues may be present in these situations. See the following chapter for a helpful discussion and illustration of!
  how this
can happen.

Wothke, W. (1993) Nonpositive definite matrices in structural modeling. In K.A. Bollen & J.S. Long (Eds.), Testing Structural Equation Models. Newbury Park NJ: Sage. (Chap. 11, pp. 256-293).

Elements of the KMO and Bartlett test statistic can not be calculated if the correlation matrix is NPD. See the formulae for these statistics in the current Statistical Algorithms documentation by clicking Help->Algorithms in SPSS, then scrolling down to the link for Factor Algorithms. Then click the link for Optional Statistics. . The formulae are also on page 20 of the Factor chapter at
http://support.spss.com/ProductsExt/SPSS/Documentation/Statistics/algorithms/14.0/factor.pdf

The Bartlett formula includes the log of the determinant of the correlation matrix. If there are linear dependencies, then the determinant of the matrix will be 0 and its log will be undefined. The KMO measure formula includes elements of the anti-image covariance matrix, whose calculation involves the inverse of the correlation matrix. If the correlation matrix has linear dependencies, then its inverse can not be computed.

Apart from the inability to print the KMO or Bartlett's test, the presence of an NPD correlation matrix may lead you to rethink the choice of variables or attempt to acquire data on a larger sample to achieve more reliable results.

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

tianwk

2020-2-23 13:37:26

thanks for sharing

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

扫码加我 拉你入群

扫码加我 拉你入群

扫码加我 拉你入群

扫码加我 拉你入群

扫码加我 拉你入群

扫码加我 拉你入群

扫码加我 拉你入群

扫码加我 拉你入群

扫码加我 拉你入群

扫码加我 拉你入群

扫码加我 拉你入群

扫码加我 拉你入群

扫码加我 拉你入群

扫码加我 拉你入群

扫码加我 拉你入群

扫码加我 拉你入群

分享

扫码加好友，拉您进群

扫码加我拉你入群

扫码加我拉你入群

扫码加我拉你入群

扫码加我拉你入群

扫码加我拉你入群

扫码加我拉你入群

扫码加我拉你入群

扫码加我拉你入群

扫码加我拉你入群

扫码加我拉你入群

扫码加我拉你入群

扫码加我拉你入群

扫码加我拉你入群

扫码加我拉你入群

扫码加我拉你入群

扫码加我拉你入群