John, Mike and Kate get the following percentages for exams in Maths, Science, English and Music as follows:
Maths Science English Music
John 80 85 60 55
Mike 90 85 70 45
Kate 95 80 40 50
In this case there are 12 scores in total. Each score represents the exam results for each person in a particular subject. So a score in this case is simply a representation of where a row and column intersect.
Now lets informally define a Principal Component:
In the table above, can you easily plot the data in a 2D graph? No, because there are four subjects (which means four variables), i.e.:
You could plot two subjects in the exact same way you would with x & y co-ordinates in a 2D graph.
You could even plot three subjects in the same way you would plot x, y & z in a 3D graph (though this is generally bad practice, because some distortion is inevitable in the 2D representation of 3D data).
But how would you plot 4 subjects?
At the moment we have four variables which each represent just one subject. So a method around this might be to somehow combine the subjects into maybe just two new variables which we can then plot. This is known as Multidimensional scaling.
Principal Component analysis is a form of multidimensional scaling. It is a linear transformation of the variables into a lower dimensional space which retain maximal amount of information about the variables. For example, this would mean we could look at the types of subjects each student is maybe more suited to.
A principal component is therefore a combination of the original variables after a linear transformation. In R, this is:
DF<-data.frame(Maths=c(80, 90, 95), Science=c(85, 85, 80), English=c(60, 70, 40), Music=c(55, 45, 50))
prcomp(DF, scale = FALSE)
Which will give you something like this (first two Principal Components only for sake of simplicity):
PC1 PC2
Maths 0.27795606 0.76772853
Science -0.17428077 -0.08162874
English -0.94200929 0.19632732
Music 0.07060547 -0.60447104
So what is a Principal Component Score?
It's a score from the table at the end of this post.
The output from R means we can now plot each person's score across all subjects in a 2D graph as follows:
x y
John 0.28*80 + -0.17*85 + -0.94*60 + 0.07*55 0.77*80 + -0.08*85 + 0.19*60 + -0.60*55
Mike 0.28*90 + -0.17*85 + -0.94*70 + 0.07*45 0.77*90 + -0.08*85 + 0.19*70 + -0.60*45
Kate 0.28*95 + -0.17*80 + -0.94*40 + 0.07*50 0.77*95 + -0.08*80 + 0.19*40 + -0.60*50
Which simplifies to:
x y
John -44.6 33.2
Mike -51.9 48.8
Kate -21.1 44.35
There are six principal component scores in the table above. You can now plot the scores in a 2D graph to get a sense of the type of subjects each student is perhaps more suited to.