How to calculate PCA (Principal Components Analysis) by hand?

How to calculate PCA (Principal Components Analysis) by hand?


This image has an empty alt attribute; its file name is pca-9.jpg

When you conduct PCA (Principal Components Analysis), do you simply accept the result which software programs provide? If we just accept results without any doubts, we never understand the principle of PCA. In this time, I’ll introduce how PCA is calculated step by step, and if you read this post, I believe you can fully understand the concept of PCA. Here is one data. Let’s say we measured kernel number per ear (KN), average of kernel weight per ear (KW) and cob dry weight per ear (Cob weight) about five corn genotypes.

GenotypeKNKWCob weight
A616335.5242.5
B576359.8237.2
C615335.6234.3
D590357.8243.8
E606333.3232.6

If you follow step by step, it’s a simple mathematics in middle school.



1. Data normalization

Using mean and standard deviation, we can convert each data to Z value.

GenotypeKNKWCob weight
A616335.5242.5
B576359.8237.2
C615335.6234.3
D590357.8243.8
E606333.3232.6
Mean600.6344.4238.1
Stdev17.313.24.9

As an example, I show the formula how 0.90 and -1.11 were calculated in Excel. If you follow this equation z = (x - µ)/σ , simply we can calculate the standardized (or normalized) data.

GenotypeKNKWCob weight
A0.89-0.670.90
B-1.431.17-0.18
C0.83-0.67-0.77
D-0.611.021.16
E0.31-0.84-1.11
Mean000
Stdev111


2. Calculating covariance

If we finished data normalization, the next step is to calculate covariance among variables. I calculated covariance in Excel.

I leave formula in Excel below for clear understanding.

#Excel formula 
KN ↔ KN =(((B2-B7)*(B2-B7))+((B3-B7)*(B3-B7))+((B4-B7)*(B4-B7))+((B5-B7)*(B5-B7))+((B6-B7)*(B6-B7)))/(5-1) = 1.00

KN ↔ KW =(((B2-B7)*(C2-C7))+((B3-B7)*(C3-C7))+((B4-B7)*(C4-C7))+((B5-B7)*(C5-C7))+((B6-B7)*(C6-C7)))/(5-1) =-0.93

KN ↔ Cob weight =(((B2-B7)*(D2-D7))+((B3-B7)*(D3-D7))+((B4-B7)*(D4-D7))+((B5-B7)*(D5-D7))+((B6-B7)*(D6-D7)))/(5-1) =-0.17

KW ↔ KN =(((B2-B7)*(C2-C7))+((B3-B7)*(C3-C7))+((B4-B7)*(C4-C7))+((B5-B7)*(C5-C7))+((B6-B7)*(C6-C7)))/(5-1) =-0.93

KW ↔ KW =(((C2-C7)*(C2-C7))+((C3-C7)*(C3-C7))+((C4-C7)*(C4-C7))+((C5-C7)*(C5-B7))+((C6-B7)*(C6-C7)))/(5-1) = 1.00

KW ↔ Cob weight =(((C2-C7)*(D2-D7))+((C3-C7)*(D3-D7))+((C4-C7)*(D4-D7))+((C5-C7)*(D5-D7))+((C6-C7)*(D6-D7)))/(5-1) =0.45

Cob weight ↔ KN =(((B2-B7)*(D2-D7))+((B3-B7)*(D3-D7))+((B4-B7)*(D4-D7))+((B5-B7)*(D5-D7))+((B6-B7)*(D6-D7)))/(5-1) =-0.17

Cob weight ↔ KW =(((C2-C7)*(D2-D7))+((C3-C7)*(D3-D7))+((C4-C7)*(D4-D7))+((C5-C7)*(D5-D7))+((C6-C7)*(D6-D7)))/(5-1) =0.45

Cob weight ↔ Cob weight =(((D2-D7)*(D2-D7))+((D3-D7)*(D3-D7))+((D4-D7)*(D4-D7))+((D5-D7)*(D5-D7))+((D6-D7)*(D6-D7)))/(5-1) = 1.00

However, is that really necessary to calculate covariance like this? If you fully understand covariance, you’re familiar with below equation I suggest.

If we divide covariance by multiplication of standard deviation of x and y, it will be correlation (r). That is, the below correlation formula we usually know is the same as what I suggested above.

Let’s think about this. We already normalized data, which means standard deviation is 1. Then, the below equation would be possible as standard deviation of x and y is 1.

cov (x,y) = r

For more details, please visit below websites and YouTube video.


Simple linear regression (1/5)- correlation and covariance



It means when data is normalized, covariance and correlation will be the same. In Excel, we can easily obtain correlation.

If you follow above steps, you can obtain correlation like below.

Let’s copy and paste each value in the same relationship to the blank cells, and this result will be the same as what we calculated by hand.



3. Eigenvector and Eigenvalue

From here, we need to understand the concept of vector. However, it would not be necessary to fully understand what it is. Instead, let’s grab the concept we need to analyze PCA.

Eigenvalue is called lamda (λ), and here is a formula

Ax=λx

and this formula is also explained as below

(A−λI)x=0

For example, with A matrix and λI matrix, we can think the right matrix. Then, the covariance table we calculated will be expressed as below.

This is 3×3 matrix. Unlike 2×2 matrix, 3×3 matrix will be calculated as below

First, let’s calculate eigenvalue, λ

{(1-λ){(1-λ)(1-λ)-(0.450.45)} - {-0.93((-0.93(1-λ))-(0.45-0.16))} + {-0.16((-0.930.45)-((1-λ)*-0.16))} = 0

Now we need to find λ. We might be able to primitively calculate in our notes, but let’s work smart!! Using a simple R code, we can easily obtain λ.

matrix <- matrix(c(1, -0.93, -0.16, -0.93, 1, 0.45, -0.16, 0.45, 1), 3, 3, byrow=TRUE)
matrix 

Now I generated covariance table as 3×3 matrix. Then, I use the below R code.

ev <-eigen(matrix)
values <- ev$values
values

Now eigenvalue was calculated.

λ = 2.10205207, 0.87571854, 0.02222939

To verify this value is correct, I’ll substitute those λ values in (1-λ){(1-λ)(1-λ)-(0.450.45)} - {-0.93((-0.93(1-λ))-(0.45-0.16))} + {-0.16((-0.930.45)-((1-λ)*-0.16))} = 0

When we substitute each λ value in the equation, all of them becomes 0.

Now we can obtain eigenvector.

(example)

λ = ≈ 2.102

When λ is ≈ 2.102, this is how to calculate eigenvector.

-1.11 x1 -0.93 x2 – 0.17 x3 = 0 —— (a)

-0.93 x1 -1.11 x2 + 0.45 x3 = 0 —— (b)

-0.17 x1 +0.45 x2 – 1.11 x3 = 0 —— (c)

Now we need to find x1, x2, x3. Again!! we might be able to primitively calculate in our notes, but there are many websites to calculate eigenvector.

For example, I’ll obtain eigenvector in below website.

https://www.symbolab.com/solver/matrix-eigenvectors-calculator

First, let’s generate 3×3 matrix, and input covariance values, and in front of the matrix, type ‘Eigenvalue’

The website shows the λ equation

{(1-λ){(1-λ)(1-λ)-(0.450.45)} - {-0.93((-0.93(1-λ))-(0.45-0.16))} + {-0.16((-0.930.45)-((1-λ)*-0.16))} = 0

is summarized as

3 + 3λ2 - 1.907λ + 0.04092 = 0

and λ is ≈ 0.022, ≈ 0.876, ≈2.102, which is the same value we obtained from R. Now we can trust eigenvalues R provides.


Now, let’s calculate eigenvectors when λ is ≈2.102. Again!! let’s generate 3×3 matrix, and input covariance values, but now in front of the matrix, type ‘Eigenvectors’

Eigenvectors corresponding to each Eigenvalue was calculated. When λ is ≈2.102, eigenvectors are ≈ – 1.701 , 1.844, 1

Let’s check this values are correct in Excel. If those values are correct, in below equation, when we input x1 ≈ – 1.701, x2 ≈ 1.844 , x3 ≈ 1, all equation will be 0.

-1.11 x1 -0.93 x2 – 0.17 x3 = 0 —— (a)

-0.93 x1 -1.11 x2 + 0.45 x3 = 0 —— (b)

-0.17 x1 +0.45 x2 – 1.11 x3 = 0 —— (c)

All equations become 0, and therefore when λ is ≈2.102 the below matrix is validated.

If we check eigenvectors from the other two λ values, the matrix is also validated.



4. Calculating component score

Now, let’s sort eigenvalue from large to small values (2.102, 00876, 0.022).

To verify these values are correct, I’ll check in R.

ev <-eigen(matrix)
ev

Eigenvectors are different from what we calculated. How these values were calculated in R?

In R, eigenvectors are ‘Normalized Eigenvector.’ The way to normalize eigenvectors is like below. Simply, it’s kind of reducing the length of vector.

Based on the above equation, I re-calculated eigenvectors. For your understanding, I leave a formula about how 4.41 was calculated in Excel.

Now our eigenvectors is the same as R provided. The difference of positive and negative values will be explained in the last part.

Using these eigenvectors, we can calculate component score per genotype.

Do you remember we normalized our data in the first step? Now we use the normalized data to calculate component score.

For example, component score of Genotype A will be calculated as

Genotype A
PC1 = -0.6299*KN + 0.6827*KW + 0.3702*Cob weight
PC2 = 0.4227*KN + -0.0985*KW + 0.9009*Cob weight
PC3 =-0.6515*KN + -0.7240*KW + 0.2266*Cob weight

In Genotype A, KN = 0.89, KW = -0.67, Cob weight = 0.90.

Therefore, it will be calculated as below

PC1 = -0.6299*0.89 + 0.6827*-0.67 + 0.3702*0.90 = -0.6848
PC2 = 0.4227*0.89 + -0.0985*-0.67+ 0.9009*0.90 = 1.253
PC3 = -0.6515*0.89 + -0.7240*-0.67 + 0.2266*0.90 = 1.109

There are many websites to calculate matrix.

In the below website, let’s generate 5×3 matrix and 3×3 matrix, and input the values

https://matrix.reshish.com/multiplication.php

and the result will be

It was a long journey!!! Here is the final component score



<Verification using R>

We primitively calculated PCA by hand. Now let’s verify this calculation is correct. I’ll calculate each process I introduced using R.

0. generating data

In R, I generate the same data.

Genotype<- c("A","B","C","D","E")
KN<- c(616,576,615,590,606)
KW<- c(335.5,359.8,335.6,357.8,333.3)
COB<- c(242.5,237.2,234.3,243.8,232.6)
dataA<- data.frame(KN,KW,COB)
row.names(dataA) <- Genotype
dataA

1. Data normalization

I’ll normalize data using scale()

dataA$KN<- scale(dataA$KN, center=TRUE, scale=TRUE)
dataA$KW<- scale(dataA$KW, center=TRUE, scale=TRUE)
dataA$COB<- scale(dataA$COB, center=TRUE, scale=TRUE)
dataA

It’s the same as what we calculated by hand.



2. Calculating covariance

In R, it’s not necessary to do this process when conducting PCA, but to verify our calculation is correct, let’s check the covariance in R.

cor(dataA)

It’s the same as what we calculated by hand.



3. Eigenvector

Now, it’s the time to obtain eigenvectors

pca<-prcomp(dataA)
pca

In R, eigenvectors are quickly calculated. It’s the same as what we calculated by hand. The difference of positive and negative values will be explained in the last part.

We remember the complex process to calculate eigenvectors by hand, but in R, we can obtain in 5 seconds. However, now we understand how eigenvectors are calculated, which means we know the principle.

We can skip the first step in R, using the blue color code.

Genotype<- c("A","B","C","D","E")
KN<- c(616,576,615,590,606)
KW<- c(335.5,359.8,335.6,357.8,333.3)
COB<- c(242.5,237.2,234.3,243.8,232.6)
dataA<- data.frame(KN,KW,COB)
row.names(dataA) <- Genotype
dataA

pca<-prcomp(dataA,scale=TRUE) 
pca


4. Calculating component score

Now, let’s calculate component score. We calculated component score as multiplying the matrix between normalized data and eigenvectors, and the same code is below.

round(scale(dataA) %*% pca$rotation,3)
or 
round(pca$x,3)

It’s the same as what we calculated by hand. The difference of positive and negative values will be explained in the last part. The small difference is due to round-off in different decimal points.



Making a graph

In the below website, let’s draw a graph about eigenvectors and component scores

https://www.geogebra.org/calculator

I compared the value of eigenvectors and component scores from R (left graph) with the values from my calculation (right graph).

The difference of positive and negative values between R and my calculation is expressed like above graph. If we turn over the left graph, it will be the same in the right graph. Anyhow, the interpretation of the result of both graphs will be the same.

I’ll draw a graph using R.

install.packages("factoextra")
library(factoextra)

fviz_pca_var(pca, labelsize = 4, repel = TRUE) +
  labs(title="") +
  theme_grey(base_size=18, base_family="serif")+
  theme(axis.line= element_line(size=0.5, colour="black"),
        axis.title.y= element_text (margin = margin(t=0, r=0, b=0, l=0)),
        plot.margin= margin(-0.7,-0.7,0,-0.7,"cm")) +
  windows(width=5.5, height=5)

It’s the same as what I drew by hand.

If you copy and paste the below code in R, you can obtain the same graph above.

install.packages("factoextra")
library(factoextra)

Genotype<- c("A","B","C","D","E")
KN<- c(616,576,615,590,606)
KW<- c(335.5,359.8,335.6,357.8,333.3)
COB<- c(242.5,237.2,234.3,243.8,232.6)
dataA<- data.frame(KN,KW,COB)
row.names(dataA) <- Genotype

pca<-prcomp(dataA,scale=TRUE)

fviz_pca_var(pca, labelsize=4, repel=TRUE) +
  labs(title="") +
  theme_grey(base_size=18, base_family="serif")+
  theme(axis.line=element_line(size=0.5, colour="black"),
        axis.title.y=element_text(margin=margin(t=0,r=0,b=0,l=0)),
        plot.margin=margin(-0.7,-0.7,0,-0.7,"cm")) +
windows(width=5.5, height=5)


How to interpret the graph?

1) The range of eigenvector

The eigenvector was normalized, the range is always between -1 and 0. If you see the different data range in PC1 and PC2, we can guess those eigenvectors were not normalized. If you obtain PCA, I believe it’s always normalized Eigenvector.

2) Variance of components

summary(pca)

This is the summary of component scores. It shows standard deviation, and cumulative variance in PC1 to PC3. It says PC1 and PC2 account for 99% variance from the whole data.

In standard deviation, let’s check variance as squaring (s =√v, v = s2)

1.44972 = 2.10
0.93492 = 0.87
0.156272 = 0.02

Do you remember those values? We saw those values when we calculate PCA step by step. It’s eigenvalue

That is, eigenvalue (λ) is variance of components. That’s why we sort from the large value.

In PC1, it account for 70.1% of the total variance, and in PC2, it accounts for 29.1% of the total variance. How these values were calculated?

Let’s add up all eigenvalue (λ)

2.102 + 0.876 + 0.022 ≈ 3

and
2.102 / 3 ≈ 0.70 (variance ratio in PC1)
0.876 / 3 ≈ 0.29 (variance ratio inPC2)

3) Interpretation of vector

A) Direction and proximity of vector

A and B is negatively correlated (in PC1, A is positive value, while B is negative value).
A and C is positively correlated (in PC1, both A and B are positive value).

If two vectors are adjacent, and the angle between two vectors is small, two variables are positively correlated. If the angle between two vectors is 90°, two variables are independent.

B) Length of vector

As mentioned above, A and B; and C and D are negatively correlated, but A and B highly affects variance of the data while C and D less affect the variance. It means A and B is the most important variables to explain the whole data.


Tip!! PCA analysis using SAS Studio

GenotypeKNKWCob weight
A616335.5242.5
B576359.8237.2
C615335.6234.3
D590357.8243.8
E606333.3232.6

In SAS STUDIO, please go to Tasks and Utilities and select Multivariate Analysis > Principle Component Analysis. Then choose KN, KW, Cob weight at Analysis variables in ROLES section.

I’d like to see the outcome from PC1, PC2, PC3, and therefore I selected 3 in Number of components for score and pattern. Then, in Select plots to display, please select Default and addtional plots and click all options. Then click the run button.

Let’s compare the outcomes provided by SAS with the ones we calculated by hand. All values are the same. From now on, we can fully understand the values that statistical programs provide for PCA analysis and have also learned the underlying principle.

Now, let’s compare the graphs. When we compared the graphs we drew manually with the ones provided by R, we noticed that they were inverted up and down. However, when we compared the graphs we drew manually with the ones provided by SAS, they were the same. Therefore, the inversion of the graph doesn’t matter as the pattern remains the same.




Leave a Reply

If you include a website address in the comment section, I cannot see your comment as it will be automatically deleted and will not be posted. Please refrain from including website addresses.