[STAT Article] Step-by-Step Guide to Calculating and Analyzing Principal Component Analysis (PCA) by Hand

[STAT Article] Step-by-Step Guide to Calculating and Analyzing Principal Component Analysis (PCA) by Hand





This image has an empty alt attribute; its file name is pca-9.jpg

Principal Component Analysis (PCA) is a statistical technique used for dimensionality reduction while preserving as much variability in the data as possible. It transforms the original variables in a dataset into a new set of uncorrelated variables called principal components, ordered by the amount of variance they capture from the original dataset.

Here’s the step of Principal Component Analysis (PCA).

1. Standardize the Data: 
Since PCA is affected by the scale of the variables, it often begins with standardizing the dataset, so each feature contributes equally to the analysis.

2. Compute the Covariance Matrix:
This step captures the relationships between the features. A high covariance between two variables suggests they are strongly related.

3. Calculate Eigenvalues and Eigenvectors:
Eigenvalues represent the amount of variance captured by each principal component, while eigenvectors represent the direction of these components.

4. Sort and Select Principal Components:
The principal components are ranked by the eigenvalues. You select the top components, which capture the majority of the variance, to reduce dimensionality without losing significant information.

When you conduct PCA, do you simply accept the results provided by the software without considering the preceding steps? If we just accept results without any doubts, we never understand the principle of PCA.

In this time, I’ll introduce how PCA is calculated step by step, and if you read this post, I believe you can fully understand the concept of PCA.






Here is one data for corn. Let’s say we measured kernel number per ear (KN), average of kernel weight per ear (KW) and cob dry weight per ear (Cob weight) about five corn genotypes.

GenotypeKNKWCob weight
A616335.5242.5
B576359.8237.2
C615335.6234.3
D590357.8243.8
E606333.3232.6

If you follow step by step, it’s a simple mathematics in middle school.






1. Standardize the Data

Using mean and standard deviation, we can convert each data to Z value.

GenotypeKNKWCob weight
A616335.5242.5
B576359.8237.2
C615335.6234.3
D590357.8243.8
E606333.3232.6
Mean600.6344.4238.1
Stdev17.313.24.9

As an example, I show the formula how 0.90 and -1.11 were calculated in Excel. If you follow this equation z = (x - µ)/σ , simply we can calculate the standardized (or normalized) data.

GenotypeKNKWCob weight
A0.89-0.670.90
B-1.431.17-0.18
C0.83-0.67-0.77
D-0.611.021.16
E0.31-0.84-1.11
Mean000
Stdev111

For more details, please visit below my posts.


□ Reading Materials

[Data article] Why is data normalization necessary when visualizing data?
[Data article] Data Normalization Techniques: Excel and R as the Initial Steps in Machine Learning


Recently, I developed an R package called normtools() to easily normalize datasets. I’ll show you how easily we can standardize (or normalize) data.

# to create dataset
group=rep(c("2024"),5)
genotype=c("A","B","C","D","E")
kn=c(616,576,615,590,606)
kw=c(335.5,359.8,335.6,357.8,333.3)
cob_weight=c(242.5,237.2,234.3,243.8,232.6)
df=data.frame(group, genotype,kn,kw,cob_weight)

# to install the package
if(!require(remotes)) install.packages("remotes")
if (!requireNamespace("normtools", quietly = TRUE)) {
  remotes::install_github("agronomy4future/normtools")
}
library(remotes)
library(normtools)

# to run normtools()
z_test= normtools(df, c("group"),c("kn","kw","cob_weight"), method= 1) # 1 or "z_test"
z_test
I added a column named 'group,' which is a dummy variable to enable normalization within groups. If there is only one level, such as 2024, all data will be normalized together. If there are two levels, such as 2023 and 2024, the data will be normalized separately within each level.

If you run the above code, you can obtain the normalized dataset as follows:

  group genotype  kn    kw cob_weight Normalized_kn mean_kn    sd_kn Normalized_kw mean_kw   sd_kw Normalized_cob_weight mean_cob_weight sd_cob_weight
1  2024        A 616 335.5      242.5     0.8923976   600.6 17.25688    -0.6744263   344.4 13.1964             0.8959603          238.08      4.933255
2  2024        B 576 359.8      237.2    -1.4255182   600.6 17.25688     1.1669848   344.4 13.1964            -0.1783812          238.08      4.933255
3  2024        C 615 335.6      234.3     0.8344497   600.6 17.25688    -0.6668485   344.4 13.1964            -0.7662285          238.08      4.933255
4  2024        D 590 357.8      243.8    -0.6142477   600.6 17.25688     1.0154284   344.4 13.1964             1.1594780          238.08      4.933255
5  2024        E 606 333.3      232.6     0.3129186   600.6 17.25688    -0.8411384   344.4 13.1964            -1.1108286          238.08      4.933255

This package provides normalized data along with the mean and standard deviation used for normalization. To simplify, I’ll select only the normalized data.

df1= data.frame(subset(z_test, select=c("Normalized_kn","Normalized_kw","Normalized_cob_weight")))

df1
  Normalized_kn Normalized_kw Normalized_cob_weight
1     0.8923976    -0.6744263             0.8959603
2    -1.4255182     1.1669848            -0.1783812
3     0.8344497    -0.6668485            -0.7662285
4    -0.6142477     1.0154284             1.1594780
5     0.3129186    -0.8411384            -1.1108286

This data matches what I calculated manually. For more details, please visit my post below introducing the package.


Practices in Data Normalization using normtools() in R






2. Calculating covariance

Once data normalization is complete, the next step is to calculate the covariance among variables. I calculated the covariance in Excel using the following equation.

I leave formula in Excel below for clear understanding.

#Excel formula 
KN ↔ KN =(((B2-B7)*(B2-B7))+((B3-B7)*(B3-B7))+((B4-B7)*(B4-B7))+((B5-B7)*(B5-B7))+((B6-B7)*(B6-B7)))/(5-1) = 1.00

KN ↔ KW =(((B2-B7)*(C2-C7))+((B3-B7)*(C3-C7))+((B4-B7)*(C4-C7))+((B5-B7)*(C5-C7))+((B6-B7)*(C6-C7)))/(5-1) =-0.93

KN ↔ Cob weight =(((B2-B7)*(D2-D7))+((B3-B7)*(D3-D7))+((B4-B7)*(D4-D7))+((B5-B7)*(D5-D7))+((B6-B7)*(D6-D7)))/(5-1) =-0.17

KW ↔ KN =(((B2-B7)*(C2-C7))+((B3-B7)*(C3-C7))+((B4-B7)*(C4-C7))+((B5-B7)*(C5-C7))+((B6-B7)*(C6-C7)))/(5-1) =-0.93

KW ↔ KW =(((C2-C7)*(C2-C7))+((C3-C7)*(C3-C7))+((C4-C7)*(C4-C7))+((C5-C7)*(C5-B7))+((C6-B7)*(C6-C7)))/(5-1) = 1.00

KW ↔ Cob weight =(((C2-C7)*(D2-D7))+((C3-C7)*(D3-D7))+((C4-C7)*(D4-D7))+((C5-C7)*(D5-D7))+((C6-C7)*(D6-D7)))/(5-1) =0.45

Cob weight ↔ KN =(((B2-B7)*(D2-D7))+((B3-B7)*(D3-D7))+((B4-B7)*(D4-D7))+((B5-B7)*(D5-D7))+((B6-B7)*(D6-D7)))/(5-1) =-0.17

Cob weight ↔ KW =(((C2-C7)*(D2-D7))+((C3-C7)*(D3-D7))+((C4-C7)*(D4-D7))+((C5-C7)*(D5-D7))+((C6-C7)*(D6-D7)))/(5-1) =0.45

Cob weight ↔ Cob weight =(((D2-D7)*(D2-D7))+((D3-D7)*(D3-D7))+((D4-D7)*(D4-D7))+((D5-D7)*(D5-D7))+((D6-D7)*(D6-D7)))/(5-1) = 1.00

However, is that really necessary to calculate covariance like this? If you fully understand covariance, you’re familiar with below equation I suggest.

If we divide covariance by multiplication of standard deviation of x and y, it will be correlation (r). That is, the below correlation formula we usually know is the same as what I suggested above.

Let’s think about this. We already normalized data, which means standard deviation is 1. Then, the below equation would be possible as standard deviation of x and y is 1.

cov (x,y) = r

For more details, please visit below my post and YouTube video.


□ Reading Materials

Simple linear regression (1/5)- correlation and covariance



It means when data is normalized, covariance and correlation will be the same. In Excel, we can easily obtain correlation.

If you follow above steps, you can obtain correlation like below.

Let’s copy and paste each value in the same relationship to the blank cells, and this result will be the same as what we calculated by hand.


In R, we can easily obtain correlation.

cor(df1)
                      Normalized_kn Normalized_kw Normalized_cob_weight
Normalized_kn             1.0000000    -0.9271996            -0.1613366
Normalized_kw            -0.9271996     1.0000000             0.4525646
Normalized_cob_weight    -0.1613366     0.4525646             1.0000000

Or, in R, we can also obtain covariance.

cov(df1)
                      Normalized_kn Normalized_kw Normalized_cob_weight
Normalized_kn             1.0000000    -0.9271996            -0.1613366
Normalized_kw            -0.9271996     1.0000000             0.4525646
Normalized_cob_weight    -0.1613366     0.4525646             1.0000000

Either way, the value is the same. This is because the data is normalized, and the standard deviation is 1. Therefore, in normalized data, covariance and correlation will be the same. Anyway, this data matches what I calculated in Excel.






3. Eigenvector and Eigenvalue

From here, we need to understand the concept of vector. However, it would not be necessary to fully understand what it is. Instead, let’s grab the concept we need to analyze PCA.

Eigenvalue is called lamda (λ), and here is a formula

Ax=λx

and this formula is also explained as below

(A−λI)x=0

For example, with A matrix and λI matrix, we can think the right matrix. Then, the covariance table we calculated will be expressed as below.

This is 3×3 matrix. Unlike 2×2 matrix, 3×3 matrix will be calculated as below

First, let’s calculate eigenvalue, λ

{(1-λ){(1-λ)(1-λ)-(0.450.45)} - {-0.93((-0.93(1-λ))-(0.45-0.16))} + {-0.16((-0.930.45)-((1-λ)*-0.16))} = 0

Now we need to find λ. We might be able to primitively calculate in our notes, but let’s work smart!! Using a simple R code, we can easily obtain λ.

matrix= matrix(c(1, -0.93, -0.16, -0.93, 1, 0.45, -0.16, 0.45, 1), 3, 3, byrow=TRUE)

matrix 
      [,1]  [,2]  [,3]
[1,]  1.00 -0.93 -0.16
[2,] -0.93  1.00  0.45
[3,] -0.16  0.45  1.00

Now I generated covariance table as 3×3 matrix. Then, I use the below R code.

ev= eigen(matrix)
values= ev$values

values
[1] 2.10205207 0.87571854 0.02222939

Now eigenvalue was calculated.

λ = 2.10205207, 0.87571854, 0.02222939

To verify this value is correct, I’ll substitute those λ values in (1-λ){(1-λ)(1-λ)-(0.450.45)} - {-0.93((-0.93(1-λ))-(0.45-0.16))} + {-0.16((-0.930.45)-((1-λ)*-0.16))} = 0

When we substitute each λ value in the equation, all of them becomes 0.

Now we can obtain eigenvector.

(example)

λ = ≈ 2.102

When λ is ≈ 2.102, this is how to calculate eigenvector.

-1.11 x1 -0.93 x2 – 0.17 x3 = 0 —— (a)

-0.93 x1 -1.11 x2 + 0.45 x3 = 0 —— (b)

-0.17 x1 +0.45 x2 – 1.11 x3 = 0 —— (c)

Now we need to find x1, x2, x3. Again!! we might be able to primitively calculate in our notes, but there are many websites to calculate eigenvector.

For example, I’ll obtain eigenvector in below website.

https://www.symbolab.com/solver/matrix-eigenvectors-calculator

First, let’s generate 3×3 matrix, and input covariance values, and in front of the matrix, type ‘Eigenvalue’

The website shows the λ equation

{(1-λ){(1-λ)(1-λ)-(0.450.45)} - {-0.93((-0.93(1-λ))-(0.45-0.16))} + {-0.16((-0.930.45)-((1-λ)*-0.16))} = 0

is summarized as

3 + 3λ2 - 1.907λ + 0.04092 = 0

and λ is ≈ 0.022, ≈ 0.876, ≈2.102, which is the same value we obtained from R. Now we can trust eigenvalues R provides.


Now, let’s calculate eigenvectors when λ is ≈2.102. Again!! let’s generate 3×3 matrix, and input covariance values, but now in front of the matrix, type ‘Eigenvectors’

Eigenvectors corresponding to each Eigenvalue was calculated. When λ is ≈2.102, eigenvectors are ≈ – 1.701 , 1.844, 1

Let’s check this values are correct in Excel. If those values are correct, in below equation, when we input x1 ≈ – 1.701, x2 ≈ 1.844 , x3 ≈ 1, all equation will be 0.

-1.11 x1 -0.93 x2 – 0.17 x3 = 0 —— (a)

-0.93 x1 -1.11 x2 + 0.45 x3 = 0 —— (b)

-0.17 x1 +0.45 x2 – 1.11 x3 = 0 —— (c)

All equations become 0, and therefore when λ is ≈2.102 the below matrix is validated.

If we check eigenvectors from the other two λ values, the matrix is also validated.






4. Calculating component score

Now, let’s sort eigenvalue from large to small values (2.102, 00876, 0.022).

To verify these values are correct, I’ll check in R.

ev= eigen(matrix)

ev
eigen() decomposition
$values
[1] 2.10205207 0.87571854 0.02222939

$vectors
           [,1]        [,2]       [,3]
[1,]  0.6299076  0.42274681 -0.6515379
[2,] -0.6827452 -0.09849582 -0.7239873
[3,] -0.3702371  0.90087941  0.2265851

Eigenvectors are different from what we calculated. How these values were calculated in R?

In R, eigenvectors are ‘Normalized Eigenvector.’ The way to normalize eigenvectors is like below. Simply, it’s kind of reducing the length of vector.

Based on the above equation, I re-calculated eigenvectors. For your understanding, I leave a formula about how 4.41 was calculated in Excel.

Now our eigenvectors are the same as R provided. The difference of positive and negative values will be explained in the last part.

Using these eigenvectors, we can calculate component score per genotype.

Do you remember we normalized our data in the first step? Now we use the normalized data to calculate component score.

For example, component score of Genotype A will be calculated as

Genotype A
PC1 = -0.6299*KN + 0.6827*KW + 0.3702*Cob weight
PC2 = 0.4227*KN + -0.0985*KW + 0.9009*Cob weight
PC3 =-0.6515*KN + -0.7240*KW + 0.2266*Cob weight

In Genotype A, KN = 0.89, KW = -0.67, Cob weight = 0.90.

Therefore, it will be calculated as below

PC1 = -0.6299*0.89 + 0.6827*-0.67 + 0.3702*0.90 = -0.6848
PC2 = 0.4227*0.89 + -0.0985*-0.67+ 0.9009*0.90 = 1.253
PC3 = -0.6515*0.89 + -0.7240*-0.67 + 0.2266*0.90 = 1.109

There are many websites to calculate matrix. In this website, let’s generate 5×3 matrix and 3×3 matrix, and input the values (https://matrix.reshish.com/multiplication.php)

and the result will be

It was a long journey!!! Here is the final component score!!






Principal Components Analysis using R

We primitively calculated PCA by hand. Now let’s verify this calculation is correct. I’ll calculate each process I introduced using R.

1. to create data

First, I’ll create the dataset.

genotype=c("A","B","C","D","E")
kn=c(616,576,615,590,606)
kw=c(335.5,359.8,335.6,357.8,333.3)
cob_weight=c(242.5,237.2,234.3,243.8,232.6)
df=data.frame(genotype,kn,kw,cob_weight)

df
  genotype  kn    kw    cob_weight
1        A  616  335.5  242.5
2        B  576  359.8  237.2
3        C  615  335.6  234.3
4        D  590  357.8  243.8
5        E  606  333.3  232.6

However, to conduct PCA, categorical variable should be row name. So, I’ll change ‘genotype’ as the row name.

genotype=c("A","B","C","D","E")
kn=c(616,576,615,590,606)
kw=c(335.5,359.8,335.6,357.8,333.3)
cob_weight=c(242.5,237.2,234.3,243.8,232.6)
df=data.frame(kn,kw,cob_weight)
row.names(df)= genotype

df
   kn    kw cob_weight
A 616 335.5      242.5
B 576 359.8      237.2
C 615 335.6      234.3
D 590 357.8      243.8
E 606 333.3      232.6





2. to normalize data

I’ll normalize data using scale()

df$kn= scale(df$kn, center=TRUE, scale=TRUE)
df$kw= scale(df$kw, center=TRUE, scale=TRUE)
df$cob_weight= scale(df$cob_weight, center=TRUE, scale=TRUE)

df
          kn         kw cob_weight
A  0.8923976 -0.6744263  0.8959603
B -1.4255182  1.1669848 -0.1783812
C  0.8344497 -0.6668485 -0.7662285
D -0.6142477  1.0154284  1.1594780
E  0.3129186 -0.8411384 -1.1108286





3. Calculating covariance

In R, it’s not necessary to do this process when conducting PCA, but to verify our calculation is correct, let’s check the covariance in R.

cor(df)
                   kn         kw cob_weight
kn          1.0000000 -0.9271996 -0.1613366
kw         -0.9271996  1.0000000  0.4525646
cob_weight -0.1613366  0.4525646  1.0000000





3. Eigenvector

Now, it’s the time to obtain eigenvectors

pca= prcomp(df)

> pca
Standard deviations (1, .., p=3):
[1] 1.4496795 0.9348841 0.1562719

Rotation (n x k) = (3 x 3):
                  PC1         PC2        PC3
kn         -0.6289457 -0.42563473 -0.6505862
kw          0.6824142  0.09867947 -0.7242743
cob_weight  0.3724758 -0.89949844  0.2283952

In R, eigenvectors are quickly calculated. It’s the same as what we calculated by hand. The difference of positive and negative values will be explained in the last part.

We can skip the step ‘2. Normalize data’ by directly adding scale=TRUE inside prcomp().”

genotype=c("A","B","C","D","E")
kn=c(616,576,615,590,606)
kw=c(335.5,359.8,335.6,357.8,333.3)
cob_weight=c(242.5,237.2,234.3,243.8,232.6)
df=data.frame(kn,kw,cob_weight)
row.names(df)= genotype

pca= prcomp(df, scale=TRUE) 

pca
Standard deviations (1, .., p=3):
[1] 1.4496795 0.9348841 0.1562719

Rotation (n x k) = (3 x 3):
                  PC1         PC2        PC3
kn         -0.6289457 -0.42563473 -0.6505862
kw          0.6824142  0.09867947 -0.7242743
cob_weight  0.3724758 -0.89949844  0.2283952

Calculating eigenvectors by hand was a complex process, but in R, we can obtain them in just 5 seconds. However, now we understand how eigenvectors are calculated, meaning we know the underlying principles.






4. Calculating component score

Now, let’s calculate component score. We calculated component score as multiplying the matrix between normalized data and eigenvectors, and the same code is below.

round(scale(df) %*% pca$rotation,3)
or 
round(pca$x,3)

     PC1    PC2    PC3
A -0.688 -1.252  0.113
B  1.626  0.882  0.041
C -1.265  0.268 -0.235
D  1.511 -0.681 -0.071
E -1.185  0.783  0.152

The result is the same as what we calculated by hand. The difference in positive and negative values will be explained in the final section. The small discrepancy is due to rounding at different decimal points.






Making a graph

In this website (https://www.geogebra.org/calculator), let’s draw a graph about eigenvectors and component scores. I compared the value of eigenvectors and component scores from R (left graph) with the values from my calculation (right graph).

The difference in positive and negative values between R and my calculation is shown in the graph above. If we flip the left graph, it will match the right graph. Nonetheless, the interpretation of the results for both graphs remains the same.

I’ll draw a graph using R.

if (require("factoextra")== F) install.packages("factoextra")
library(factoextra)

fviz_pca_var(pca, labelsize= 4, repel= TRUE) +
  labs(title="") +
  theme_grey(base_size=18, base_family="serif")+
  theme(axis.line= element_line(size=0.5, colour="black"),
        axis.title.y= element_text (margin = margin(t=0, r=0, b=0, l=0)),
        plot.margin= margin(-0.7,-0.7,0,-0.7,"cm")) +
  windows(width=5.5, height=5)

It’s the same as what I drew by hand.


If you copy and paste the below code in R, you can obtain the same graph above.

if (require("factoextra")== F) install.packages("factoextra")
library(factoextra)

genotype=c("A","B","C","D","E")
kn=c(616,576,615,590,606)
kw=c(335.5,359.8,335.6,357.8,333.3)
cob_weight=c(242.5,237.2,234.3,243.8,232.6)
df=data.frame(kn,kw,cob_weight)
row.names(df)= genotype

pca= prcomp(df, scale=TRUE) 

fviz_pca_var(pca, labelsize=4, repel=TRUE) +
  labs(title="") +
  theme_grey(base_size=18, base_family="serif")+
  theme(axis.line=element_line(size=0.5, colour="black"),
        axis.title.y=element_text(margin=margin(t=0,r=0,b=0,l=0)),
        plot.margin=margin(-0.7,-0.7,0,-0.7,"cm")) +
windows(width=5.5, height=5)





How to interpret the graph?

1) The range of eigenvector

The eigenvector is normalized, with values always ranging between -1 and 0. If you observe a different data range in PC1 and PC2, it suggests that those eigenvectors were not normalized. When performing PCA, the eigenvectors are typically normalized by default.

2) Variance of components

summary(pca)

Importance of components:
                          PC1    PC2     PC3
Standard deviation     1.4497 0.9349 0.15627
Proportion of Variance 0.7005 0.2913 0.00814
Cumulative Proportion  0.7005 0.9919 1.00000

This is the summary of component scores. It shows standard deviation, and cumulative variance in PC1 to PC3. It says PC1 and PC2 account for 99% variance from the whole data.

In standard deviation, let’s check variance as squaring (s =√v, v = s2)

1.44972 = 2.10
0.93492 = 0.87
0.156272 = 0.02

Do you remember those values? We encountered them when we calculated PCA step-by-step—they’re the eigenvalues.

That is, eigenvalue (λ) is variance of components. That’s why we sort from the large value.

In PC1, it accounts for 70.1% of the total variance, and in PC2, it accounts for 29.1% of the total variance. How were these values calculated?

Let’s add up all eigenvalue (λ)

2.102 + 0.876 + 0.022 ≈ 3

and
2.102 / 3 ≈ 0.70 (variance ratio in PC1)
0.876 / 3 ≈ 0.29 (variance ratio inPC2)


3) Interpretation of vector

A) Direction and proximity of vector

A and B is negatively correlated (in PC1, A is positive value, while B is negative value).
A and C is positively correlated (in PC1, both A and B are positive value).

If two vectors are adjacent, and the angle between two vectors is small, two variables are positively correlated. If the angle between two vectors is 90°, two variables are independent.

B) Length of vector

As mentioned above, A and B; and C and D are negatively correlated, but A and B highly affects variance of the data while C and D less affect the variance. It means A and B is the most important variables to explain the whole data.


Tip!! PCA analysis using SAS Studio

GenotypeKNKWCob weight
A616335.5242.5
B576359.8237.2
C615335.6234.3
D590357.8243.8
E606333.3232.6

In SAS STUDIO, please go to Tasks and Utilities and select Multivariate Analysis > Principle Component Analysis. Then choose KN, KW, Cob weight at Analysis variables in ROLES section.

I’d like to see the outcome from PC1, PC2, PC3, and therefore I selected 3 in Number of components for score and pattern. Then, in Select plots to display, please select Default and additional plots and click all options. Then click the run button.

Let’s compare the outcomes provided by SAS with those we calculated by hand. All values match. From now on, we can fully understand the values that statistical programs provide for PCA analysis and have gained insight into the underlying principles.

Now, let’s compare the graphs. When we compared the graphs, we drew manually with those provided by R, we noticed they were inverted vertically. However, when we compared the manually drawn graphs with those from SAS, they were identical. Therefore, the inversion of the graph doesn’t affect the interpretation, as the pattern remains consistent.







Comments are closed.