If you are a data scientist or aspiring to be one you must have come across a term called “The Curse of Dimensionality”. In simple terms, “the curse of dimensionality” refers to challenges we could bump into when handling data in high dimensional space (in simple terms – too many features) in terms of visualization, inference, compute, etc!
When the dimension of data grows, often the performance of machine learning algorithms degrades and this is where dimensionality reduction comes into play!

You can simply understand Dimensionality reduction as a technique to reduce the number of features in a dataset. Less number of features means a simpler and efficient machine learning model! One of the most commonly used methods for dimensionality reduction is Principal Component Analysis(PCA).

This article will demystify the underlying working of PCA and how dimensionality reduction is accomplished using PCA! Let’s get started!

What is Dimensionality Reduction ?

To put it simply, dimensionality reduction is a technique where the number of features in a dataset is reduced to a lower dimension without losing the essence of the data.

When dealing with high dimensional data, it is often useful to reduce the dimensionality by projecting the data to a lower dimensional subspace which captures the “essence” of the data. This is called dimensionality reduction.Page 11, Machine Learning: A Probabilistic Perspective, 2012.

Dimensionality reduction is broadly classified into two categories – Feature Selection and Feature Extraction. Feature Selection is a method, in which most important features are retained and the rest are simply left out. Feature Extraction on the other hand is a slightly advanced technique where features in higher dimension space are reduced to a lower dimension space. PCA belongs to this category!

In this blog, we will demystify foundational concepts behind PCA, an unsupervised learning algorithm.

What is Principal Component Analysis( PCA ) ?

Principal Component Analysis is a popular Dimensionality Reduction Technique, which is in simple terms :  “ transforming n-dimensional features ( n being large ) to k-dimensional features ( where k is lesser than n ) , without loss of information present in the data ”. 

PCA is widely used for Visualization of higher dimensionality data projected into a lower dimension for easy representation.

Let’s understand the concept of PCA in a simple way. Say, there are two variables as seen from the above image X1 and X2. Which of these variables would you choose if you want to reduce the dimension from two ( 2 – D ) to one ( 1 – D ) ? 

Is it variable X1 or X2 or a new one ? Why should we actually reduce the dimension ? We will see one by one. 

If our problem was to reduce from 2-D to 1-D, it would be the variable , say Z1 , with higher variance, i.e.,a variable (Z1)  which holds most of the information present in the data. The new variable Z1 will be the linear combination of X1 and X2. 

The linear combination will be the line ( EigenVector of highest EigenValue of the Covariance Matrix ) which is the new feature space transforming the data with less projection error.

Why should we reduce the dimension ? Say if we have 1000 of variables in a dataset. The main question to be answered is do all the variables have equal importance. If the information present in 1000 variables is the same with the case of having 10 variables, then it is the case of Dimensionality Reduction problem.  

This is just an outline of how PCA works in reducing into lower dimensions. The most interesting part is how PCA works in forming a new feature space with lower dimensions.

In general, 

P.C.A is a linear transformation of a n-dimensional space reduced to k-dimension 
( k << n ), in an either positive or negative direction to project the original data with minimized projection error.

Advantages of PCA 

  • PCA performs significantly in reducing the number of dimensions.
  • Removes correlated features automatically
  • Visualization of higher dimensional data is made easier.
  • The performance of the model improves
  • Low noise sensitivity

Disadvantages of PCA 

  • PCA is sensitive to outliers.
  • Assumes that feature set is correlated
  • There may be a loss of information present in the data to an extent. Selecting the number of principal components should be handled with care.
  • Data Standardization of features represented in a huge variance of scale may be difficult with biased results.
  • Interpretability is difficult
  • Covariance matrix is difficult to be evaluated

How does PCA  work?

This part provides an explanation on how Principal Component Analysis can be implemented based on a step-by-step approach. It is important to understand every approach and the mathematics behind them. So let’s not wait and get started. 

  1. Import the Data 

The first step in building PCA is to gather the necessary data which will be the most important part of any Machine Learning Technique. So, primarily we need to import data. 

  1. Standardisation 

Features present in the dataset may be measured in different scales ( units ).  If the features are in different units, it is important to standardise to bring them to one unit. This makes all features with zero mean and a unit standard deviation (=0, =1) . 

  1. Compute Covariance Matrix

After standardising all features in the data, covariance matrix is calculated. Covariance matrix is the quantitative measure that defines how two or more variables are related to each other. 

  1. Compute EigenValues and EigenVectors

From the covariance matrix, EigenValues () and EigenVectors() are calculated, which are the core part in PCA. 

  1. Selecting Principal Components

The EigenValues are sorted in a descending order to know the significant Principal Component. The First Principal Component  ( PC 1 ) will contain the maximum variance of the data. The second Principal Component will be orthogonal ( perpendicular )  to the first component.

  1. Projection to new Feature space

With the selected Principal Component, newly formed data which is just the projections of actual data, in a  lower dimensional space can be calculated. 

Diving into Math behind PCA

In the last section, we understood about the methods followed in PCA. Now, to make the maximum out of the concepts mentioned earlier, we have to understand and learn about the mathematics behind it. I promise this will not be a boring maths class. 

Covariance Matrix

So, what is a Covariance matrix ? It is the ‘co-variance’  describing how two variables vary with each other. It calculates variation in 2 Dimensions. If there are many variables, every calculation goes in a pair to get the final covariance matrix. 

Say, we have 3 variables, ( x, y, z ) , the pairs will be 

  • Covariance of ( x,y )
  • Covariance of ( y,z )
  • Covariance of ( x,z )

The formula to calculate Covariance is 

This image has an empty alt attribute; its file name is image-40.png

The range of covariance can vary between – to +. They can be either negative covariance or positive while zero represents no covariance between the two variables. 

Eigen Decomposition 

EigenValues ( lamda ) and EigenVectors (x) are calculated from the covariance matrix which is a square matrix n x n. The following steps are followed to arrive at EigenValues and EigenVectors. 

This image has an empty alt attribute; its file name is image-42.png

Out of the computed EigenVectors and EigenValues, the Highest EigenValue (sorted in an Descending order ) (lambda max) and corresponding EigenVectors (x1 , x2 , . . . , xk) will be the Principal Component with higher variance. This Principal Component, generally PC1 , will be the new feature space with data projected in k dimension ( k << n ).

Projected Data 

Once the EigenValue and corresponding EigenVectors are found, the data formed by PCA is calculated. 

Given , { x1 , x2 , . . . , xk } = Eigenvectors, XScaled = Standardised data

Dot Product of the EigenVectors and Scaled data will produce the data for new feature space. This newly formed data is a projection of the actual data. 

Projected Data = EigenVectors . XScaled

This data will be projected into the direction of EigenValue ( Principal Component ) in a lower dimensional space.

Working around PCA with an Use Case

As we have learnt much about what is PCA and how PCA works. Let’s get hands-on experience working with PCA using Python.

The dataset used for this coded example is the digits dataset from scikit library.

Python

Digits dataset consists of 1797 observations and 64 features. Each value is a 8×8 image of a digit.

Standardising the features into a single scale with zero mean and unit standard deviation. StandardScaler is a function which standardized all features into a single scale.

Python

Let’s see how internally PCA works on calculating covariance matrix and performs Eigen decomposition.

Python

Now let’s build our PCA model on digits dataset. We will also see how the Dimensionality Reduction technique is performed and how visualization is made easier using PCA.

Python
dim_1dim_2dim_3labels
-1.25946621.274884-9.4630540
7.957611-20.7686994.4395061
6.991923-9.9559862.9585582
-15.9061053.3324649.8243723
23.3068674.269061-5.6751284
Transformed Data

Python

We arrive at the end of this blog. PCA can experiment with any dataset. Here, we reduced the dimension from 64 to just 3. Also, we visualized how a higher dimensional data with many features were projected to a 3 Dimensional space using PCA. 

Conclusion

PCA, is a multi-purpose technique used for Dimensionality reduction as well as Data Visualisation. We covered many topics like Dimensionality Reduction, Visualisation, and PCA. Also, we experimented with how PCA worked with the sklearn in-built dataset which consists of many features. 

I hope this blog gave you a clear understanding of how PCA works. Please give a thumbs up if you like this blog. Keep learning. 

Reference Articles

A tutorial on Principal Components Analysis – http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf

PCA – https://www.dezyre.com/data-science-in-python-tutorial/principal-component-analysis-tutorial