Principal Component Analysis(PCA) is a type of unsupervised algorithm which is used for dimensionality reduction. The concept is simple – remove the features that are similar or do not add any extra value in the decision making so that the number of distinct features is less thus reducing the processing time without compromising on the accuracy. Some of the other popular dimensionality reduction algorithms are matrix factorisation, linear discriminant analysis etc.
In layman’s words, you can say that PCA summaries the data by highlighting the important features while eliminating the less useful or less valuable features.
PCA is a widely used mechanism for exploratory data analysis and predictive modelling. Karl Pearson first invented it in 1901 as the Principal axis theorem, and later evolved and named as PCA(Principal Component Analysis) in the 1930s by Harold Hotelling.
What is PCA?
Time to get a bit more technical!
PCA(Principal Component Analysis) is a linear dimensionality reduction technique which converts a set of potentially correlated variables (features) into a set of linearly uncorrelated variables called principal components, using orthogonal transformation. All the features are converted to numerical values before performing the transformation. A simple diagram to illustrate this is:
As we see, there are many variables in the original dataset, each representing an axis – x, y, z, x’, y’. After the algorithm is applied, the variables are converted into a reduced set of principal components, PC1 and PC2.
So, what is orthogonal transformation?
Two or more variables are orthogonal if there is no correlation between the variables, i.e. the value of correlation is zero. For example, if there is a vector v, and we want to transform it into another vector u by performing dot product (scalar multiplication) with a matrix,
u = Av, where A is the matrix
If we transpose the matrix A, i.e. AT, and still get the same product, then the matrix is said to be orthogonal:
u = ATv = Av
This means we get the same dot product, even after the matrix is transposed. Therefore, if we have an object represented as a vector, and if we just rotate (transpose) the matrix A (orthogonal transformation), we will still be able to retain all the properties of the vector, and the object will be just flipped or rotated.
For simplicity sake, let us take an example dataset with two dimensions. We will see the correlation between the X & Y variables, and then see how we can flip and rotate the linear plot (orthogonal transformation) to make it one-dimensional:
We can see that the coordinates x & y are plotted, and there is a positive correlation between x & y. When we rotate the line that crosses maximum data points, we get a straight line, i.e. a single dimension, thus achieving dimensionality reduction.
Before digging more details about PCA, let us understand the important concepts that we need to know to perform PCA.
It is defined as the number of features in the dataset. If your dataset has 100 columns, then 100 will be the dimensionality.
Variance or covariance determines the difference between the actual and expected value of a random variable or feature. It can be mathematically defined as the average of squares of the difference between the actual and the expected value of a variable. Since we find the squares, the variance is always a positive value. In a more straightforward explanation, the variance is simply the measure of how much something changed (from its expected value).
Eigen vector & Eigen value
Let us consider a non-zero vector v. If a square matrix (n x n), say A, when multiplied (scalar) with the vector v (i.e. Av), is a multiple of v, then the vector v is said to be the eigenvector of the square matrix A. It can be represented in the following equation:
, where v is said to be the eigenvector and λ is the corresponding eigenvalue.
As we know that a vector has direction and value both, applying linear transformation will not change the direction of the vector. Further, the eigenvector should be a non-null value. For easy reference, here is the equation again:
(Square matrix A * EigenVector) – (EigenValue*EigenVector) = 0
Eigenvectors and eigenvalues help us understand and interpret data in a better manner and are used to transform data and represent it in a more manageable form. Both concepts are vital for performing data science. As we will see later, it is elementary to calculate eigenvectors and eigenvalues using Python NumPy library.
The extent or degree to which variables are related to each other is called correlation. A positive correlation between A & B means A increases with B and vice versa. A negative correlation means A increases when B decreases, and vice versa. If there is no correlation between A & B, the value is 0. The value of correlation always lies between -1 and +1.
We have already discussed how PCA reduces the n-dimensional space into a lesser dimension by reducing the number of variables. The reduction is performed by choosing features that do not correlate with each other, i.e. they are orthogonal (or perpendicular) to each other. Such vectors minimize the averaged squared perpendicular distance from a point to the line. Repeating this process to minimize the distance forms an orthogonal basis wherein the individual data dimensions are uncorrelated (orthogonal). These basis vectors are known as Principal Components – in our diagram above; they are PC1 and PC2.
A vector, as we know, as direction and value, thus we can represent it spatially. A vector that contains multiple elements (attributes) about an object is called a feature vector. A feature represents an object’s numerical representation. For example, a feature can be a keyword, image, sound length etc. For example, for an image, a feature vector can contain information like colour, size, shape, edges, intensity and so on. A feature vector is represented as a matrix:
F = f1 f2 . . fn
Putting all the features together in a feature vector is called as a feature space:
Feature vectors can be represented in many ways, which make them an excellent choice for machine learning algorithms for data analysis.
How does the PCA algorithm work?
We already know that all the principal components resulting from the reduction are orthogonal. The principal components (PC) are nothing but the linear combinations of the original variables and follow the principle of least squares. The PCs are named as per their importance, for example, PC1 should be the most important PC, PC2 should be less important than PC1, PC3 < PC2 and so on for further principal components. This is because the variations in the components decrease when we move from PC1 (first) to PCn (last).
So, how does this orthogonal transformation happen? How can we flip or rotate the dimensions and reduce them? The simple answer is by finding the components with the largest variance. Let us perform PCA step by step:
1. Standardization of variables
Initially, all the variables will have a different range of values. For example, the variable height may range from 60-180, whereas weight values may range from 1-100. To find the correlation between both, we need to bring them to a comparable scale. Otherwise, the variables that have a larger range will overshadow the variables with smaller ranges, and we would get biased results. Standardization is done using the following formula:
s = (value – mean)/standard deviation
Refer to our article on Statistics and probability to know how to calculate the above values.
2. Computing the covariance matrix
A covariance matrix is used to determine the correlations between different variables in various possible pairs. If there is negative covariance, then the variables are said to be indirectly proportional, if there is positive covariance, then the variables are directly proportional to each other. The covariance matrix can be represented as:
C = Cov(a,a) Cov(a,b) Cov(b,a) Cov(b,b)
Where a & b are the variables and the matrix is a 2×2 or 2-dimensional matrix, meaning that there are two variables. If there are ten dimensions, the covariance matrix will be a 10×10 matrix.
Covariance matrix is commutative i.e. Cov(a,b) = Cov(b,a)
Finding covariance helps us identify heavily dependent variables that contain biased and redundant information which can result in reduced model performance.
3. Calculating the eigenvector and eigenvalue
Eigenvector and eigenvalue are calculated from the covariance value. Remember that every eigenvector has an associated scalar value, called an eigenvalue. The sum of all the eigenvalues calculated from the covariance matrix represents the total variance. The principal components are highly significant with the first component (PC1) being the most significant one. So, if there are ten original variables in your dataset, then there will be ten eigenvalues, hence 10 PCs, but the most important information will be contained in the first few PCs, and somewhere from PC6…PC10 can be eliminated. Mathematically, Principal Components are the linear combinations of the initial variables and represent the direction of maximum variance. If you plot a graph, the PC will be a new axis containing the best angle to view the data.
So, how do we identify which PC is most significant?
Yes, the one with the highest value of eigenvalue is the most significant PC.
4. Determine the principal components
We determine the principal components (PC) using the values of eigenvector and eigenvalue. The components are identified in such a way that they still preserve the important information contained in the data.
If we arrange the eigenvectors in order from highest to lowest eigenvalue, we will get PC1 to PCn for an n-dimensional array. Suppose there are three variables in the dataset, the covariance matrix will calculate three eigenvalues. Let’s say the values are 0.23 (λ1), 1.34 (λ2), 0.007(λ3), respectively. We observe that,
λ2 > λ1 > λ3
This means the eigenvector (v2) corresponding to the eigenvalue λ2 is PC1, v1 corresponding to λ1 is PC2, and v3 corresponding to λ3 is PC3.
5. Dimensionality reduction using PCA
Once we know the Principal components, we can reduce the dimensions of the data by applying the algorithm and get the reduced set of variables, which can be further processed for data analysis. Imagine having 1000 variables in the original dataset and then reducing them into about 200 variables and then analysing only those 200 variables with all the important information retained – what a saving of resources and time!
Pros and Cons of PCA
PCA(Principal Component Analysis) has its advantages and disadvantages, just like everything else in this universe!
Advantages of PCA
- Removes all the correlated features from the dataset
- With fewer features, the performance of the algorithm significantly improves
- Since the variables (features are less), the problem of overfitting doesn’t occur
- Makes the visualization of data easier because of lower dimension
Disadvantages of PCA
- It is essential to standardize the data and scale the features to find the correct principal components, which needs conversion of all the features into numerical features.
- Since Principal components are a linear combination of original features, they are not easily readable or interpretable
- Information less can happen if the principal components are not selected properly
Implementing PCA in Python
It is very easy to implement PCA in Python, because of the friendly libraries it has. For this article, we will use the scikit-learn library, which implements PCA using decomposition. If you want a quick recap of Python, read our article on Python for data science.
Let’s first load the dependencies (libraries):
import pandas as pd import numpy as np import StandardScaler from sklearn.preprocessing import matplotlib import matplotlib.pyplot as plt from sklearn.decomposition import PCA # load the data from your dataset using the pandas library <datasetname> = pd.read_csv(‘path to csv’) df1 = pd.Dataframe(<dataset>, columns = [required columns]) # Perform data processing on the original dataset like standardization, removing/replacing null values, converting variables into numerical values etc. X_std = StandardScaler().flt_transform(df1) #standardization # Find the covariance matrix mean_vectr = np.mean(X_std, axis=0) cov_matrix = (X_std – mean_vectr).T.dot(X_std – mean_vectr)/(X_std.shape-1) # Calculate eigenvalue and eigenvector (eigendecomposition) cov_matrix = np.cov(X_std.T) eigen_values, eigen_vectors = np.linalg.eig(cov_matrix) # Sort the eigenvalues in descending order to find the principal components eigen_sorted = [(np.abs(eigen_values[i]), eigen_vectors[:,i]) for i in range(len(eigen_values)] # Apply PCA technique for dimensionality reduction pca = PCA(n_components = 2) pca.fit_transform(df1) pca = PCA().fit(X_std) # Plot the variance explained by each principal component plt.plot(np.cumsum(pca.explained_variance_ratio_)) plt.xlabel(‘Final number of features) plt.ylabel(‘Total explained variance’) plt.show()
Applications of PCA
There are several applications of PCA(Principal Component Analysis) dimensionality reduction technique in the fields of computer vision, facial recognition and image compression. It is also used extensively in medical diagnosis, anomaly detection, bioinformatics, data mining, financial risk management, spike sorting (neuroscience) etc.
We have included as many details as possible about PCA(Principal Component Analysis) through this article. PCA is reasonably straightforward when done with care and focus. You need the practice to understand and know the principal components in your dataset. Since many libraries help you in the process, it is crucial to know how to use the libraries and frameworks. You should also be familiar with statistics and mathematics to understand the calculations for variance, mean square and other important parameters.