What is Unsupervised Learning?

By | May 16, 2020
What is Unsupervised Learning

If you are new to machine learning, we suggest you read our articles about Machine learning and supervised learning. If you know about both, you can quickly grasp unsupervised learning as well.

What is Unsupervised Learning?

In unsupervised learning methods, data is fed to the system. The machine classifies, sorts, groups and finds patterns on its own without any human intervention. This is unlike supervised learning where we label or classify the inputs. Thus, unsupervised methods can solve complex problems and can group the data based on similarities or differences on its own. There is no prior training involved.

Vamware

For example,

A child sees many people around. How does the child differentiate between other children, adults, elderly? You wouldn’t point each person and label them – rather, the child will observe and find similar characteristics, for example, hair, body built, height, voice and other features to categorize people into different ages.

Compared to supervised learning, unsupervised learning is more difficult. Since we have nothing to compare or label, we need experts to find out whether the insights are useful or not. Further, the algorithm may pick some categories that may confuse the algorithm and product irrelevant results.

Let us take a simple example,

Suppose you feed data containing bats and balls. Unsupervised learning can sort bats and balls based on the characteristics of each; however, it might add more categories like the color – so will it sort based on color or based on object type? This makes unsupervised learning a little more unpredictable compared to other machine learning methods.

Despite this, unsupervised learning is considered useful and important because –

  • It can find many classes and patterns in data that may not be possible otherwise
  • Labeled data needs human intervention, but unlabeled data doesn’t, so this method is more suitable for real-time data
  • Labeling huge datasets also involves a lot of time and cost, so only a few samples can be labeled manually

Types of Unsupervised Learning

There are many unsupervised learning types, which come under three categories –

  • Clustering – Clustering involves sorting the data into various groups (called clusters) based on their similarities. Each cluster has similar items and is different from other clusters. The features for distinction can be anything like color, shape, size, etc…

Clustering

  • Hard clustering – In this type, every data point either belongs to one cluster completely or not. The above example of colors is hard clustering.
  • Soft clustering (Fuzzy clustering) – in this type, instead of putting the data points strictly into a cluster, the likelihood of the data point belonging to a particular cluster is assigned. So, each data point will have a probability (less or more) to be assigned to any of the clusters.

Unsupervised Learning Algorithms

Based on this concept, there are 100s of algorithms, but some are more popular than others. We will discuss some of the most popular algorithm types –

  • Connectivity models – In this type of model, we consider that the data points that are closer to each other have more similarities than the data points that are far. These models are useful for small datasets. Either all the data sets are first classified and then put into various clusters, or all data points are put into a single cluster and then divided as the distance increases.
  • Distribution models – These models work on the probability that all the data points of a cluster come under the same distribution.
  • Density models – the data points are separated as per regions of different density and the data points within certain regions are put into one cluster. An example of this model is DBSCAN.
  • Centroid models – these models are iterative and the number of clusters to be created should be decided in the beginning i.e. we should know the dataset. Similarities are determined by the proximity of data points to the centroid of the clusters.
  • Hierarchical models – This type of model creates a tree structure. The structure can be either top-down or bottom-up. Initially, each data point is treated as a cluster. As we iterate, we can combine two clusters based on the distance between them (smallest). This process is repeated until we reach the root, that is when all the data points will be in one cluster. Similarly, we can build the other clusters too.

K-means clustering

It is an efficient algorithm, based on the centroid model. We choose k number of points (clusters) in the beginning. Next, assign all the data points that are nearest to each of the k clusters to the corresponding cluster. Then calculate the mean of all the points that belong to the cluster (centroid). The value will become the next centroid. We repeat the process till then centroid value remains constant – once this happens, it means the data points have been all grouped. The main issue with this algorithm is that it has a high variance. Thus, the performance is not as good as other powerful algorithms.

DBSCAN

Clusters are not always in circles or spheres. They can take any other shape and size too. This is the main idea behind DBSCAN. Also, DBSCAN doesn’t need you to define the number of clusters beforehand. It can also identify noise and doesn’t include it in any of the clusters. The starting point is randomly chosen. After this, we can get the neighbors of this selected point, and if there are sufficient neighbors, clustering can happen. All the neighbors of the particular selected point become a part of the cluster. Same way, other clusters are formed. If a particular neighbor is too far away from the arbitrary data point, it is considered as noise.

EM (Expectation maximization) clustering using Gaussian mixture models

In this method, we use both the mean and standard deviation to form the cluster, thus giving it an elliptical shape. The data points are assumed to have Gaussian distribution. GMM is a very flexible model because one data point can belong to multiple clusters. To determine the class of an overlapping data point, we can just measure the probability of the data point to both the classes and choose the higher value.

Clustering algorithms are widely used–

  • To detect anomalies in data, for example, a customer who is randomly liking a lot of pages of Facebook in a short time,
  • For image compression and face detection
  • To merge points in a map that are close together
  • for market segmentation, for example, identify customers who like to buy coke along with pizza and separate them from those who like to order fries with pizza, etc…

Generalization (Dimensionality Reduction)

Generally, in classification problems, there are many random variables or features. As the number of features increases, the complexity of the model increases and so does the training time. If features are correlated, then adding so many features becomes redundant. By using dimensionality reduction, we can reduce the number of random variables (features) and obtain a set of ‘principal’ variables. Dimensionality reduction has 2 components – feature extraction (reduces the high-dimensional data to lower dimension) and feature selection (find a smaller subset of the original set of features). Dimensionality reduction methods can be linear or nonlinear.

Generalization

Principal Component Analysis (PCA)

PCA is a linear method for dimensionality reduction. It is based on the principle that the variance of data when it is reduced from higher dimension space to lower dimension (for example, 3D to 2D, or 2D to 1D) should be maximum. PCA helps in reducing the storage space through data compression and reduces overall computation time other than removing redundant features. This is done by transforming the variables into a set of principal components (another set of variables). These variables are orthogonal (zero correlation) because the principal components are eigenvectors from a covariance matrix. This ensures that the variance is retained even when the dimensions are reduced.

Singular Value Decomposition (SVD)

SVD is the most common method for matrix decomposition (factorization). In this method, the original matrix is split into two smaller matrices as,

A = U x D x VT

U and V are smaller matrices and D is the diagonal matrix. VT is the transpose of the matrix V.

SVD works on both square and non-square matrices and the matrices U and V are orthonormal. It can be used for data reduction, Principal Component Analysis (PCA), least-squares linear regression, removing noise from data and compressing images. The most important use of SVD is in a movie recommendation system where the ratings for movies are sparse. For example, 100 users would have watched a movie, but only 10 would have reviewed it. For such sparse data, SVD provides a good way of factorization and completion of data.

Random forests

Random forests are great algorithms for feature selection. We can create a large set of trees and find the most informative subset of trees. With that, we can reduce the number of trees (levels) to only those that do not have similar features. Random forests are an easy and efficient way to identify the features that are most predictive when compared to other features.

There are many other methods for dimensionality reduction, but these are the most performant ones. Some of the most useful applications of dimensionality reduction are –

  • Collaborative filtering and Recommender systems
  • Text mining
  • Customer Relationship Management
  • Face and handwriting recognition

Association rules

Association algorithms find links or patterns in the data and build a predictive model based on strong rules. A simple example would be market basket analysis, where if a person purchases rice and some lentils, he would also most likely like to purchase salt and some other spices; or if a person purchases bread, he is also likely to purchase butter, eggs or jam. All these show the typical association of one item with the other related items through data mining.

Customer 1 Bread, Butter Associations
Customer 2 Bread, Eggs {Bread, Eggs}
Customer 3 Bread, Butter, Eggs {Bread, Butter}
Customer 4 Milk, Eggs {Milk, Eggs}
Customer 5 Milk, Bread, Eggs

Some popular association rule algorithms are –

Apriori

If you have often seen ‘You may also like’ on Amazon and other e-commerce websites, this algorithm is the reason behind it. Apriori is a simple algorithm where-in through data mining, associations and patterns can be formed about the data. The name comes from the prior knowledge of the properties of items that can be bundled together. The most common use of this algorithm is seen in market basket analysis; however, it can also be used in detecting common medical conditions and diseases.

There are 3 steps to calculate association rule –

  • Support –Support tells us about the items that are frequently bought together.

Support = frequency (A, B)/N

Items with low fraction can be filtered out with this step.

  • Confidence – Suppose most users purchase items A and B together. Confidence tells us the number of times A & B are bought together, given the number of times A is bought.

Confidence = frequency (A, B)/frequency(A)

This filters those items that have less frequency compared to others. For example, if A is bread and B is butter, and there is C that represents jam, and the confidence for A&B is way more than A&C, then we can filter out A&C.

  • Lift – Even after support and confidence, there might be many records left. Thus, we calculate the lift to create the final association rule. Lift tells us about the strength of any rule. A higher lift value indicates a better association. For example, if a person buys bread and the chances of buying eggs are twice, the lift value will be 2.

FP-Growth

Frequent Pattern (FP) growth algorithm is an improvement over the Apriori algorithm. In this algorithm, a frequent pattern tree is created to represent the database. The tree depicts the association between different itemsets. One frequent item is used to fragment the data. Each fragmented pattern’s itemsets are analyzed and the most frequent patterns are identified. Unlike Apriori, which needs database scans for each iteration, FP growth creates a tree structure by scanning the database only twice. Every node of the tree indicates one item of the itemset. The first node (root) is null, and the subsequent nodes represent itemsets. It is a scalable algorithm, however more expensive compared to Apriori.

ECLAT

ECLAT or Equivalence Class Clustering and bottom-up Lattice Traversal is another more scalable version of the basic Apriori algorithm. As the name suggests, it follows the bottom-up approach, which makes it a faster algorithm. It needs less memory as it follows a depth-first approach.

Association algorithms are typically used –

  • To analyze browsing patterns of users
  • To predict sales and discounts
  • For targeted advertising
  • To place products on the racks in a shop – for example, keeping bread-eggs on the same rack

Summary

Well, in this article, we introduced you to the most important concepts in unsupervised learning. If you are just beginning your career as a data scientist or machine learning engineer, this introduction is a good head start for you. Check out our list of machine learning books to get in-depth knowledge about these algorithms and other important ML concepts. A quick summary of the article –

  • In unsupervised learning, the machine learns without any human intervention. There are no labels, unlike supervised learning.
  • There are three types of algorithms – clustering, association and generalization
  • Clustering algorithms are used to group similar items together based on features (random variables)
  • Generalization or dimensionality reduction algorithms reduce the dimensions to eliminate redundant features
  • Association rules find links and patterns in the data for predictive analysis

Although unsupervised learning models are unpredictable and need more data, they are extremely important because they can handle huge sets of raw, unstructured and unknown data with ease, which is not possible by humans.

You might be also interested in:

Leave a Reply

Your email address will not be published. Required fields are marked *