Introduction to Classification Algorithms

By | July 19, 2020
Introduction to Classification Algorithms

There are three main types of machine learning algorithms – supervised, unsupervised, and reinforcement. Classification is a type of supervised learning algorithm in which incoming data is classified or labelled based on past data. The algorithm is trained by feeding it with different types of labelled data that can be categorized into different outcomes. For example, we can train a model to identify cats, dogs, and horses in the data fed based on features like eyes, colour, size, shape, etc. Once the model is trained, it can categorize a new set of data into the same categories’ cats, dogs, and horses.

Classification Algorithms Scenarios

Different machine learning algorithms can solve different types of problems. Sometimes more than one model (algorithm) is applied to the data to get the final solution. Some typical scenarios where classification algorithms are used are:

Vamware
  • Detecting if an email is a spam or not spam
  • Identifying the gender of a person based on certain features (like hair length, body structure, etc.)
    • Grouping similar items together – for example, the same type of files, same fruits, the same type of news items, the genre of music and movies, etc.
  • Identifying if a user will buy particular accessories with a product

There is no particular rule that states what type of algorithms should be used for a particular problem. Generally, data scientists and machine learning engineers perform controlled experiments to determine the most suitable algorithm for a particular task. If we were to represent a classification algorithm mathematically, it would be a mapping function from the input variables A into output variables B, where B = f(A).

What are Classification Algorithms

As we have already read, classification algorithms classify or categorize items based on their similarities. The model is trained using a set of data for which the output is already known. Once the model is trained and has good accuracy, new data is supplied for testing and evaluation. The outcomes are called a target class or class label. For example, if the model has to classify the following fruits based on their shape, size, and colour, the target classes will be each of the fruits. Mixed fruit basket: (Image source: Wikipedia)

Mixed fruit basket

When we classify this, we will get the result as:

classify Mixed Veg Fruit

This labelling and classification of data can be done using many algorithms. Before getting into that, let us explore some common terminologies used in classification.

Important terms in the Classification

Feature – Feature is a measurable property. For example, height, length, size features.

Classifier – it is an algorithm that correctly maps the input into one of the categories. For example, linear classifier, naïve Bayes classifier etc.

Label – names assigned to the output categories. For example, Apple, Orange are labelled for fruits.

Class – the instances into which the problem is categorized. For example, in the problem of identifying if an email is a spam or not spam, the target classes are ‘spam’ and ‘not spam’.

Types of classifications

There are different types of classification algorithms:

1. Binary classification

Binary classifications have two target labels. These are usually the outcomes of ‘yes’ or ‘no’ to a problem. For example, a person will buy a particular course or not, an email is spam or not, a patient has a particular disease or not, a product has passed a quality test or not. Remember that binary classification is one of the below:

Binary classification

Some of the popular algorithms used for binary classification are – k-NN, decision trees, logistic regression, SVM and Naïve Bayes. Algorithms like logistic regression and SVM support only binary classification.

The accuracy of each of the above classifier algorithms can be tested using various methods like confusion matrix, ROC curve, cumulative gain, positive and negative rates, lift chart, binary classification tests like classification accuracy, error rate, sensitivity, specificity.

2. Multi-class classification

If a task has more than two class labels, it comes under multi-class classification. There is no yes or no outcome unlike binary classification; instead, classification is done amongst a set of known classes. A data sample can belong to only one particular class. For example, an animal can be a lion or a tiger but not both at the same time. Multi-class classification is the most commonly used classifier of all, some of the use-cases being:

  • Filtering out particular words in a text (email, document, resume, etc.) to identify phishing emails, relevant keywords etc.
  • Identifying the rating (A, U, U/A) of a movie based on the tags and content
  • Helping businesses predict the next set of products that a user might like to buy based on their current purchases and preferences
  • Handwriting recognition, image classification, intent classification

The most popular algorithms used for this type of classification are kNN, decision trees, random forest, gradient boosting, Naïve Bayes, neural network classifier. Don’t worry, we will learn about each of these later in this article.

3. Multi-label classification

In multi-label classification, a data sample can belong to many labels. For example, a person can be adept in history as well as physics, or love comedy and horror movies both, a piece of news can belong to both sports and music category and so on. Multi-label classifiers require knowledge of the correlation between the various classes and can be considered as the generalization of the multi-class classification. The most common classifiers used are the modified versions of k-NN, decision trees, kernel methods and neural networks. Multi-label algorithms are implemented in Python using the scikit-learn package.

4. Imbalanced classification

In an imbalanced classification, the number of examples in each class are not equally distributed. It is a type of binary classification, where most of the examples belong to a class A which is normal, and some (less) belong to class B, i.e. not normal.

Some typical scenarios where imbalanced classification is used are outlier and fraud detection, medical diagnosis tests. Decision trees, SVM and logistic regression, are commonly used algorithms. Performance metrics can be calculated using precision, recall, f-measure.

Types of Learners

There are two types of learners in a classification algorithm:

  • Lazy learners: These learners perform classification only after receiving the testing data. These learners spend more time on predicting and less time on training. K-nearest neighbours and case-based reasoning are lazy learning algorithms.
  • Eager learners: Eager learners do not wait for testing data. They work on training the model, thus spending more time on training, and reducing the time spent on prediction. Decision trees, Artificial neural networks and Naïve Bayes are eager learning algorithms.

Popular classification algorithms

We have discussed above, the types of classification and the algorithms that are used for each. Let us now explore more about the algorithms. We have already described some of the above algorithms in our article on supervised learning if you wish to go into details.

1. k-NN

k-NN refers to k-nearest neighbours. In this algorithm, the category of the new data sample is found by determining its nearest neighbours. If the data point has more neighbours of a particular class, the data point is assigned the same class. A simple diagram to illustrate the same is as follows:

K-NN

In the above picture, for the white circle, there are three blue neighbours and two red neighbours. As 3 > 2, the new circle will be classified in the blue category.

2. SVM (Support Vector Machines)

In SVM, each data item (sample) is marked as a point in n-dimensional space, where n is the number of features. Each feature represents the axis coordinate. To classify the items, a hyperplane is constructed that best separates each class of data items.

Linear hyper-plane: In this case, we can classify the data items linearly, i.e. with a straight line. Sometimes, there might be an outlier (as shown below); however, SVM ignores the outliers.

Linear hyper-plane

Non-linear hyper-plane: If you have the below plot of data items, how would you classify them?

Non-linear hyper-plane

For this, we have to create another dimension and represent the data in the new dimension. If we look at the data from a different dimension, below is how it will look like: (Think of the below view as the view of same data from the left or right side and the above view as the top view of the data)

same data from the left or right side

If we now classify the data, we can see that the hyperplane is a circle.

hyperplane is a circle

This method is also called the kernel trick, where a feature (axis) is added manually to get the best hyperplane, thus converting a lower-dimensional plane into a higher-dimensional plane.

3. Decision tree

As the name suggests, decision trees split the data into smaller subsets to form a tree-like structure. The nodes split based on the answer to a question, for example, the question, “will it rain tomorrow?”, can have three answers – yes, no and may-be. Based on these three answers, the data can be split into smaller subsets. Based on these answers, further questions are formed, and the tree keeps splitting until no splits are possible. Check out my article on the decision tree in machine learning for details on this algorithm.

4. Naïve Bayes

Naïve Bayes, which is based on the Bayes theorem, is considered to be the easiest amongst the classification algorithms. It is also the most popular of all. Bayes theorem uses probability and conditional probability to predict the class of data sets that are unknown. For simplicity, the ‘Naïve’ Bayes assumes that the presence of one feature is independent of other features in the data set. The Bayes theorem is as follows –

P(A|B) = P(A)*P(B|A)/P(B)

where, P(A|B) = posterior probability of the target A, given B (attributes) are true

P(A) = prior probability of class A

P(B|A) = likelihood of or probability of the predictor B given the class A is true

P(B) = probability of the predictor B (attributes)

Let us take a simple example to understand this algorithm.

Let us say we have all the face cards of a pack. What is the probability that we get Q of the spade as the first card after shuffling the cards?

We have to predict the chances of getting Q spade, and we know that we have only the face cards. We know that there are 16 face cards in a pack.

Therefore, the probability of getting a ‘Q’ out of the face cards, i.e. P(B) = 4/16 = ¼

The probability of getting a spade, i.e. P(A) = 4/16 = ¼

Now, the likelihood of getting a spade if we know that the card is Q, i.e. P(B|A) = 1/4

Therefore, the posterior probability that the first card we pick will be a Q spade,

P(A|B) = (1/4*1/4) /(1/4) = 1/4

This is a simple example; however, it is excellent for understanding how probability works. Some of the real-time problems that are solved by Naïve Bayes algorithm are –

  • Text classification
  • Sentiment analysis
  • Filtering spam messages
  • Recommender systems (along with collaborative filtering)

In Python, the scikit-learn package offers three types of Naïve Bayes models – Gaussian, Bernoulli (Binomial), Multinomial.

5. Logistic regression

While the name suggests it has something to do with regression, the logistic regression algorithm is a popular algorithm used for binary classification problems (binomial logistic regression). There could be more than two classes as well, which is called as multinomial logistic regression. The core of the algorithm lies in using the logistic or sigmoid function, which indicates exponential growth in the data and maps all the real values into a value between 0 and 1. If we plot the data, we will get an S-shaped curve, the function being p(x) = 1/(1+e-(βx)). The curve of a logistic regression also flattens out near 0 and 1.

As the input gets positive and large, the output becomes closer to ‘1’. On the contrary, if the input is negative and large, the output goes closer to 0.

Logistic regression is considered to be a generalized form of a linear model, which is because the decision boundary (of classification) is linear. The model produces some probabilities based on which classification is done. A threshold value is set that decides the outcome. A simple example is to predict whether an employee would get the job or not. Since probability always lies between 0 and 1, typically the threshold is set at 0.5 (half), i.e. if the probability is greater than 0.5, the employee is more likely to get the job. This probability is determined based on a combination of factors, let’s say, confidence, communication and subject knowledge. After applying the logit (logistic) function, we would get a plot in the shape of S for the given data.

Logistic regression

6. Random forest

The random forest creates decision trees, the more the number of trees, the stronger the algorithm becomes. Random forest eliminates the issue of over-fitting that could occur while using a single decision tree. In this algorithm, the data is split into smaller sets and decision trees are constructed from these smaller samples. Each decision tree gives an outcome. If there are n decision trees, we get n outcomes. Finally, voting is performed to decide the best-predicted value, which then becomes the outcome. It is a type of ensemble method for machine learning.

Random forest

7. Gradient boosting

GBM is another ensemble algorithm that generates a prediction model by combining weak learning models, most decision trees. By combining the weak models, we get a more robust model. This is done by optimizing the mean squared value (i.e. the average of the square of the difference between targeted and predicted values from a sample). By combining the MSE(Mean Squared Error) of each weak model, the overall MSE value reduces in the final model. GBM uses two direction vectors, namely, the residual vector and the sign vector and uses the optimization technique – gradient descent to find the minimum error value by moving in the direction of steepest descent. Think of this function similar to you going down a hill through a steep path till you reach the bottom, from where you can only go upwards.

Gradient boosting

8. Neural networks

Neural networks are deep learning techniques that try to mimic the brain in making decisions and predicting outcomes. In such a network, each neuron is connected to many others using connection links. The connection links contain information about the input signal (neuron). These are called weights and are responsible to either activate or inhibit the input signal. The output is a result of the processing of input using an activation function. Thus, the three layers of an artificial neural network are:

  • The input layer (collects the input and gives a pattern to the hidden layers)
  • Hidden layers (where the input is processed; there can be n hidden layers)
  • Output layer (collects and transmits the output)

The hidden layer consists of aggregation and activation functions. The entire ANN structure can be represented as:

Neural networks

Image source: Wikimedia

where the transfer function is the summation function, ∑xnwnj and activation oj is the output.

ANN follows the iterative learning process where rows are presented to the network one at a time, and weights are adjusted every time based on the errors of the previous output. The weights keep adjusting to achieve the desired accuracy in predicting the correct class of the input data. Neural networks can classify entirely new data patterns that have not been trained as well. The several types of ANN include Perceptron, Convolutional Neural Network, LSTM, shallow neural networks, recursive neural networks, Modular neural network etc.

Neural networks are used for NLP (Natural Language Processing), text classification, part-of-speech tagging, paraphrase detection, language generation, document summarization, targeted marketing, medical diagnosis and many more.

Applications of Classification Algorithms

Some of the most critical and practical applications of classification algorithms are:

  • Speech recognition
  • Handwriting detection
  • Document classification, image and video classification
  • Biometric identification
  • Recommender systems
  • Internet traffic inception
  • Click-stream analysis (i.e. to predict if a user will buy certain products based on their past clicks and purchase history)

Conclusion

Through this article, we have introduced you to one of the supervised learning algorithms, i.e. classification algorithms, that include many classifiers like SVM, logistic regression, Random forest and so on. As this is only an introductory article, the actual implementation of algorithms (code) is out of scope and will be covered in separate articles in detail. To summarize,

  • Data is categorized into several classes using different classifiers
  • A type of Supervised learning mechanism
  • Makes conclusions and predictions based on the values received through model training
  • Most popular algorithms are SVM, Random forest, Naïve Bayes, decision tree, GBM, neural networks, logistic regression etc., each of which can be implemented using the scikit-learn package of Python
  • Different types of classification algorithms are binary, multi-class, multi-label and imbalanced.

Leave a Reply

Your email address will not be published. Required fields are marked *