What is Supervised Learning?
With machines that are constantly learning new things and processing more data than ever – we are able to solve business problems quickly and accurately. A machine that can use its “brain” to make decisions and solve problems is said to be an “artificially intelligent” system. Such a system can be trained to become better through learning – just like a human brain improves itself by observing similar patterns and practicing with similar kinds of data. Such a process where a machine can be trained and modeled with or without any human intervention is called “Machine learning.” Machine learning is of various types
- Supervised learning
- Semi-supervised learning
- Unsupervised learning
- Reinforcement learning
Naturally, the simplest of all, think of supervised learning as similar to how you would teach a concept to your child. A child at a tender age wouldn’t know different colors or shapes. When you teach them about pink, blue, red, black, white etc…, You already know the colors (input and output both) and you are essentially teaching them the same with the knowledge that you have.
By showing them objects of various colors and asking them to make the correct choice. For example, you ask them to pick a yellow fruit from a basket of fruits, or show them a leaf and ask its color. The child then knows to recognize and differentiate objects by colors. Well, if the child makes mistakes, you just correct them.
For a little older child, you would probably want to teach counting numbers and adding them. So, you give them balls, cups or spoons to count and then once they count right, you give them examples on how one spoon and one more spoon become two spoons! This goes on until the child can do the same himself with any combination of numbers – and you feel accomplished.
Supervised learning is based on the same principle – only that you are teaching the concepts to a computer!
In this type of machine learning, the training dataset is inputted to a learning system and once the machine is trained, it predicts outcomes on new datasets based on its previous learning experience. Some of the common applications of supervised learning are movie recommendation systems like Prime, Netflix, automatic classification of emails into various categories like spam, important, social etc…, traffic updates based on weather, route congestion, etc.., determining age or salary of a person based on various known factors and many more applications.
More technically, a supervised learning outcome can be represented in an input-output format as,
Y = function(X)
where X is the input, and Y is the output determined by applying the mapping algorithm or function created during the training phase.
Some features of supervised learning are –
- It works on a labeled dataset i.e. data that has defined a set of input and output parameters.
- The machine learns from past experiences which is nothing but the data fed to it.
- The performance of the machine improves with more data (i.e. more experience).
- Mostly, 80% of the data is used for training the model (learning) and 20% is used to validate or test the model.
- During the testing phase, the output data generated by the machine is compared with the actual output results to determine the accuracy.
- Based on the accuracy, the model is tweaked and re-validated if required.
- The final model built through example and learning are called a predictive model.
Types of Supervised Learning
As we see in the above diagram, supervised learning can be of two types –
Each of these types has many types of algorithms, the most common of which are listed in the diagram below the type.
Classification divides a set of information into categories. For example, in our color example before, we can classify colored objects into different categories – pink, blue, green and so on. If we have different fruits, we can classify and separate them based on colors – yellow fruits, red fruits, green fruits, etc…
The most relevant example of classification that we see every day is how your email is classified into different categories as it arrives – inbox, important, spam, social, promotions and more.
Same way, determining whether a person would purchase a particular item is based on a multi-class classification algorithm – specifically binary classification problem – as there are two choices – yes or no!
The training data set would have all kinds of data that would help the system create a mapping function, Y = function(X).
The result (Y) of a classification task would be a set of discrete values. There are many applications of classification algorithms other than the spam filter, like determining a credit score, handwriting, face and speech recognition, medical imaging, detection of pedestrians by an automated car driving system, sentiment analysis, classification of drugs, document classification, etc…
Support Vector Machines
Suppose there are n features that can be used to classify data points into categories. SVM algorithm finds a hyperplane in an n-dimensional space which makes it easy to draw a straight line, circular or n-dimensional solid plane (decision boundary) to separate the data points of the same class from other classes. With this hyperplane, we can find the maximum margin distance which helps us classify more data points accurately as more data is processed. For example, let us say we want to accept interview applications of applicants who have marked greater than 65% and are less than 25 years in age. Based on these two features marks (x) and age (y), we have a pair of coordinates (x, y), that give output either a red circle or a green triangle.
Since the data points are scattered in our case, we have a non-linear (in this case circular) hyperplane. On a normal plane, this information will look messy and scattered. The circular hyperplane classifies and separates items based on the highest margins of the items closest to the hyperplane.
To make computations easier for an n-dimensional space with more features, it is a good idea to find a kernel function for the new space. Finding out kernel function reduces the complexity of computations just by changing the dot product corresponding to the space we have decided.
The training time for the model is more and it requires extensive computation which makes SVM costly for huge datasets having more features.
A linear classifier achieves classification (grouping based on categories) by assessing each feature individually as if it is unrelated to any other feature present in the dataset. Linear classifiers are simple and ideal for working with huge datasets. Linear classifiers can be further divided into generative or discriminative models. The most common generative model is the Naïve Bayes classifier whereas the most common discriminative model is the logistic regression model. When there are less than 3 features present, even SVM can be categorized as a linear classifier, however for more features, linear separation is not possible (as we have discussed above).
Naïve Bayes classifier – Bayes classifier is based on Bayes theorem which determines the probability of an event occurring based on various features that are considered independent of each other. Each feature contributes equally to the final outcome. For example, determining whether a cricket match should be played on a particular day based on factors like a holiday, temperature, wind conditions, humidity, etc… Each of these factors is independent of the other and contributes to the outcome (match = Yes or No) equally.
Mathematically, Bayes theorem calculates the probability of occurrence an event A based on the probability of event B that has already occurred (true) as –
P(A|B) = (P(B|A)*P(A))/P(B)
where B is the even that has already happened i.e. the evidence, and P(A|B) is the probability of event A’s occurrence considering that event B has occurred.
Logistic regression – Logistic regression itself is not a classifier but provides a means to create a classifier. It is extensively used whenever the dependent variable is a simple yes/no, 1/0, true/false, success/failure, etc… i.e. a binary or more technically called a dichotomous variable. Apart from the dependent (let us say Y) variable, there are explanatory variables (consider them as X) too. The value of the Y variable is predicted using the values of X variables. For example, to determine if a person is eligible for a home loan (yes/no), we need to know various factors (X variables) like monthly income, expenditure, nature of the job, marital status, number of family members, etc… All these X variables are used to create a mathematical equation or the function that determines the value of Y.
Decision trees split a huge dataset into smaller subsets that form a tree structure. Along with categorical data, decision trees can also handle numerical data. Each node of the tree represents a test on the variable, and the branch represents the resultant outcome of the test. Decision trees are the most widely used classification algorithms. They are easy to implement, neat and accurate. Here is a simple tree to determine whether a student will take up a particular course or not –
Some common applications of decision trees are in the fields of healthcare, education, manufacturing, engineering, law, and marketing.
Random forest is a combination of various decision trees. The individual decision trees are then combined (ensemble). Individual trees have very low correlations with each other to get the best predictions/errors. With each iteration, the result of a random forest model gets more accurate and also prevents overfitting. The name random comes from the fact that each decision tree brings randomness to form a forest of trees that are not correlated in any sense. The final output is a mode, mean or median of individual decision trees, based on random sampling or a random subset of the data while building the tree or splitting the nodes respectively.
Note – Since decision trees are very flexible and have no set maximum depth of nodes, a problem overfitting can occur. Random forests solve this problem through random sampling of the data points. Each decision tree is trained on a limited and different set of features and then averaged to produce the final prediction of the random forest.
K-nearest neighbor or KNN requires the least amount of computation time and is easy to interpret. The predictive power is as good as other classification models in many cases. KNN uses all the data for training purposes and doesn’t make any assumptions about data. Thus, it is called a ‘lazy-learning’ and ‘non-parametric’ learning algorithm.
Suppose we have classified all the training data into two categories A & B. The new dataset that comes in now has to be categorized into A or B. KNN uses feature similarity to do the same. This can be done by finding the nearest neighbors of the new data point. If there are more neighbors corresponding to category A, the new data point will also be categorized as A, else it would be categorized under B.
Here, category A is in purple and B in yellow. The red triangle is the new data point that needs to be categorized. We can find as many nearest neighbors as we want to improve accuracy, however, in this case, we are finding 3 (k) nearest neighbors, of which two belong to category A.
So, did you guess where the red triangle belongs?
Regression algorithms are widely used to give continuous output as opposed to a discrete categorized value given by classification algorithms. Regression models are much suitable for forecasting, finding cause-effect relationships between variables, trend analysis, drug response modeling, optimization of delivery and other logistics, etc…. There are many types of regression algorithms used – linear, logistic, polynomial, ridge, stepwise, lasso and elastic net, with the first three being the most popular ones.
We have already discussed logistic regression which can be used for classification as well as regression. It can give cut off values, that can be then categorized to form binary classifiers.
Let us know a little bit about linear and polynomial regression, which are the other most commonly used regression algorithms.
1. Linear regression
The easiest, most widely used and effective algorithm to find the best guess from the available data set is linear regression. In this method, there are 1 or more independent variables (consider X) and a dependent variable (say Y), and we determine the linear relationship between X & Y by plotting a regression line, which is nothing but the best-fit crossing or nearing the maximum number of data points. A linear regression plot with more than one independent variable is called multiple linear regression. The regression line is represented using a linear equation such as –
Y = a + X*b + error
where b is the slope of the straight line and a is the intercept. The values of a and b are found using the Least Square Method (LSM). A simple linear equation can be represented as –
2. Polynomial regression
When there is more than one independent variable, the result needs not to be always a straight line. It can be a polynomial equation representing a curved line. Thus, the regression line or the best-fit line is a curve that fits maximum data points. With curved lines, one has to be careful about over or underfitting and plot the right curve. A polynomial equation can be represented as –
Y = a + (X^n)*b + error
where n is the degree of polynomial representing the number of independent variables X.
In this article, we have had a quick glance about supervised learning and the various algorithms that are used to perform the same. Here are quick reference points to summarise what we have learned about supervised learning –
- In supervised learning, a machine learns through a set of data that has defined input and output.
- With the necessary training, the machine learns and improves itself. This learning is applied during the testing stage to determine the strength and performance of the machine learning algorithm.
- There are two types of supervised learning algorithms – classification and regression
- Most of the classification algorithms can be used for regression too.
- In classification, data points are categorized or sorted into various categories, for example, it can predict what age group of customers are more likely to purchase a particular set of items, for example, a hearing aid, the Alexa device, a tablet or an iPod.
- Regression algorithms produce a continuous numerical value (and not a category) most likely an integer, double or float. For example, predicting home loan, age of a person, forecasting sales value for the next 5 years, etc…
- Classification algorithms solve the problem of yes/no, for example, whether sales will increase or decrease in the next few quarters, whereas regression algorithms give a specific output, for example, the estimated value of sales in the next few quarters.