There are many types of machine learning algorithms. Important ones among them have supervised learning and unsupervised learning. Here, we are going to discuss supervised learning.
What is Supervised Learning?
With machines that are constantly learning new things and processing more data than ever, we are able to solve business problems quickly and accurately. A machine that can use its “brain” to make decisions and solve problems is said to be an “artificially intelligent” system. Such a system can be trained to become better through learning, just like a human brain improves itself by observing similar patterns and practicing with similar kinds of data.
The process where a machine can be trained and modeled with or without any human intervention is called “Machine learning.”
Types of Machine Learning
Machine learning is of various types, such as:
- Supervised learning
- Semi-supervised learning
- Unsupervised learning
- Reinforcement learning
Naturally, the simplest of the lot is supervised learning. Think of it as similar to how you would teach a concept to a child.
A young child at a tender age wouldn’t know different colors or shapes. Suppose you teach her about pink, blue, red, black, white, and other colors. You already know the colors (input and output both), and you are essentially teaching them the same with the knowledge (data and information) that you already have.
By showing the child the objects of various colors and asking them to make the correct choice, you will allow her to learn color recognition. For example, you ask her to pick a yellow fruit from a basket of fruits or show her a leaf and ask about its color. The child will gradually learn how to recognize and differentiate objects by colors.
Well, if the child makes mistakes, you just correct them. For a little older child, you would probably want to teach counting numbers and adding them. So, you give them balls, cups, or spoons to count, and then once they count right, you give them examples of how one spoon and one more spoon become two spoons. This goes on until the child can do the same herself with any combination of numbers.
Supervised learning is based on the same principle as that of these examples, only that you are teaching the concepts to a computer!
In this type of machine learning, the training dataset is fed to a learning system, and once the machine is trained, it predicts outcomes on new datasets based on its previous learning experience. Some of the common applications of supervised learning are:
- Movie recommendation systems like Amazon Prime and Netflix.
- Automatic classification of emails into various categories like spam, important, and social.
- Traffic updates on the basis of weather conditions, route congestion, etc.
- Determining the age or salary of a person based on various factors.
Technically, we can represent a supervised learning outcome in an input-output format as:
Y = function(X)
Here, X is the input, and Y is the output determined by applying the mapping algorithm or function created during the training phase. Some important features of supervised learning are:
- It works on a labeled dataset, i.e., data with a defined set of input and output parameters.
- The machine learns from past experiences that are nothing but the data fed to it.
- The performance of the machine improves with more data (i.e., more experience).
- Usually, 80% of the data is used for training the model (learning), while the remaining 20% is for validating or testing the model.
- During the testing phase, the output data generated by the machine is compared with the actual output results to determine the accuracy.
- Based on the accuracy, the model is tweaked and re-validated if required.
- The final model built through the examples and learning is called a predictive model.
Types of Supervised Learning
Supervised learning can be of two types:
Each of these types encompasses many types of algorithms. The following section details the two types of supervised learning and the most common algorithms of each type.
Classification divides a set of information into categories. For example, in our color example before, we can classify colored objects into different categories, pink, blue, green, and so on. If we have different fruits, we can classify and separate them on the basis of colors, such as yellow fruits, red fruits, and green fruits.
The most relevant example of classification we see daily is how your email service classifies incoming mail into different categories; inbox, important, spam, social, promotions, and more.
Same way, determining whether a person will purchase a particular item depends on a multi-class classification algorithm – specifically a binary classification problem – as there are two choices; yes or no!
The training data set would have all kinds of data that will help the system to create a mapping function; Y = function(X). The result - Y - of a classification task is a set of discrete values. There are many applications of classification algorithms other than spam filtering.
Examples include determining a credit score, identifying handwriting, face and speech recognition, medical imaging, detection of pedestrians using an automated car driving system, sentiment analysis, classification of drugs, and document classification.
i. Support Vector Machines
Suppose there are n features that help to classify data points into categories. Then, the SVM algorithm finds a hyperplane in n-dimensional space that makes it easy to draw a straight line or circular or n-dimensional solid plane (decision boundary) to separate the data points of the same class from other classes. With this hyperplane, we can find the maximum margin distance that helps us to classify more data points accurately with the processing of more data.
For example, let us say we want to accept interview applications of applicants with scores greater than 65% and less than 25 years of age. Based on these two features, i.e., marks (x) and age (y), we have a pair of coordinates (x, y) that give output in the form of either a red circle or a green triangle. Since the data points are scattered in our case, we have a non-linear (in this case, a circular) hyperplane.
On a normal plane, this information will look messy and scattered. The circular hyperplane classifies and separates items on the basis of the highest margins of the items closest to the hyperplane. To make computations easier for an n-dimensional space with more features, finding a kernel function for the new space is a good idea.
That's because finding out the kernel function reduces the complexity of computations just by changing the dot product corresponding to the space we have decided. The training time for the model, however, is and it requires extensive computation, which makes SVM costly for huge datasets having more features.
ii. Linear Classifiers
A linear classifier achieves classification (grouping based on categories) by assessing each feature individually as if it is unrelated to any other feature present in the dataset. Linear classifiers are simple and ideal for working with huge datasets. We can divide linear classifiers into generative or discriminative models.
The most common generative model is the Naïve Bayes classifier , whereas the most common discriminative model is the logistic regression model. Interestingly, when there are less than three features present, even SVM qualifies as a linear classifier. For more features, however, linear separation is not possible.
a. Naïve Bayes Classifier
The Bayes classifier is based on the Bayes theorem that determines the probability of an event occurring on the basis of various features that are considered independent of each other. Each feature contributes equally to the final outcome.
For example, determining whether a cricket match will be played on a particular day depends on factors like whether it's a holiday or not, temperature, wind conditions, and humidity. Each of these factors is independent of one other and contributes to the outcome (match = yes or no) equally.
Mathematically, the Bayes theorem calculates the probability of the occurrence of an event A based on the probability of event B that has already occurred (true) as: P(A|B) = (P(B|A)*P(A))/P(B) Here, B is the event that has already happened, i.e. the evidence, and P(A|B) is the probability of event A’s occurrence considering that event B has occurred.
b. Logistic Regression
Logistic regression itself is not a classifier but provides a means to create a classifier. It is extensively used whenever the dependent variable is a simple yes/no, 1/0, true/false, success/failure, etc. i.e., a binary - or technically called a dichotomous - variable. Apart from the dependent (let us say Y) variable, there are explanatory variables (consider them as X) too.
The value of the Y variable is predicted using the values of X variables. For example, to determine if a person is eligible for a home loan (yes/no), we need to know various factors (X variables) like monthly income, expenditure, nature of the job, marital status, and the number of family members. All these X variables help to create a mathematical equation or the function that determines the value of Y.
iii. Decision Trees
Decision trees split a huge dataset into smaller subsets that form a tree structure. Along with categorical data, decision trees can also handle numerical data. Each node of the tree represents a test on the variable, and the branch represents the resultant outcome of the test. Since they are easy to implement, neat and accurate, decision trees are the most widely used classification algorithms.
Some common applications of decision trees are in the fields of healthcare, education, manufacturing, engineering, law, and marketing.
iv. Random Forest
Random forest is a combination of various decision trees. Individual trees have very low correlations with each other to get the best predictions/errors. With each iteration, the result of a random forest model gets more accurate and also prevents overfitting. The name random comes from the fact that each decision tree brings randomness to form a forest of trees that aren't correlated in any sense.
The final output is a mode, mean or median of individual decision trees, depending on random sampling or a random subset of the data while building the tree or splitting the nodes, respectively.
Note: Since decision trees are very flexible and have no set maximum depth of nodes, a problem of overfitting can occur. Random forests solve this problem through random sampling of the data points. To do so, each decision tree is trained on a limited and different set of features and then averaged to produce the final prediction of the random forest.
v. K-Nearest Neighbor
K-nearest neighbor or KNN requires the least amount of computation time and is easy to interpret. The predictive power is as good as other classification models in most cases. KNN uses all the data for training purposes and doesn’t make any assumptions about data. Thus, it is called a ‘lazy-learning’ and ‘non-parametric’ learning algorithm.
Suppose we classify all the training data into two categories; A and B. The new dataset that comes in now must either be put into A or B. KNN uses feature similarity to do the same. This can be done by finding the nearest neighbors of the new data point. If there are more neighbors corresponding to category A, the new data point will also belong to A; else, it will be categorized under B.
Here, category A is in purple, and B is in yellow. The red triangle is the new data point that requires categorization. We can find as many nearest neighbors as we want to improve accuracy. However, in this case, we are finding 3 (k) nearest neighbors, of which two belong to category A.
Note : So, did you guess where the red triangle belongs?
These ML algorithms give continuous output contrary to a discrete and categorized value given by classification algorithms. Regression models are much more suitable for forecasting, finding cause-effect relationships between variables, trend analysis, drug response modeling, optimization of delivery, and other logistics.
Regression algorithms are of several types, namely linear, logistic, polynomial, ridge, stepwise, lasso, and elastic net. Out of these, the first three are the most popular ones. We discussed logistic regression above, which is suitable for both classifications as well as regression. It can give cut-off values, which can then be categorized to form binary classifiers.
Let us know a little bit about linear and polynomial regression, which are the other 2 most popular regression algorithms below:
i. Linear Regression
The easiest, most popular, and most effective algorithm to find the best guess from the available data set is linear regression. In this method, there are 1 or more independent variables (consider X) and a dependent variable (say Y).
Here, we determine the linear relationship between X and Y by plotting a regression line, which is nothing but the best-fit crossing or nearing the maximum number of data points. Multiple linear regression is a linear regression plot with more than one independent variable.
The following linear equation represents the regression line: Y = a + X*b + error. Here, b is the slope of the straight line, and a is the intercept. To find the values of a and b, we use the Least Square Method (LSM).
ii. Polynomial Regression
When there is more than one independent variable, the result needs not to be always a straight line. In such a case, it can be a polynomial equation representing a curved line. Thus, the regression line or the best-fit line, in this case, is a curve that fits maximum data points.
With curved lines, one has to be careful about over or underfitting and plot the right curve. The following equation represents a typical polynomial equation: Y = a + (X^n)*b + error. Here, n is the degree of polynomial representing the number of independent variables, i.e., X.
In this article, we discussed supervised learning and also the various algorithms to perform the same. Here are some brief points to quickly summarize what we learned about supervised learning:
- In supervised learning, a machine learns through a set of data that has defined input and output.
- With the necessary training, the machine learns and improves itself. This learning is applied during the testing stage to determine the strength and performance of the machine learning algorithm.
- There are two types of supervised learning algorithms; classification and regression.
- Most of the classification algorithms can be used for regression too.
- In classification, data points are grouped into various categories. For example, it can predict what age group of customers is more likely to purchase a particular set of items, such as a hearing aid, an Alexa device, a tablet, or an iPod.
- Regression algorithms produce a continuous numerical value (and not a category), most likely an integer, double, or float. For example, predicting home loans, the age of a person, and forecasting sales value for the next 5 years.
- Classification algorithms solve the problem of yes or no. For example, whether sales will increase or decrease in the next few quarters. Regression algorithms, on the other hand, give a specific output. For example, the estimated value of sales in the next few quarters.
People are also reading:
- Best Machine Learning Books
- Best Machine Learning Interview Questions
- 10 Best Machine Learning Projects for Beginners
- Decision Tree in Machine Learning
- How to Implement Classification in Machine Learning
- Best Machine Learning Frameworks
- Top 10 Machine Learning Algorithms for Beginners
- Top 10 Machine Learning Applications
- What is Unsupervised Learning?
- Introduction to Classification Algorithms
Leave a Comment on this Post