Machine learning is a process where machines can perform specific tasks without much intervention by humans. A machine can learn in various ways – it can be trained using a known data set where we feed the output to the machine, and it derives a general function from getting the output. The function can then be used on new unknown sets of data. In the above example, the machine is fed inputs by external sources. Such a mechanism to train machines is called supervised learning. In a more complex real-world, this is not always possible. Machines have to learn by experience and produce an outcome that is better than the previous outcome until it has reached an acceptable accuracy level. This is called unsupervised learning. An even more dynamic form of machine learning is where an agent maximizes the cumulative reward by taking the necessary actions. This method, known as reinforcement learning, forms the basis for today’s gaming theory and robotics.
Simple day to day examples
Before we dive into details of algorithms, let me give you a few simple examples of how we are dependent on machine learning knowingly or unknowingly.
- Suppose you have loads of photos on Google and you want to search for photos of a particular person. Google shows you an option ‘People’ where it has sorted the photos based on face recognition. Google also asks you whether a person X is the same or different based on different photos of them taken at different angles. Based on this information, Google can improve its results every time.
- When you ask a particular question to Voice assistants like Alexa/Cortana/Siri, they can talk like a human to you. They can understand and process human speech and search the web for information that you need and present to you in a human voice. This is possible through speech recognition.
- You want to watch a new series – and Netflix can recommend one! How? It can analyse your previous watching patterns, or series/movies watched by other users who have similar taste as yours, and then recommend what you potentially like. This gives you a personalised experience and saves browsing time!
- You open your email to find personalised marketing content based on your browsing patterns. For example, if I search for a headphone on Flipkart, and just close the browser, a little later, Flipkart can send me an email and offer discounts on the same and other similar headphones!
- You want to play a game of chess, but alas, no company! Well, no more worries, as the computer (machine) can play with you and dynamically place their moves based on your move. You can simply fiddle around just to check the intelligence of the machine!
- During the COVID-19 pandemic situation, doctors were fully occupied in treating patients and handling emergencies. In such cases, chatbots were able to handle the queries of patients, based on their symptoms. These self-triage systems can advise a patient whether they need to call a doctor, visit the doctor or remain in home isolation.
These are only a few of the many machine learning examples. There are many more examples, and we can keep going on and on! But before that, can you recognise which type of learning the above examples fall under? Don’t worry, if you can’t answer this question right away, we will revisit this section after reading the article!
Types of Machine Learning Algorithms
As we have learnt above, there are three main types of learning algorithms – supervised, unsupervised and reinforcement algorithms.
1. Supervised learning algorithms
These are the algorithms where the machine is fed with input-output pairs, and the machine finds a function f that satisfies the input-output pair. The function is derived using a labelled set of data. This is then used to analyze a new set of data for which the output is not known. In supervised learning, a function f is mapped to the input X and output Y as,
Y = f(X)
There are two approaches to find the function f. First is the empirical risk minimization, where the model determines the function f that tries to fit the training data best minimizing the risk R. Another approach is the Structural risk minimization, where a penalty function is introduced that controls the bias-variance trade-off to find the best model.
Once the desired accuracy is achieved, the model stops any further learning. Supervised algorithms are of two types –
- Classification: These algorithms classify data into different categories based on a set of features. For example, sorting of fruits based on colour, sorting of text based on specific keywords, sorting of emails based on their content – important, spam, social, sports etc.
- Regression: These algorithms forecast the relationship between a dependent variable based on independent variables. For example, obesity (dependent variable) depends on various factors like overeating, lack of exercise etc. (independent variables). Regression techniques are extensively used for sales forecast, time-series modelling – wherever we can fit the line or curve to get the least distance between the data points and the curve. The three main metrics for regression analysis are – number of independent variables, the shape of the regression curve (linear, S-shape, etc.) and the type of dependent variable. For example, in our obesity example, there is a linear relationship between obesity and eating habits.
2. Unsupervised learning algorithms
In this type of learning, we don’t know the output and can’t label the data. The machine learns on its own and identifies similar patterns and behaviour to build a model. Unsupervised learning is a bit more complicated and unpredictable than supervised learning because the machine selects the features on its own. Real-time datasets usually don’t come with labels, making unsupervised methods of learning more suitable and detailed. The main types of unsupervised learning algorithms are Clustering, association rules and Dimensionality reduction.
- Clustering: In this type, data is sorted into groups of clusters based on their similarities. The similarities can be in anything like color, shape, size, etc. For example, people who like to order pizzas over weekends can be clustered together to send promotional emails over the weekend.
- Association rules: Association rule finds patterns and links in data and builds rules to create a model. For example, a person who buys bread would most likely buy milk, eggs or butter; thus, those items can be shown together in a cart, for a user to add them easily.
- Dimensionality Reduction: If there are lots of features in a dataset, it becomes difficult to train the model. Thus, it is crucial to find the essential features that can yield high accuracy. Thus, similar features can be eliminated as they would not give out any useful information, making the number of features less than in the original data set. This makes it easier to analyse the data and predict outcomes.
Read more about unsupervised learning.
3. Reinforcement learning algorithms
Reinforcement learning is an approach involving an agent that takes actions in a specific environment leading to maximum cumulative rewards. This type of learning is more a part of deep learning. For example, making a child write the number ‘’8”. Since children have a minimal time of focus, we have to continually remind the child of a reward that he/she will get upon performing their best. Once the child writes ‘8’, we reward her with a new set of color pencils. If they don’t finish, we take away her share of apple pie, and if she finishes fast, we give her both the pencil set and chocolate. Reinforcement learning can be positive or negative and has three approaches – value-based, model-based and policy-based.
- Value-based: In a value-based method, the agent expects a long-term value or return
- Model-based: Here a virtual model is created for the environment, and the agent learns to perform in that particular
- Policy-based: In this approach, policies are formed in such a way to maximize the rewards for every action in every state
The decision making in reinforcement learning in sequential, and there are no supervisors involved. Time plays a pivotal role in the decision-making process.
Top Machine Learning Algorithms
Now that we have understood the different types of learning algorithms, we will further learn about a few specific algorithms of each type. There are hundreds of machine learning algorithms, but you will find the below most useful and easy to understand.
1. Naïve Bayes Classifier
It is a joint probability supervised (classification) algorithm. It is a highly scalable model where maximum-likelihood of an event can be calculated using a set of independent features. For example, how about determining the probability of a person purchasing a matching pair of earrings, knowing that they purchased a necklace and a green-coloured dress. The events that occurred (purchasing necklace and green-coloured dress) are independent events that can help determine the probability of buying earrings. In general,
P(A|B) = (P(B|A)*P(A))/P(B)
where, P(A|B) = Posterior probability of event A occurring given B occurred
P(B|A) = probability that B occurs given that A occurred
P(B) and P(A) = individual probabilities of occurrence of B and A
Some critical applications of Naïve Bayes algorithm are in data mining like text processing, text classification, removing insignificant words from the text, finding keywords in an email, classifying email as spam etc.
Naïve Bayes classifier can be of different types – Gaussian classifier, multinomial classifier, Bernoulli’s classifier. It is a fast method for classification and easy to train as it requires less data for training.
2. Support Vector Machine
In this supervised learning algorithm, each data item is plotted in an n-dimensional space, where n represents the number of features. So, if we have three features, we will have three axes of coordinates. Data points of different categories are then separated by drawing a hyper-plane (line, curve, circle, etc.). The hyper-plane is the best-fitting line that separates the categories. Determining the hyper-plane is the essence of SVM. If the hyper-plane is not a linear one, i.e. data cannot be separated linearly, SVM adds a feature that transforms the low-dimensional input plane into a higher dimensional plane. This is called the kernel trick. Thus, SVM can be used for both linearly and nonlinearly separable data.
Did you already guess that it is a classification algorithm? Well, it can be used for regression too. The principles for SVR (Support Vector Regression) are the same but used for regression, i.e. finding a function that gives a real number for output instead of classifying objects.
Some examples where SVM is used are face detection, image classification, handwriting detection, classification of genes, etc.
3. Decision Trees
These supervised learning algorithms predict response values based on certain decision rules derived based on the features. Decision trees can be used for classification and regression. For example, based on a customer’s lifestyle choices, occupation and spending preferences, we can predict what income range they would fit in. Decision trees can be categorised based on the type of target variable as a categorical variable decision tree or continuous variable decision tree. There are many algorithms used by decision tree for classification and regression like ID3, CART, C4.5, C5.0.
Decision trees in machine learning are easy to understand and implement and are widely used in determining the best path to reach a goal, making strategic business decisions, etc. Decision trees are also used as ensembles in boosting or bagging, especially where there is unbalanced class distribution.
4. K-Means Clustering
K-means clustering is an unsupervised learning algorithm where data is divided into homogenous clusters (groups) based on similarities in their patterns. K-means aims to minimize the distance between the data points in the same cluster.
The number of clusters is decided at the beginning itself and is denoted by the letter ‘K’. The optimal value of ‘K’ is determined using the ‘elbow’ method. Initially, k random points are selected from the data points as the centroids. Then all the nearest points are assigned to the closest centroid. Once all the points are assigned to clusters, we can revisit and recompute the clusters to get the new centroids. This process is iteratively done till we achieve the desired accuracy. K-means is computationally fast and produces tighter clusters than any other hierarchical clustering algorithms.
Some applications of K-means clustering are in market research (analyzing which group of people prefer what kind of products and services), image processing, pattern recognition, data mining and predictive analysis.
5. Linear regression
Linear regression helps determine the extent of the linear relationship of a dependent variable with one or more independent variables. It is one of the simplest forms of the supervised learning regression model and can be represented using the equation,
Y = a + bX
where Y is the target (or outcome), X is the predictor (i.e. the input), and a is the y-intercept or the regression coefficient, and b is the constant or the slope of the line. A simple example of linear regression is the relationship between the age of a person and the risk of them catching infections. As age increases, the risk of getting an infection becomes more. Same way, a person’s weight is directly proportional to their eating habits and forms a linear (straight) line.
It is also possible for linear regression to have more than one explanatory variable to predict the outcome of a response variable. This type of regression is known as multiple regression or multiple linear regression (MLR). Some other forms of linear regression are general linear models, generalized linear models, heteroscedastic models, hierarchical linear models etc.
Linear regression models are fitted or cross-checked for linearity using the least-squares method. Linear regression is used for forecasting an effect, forecasting trends, determining the strength of predictors, demand analysis etc.
6. Logistic regression
Logistic regression is a statistical model that uses a logistic function (also known as the sigmoid function) to model binary variables, giving the outcomes as yes/no, true/false, pass/fail and so on. There are only two possible values.
In other words, it is a supervised learning method for predicting the probability of a categorical dependent variable, which is why it is listed under classification algorithms in some places. The dependent variable is binary, that can have two possible values, between 0 and 1. Logistic regression can be represented as:
p = 1/(1+e-(b0+b1x))
The one question that arises in everyone’s mind is that if the algorithm is based on classification, why do we call it logistic regression and logistic classification?
That’s because it classifies the data only when combined with a decision rule. But it predicts probability value which is a continuous value ranging from 0 to 1. Since it has something to do with a continuous real value (and not category), it is called regression, but used for classification.
Logistic regression is also said to be a special case of linear regression where the outcome is a categorical value. Logistic regression can be used in the field of medicine for detection and diagnosis of diseases and in predictive analysis for likelihood indication – example, whether a person will buy a certain product; a student will pass his exams etc.
This is an unsupervised learning algorithm used for data mining, i.e. mining frequently purchased items and associated rules. The algorithm is applied to a database with a massive number of transactions to find similarities in buying patterns. Through this algorithm, many companies were able to increase their sales, offer effective discounts and make purchases easy for the customers. The example or bread-eggs, onion-potatoes, beer-diaper has been used time and again everywhere to illustrate the concept. The idea is that the customers who buy bread will most likely buy eggs as well. Thus, these two items can be bundled together so that buyers can easily pick them, similarly, for other combinations. The three important components of this algorithm are –
- Support – the default popularity of an item. For example, if in 100 transactions, 80 included bread, the support = transactions involving bread/total transactions = 80/100 or 80%
- Confidence – it is the likelihood that customers who bought bread also bought eggs. Suppose, out of the 80 who bought bread, 60 bought eggs also. So, confidence = transactions involving bread and eggs/total transactions = 60/100 or 60%
- Lift – it is the increase in the ratio of sales of bread when sold along with eggs as a bundle. Lift = Confidence (Eggs-bread)/Support(eggs). That is the likelihood of buying bread and eggs together ‘lift’ times more than that of purchasing eggs alone.
8. Mean Shift
Mean shift is a type of unsupervised clustering algorithm which is centroid-based. The candidates for centroids, i.e. the data points are updated as the mean of the points within the given cluster or region. The duplicates are then removed to identify the final set of centroids.
This algorithm intends to locate the mode (maximum) value of a density function through iteration. The data points keep shifting towards the highest density point, i.e. the cluster’s centroid during each iteration. Unlike K-means, this algorithm doesn’t need the number of clusters (k) to be specified. The algorithm detects the number of clusters on its own.
Mean-shift algorithm is also called the mode-seeking algorithm and used in several machine learning and deep learning applications like image processing and computer vision.
9. Principal Component Analysis
PCA is a dimensionality reduction algorithm where the most important features of a data set are selected to reduce the number of features, thus reducing the complexity of the dataset and the training time of the model. To do this, the features that are correlated (linearly, non-linearly) are eliminated as they don’t provide any extra useful information. The value of correlation can range from 0 to 1, with one being the maximum (but we can never get 1, except if the features are the same). So, if a feature A and feature B have a correlation value of 0.95, then they are highly correlated and either A or B can be eliminated.
You might think that if some features are eliminated, we might lose important information, but that’s not the case. With this algorithm, the features are reduced without the loss of any vital information. The final dataset is a set of orthogonal ‘principal components’ that are a set of uncorrelated variables.
PCA is widely used in image compression, computer vision, facial recognition, finding patterns in psychology, data mining, bioinformatics etc.
10. Markov Decision Process
MDP is used in situations where the outcome of a problem is partly under the control of the decision-maker and partly random. The goal of MDP is to find a policy function (solution) that specifies the action that the decision-makers choose in a particular state ‘s’ to maximize the cumulative (sum of) rewards. The process consists of 4-tuple –
- The set of states’ S
- Set of actions A; each state s from the state space S has a set of actions As
- Probability P which is the probability of the action a in a particular state s and discrete-time t leads to another state s1 at another time t+1
- Reward R which is the immediate expected reward once the state is changed from s to s1
The reward can be positive (points) or negative (punishments) and the best rewards are received only towards the end. The reward in one state decides the probability of future states and rewards. MDP is a classic example of learning from interaction to achieve the desired goal. The decision-maker (agent) interacts continually with the states (environment), and the states respond to the actions and present new situations for the agent.
MDP is widely used in modelling environments that are discrete, stochastic and sequential. However, continuous-time MDP is also possible, where the decision can be made at any time decided by the decision-maker, as opposed to discrete time-intervals.
MDP is extensively used in economics, queuing systems, manufacturing, automatic control systems and robotics.
Bonus Machine learning algorithm
11. Q learning
Q learning is a reinforcement learning algorithm that finds an optimal policy function. This policy tells the decision-maker (agent) about the action to be taken under various situations. There is no model (of the environment) required, and thus it is called a ‘model-free’ algorithm. The ‘q’ refers to the ‘quality’ of action taken by the agent, rather than the action itself and is represented as Q(s, a) where a is the action and s is the state.
In Q-learning, the agent knows exactly what reward it is going to get for each action. Therefore, the agent tries to perform the action that gives the maximum reward (obviously!). The Q value depends on all the future states s1, s11…. up to sn. This can be represented as,
and can be calculated using the Bellman Equation. The value of Y decides whether the contribution of future rewards will increase or reduce.
So, how is the value of Q determined? Initially, we assign random values for Q, and the process is repeated after receiving feedback. The values are then updated, and with the experience, we get the optimal value of Q.
A simple example of Q learning will be a car parking game, where a car can be parked in many ways, but there is one best way, which gives maximum points. If a wrong path is followed, then you lose points. Note that just like MDP, the sequence of events and time matters in Q learning, because the agent has to remember not just the current and future states, but also all the previous states.
Some applications of Q learning are in self-driving cars, new recommendations, traffic signal control etc.
How should I learn machine learning algorithms?
If you are a programmer and know at least one programming language like C++, Java, Python, R, you can start with the basics of machine learning right away. If you don’t know any programming language yet, that’s the best thing – you are like a clean sheet of paper and can select any language you want to learn! (My two cents – Python is most comfortable to learn, but if you have time, go for Java, so that all the other languages will become easier to learn if the need is later on!)
Once you choose the language, make sure you learn the following concepts thoroughly:
- Data structures like Arrays, LinkedList, Stack, queue, 2D arrays, sets, binary trees, hashtable etc.
- Strings, Boolean, control statements (for, while, do etc.), conditional statements (if-else if)
- Sorting algorithms like the bubble sort, merge sort, quick sort, selection sort etc.
- Object-oriented programming, system design
- Writing lots of medium-complexity and straightforward code
With basic programming language, you can get started with machine learning algorithms. Most books come with at least a crash course on programming languages as well. Also, you need to be familiar with library functions to be used while coding, which will come with practice, practice and more practice!
In this article, we have introduced you to some standard and powerful machine learning algorithms along with their applications. These algorithms will be covered in detail as separate articles and giving more details like implementation or code is beyond the scope of this article. If you wish to learn machine learning, you could start with free online video tutorials and some excellent machine learning books. With the COVID-19 not ending anytime soon, machine learning algorithms are becoming all the more important to predict the spread, identify patients with high-risk, helping patients with basic treatment through chatbots and diagnosing more patients in less time. All in all, machine learning jobs are one of the highest paying jobs in the market now.
People are also reading:
- What is Machine Learning?
- Best Machine Learning Interview Questions
- Best Machine Learning Frameworks
- How to become a Machine Learning Engineer?
- Machine Learning Projects
- Classification in Machine Learning
- AI vs. ML vs. Deep Learning
- Machine Learning Applications
- Data Science vs. Machine Learning
- Decision Tree in Machine Learning