Machine learning is one of the top career options right now, other than data science. If you have good knowledge of machine learning algorithms, you can easily move on to becoming a data scientist. Machine learning requires good technical knowledge and expertise in math, statistics, probability. You should also possess good problem solving and analytical skills, and the aptitude to analyze and solve complex problems. On average, a machine learning engineer having about 5-7 years of experience earns about $134,756 yearly. They are in high demand and will be more so as artificial intelligence technologies evolve in the years to come.
Machine learning is vast. However, the fundamental concepts remain the same. In our article, we have listed some very common questions asked in machine learning interviews that help interviewers understand your level of knowledge as well as an aptitude for the subject. Other than the purely technical questions, you might be asked some subjective questions like –
- Why have you chosen machine learning and not data science as your career?
- Which according to you is the most preferred language for machine learning and why?
- Mention about the projects or models you have previously built based on machine learning algorithms.
- If you have done any courses or certifications, they may ask you a little about neural networks and deep learning as well.
If you have previously done any ML projects, tell the interviewer(s) about your experience, problems faced, steps you took to solve the challenges, languages and tools used and so on.
Top Machine Learning Interview Questions and Answers
Let us now dive into the different Machine learning interview questions and answers that are asked in most machine learning interviews, with comprehensive answers for each –
What is machine learning? Why is it important?
Answer: Machine learning is the ability of a machine to learn and identify patterns by itself, just like a human brain. Machines are fed with data sets and algorithms are applied to build and train models to produce highly accurate outcomes. For example, to find what kind of movies are more popular amongst people of a particular age group, we can feed data into machines to identify the similarities in the genre of movies people of a particular age group watch. Machine learning algorithms will then learn through the data, build a model, and improve it with more iterations.
Machine learning is important because –
- More accurate than human (manual) processing
- Saves a lot of effort, resources, and cost
- More customer requests can be processed in less time with automated systems that can take care of similar queries
- Machines can take care of monotonous jobs helping businesses focus on business logic and other important matters
Question: What are the three types of ML algorithms?
Answer: The three basic types are supervised learning, unsupervised learning, and reinforcement learning. Learn more about these in our Machine Learning Introduction article.
Question: How do you differentiate between classification and regression methods?
Answer: Both regression and classification are supervised machine learning types. Some differences are –
|The output variable is a numerical or continuous value, like a quantity or size.||The output variable is a category and not a quantitative value.|
|Example, predicting the age of a person, the price of a house, etc…||For example, determining whether a person will purchase a certain product, whether sales will increase or decrease in the next quarter, etc…|
|Some regression algorithms are – linear regression, regression trees, and SVM.||Example of classification algorithm includes logistic regression, Naïve Bayes, kNN and decision trees|
Question: What will you do to ensure that your model is not overfitting?
Answer: Here are a few ways –
- Don’t introduce too many input variables and parameters
- Use cross-validation technique i.e. training the model on a part of the data set and reserve the rest for testing
- Regularization i.e. adding a penalty in error function, so that the coefficients do not take any extreme values, thus reducing noise components
- Through bias-variance trade-off i.e. obtaining the optimal value of bias and variance. Low bias and high variance can cause overfitting, so reducing the variance and increasing the bias will help
Question: If your model is underfitting, what can you do to correct the model?
Answer: Underfitting happens when the model is too simple or did not take into consideration all the essential variables. This can happen when the amount of data provided is too less. We can take care of underfitting by –
- Increase the training time for the model
- Increase the data supplied to train the model
- Increase the number of parameters and input variables, thus increasing the complexity of the model
Question: What is the naïve Bayes theorem? What are its disadvantages?
Answer: Naïve Bayes is a classification algorithm, which is mainly used for text classification, using high-dimensional training data. Some common uses for Naïve Bayes are spam filter, classification of news items, sentiment analysis, etc… The equation to calculate the posterior probability is,
P(B/A) = P(A/B)*P(B)/P(A)
where P(A) is the probability of A to be true, P(B) is the probability of B to be true, P(A/B) is the probability of A to be true if B is already proven to be true.
The algorithm is very powerful; however, some disadvantages are –
- Loss of accuracy because of the assumptions of other features being independent
- Data scarcity
Question: What is the meaning of ‘Naïve’ in naïve Bayes theorem?
Answer: The naïve Bayes is called naïve because it assumes that the occurrence of one feature is independent of the occurrence of other features (predictors). In other words, the conditional probability is the pure product of individual probabilities, which in real-life is not always possible. For example, if you like chocolate and chili, as per naïve Bayes, you should be liking “chili-chocolate” as well, which is practically incorrect.
Question: How is machine learning different from deep learning?
Answer: Both ML and deep learning are branches of AI. Here are some of the differences –
|Machine learning||Deep learning|
|models can be trained with less data||requires huge dataset to train the model|
|the accuracy may or may not be high||provides high accuracy|
|it is difficult to improvise and tune the model||the model can be tuned in several ways|
|The training time for the model is less||requires more training time as it is data-intensive|
|three types of algorithms –
supervised, unsupervised, reinforcement
|deep-learning works on neural networks consisting of three layers – input, hidden and output|
|some common algorithms are clustering, classification, regression etc…||some examples are convolutional neural networks (CNN), recurrent neural networks (RNN) etc…|
Question: Mention some of the important applications of machine learning.
Answer: Some of the most relatable applications are –
- Recommender Systems – Movie and product recommendations by sites like Netflix and Amazon respectively are the most popular ML applications. Netflix recommends movies to a user based on preferences of other similar users as well as previous preferences of the particular user. Same way, Amazon recommends similar products based on the user’s browsing patterns and the items other people bought along with a particular item.
- Voice Assistants – Voice assistants like Alexa, Cortana, and Siri use speech recognition, which is nothing but a translation of words into text. These assistants can play music, set reminders, task lists, set appointments, etc…
- Image Recognition – Facebook faces tagging is a good example of image (face) recognition. Same way, google photos recognizes and sorts humans based on their face. Image recognition also includes character recognition for handwriting detection.
- Medical Diagnostics – machine learning algorithms are used for detecting various diseases by analyzing clinical parameters, for research purposes and helping in better tracking of patient health.
Apart from these, ML is also used in self-driving cars, live traffic updates, fraud detection, google translate, and dynamic flight pricing.
Question: Which one is a better decision tree or random forest? Why?
Answer: More often, random forests are a better version of individual decision trees.
Random forests are a collection of many decision trees. As more and more trees are added, the accuracy and stability of the model increase. Here are a few reasons why we think random forests are better –
- Random forest is fully grown because of the presence of multiple trees, so the feature space has multiple smaller regions
- The boundary from multiple trees is smooth and has a high resolution compared to that of a single tree.
- Random forests consist of random samples from the individual decision trees. Hence, it solves the overfitting problem that can happen with individual decision trees.
Question: Do you know about the kernel trick? How is it useful?
Answer: A kernel is a method of the computing dot product of two vectors in a high dimensional space. The dot product of two vectors can be calculated as D = A.B.
The kernel trick is used before applying the SVM algorithm to improve its accuracy and performance. If a data set has n features, the SVM separates them using (n-1) dimension hyperplane. The data sets are not linearly separable, hence SVM fails to perform well on non-linear data (imagine the number of transformations when n is a big number).
With the kernel trick, we can use a linear classifier to classify data with non-linear parameters without explicitly transforming the vectors into higher-dimension. That means, the computation is still done in the lower dimension, but we can separate the data that we weren’t able to do before using the kernel method, which separates the data in higher-dimensional space.
Answer: Machine learning is a subset of Artificial intelligence where machines can perform certain tasks, learn and improvise through experience and data. In other words, machine learning comes under the umbrella of AI and focuses on writing computer algorithms that solve different types of problems like classification, regression, etc…
Question: Explain some ways to handle corrupt or missing data.
Answer: There are several ways to do this –
- Predicting the missing/corrupt values using different machine learning algorithms. For example, we can use linear regression to predict a null value for a numerical column.
- For nulls in the dataset, algorithms like KNN and random forest provide good support in filling missing values.
- We can create a new category, like ‘unknown’ to cater to missing or corrupt data. This way, we wouldn’t have to eliminate any data.
- Deleting rows that have corrupt or missing values
- Replacing the value with median, mean, standard deviation or mode
Question: What are the most common types of bias in machine learning?
Answer: They are –
- Confirmation bias – happens when the person analyzing the data has some assumptions about the data. To prove the same, they exclude certain variables from the analysis itself.
- Selection bias – this happens when the sample doesn’t represent the entire population of data. For example,
- Outliers – data points that are predominantly different from other values. For example, a value with an age of 35 years in the dataset that contains data about people from age 50-80.
- Overfitting and underfitting – When a model fits too closely with the data points and includes even noise-reducing the performance of the model. Underfitting is just the opposite where an overly simplistic model is generated which is unable to capture the correct relationship between the features and the variable.
- Confounding variables – variables that influence the independent and dependent variable. For example, exercising is directly proportional to weight loss. You might, however, miss that diet control may have a significant impact on weight loss. Thus, diet control is a confounding variable.
Question: What are the differences between machine learning and data science?
Answer: The main difference is that machine learning is a branch of AI whereas data science uses machine learning algorithms to perform predictive analysis, finding trends and forecast the future. To elaborate, read our article about DS vs ML.
Question: What is collaborative filtering and where is it used?
Answer: The most common application of collaborative filtering is recommender systems (think Netflix).
Collaborative filtering is an approach to find user preferences by learning from previous user experiences, for example, browsing history, genres of movies watched, etc… By knowing this information, machines can recommend relevant items to the user thereby increasing sales as well as user satisfaction. There are two collaborative filtering approaches –
- User-based – finds similar patterns between users who found a particular item interesting. For example, if user A likes comedy movies, and user B also liked a particular comedy movie, will user A also like that movie?
- Item-based – finds similarities between two items to find if a user who likes item A will like an item B or not
Question: Explain the Naïve Bayes classifier.
Answer: Naïve Bayes classifier is a supervised learning algorithm based on the Naïve Bayes algorithm. The algorithm calculates the probability of occurrence of event A if another event B has already happened. A is called hypothesis and B is called as the evidence. The formula to calculate the likelihood of A when B has occurred can be written as –
P(A/B) = P(B/A)*P(A)/P(B)
where P(A/B) is the probability of occurrence of A when B has occurred,
P(A) is the probability of occurrence of event A
P(B) is the probability of occurrence of event B
P(B/A) is the probability of occurrence of B when A has occurred
Now, if we have a vector v having multiple features, Naïve Bayes can calculate the probability that the vector belongs to each of the classes C which have k possible outcomes.
P(Ck/v) = P(v/Ck)*P(Ck)/P(v)
P(v/Ck) is the joint probability P(Ck, v1, v2, ….. vn) = P(v1|v2…vn, Ck)*P(v2|v3…vn, Ck)*…..P(vn|Ck)*P(Ck)
Naïve Bayes assumes that these conditional probabilities are independent of each other. That is the occurrence of feature v1 is independent of feature v2 and so on.
Using the probability model, a classifier can be constructed. Naïve Bayes classifier is used for face recognition, sentiment analysis, spam detection, weather prediction, etc…
Question: Explain the terms likelihood, marginal likelihood, and prior probability with respect to the Bayes algorithm.
Answer: To answer this, let’s understand the Bayes theorem. It is as follows –
Bayes Theorem is a slight modification of the Naïve Bayes theorem that we have seen in previous questions. It finds the likelihood of an event happening based on the prior knowledge of the conditions related to the event (in the below equation P(A/B) or posterior probability). For example, if a person is overweight and has high sugar levels, the likelihood of them having diabetes is high.
P(A/B) = P(B/A)*P(A)
In this theorem,
Likelihood – is the probability of event B to be true, if A is already true i.e. P(B/A)
Marginal likelihood – It is the probability of B is true, i.e. P(B)
Prior probability – The probability or knowledge that A is true, i.e. P(A)
Question: What is a time series dataset? What is the best algorithm to analyze such a data set?
Answer: In a time-series dataset, the data is in a series of particular times or intervals. SVM and KNN are the best algorithms for analyzing time-series dataset.
Question: In what scenarios should you use a logistic regression algorithm?
Answer: Logistic regression works well for binary or dichotomous variables (let’s say Y) i.e. a variable that can have only two possible values, like yes-no, 1-0, true-false. There can be many variables X that are related to the Y-variable and the outcome of Y (yes or no) depends on the combination of these explanatory X variables.
Question: What is PCA? How does component rotation affect PCA?
Answer: PCA or Principal Component Analysis is a technique in statistics that transforms the data into new coordinates to differentiate and summarize the data points where there are many correlated variables. For example,
|sample 1||sample 2||sample 3||sample 4|
For these variables, we can plot a graph, with X, Y, and Z axes respectively for variables 1, 2, and 3. However, if there are more than 3 variables, it would not be possible to plot a graph. Therefore, we can reduce the dimension of the datasets through PCA so that they can be clustered. PCA draws a line through the origin and then finds out the difference between a point in the line and the actual data point. Then it optimizes the line to minimize the distance by finding the squared distance between the points.
Component rotation is the orthogonal rotation used for factor analysis in PCA. Through rotation, we can get clusters of correlated variables. Rotation brings the data points at a distance relative to the origin, however, it doesn’t change the relative location of the points from each other. This makes it easier to calculate variance.
Question: How can you achieve maximum variance in PCA?
Answer: To get the maximum variance, we select the dimension that increases the distance between most data points. Then we project the points onto a vector v and calculate the variance of the projections.
Question: What is gradient descent? Explain its use with an example.
Answer: To understand gradient descent, we should know what a gradient is.
The gradient is a measure of the change in the output function based on small changes in the input. Gradient is generally the slope of a function when represented in a graphical form. If the slope is steep the learning in a model is faster, and if the slope is zero, there is no learning.
Gradient descent is a minimization algorithm that minimizes a given function. The minimization is determined through the learning rate, which is the rate at which we move towards the direction of the local minimum.
Gradient descent is mainly used to minimize cost function iteratively.
Here is a nice KDNuggets article that gives a practical example to explain Gradient descent.
Question: What is a correlation matrix? How is it useful?
Answer: It is a matrix (or table) that shows correlation coefficients between variables. Every cell in the table shows correlation between two variables. Both rows and columns have the same variables and the row and column having the same variable always have value of 1 indicating they are perfectly correlated. Higher value means higher correlation. Here is a simple example –
|People who like chocolates||People who like ice creams|
|People who like chocolates||1|
|People who like ice creams||0.87||1|
As the number of variables increases, the correlation value changes. The final value is the average of all the values of a row in the matrix.
The correlation matrix is mainly used to find patterns in large datasets. It can also be used for exploratory data analysis and as a diagnostic measure to perform deeper analysis.
Question: What is the gradient boosting algorithm? How is it different from a random forest?
Answer: It is an algorithm used for classification and regression problems. It produces a prediction model that is an ensemble of various weaker prediction models, for example, a decision tree. GBM can reduce bias and variance, but can overfit. It is a type of sequential classifier.
Both GBM and Random forest are examples of an ensemble, which is just a collection of predictions to give the final prediction.
Some differences between GBM and Random forest are –
|multiple random samples are produced; however, the final prediction is the weighted average of all the predictions||uses a bootstrapping method i.e. generating random samples from the dataset and predicting the final one as a simple average of all the samples|
|we can use SVM, decision tree or any algorithm that works best for us||uses only decision trees for prediction|
|It is a boosting type of ensemble||Bagging type ensemble|
|consists of sequential classifiers i.e. build one tree at a time and the next tree depends on the correctness of the previous tree (weak learner)||each classifier is independent of other|
|reduces variance and bias both||can reduce variance|
|model can overfit||handles overfitting quite well|
Question: Explain about SVM.
Answer: SVM or Support Vector Machine is a supervised learning algorithm. It is a non-probabilistic, linear, binary classifier. SVM can also perform non-linear classification using kernel trick, where inputs can be mapped to a hyper plane (higher dimensional plane). SVM finds the best margin between the support vectors and the straight line, circle or any solid-plane that separates the features (vectors).
Question: State some differences between k-means clustering and kNN algorithms.
Answer: Since the names are similar, many people get confused between both. Here are some important differences –
|unsupervised algorithm used for clustering; no labelled data||It is a supervised learning algorithm used for classification i.e. we give labelled data as input for training|
|k refers to the number of centroids or cluster centres picked randomly||here k refers to the closest neighbors of the target label|
|the algorithm is complete when the clusters stop moving from one to another||Algorithm is said to be complete when all the classification is completed with highest accuracy|
|We can optimize the algorithm using elbow method and silhouette method||optimization can be done using cross-validation and confusion matrix|
Question: What are type I and type II errors?
Answer: We make many decisions based on probabilities. Thus, there are chances of making mistakes and coming to incorrect conclusions. There are 4 possible outcomes of any decision using statistics, of which two are correct and the other two are wrong (errors). The two errors Type I and Type II are related and increasing one decreases the other and vice versa.
Type I error occurs when a true null hypothesis is rejected. It is also called as false positive. It means that you assert that something is true when it is false in reality.
For example, you think that you will win a game of monopoly by strategically tossing the dice and buying certain specific properties. And, when you do that, you win the game a greater number of times. However, if that’s not true, and your winning with those conditions is just by chance – that’s a type I error.
Type II error occurs when a false null hypothesis is accepted. It is also called a false negative.
For example, you believe what a friend says about someone else’s character without knowing the other person’s side of the story. You have all the information that makes you think whatever your friend said is true, which it is not because you did not cross-check it to be correct with the actual person.
Question: What is stratified sampling? How is it better than random sampling?
Answer: Stratified sampling is a sampling method where the population is divided into smaller groups based on the characteristics (like age, marks, gender, income group etc…0029, also called as strata, and a probability sample is drawn from each group. Compared to random sampling, stratified sampling gives more precision for the same sample size. We can also reduce the sample size depending upon the precision we want to achieve.
Question: What type of analytics can be done with machine learning algorithms?
Answer: Predictive analytics is done using machine learning algorithms. It is used in many industries for automation, marketing, detection of diseases, prevention and detection of fraud, forecasting, etc… Usually, supervised machine learning techniques are used for prediction.
Question: Mention some of the Python (or R) libraries you have used for machine learning algorithms.
- TensorFlow, etc
- CARET(Classification and Regression training)
- random forest
- wordcloud, etc…
Question: What is the difference between a training and testing set? How do you split the data into training and testing set?
Answer: The training set is used to learn from the data and build a model. Test data is used to validate and test the accuracy of the model when new data is supplied to it. We can split the data into training and testing set using partition by supplying the percentage, usually, the data is split as 80-20, however, no rule says so. In python, you can use train_test_split to split the data.
Question: What is a confusion matrix? Give an example to explain.
Answer: The confusion matrix helps visualize the accuracy of a classification model by comparing the predicted and actual classes.
|FALSE||TRUE NEGATIVE (TN)||FALSE POSITIVE (FP)|
|TRUE||FALSE NEGATIVE (FN)||TRUE POSITIVE (TP)|
Where TP/(TP + FP) indicate the precision.
TP is true positive i.e. predicted value is rightly predicted as true
FP – where negative or false value is incorrectly predicted as true
FN – positive values incorrectly predicted as negative values
TN – negative value rightly predicted as actual negative
The accuracy can be calculated as,
Accuracy = TP + TN
TP + TN + FP + FN
For example, a classifier algorithm determines the age range of a person based on their facial features. We feed the data along with labels and ask the algorithm to predict the values. Once all the values are predicted, we can apply the accuracy-test to evaluate the correct values.
Question: Suppose you are playing a game of chess with the machine. What type of learning algorithm is it?
Answer: A game of chess with a computer is a type of reinforcement learning. Rather than feeding all the moves into the computer, it is made to learn the game and think like how a human would. The computer improves as more and more scenarios and moves are put forward. It is implemented using neural networks. The system takes a decision and makes a move, checks if it is the right move through feedback, and learns the next set of moves. The feedback can be positive or negative based on which the computer is either punished or rewarded.
Question: What are the differences between classification and regression algorithms? How do you decide which one to use?
Answer: Both classification and regression are supervised learning algorithms. However, classification divides the data into categories (class labels) and not a single value. For example, a set of balls can be categorized (or sorted) based on their colors. Regression algorithms predict a certain value or numerical quantity. For example, the age of a person based on their characteristics, likes, lifestyle choices, etc… Classification problems are evaluated using accuracy whereas regression problems are evaluated using root mean square error.
Some algorithms like decision trees can be used for classification as well as regression problems. If the problem is to label the data and assign it to a group, then we can use classification algorithms, however, if it is to find a quantity or a continuous value, we should use regression algorithms.
Question: What is the decision tree? What are the types of decision tree algorithms?
Answer: A decision tree is a supervised learning algorithm used for both regression and classification. It is called a tree because based on the features, certain questions can be formed the answers to which split the data into further questions or categories until there are no more categories left to split. Read more about the decision tree for machine learning.
Question: Which type of algorithm is email spam detection? How does it work?
Email spam detection uses the decision tree algorithm. Decision trees are based on splitting the data based on some questions. The questions are related to the attributes of the dataset. Here is our detailed article on how decision trees in machine learning work.
Question: There are many machine learning algorithms. How will you choose which one to use for which problem?
Answer: Many times, choosing the right algorithm could be just a trial and error way. You can try different algorithms and compare the accuracy to cross-validate them. If your dataset is huge, you should go for algorithms with high variance and low bias, whereas if your dataset is small, you should try algorithms or models that have low variance and high bias.
Question: What is a recall?
Answer: A recall is the total number of truly positive results out of the results identified as true in a classification problem. For example, if the algorithm identifies 7 fruits as apples out of a set of 15 fruits, and only 4 are apples, then the recall will be 4/15. Learn more from Wikipedia.
Question: What are the steps involved in machine learning?
Answer: Although machine learning mainly involves building, training, and testing a model, there are several other preparation steps involved to reach this final stage of model creation. The various steps of machine learning are –
- Data collection – just having a large dataset is not enough. The accuracy of a model depends upon the quality of data.
- Data preparation – wrangle and prepare data for training purposes. You may have to clean the data, remove duplicates, correct missing or corrupt values, normalize data, change data types, etc… Perform exploratory analysis on the data and split it into training and testing sets, e.g. 80:20 or 70:30.
- Choose the appropriate model (algorithm) to solve the given problem
- Train the model – model should be trained such that the accuracy score is high so that we can get an accurate prediction. To achieve this, many iterations are performed so that the model can improvise and improve.
- Test the model – Using testing data (new data), test the accuracy of the model.
- Tune the model parameters to increase the performance and correct overfitting/underfitting.
- Make predictions based on the outcome of the model
Question: What is data mining?
Answer: Data mining is the process of finding patterns and trends in large datasets, also known as big data using statistics, machine learning, and other aspects of data science. It is also called Knowledge Discovery in Data (KDD). Some of the most popular data mining techniques are classification, regression, clustering, prediction, etc… R, Python, Oracle are some of the most popular tools used for data mining. Have a look at our article on data science tools to see the complete list.
Question: What is the meaning of K in the KNN algorithm?
Answer: KNN algorithm is k-nearest neighbour algorithm. In this, ‘k’ refers to the closest neighbour of the target label.
Question: Explain how the KNN algorithm works.
Answer: KNN uses feature similarity to classify new data. KNN doesn’t make any assumptions about the data and all the data is used for training. Suppose we have a new label as shown in the figure, as the red triangle. This is the point we need to classify either as purple squares or yellow circles.
The algorithm finds the ‘k’ nearest neighbours and then calculates the distance between the samples and the query instance (red triangle). Determine the nearest neighbours based on kth minimum distance. Based on the above, there are more purple neighbours of the red triangle than the yellow ones. Hence, the triangle will be classified as a purple square.
Question: Explain unsupervised learning algorithms.
Answer: Unsupervised learning algorithms are ML algorithms where there is no labelled data and the machine figures out patterns by self-learning. These algorithms are capable of performing more complex analysis compared to supervised methods. The main idea of unsupervised learning is to cluster the data into similar groups. Read our article on unsupervised learning to know more.
Question: What is the difference between inductive and deductive machine learning?
Answer: Inductive learning is learning that’s given by someone to us. For example, if a child is about to touch fire, we tell them not to do so as it is dangerous and they should not be experiencing it even once. In this type of learning, we draw conclusions based on a set of pre-defined principles and not by experience.
In deductive learning, learning happens through experience. Well, what if we don’t tell the child not to touch the fire. We let him play so that he realizes not to repeat the same thing.
Machine learning models can be trained using inductive or deductive learning both.
Question: What is ensemble learning? Explain about boosting and bagging.
Answer: Ensemble learning is a process where multiple models are combined to give the outcome. For example, a random forest is an ensemble learning algorithm which ensembles various decision trees to generate the final prediction. Ensemble methods are of two types – bagging and boosting.
- Bagging – Bagging is used when we want to decrease the variance while bias remains the same. In bagging, the dataset is split into subsets and different models can be run on each individual set. The prediction generated from all the sets is then averaged to get the final prediction. Bagging helps with overfitted models, however, doesn’t help much with models having a high bias. Random forest is an example of a bagging type ensemble.
- Boosting – This is a sequential ensemble type where data is split into subsets which are trained one after the other. Once you train one subset, you weigh the samples that are wrongly classified in the last (previous) phase with a higher weight than others. In a way, you are making the model learn from its previous mistake, thus reducing the variance, while also correcting the high bias. GBM (Gradient Boosting) is a good example of boosting ensemble type.
Question: What is dimension reduction? How is it done?
Answer: Every day huge volumes of data are collected from various sources like social media, shopping websites, smartphone apps, etc… Collecting this data to make sense out of it and bringing a solid inference is becoming a challenge, especially when there are a lot of features (variables).
With dimensionality reduction, we can reduce the number of variables by creating a subset that can capture information as much as the original set of variables would do.
Using dimensionality reduction, we can save a lot of storage space, the number of features (dimensions) will reduce. It will also be easier to train the model when there are fewer dimensions. Some techniques for dimensionality reduction are –
- Missing value ratio – If one or more variables have too many missing values, we can drop the variable, thus reducing the size.
- Low variance filter – if there is no variance in the value of a variable i.e. the variable values remain the same through most or all of the dataset, it won’t be useful, so it can be removed.
- High correlation filter – variables having high correlation are likely to carry the same kind of information, thus, we need not retain both the variables.
- Random forest – we can select a small subset of features to determine which features are most helpful for training the model.
- Factor analysis – In this method, variables are grouped by their correlations, where the variables of the same group have a high correlation with each other, but the low correlation with the variables of other groups. Each group is called a factor.
- Principal component analysis – In this algorithm, a set of variables is converted to a new set of variables extracted from the original ones. The new variables are called Principal Components. Each principal component is a linear combination of original variables.
Question: Which is your preferred language for machine learning? Why?
Answer: This question depends on your personal experience. For example, if you have worked with R, you can say it has a lot of ready to use and advanced libraries to perform complex calculations and is a robust language with R studio where you can visualize data in a better manner. If you have extensive experience in Python, on the other hand, you can state some of the advantages of using Python –Python has many libraries explicitly for machine learning and it is easy to code and understand.
Question: Have you worked with any unsupervised learning algorithm? Explain anyone.
Answer: This question is based on your personal experience. The interviewer may also ask variations of the same question like explain anyone unsupervised learning algorithm or what are unsupervised learning algorithms etc… You can check our article on unsupervised learning to know about the various algorithms.
Question: What is the standardization of data?
Answer: Standardization is a method used for rescaling data attributes. Standardization is done to find the mean and standard deviation of attributes where mean would be 0 and the standard deviation would be 1. In Python, standardization can be done using sci-kit-learn’s StandardScalar as,
scaler = StandardScaler().fit(input)
rescaledVal = scaler.transform(input)
Question: Mention some of the tools used for machine learning.
Answer: Some tools are – scikit-learn, rapidminer, TensorFlow, PyTorch, knime, etc…
Question: What is an analysis of covariance (ANCOVA)?
Answer: ANCOVA is an extension of ANOVA that includes a covariate, i.e. inclusion of a continuous variable along with dependent and independent variables.
Question: If you have a small training set, which ML algorithm will you use? What if your training set is big?
Answer: Logistic regression and Bayes classifier algorithms are best suited for small training sets. Algorithms like linear regression, k-means clustering are some methods for larger training sets.
Question: Suppose you write an algorithm to detect if a person has a particular disease. What should be the minimum accuracy of the model and how to improve it?
Answer: To detect if a person has a particular disease or not, we use classification algorithms. The algorithm should have an accuracy of at least 99%, to cover most cases. To evaluate the performance of the model, we should use True Positive Rate (Sensitivity or TPR), True Negative Rate (TNR, or specificity). The data should be balanced, and the algorithm should self-improve by giving higher weight to the wrong predictions in each iteration.
Question: Suppose you are the sales head of your company. There hasn’t been much increase in the sales of your product in the last quarter and you want to know why. How can ML help in achieving this?
- Using data science and machine learning algorithms, we can analyze the data regarding the sales data and help identify the target audience so that sales reps can focus on marketing to those people.
- Machine learning algorithms can also generate a profile of how a sales rep should be, key roles and responsibilities so that they can deliver their best
- Machine learning algorithms are known for accurate predictive analysis of the data. The algorithms can quickly find patterns and trends and identify the points where the sales are deviating from their ideal status.
- Through machine learning, the accuracy of risk models can be enhanced thus mitigating risks and other unforeseen obstacles.
Question: Which one is more important – false positive or false negative?
Answer: One can’t say which is more important – both false positive and false negative have their impact. A false positive can lead to unnecessary treatment, which is dangerous because it can cause side effects other than wastage of time and money. On the other hand, false negatives can lead to incorrect diagnostics, thus completely ignoring the real issue that a patient has.
Both errors can be minimized by collecting more information through other variables, increasing the TPR and TNR (aka sensitivity and specificity) of the test or taking the test multiple times. However, when one type of error reduces, the other increases. In such cases, scientists evaluate which type of error is less dangerous and make appropriate decisions.
Question: What is the R squared method? What is it used for?
Answer: R square is a statistical measure that explains the extent of variance in one variable (dependent) as compared to other independent variables. To calculate R2, we use the best fit line from a regression model and subtract the predicted values from actual values, square the results. This would be the list of error2, which then summed gives the unexplained variance. Total variance can be calculated by subtracting the average actual value from each actual value and squaring and summing the results. R2 can then be calculated as,
R2 = 1 – (unexplained variance/total variance)
R2 is used for finding the best fitting line with the least error. A value of 100% indicates that the model explains all the variability in the response data around the mean. A higher value of R2 indicates better fitting for the model.
Question: What are the advantages and disadvantages of a linear model?
- The simplest model to execute
- Easy to visualize
- When we know the relationship between the independent and dependent variable, the algorithm is the most suitable
- Most of the times, there is no clear relationship between the dependent and independent variables
- Linear regression cannot handle outliers
- Can be used only when there is a linear relationship between the variables
Question: What is the difference between RMSE and F1 scores?
Answer: F1 score is a metric defined as the combination of recall and precision metrics. It can range from 0 to 1 where 1 is the best and 0 is the worst.
F1 = 2 * (recall*precision/(recall + precision))
F1 is more useful than accuracy when the class distribution is uneven.
RMSE is the measure of the standard deviation of the errors (residuals). It tells how spread the residuals are. It is calculated as the square root of the Mean Square Error or MSE. It calculates the average sum of the squared difference between predicted and actual values for all the data points.
MSE = 1/n(∑et2) where the sum ∑ is for i = 1 to n. et is the difference between actual and predicted value.
RMSE = √MSE
Question: What is an intelligent cloud?
Answer: The integration of cloud with machine learning is termed intelligent cloud. Having ML algorithms and data storage on the cloud can greatly speed up the computation time. Most top cloud leaders offer ML capabilities. Some examples of ML using the cloud are cognitive cloud, chatbots, personal assistants, IoT, Business intelligence etc…
The interviewer might ask you to write down formulae, draw diagrams, and explain concepts like linear regression, SVM, decision trees, and others using pictures. They might ask you one or two questions about real-life practical problems that can be solved using machine learning such as –
- Recommendation systems
- Email spam filtering
- How to remove outliers from data
- Giving a practical case for using decision tree
- Selecting the right algorithm for the given problem with a given dataset
For example, which algorithm will be best suited to determine if a person has cancer or not, or which is the best algorithm to understand the shopping preferences of people aged between 15-20 years, or which algorithm is the best one to understand if a particular person will opt for a particular course or not.
Think about each question from an analytical angle. Whether your choice of algorithm is right or wrong, your justification should be logical and thorough. The more important thing that the interviewer wants in you is whether you are able to think logically, give appropriate reasoning for your actions, and able to solve problems mathematically by analysing huge datasets.
You might be also interested in: