Data science is one of the top career options right now, and the average salary of a data scientist can go up to $117,345 per annum. Cracking a data science interview, however, is not easy.
There are various technical and non-technical data science interview questions that the interviewer may ask you to test your overall skills, including analytical, programming, statistics, domain, creativity, and communication skills. Some interviews could be mixed, i.e., the interviewer could ask you questions on data science and machine learning, so you need to be well-prepared.
Here are a few generic data science interview questions that are important and you may expect in most of the interviews:
- Why did you choose data science as your career path?
- What tools would you like to work with for data science?
- What have you learned about data science in your previous projects?
- Have you ever faced challenges in any stage of data science? If yes, how did you overcome them?
These data science interview questions are subjective and can have different answers based on individual experiences. For example, for the first one, you can say that you have always liked mathematics and love to solve analytical problems.
Remember that the interviewer intends to ask you these questions to test your motivation and involvement in the subject. If you are beginning your career in data science, you can talk about any courses or certifications you did in data science and mention the projects, challenges you faced, etc.
Top 50 Data Science Interview Questions
Let us now discuss the top 50 data science interview questions and answers asked in most interviews. We have divided them into three levels: Basic, Intermediate, and Advanced. So, let's start.
Basic Data Science Interview Questions
1. What is data science? What are the steps involved in data science?
Usually, the web definition starts as data science is an interdisciplinary field. However, you can answer the question in your words rather than just giving the definition.
Data science is a process or study of a vast amount of data collected from numerous sources that help businesses gain valuable insights about customer needs, preferences, behaviors, and patterns and help predict the future course of action based on past and present data.
The steps involved in data science are:
- Problem definition – This step involves identifying and defining the problem. For example, “Why did the product sales reduce in the last quarter?”
- Data collection involves collecting data from various sources, like databases and surveys.
- Data wrangling or data preparation – Removal of redundancy and outliers, completing missing data and filtering relevant data.
- Machine learning and data mining – This step involves building, training, and testing various algorithms to analyze data and produce valuable insights.
- Data visualization – View the insights gained in a readable format for non-technical (business) persons.
- Predictions and generating solutions – In this step, business analysts and other stakeholders decide the next course of action for the business problem defined in the first step.
2. What is the difference between data science and machine learning?
Data science involves a lot of stages, from problem definition, data collection, cleaning, analysis, visualization, prediction, etc. On the other hand, machine learning is just one step in the data science lifecycle , where the data is modeled for analysis and obtaining insights.
3. How are data science and artificial intelligence related?
Artificial intelligence involves training machines to behave and make decisions similar to humans. Machine learning is the most basic type of AI, primarily used in data science for performing various analytical operations. ML is the main link between data science and AI.
4. Which is the best language for data science and why?
You can choose your preferred language or the one you have used the most. But the reason should be more than just saying, “because I have worked only with this language.” For example, if you say that your favorite language is Python, you can say that it has a lot of libraries that cater to data science operations, making it easy to code. Further, Python has a straightforward syntax and enables faster programming.
5. What are the tools and techniques used for data science?
Each stage of data science uses various tools. For example, we use various tools like Python, R, KNIME, Excel, and SAS for data analysis. Similarly, Tableau, D3, and Ggplot2 are the tools available for visualization.
Note : Check out the list of the most popular data science tools .
6. If you have a dataset of 1 lakh students, who have applied for a particular course, how would you fetch only those who satisfy all the eligibility criteria to apply for the course?
This can be quickly done by using a SQL query. Suppose the eligibility criteria are that the students should obtain a minimum of 75% marks and be aged between 20 and 23 years.
Let us write a simple SQL query for the same:
select * from student where avg_marks >= 75 and student_age between 20 and 23;
7. What is the difference between data wrangling and data transformation?
There is a thin line of difference between data wrangling and data transformation (ETL). Essential differences between the two are:
|Involves exploring the raw data and preparing it.
|The data is usually received in a well-structured format and is transformed into another format.
|Used by data analysts, line managers, and business analysts who want specific answers.
|Used by IT professionals who extract, transform and load data to the required data store and in the specific format.
|Popular tools are Apache Spark, Hadoop, MS Excel, SAS Data Preparation, and PowerBI.
|MSSQL Server, SAS Data Integration Studio, and Matillion are some of the popular data transformation tools.
8. What is data mining? How is it useful?
Data mining is analyzing data from various sources to find patterns and valuable insights for making predictions. It helps enhance marketing, build customized user experience, understand customer requirements better, improve sales, and so on.
Let's understand this with an example. Let us say you own a coffee shop, and the business is growing well. You now want to expand the business by adding new products to provide a better experience for coffee lovers. How would you know what products will sell the most?
You must study how customers react to a particular product, what they usually eat, and coffee. All this information and more has to be analyzed to make a prediction. This process refers to data mining.
9. Enlist some of the data-cleaning techniques.
Following are some of the standard data-cleaning techniques:
- Remove redundant or duplicate values.
- Fill in missing values and remove irrelevant values.
- Remove or correct data with typos and wrong inputs and information.
- Correct the data types and assign appropriate values, like N/A, for values that don’t fit into any data type category.
10. Tell us about the most common applications of data science.
The most common data science applications are recommender systems, speech and image recognition (like Alexa, Siri, and Cortana), digital marketing, healthcare systems, and logistics.
Note : Check our top 10 data science applications for a complete list of examples.
11. A company’s sales for a quarter dropped from 73% to 58%. How would you find out the possible causes?
The reduction in sales could be due to many factors. We can analyze the data for the particular quarter or the previous two quarters to determine the same. Doing this will help us understand the various factors that led to the decline. It could be because of a competitor offering a better price for the same product or service, and/or a change in company policies or structure, and/or no new value addition or good offers for the customers to stay, et cetera.
12. What are outliers? How can you treat them?
An outlier is an extraordinarily high or low value in the data that lies outside of the region where the other data points in the data sample lie. An outlier can be a real value or just a mistake in the input data.
The following ways help to handle outliers:
- Remove the records with outliers – If you think outliers can produce skewed results in your data, remove them.
- Correct the value – Correct the value if you think the outlier is just a mistake in input data. For example, someone might have entered 2000 in place of 20.00.
- Cap the outlier data – Set a maximum or minimum level for the particular field where the outlier is present.
- Transform the data – You can transform it for a particular field rather than the data itself. For example, you can create an average of the values and use the average instead of individual values.
13. What is big data? How is raw data transformed into a more usable form?
Big data refers to the vast volumes of structured or unstructured data collected and stored in different data stores to be processed later. There are many tools to transform raw data into a more usable form. Some of the most popular ones are Microsoft SSIS, SAS Data Integration Studio, Oracle Data Integrator, and Informatica PowerCenter.
14. Suppose in your data collected, 25% of the data is missing. What can you do to correct it?
If the data set is vast, 25% of missing data may just be neglected as it won’t affect the overall data. If it is a relatively minor dataset, the missing values can be predicted using the average or mean of other values. Python has the pandas library, which has the respective functions for average, mean, and other mathematical entities.
15. What steps would you follow to understand the shopping preferences of people between the ages of 30 and 40?
First, we need to collect the data from the target audience (30-40 years). Next, we need to filter the data and save only the relevant information like their previous purchase history, page likes, browsing data, age, etc. Next, we need to transform the data to apply some statistical analysis methods to the data to find common trends and patterns. Lastly, we can use a visualization tool to view various outcomes and insights.
16. What are the differences between overfitting and underfitting?
Overfitting and underfitting are two common modeling errors that can occur when machine learning algorithms are applied to a training data set. Thus, both conditions are responsible for the poor performance of machine learning algorithms. The following table enumerates the various essential differences between the two:
|The model starts learning even from noise and inaccurate or irrelevant data entries.
|The model is unable to capture or fit the underlying trend of data.
|It happens when the data supplied to the model is more than required. Because of too many details and noise, data categorization happens incorrectly.
|This happens when there is not enough data to build the model or while trying to build a linear model with a non-linear set of data.
|The model trains with more data and can improvise to build unrealistic models.
|The model trains with less data, so it becomes prone to making mistakes and wrong predictions.
|Some popular techniques to avoid overfitting are pruning regularization or cross-validation of data.
|We can prevent underfitting by feeding the model more data and selecting only the required features, i.e., reducing the total number of features.
17. What is sampling? What are the differences between cluster sampling and systematic sampling?
Sampling is selecting a specific part of the data from a vast set. Based on the sample, assumptions about the entire data can be made. The sample size can be selected based on probability, randomization, or non-probabilistic methods like research-based judgment. The sample size should be ideal, not too big or too small.
- Cluster sampling – In this method, the data scientist or analyst divides the data into clusters. A random sample is collected from each cluster, and the sampled clusters are analyzed to produce an outcome for the entire data.
- Systematic sampling – In this method, samples are collected from a random starting point, but the interval is fixed and periodic. Called the sampling interval, it is determined using the desired sample size.
18. What types of biases can occur during data sampling?
There are many forms of biases that can occur during data sampling. The most common are:
- Confirmation bias - can occur when analysts want to prove or confirm an assumption or support a particular conclusion that has been defined beforehand.
- Selection bias – This type of bias occurs if the data selected is not an accurate representation of the complete data. For Example, many people might complete surveys; however, most of the target audience might not have done it, leading to biased data.
- Outliers are extreme values, i.e., too low or too high. Such values don’t fit with other data points, causing algorithm errors.
- Overfitting and underfitting – Overfitting occurs due to data overload causing the model to analyze even noise and inaccurate values. On the other hand, lack of data causes underfitting, thus building a model that doesn’t give the real picture of the underlying data.
- Confounding variable – It is a variable that has a hidden impact on the other dependent variables but wasn’t accounted for in the data. There can be many confounding variables in a data set. These can cause errors in judgment rendering the analysis useless.
19. What is the difference between data analysis and data analytics?
Both terms are used interchangeably, but there is a subtle difference. Data analytics is a general term that encompasses data analysis and other activities like data identification, data extraction and validation, filtering, and making analysis-based predictions. While data analysis examines past data, analytics also includes looking into the future.
20. What are the steps involved in data analysis?
- Identify and set the goals – What are we trying to achieve by performing the analysis? What is the problem we are trying to solve?
- Setting up the parameters of the measurement – What are we going to measure? What kind of data do we require?
- Data collection and cleaning – Identify and gather data from various possible sources, and clean the data to remove inaccuracies, redundancy, duplicates, etc.
- Analyze and interpret the data – Analyze the data using different tools and techniques. Apply algorithms if necessary and visualize the data using various tools to interpret the insights obtained.
21. How is a training set different from a testing set?
Massive datasets are divided into training and testing sets, where a large portion is given to the training set. The training set is used to build the machine learning model, while we use testing sets for validating and improvising the ML model. The training set finds the coefficients, and the testing set uses the same coefficients to produce accurate reports.
22. What are the different types of machine learning algorithms?
There are three main types of machine learning algorithms – Supervised, Unsupervised, and Reinforcement.
Intermediate Data Science Interview Questions
23. List some differences between linear regression and logistic regression.
|Used when dependent variables are continuous, and the line (equation) is linear.
|Used when the dependent variable is binary (yes/no, 0/1, true/false).
|Dependent and independent variables have a linear relationship.
|There may or may not be a linear relationship between independent and dependent variables.
|The independent variables can be correlated with each other.
|Independent variables should not be correlated.
24. How is machine learning different from deep learning?
Both machine learning and deep learning are branches of AI. Machine learning finds extensive use in data science to build models and create algorithms that give valuable insights into data. ML uses data fed to the system to train itself and the training algorithm to test data for validation and improvement. Deep learning is a further subset of ML that uses neural networks.
Neural nets contain multiple layers of ‘neurons’ stacked one above the other, similar to how the human brain's neurons are structured. Deep learning requires more data and training time, but the neural network selects the features, whereas, in machine learning, feature selection is made by humans. You can think of deep learning as a more specialized form of ML.
25. Which algorithm will you use to find out what movies a particular person would prefer to watch?
The recommendation system is ideal for finding user preferences. Most recommender systems are based on users based on an item-based collaborative filtering algorithm. Nonetheless, matrix decomposition and clustering are other popular algorithms to find what the user likes.
26. What is data visualization? What are the tools used to perform data visualization?
After data is analyzed and the insights are produced, it can be challenging for non-technical persons to grasp the outcomes fully. Visualizing the data in a graphical or chart form can enhance the viewing experience and make it easier for business analysts and stakeholders to understand the insights better. Data visualization is called this process, where insights are communicated using visual tools. Tableau, D3, PowerBI, FineReport, and Excel are the best data visualization tools. Essential visualization functions available in Python and R work just fine too.
27. What is the difference between logistic regression and Support Vector Machine?
Both logistic regression and SVM are used to solve classification problems and are supervised learning algorithms. Some main differences between both are:
|Used for classification as well as regression problems.
|Used for classification problems only.
|Leverages a hyperplane (decision boundary) to separate data into classes. It uses a kernel to determine the best decision boundary.
|Makes use of a sigmoid function to find the relationship between variables.
|Works well on semi-structured or unstructured data.
|Works well with pre-defined independent variables.
|Less risk of overfitting.
|More risk of overfitting.
|Based on geometric properties.
|Based on statistical approaches.
28. List some disadvantages of the linear model.
The linear model assumes that all the relationships between independent and dependent variables are linear, thus over-simplifying a problem that may not be very simple. This may lead to inaccurate results. Also, this model can treat noisy data as useful and lead to overfitting. A linear model cannot be used in cases where the relationship between independent and dependent variables is non-linear.
29. State some differences between a decision tree and a random forest.
|The decision tree acts upon one dataset.
|A random forest is a collection of multiple decision trees.
|The tree is built by considering the entire dataset, considering all the variables and features.
|Selects only random observations and selected features (variables) from the collection of decision trees to build the model.
|Simple and easy to interpret.
|Randomness cannot be controlled. For example, you cannot control which feature came from which decision tree.
|Less accurate than random forest for unexpected validation dataset.
|High accuracy keeps increasing with more trees.
30. What is the ROC curve? How does it work?
ROC or Receiver Operator Curve shows the diagnostic ability of binary classifiers as a graphical plot. It is created by plotting the true positive rate (TPR), which is calculated as observations that were rightly predicted as positive among others, against the false positive rate (FPR), which is calculated as the observations that were incorrectly predicted as positive. The use of ROC is extensive in the medical field for testing.
31. What is a confusion matrix?
A confusion matrix measures performance for classification problems in machine learning. In a classification problem, the outcome is not a single value but two or more classes. The confusion matrix is a table (or matrix) that contains predicted positive and negative values and the actual positive and negative values. These values can determine the precision using the formula: Precision = (TP)/(TP + FP), where TP is genuinely positive, and FP is false positive.
32. Suppose you have a pack of 52 cards, and you pick one card. If you know that the card is a face card, what is the probability that it is a queen?
This can be calculated using the Bayes formula as shown below:
- Four queens are in the pack, so the probability of getting a queen, P(Q) = 4/52.
- We know that queen is a face card, so P(F/Q) = 1, as the queen will remain a face card always.
- There are three other face cards in a pack, which means the probability of getting a face card other than queen, P(F) is 12/52.
The probability of getting a queen from the pack of cards P(Q/F), when you know that it’s a face card is: P(Q/F) = (P(F/Q)*P(Q))/P(F) P(Q/F) = (1*4/52)/12/52 = 1/3
33. How would you define the bias-variance trade-off?
Bias-variance trade-off is a statistics property. Its value is determined using a set of predictive models where models with high bias in the parameter estimation have lower variance (across the samples) and vice-versa. Both bias and variance cause error and an optimum value that minimizes these errors is the trade-off value.
34. What is meant by binary classification?
In binary classification, the elements of a data set are classified into two groups (categories) by applying a classification rule. For example, positive-negative, true-false, yes-no, etc. Some of the commonly used, simple and effective binary classification algorithms are logistic regression, random forest, and SVM.
35. What is the normal distribution of data?
Most datasets follow a normal distribution, a bell curve (shaped like a bell) depicting probability distribution. The normal distribution is symmetric about the mean. This means that the data around the mean value occur more frequently than far from the mean. Some examples of a normal distribution are human height, age of people, and marks in an exam.
36. Why are statistics vital for data science?
Statistics provide fundamental tools and techniques to get deeper insights into the data. It also helps us to quantify and analyze data uncertainty with consistent results. Statistics is used in several stages of data science, namely problem definition, data acquisition, data exploration, analysis, modeling, validation, and reporting.
37. Do you know about covariance and correlation?
Both covariance and correlation measure the dependency and relationship between two variables. However, some important differences between the two are:
|Indicates only the direction of the linear relationship between the variables.
|Indicates both strength and direction of the relationship.
|Measured in units.
|Not measured in units.
|Calculated as the total variation of two variables from the expected values of those variables.
|The correlation coefficients of two variables can be determined by dividing their covariance with the product of standard deviations of the variables.
38. What is the p-value?
The p-value is used for hypothesis testing where the probability of getting extreme results the same as that of a statistical hypothesis test is correct. It assumes that the null hypothesis is correct as well. It is helpful to give the slightest significance to rejection points where the null hypothesis would be rejected. P-values can be calculated by using p-value tables.
39. Define regularization and explain why it is essential.
Regularizations are techniques used to avoid the issue of overfitting in the given training set and reduce the error by adequately fitting the function on the dataset. Overfitting happens when the learning algorithm considers the noise and the correct data. Thus, the result will be a pattern and noise (error or deviation). Regularization aims to eliminate the noise from the pattern.
40. What is a confounder variable?
A confounder or confounding variable is an external variable that changes the effect of independent and dependent variables. Because of such an influence, the entire experiment can go wrong and yield useless outcomes. To avoid such problems, confounding variables must be identified and considered before starting the experiment.
41. Give some differences between R and Python.
R and Python are two powerful languages for data science, particularly for data analysis. The following table lists the major differences between the two:
|Easy to learn and interpret. Features simple and readable text.
|Needs more time to learn, understand the syntax, and write programs.
|More popular than R.
|The number of users switching from R to Python is more.
|Used chiefly for programming by developers.
|Researchers and analysts use R for scholarly and research purposes.
|Has a rich set of libraries but less than R.
|Has extensive libraries that are readily available.
|Good documentation and tremendous community support.
|Good documentation, GitHub repos, and broad community support.
|Simple and understandable graphs, though not as fancy as R.
|Great graphical depictions using RStudio.
As a developer, Python is the preferred language to start with. However, R should be your choice if you do more statistical analysis and need excellent graphs for reporting.
42. What is the fastest and most accurate method of data cleaning?
There are many ways to perform data cleaning. The answer to this question can be subjective depending upon the tool(s) you have previously used. For example, you can perform data cleaning by writing code in Python or any other language, or you could do it at the database level using SQL queries.
MS Excel is another excellent tool for filtering, correcting, and cleaning data. However, SQL and Python (code-level) can give performance issues, and Excel works only on medium or small data sets. For huge datasets, Apache Spark and Hadoop are the best tools that are easier to learn, work faster and give good results. Some other efficient data cleaning tools are PowerBI and Alteryx.
43. What are the main types of statistical analysis techniques?
There are three main types of statistical analysis techniques:
- Univariate - The simplest type of analysis where only one variable, i.e., only one value, changes. The purpose of the analysis is to describe the data and find patterns. Also, it doesn’t deal with any cause or relationships in the underlying data.
- Bivariate – As the name suggests, this technique involves two variables. It deals with cause and relationships. For example, will the sales of some AC go up during summer or winter? Through the analysis, we can find the relationship between the two variables.
- Multivariate – When data contains more than two variables, this statistical analysis technique is used. The focus of the analysis depends on the questions that need to be answered through analysis. For example, a company trying to promote four products and analyze which one has a better response.
44. What is a false negative? How does it impact accuracy?
False-negative is a situation where a negative test result conducted turns out to be wrong, i.e., you should have got a positive result, but you got a negative. For example, a pregnancy test might show negative during the very early stages when the person might be pregnant. It impacts accuracy because the system you think is safe is not safe in a real sense and may lead to fatality in later stages. For example, if there is a quality check on some raw materials, and the test passes when it should fail, the item might pass, leading to a lousy product later.
45. What is collaborative filtering? Where is it used?
It is a technique that filters out content or products (or any other items) that a particular user may like based on recommendations from users with similar interests. Collaborative filtering is done by finding a small subset of people from a relatively larger dataset. It finds use in recommendation systems like Amazon product recommendations and Netflix movie recommendations.
46. How would you perform data analytics?
Data analytics involves many steps and interactions with different teams. Successful data analytics involves the following steps:
- Descriptive analytics - involves collecting all the previous and current data and obtaining historical trends. This is to determine an answer to the question ‘What happened?’. It just summarizes the collected data to find trends and patterns.
- Diagnostic analytics - The next step is diagnostic analytics, where we find the leading cause(s) of the problem. This step answers ‘Why it happened?’ and helps in improvising. Machine learning algorithms are ideal for performing diagnostic analytics.
- Predictive analytics - Once we know the ‘why’, we can perform predictive analytics to understand ‘What will happen next?’ For example, forecasting for the next quarter, semester, or year. This type of analytics requires past, present, and future data.
- Prescriptive analytics - The final step is prescriptive analytics which uses advanced AI techniques like natural language processing, image recognition, and heuristics to train the model to get better at the tasks at hand. For example, recommendation systems or search engines become better as more and more data is fed to them.
47. What is ensemble learning, and what are its techniques?
Ensemble learning involves using multiple algorithms to get better predictions and outcomes than we would obtain from using only one algorithm. Although this type of learning mostly finds use in supervised learning algorithms like decision trees and random forests, it can also be used with unsupervised algorithms. It involves more computation than a single model to give the best results. Some standard techniques of ensemble learning are:
- Bayes optimal classifier,
- Bayesian model averaging, and
- Bucket of models.
48. Suppose you build a model and train it. Now, you feed the test data, but the results are inaccurate. What could be the possible reasons for this?
There could be many reasons, such as:
- Wrong feature selection.
- Wrong choice of algorithm.
- The model is overfitting or underfitting.
49. What is an eigenvector?
It is a vector whose direction does not change when applied linear transformation. Eigenvectors and eigenvalues are used in many applications in machine learning and computer vision.
50. What is the difference between a data scientist and a data analyst?
Data analysts work very closely with data and have strong technical knowledge - of programming languages, SQL, statistics, etc. - to apply tools and techniques and find trends and patterns in the data. The question for which data has to be analyzed comes to a data analyst through a data scientist.
Data scientists identify problems and questions and present analysts' results to stakeholders and business analysts. They should know machine learning algorithms, have strong business acumen, and possess effective communication and presentation skills.
To crack the data science interview, you must familiarize yourself with the tools and techniques. Also, stay updated about the new tools available in the market for data science. You should also have a good understanding of machine learning and its algorithms and be able to explain how you have used some of them in your previous projects (if any).
Besides this, it would be a good idea to know about a few excellent books on data science and read at least one of them to thoroughly yourself with the topics. Sometimes interviewers do ask about the data science books you follow!
Hope this article helps you learn about the most essential and practical data science interview questions companies ask to hire data scientists. Happy learning!
People are also reading: