Data science is one of the top career options right now, and the average salary of a data scientist can go up to $117,345 per annum. Cracking a data science interview, however, is not easy. There are a variety of technical and non-technical data science interview questions that the interviewer may ask you to test your overall skills, which includes analytical, programming, statistics, domain, creativity, and communication skills.
Some interviews could be mixed, i.e. the interviewer could ask you questions on both data science and machine learning, so you need to be well-prepared. We will be soon covering all the important machine learning interview questions in a separate article. For programming language interview questions, refer to our specific page about interview questions.
Here are a few generic data science interview questions that are important and are asked in most of the interviews:
- Why did you choose data science as your career path?
- What are some of the tools you like to work with for data science?
- What have you learned about data science in your previous projects?
- Have you ever faced challenges in any stage of data science? If yes, how did you overcome them?
These data science interview questions are subjective and can have different answers based on individual experiences. For example, for the first one, you can say that you have always liked mathematics and love to solve analytical problems.
Remember that the interviewer’s intention is to ask you these questions is to test your motivation and involvement in the subject. If you are just beginning your career with data science, you can talk about any courses or certifications you did on data science and mention the projects, challenges you faced, etc.
Top 50 Data Science Interview Questions
Let us now discuss the top 50 data science interview questions and answers asked in most interviews. So, let’s start.
Question: What is data science? What are the steps involved in data science?
Answer: Usually, the web definition starts as data science is an interdisciplinary field. However, you can answer the question in your words rather than just giving the definition.
Data science is a process or study of a huge amount of data collected from numerous sources that helps businesses gain useful insights about customer needs, preferences, behaviors, and patterns and help predict the future course of action based on past and present data. The steps involved in data science are:
- Problem definition – The problem is identified and defined. For example, “Why did the product sales reduce in the last quarter?”
- Data collection – Data is collected from various sources, like databases and surveys.
- Data wrangling or data preparation – Removal of redundancy and outliers, completing missing data and so on to filter relevant data.
- Machine learning and data mining – Various algorithms are built, trained and tested to analyze data and produce useful insights.
- Data visualization – View the insights gained in a readable format for non-technical (business) persons as well.
- Predictions and generating solutions – In this step, business analysts and other stakeholders decide the next course of action for the business problem defined in the first step.
Question: What is the difference between data science and machine learning?
Answer: Data science involves a lot of stages from problem definition, data collection, cleaning, analysis, visualization, prediction, etc. Machine learning, on the other hand, is just one step in the data science lifecycle where the data is modeled for analysis and obtaining insights.
Question: How are data science and artificial intelligence related?
Answer: Artificial intelligence involves training machines to behave and make decisions similar to humans. Machine learning is the most basic type of AI, which is extensively used in data science to perform various analytical operations. ML is the main link between data science and AI.
Question: Which is the best language for data science and why?
Answer: You can choose your preferred language or the one that you have used the most. But the reason should be more than just saying, “because I have worked only with this language.”
For example, if you say that your favorite language is Python, you can say that it has a lot of libraries that cater to data science operations, making it easy to code. Further, Python has a very easy syntax and enables faster programming.
Question: What are the tools and techniques used for data science?
Answer: There are many tools used in each stage of data science. For example, for data analysis, we use various tools like Python, R, KNIME, Excel, and SAS. Similarly, for visualization, Tableau, D3, and Ggplot2 are the tools available.
Note: Check out the list of the most popular data science tools.
Question: If you have a dataset of 1 lakh students, who have applied for a particular course, how would you fetch only those students who satisfy all the eligibility criteria to apply for the course?
Answer: This can be easily done by using a SQL query. Suppose the eligibility criteria are that the students should obtain a minimum of 75% marks and should be aged between 20 and 23 years. Let us write a simple SQL query for the same:
select * from student where avg_marks >= 75 and student_age between 20 and 23;
Question: What is the difference between data wrangling and data transformation?
Answer: There is a thin line of difference between data wrangling and data transformation (ETL). Important differences between the two are:
|Data wrangling||Data transformation|
|Involves exploring the raw data and preparing it.||The data is usually received in a well-structured format and is transformed into another format.|
|Used by data analysts, line managers, and business analysts who want specific answers.||Used by IT professionals who extract, transform and load data to the required data store and in the specific format.|
|Popular tools are Apache Spark, Hadoop, MS Excel, SAS Data Preparation, and PowerBI.||MSSQL Server, SAS Data Integration Studio, and Matillion are some of the popular data transformation tools.|
Question: What is data mining? How is it useful?
Answer: Data mining is the process of analyzing data from various sources to find patterns and useful insights for making predictions. It is useful for enhancing marketing, building customized user experience, understanding customer requirements better, improving sales, and so on.
Let’s understand this with an example. Let us say you own a coffee shop, and the business is growing well. You now want to expand the business by adding new products to provide a better experience for coffee lovers.
How would you know what products will sell the most? For doing that, you have to study how customers react to a particular product and what they like to eat most of the time, along with coffee. All this information and more has to be analyzed to make a prediction. This process is called data mining.
Question: What are some of the techniques used for data cleaning?
Answer: Following are some of the common data cleaning techniques:
- Remove redundant or duplicate values.
- Fill missing values and remove irrelevant values.
- Remove or correct data with typos and wrong inputs and information.
- Correct the data types and assign appropriate values, like N/A, for values that don’t fit into any data type category.
Question: Tell us about the most common applications of data science.
Answer: The most common applications of data science are recommender systems, speech and image recognition (like Alexa, Siri, and Cortana), digital marketing, healthcare systems, and logistics.
Note: For a complete list with examples, check our top 10 data science applications.
Question: A company’s sales for a quarter dropped from 73% to 58%. How would you find out the possible causes?
Answer: The reduction in sales could be due to many factors. To determine the same, we can analyze the data for the particular quarter or the previous two quarters. Doing this will help us understand the various factors that led to the decline.
It could be because of a competitor offering a better price for the same product or service, and/or change in company policies or structure, and/or no new value addition or good offers for the customers to stay, et cetera.
Question: What are outliers? How can you treat them?
Answer: An outlier is an extremely high or low value in the data that lies outside of the region where the other data points in the data sample lie. An outlier can be a real value or just a mistake in the input data. Following ways help to handle outliers:
- Remove the records with outliers – If you think outliers can produce skewed results in your data, remove them.
- Correct the value – If you think that the outlier is just a mistake in input data, correct the value. For example, someone might have entered 2000 in place of 20.00.
- Cap the outlier data – Set a maximum or minimum level for the particular field where the outlier is present.
- Transform the data – You can use a transformation of the data for the particular field rather than the data itself. For example, you can create an average of the values and use the average instead of individual values.
Question: What is big data? How is raw data transformed into a more usable form?
Answer: Big data refers to the huge volumes of structured or unstructured data that is collected and stored in different data stores to be processed later. There are many tools to transform raw data into a more usable form. Some of the most popular ones are Microsoft SSIS, SAS Data Integration Studio, Oracle Data Integrator, and Informatica PowerCenter.
Question: Suppose in your data collected, 25% of data is missing. What can you do to correct it?
Answer: If the data set is huge, 25% of missing data may just be neglected as it won’t affect the overall data. If it is a relatively smaller dataset, the missing values can be predicted using the average or mean of other values. Python has the pandas library, which has the respective functions for average, mean, and other mathematical entities.
Question: What are the steps you would follow to understand the shopping preferences of people between the age of 30 and 40?
Answer: First, we need to collect the data of the target audience (30-40 years). Next, we need to filter the data and save only the relevant information like their previous purchase history, page likes, browsing data, age, etc.
Next, we need to transform the data so that we can apply some statistical analysis methods on the data to find common trends and patterns. Lastly, we can use a visualization tool to view various outcomes and insights.
Question: What are the differences between overfitting and underfitting?
Answer: Overfitting and underfitting are two common modeling errors that can occur when machine learning algorithms are applied to a training data set. Thus, both conditions are responsible for the poor performance of machine learning algorithms. The following table enumerates the various important differences between the two:
|The model starts learning even from noise and inaccurate or irrelevant data entries.||The model is unable to capture or fit the underlying trend of data.|
|It happens when the data supplied to the model is more than required. Because of too many details and noise, data categorization happens incorrectly.||Happens when there is not enough data to build the model or while trying to build a linear model with a non-linear set of data.|
|The model trains with more data and hence can improvise to build unrealistic models.||The model trains with fewer data, so it becomes prone to making a lot of mistakes and wrong predictions.|
|It can be avoided by pruning, regularization, or cross-validation of data.||Can be prevented by feeding the model more data and by selecting only the required features, i.e. reducing the total number of features.|
Question: What is sampling? What are the differences between cluster sampling and systematic sampling?
Answer: Sampling is selecting a specific part of the data from a huge set. Based on the sample, assumptions about the entire data can be made. The sample size can be selected based on probability and randomization or non-probabilistic methods like research-based judgment. The sample size should be ideal, not too big or too small.
- Cluster sampling – In this method, the data scientist or analyst divides the data into different clusters. From each cluster, a random sample is collected, and the sampled clusters are analyzed to produce an outcome for the entire data.
- Systematic sampling – In this method, samples are collected from a random starting point, but the interval is fixed and periodic. Called the sampling interval, it is determined using the desired sample size.
Question: What types of biases can occur during data sampling?
Answer: There are many forms of biases that can occur during data sampling. The most common are:
- Confirmation bias – This can occur when analysts want to prove or confirm an assumption or support a particular conclusion that has been defined beforehand.
- Selection bias – This type of bias occurs if the data selected is not an accurate representation of the complete data. For Example, surveys, where the survey might be completed by many persons; however, the majority of the target audience might not have done it, leading to biased data.
- Outliers – Outliers are extreme values, i.e. too low or too high. Such values don’t fit with other data points, causing errors in algorithms.
- Overfitting and underfitting – Overfitting occurs due to the overload of data causing the model to analyze even noise and inaccurate values. On the other hand, lack of data causes underfitting, thus building a model that doesn’t give the real picture of the underlying data.
- Confounding variable – It is a variable that has a hidden impact on the other dependent variables but wasn’t accounted for in the data. There can be many confounding variables in a data set. These can cause errors in judgment rendering the analysis useless.
Question: What is the difference between data analysis and data analytics?
Answer: Both terms are used interchangeably, but there is a subtle difference. Data analytics is a general term that encompasses data analysis as well as other activities like data identification, data extraction and validation, filtering, and making analysis-based predictions. While data analysis examines past data, analytics also includes looking into the future.
Question: What are the steps involved in data analysis?
- Identify and set the goals – What are we trying to achieve by performing the analysis? What is the problem we are trying to solve?
- Setting up the parameters of the measurement – What are we going to measure? What kind of data do we require?
- Data collection and cleaning – Identify and gather data from various possible sources, clean the data to remove inaccuracies, redundancy, duplicates, etc.
- Analyze and interpret the data – Analyze the data using different tools and techniques. Apply algorithms if necessary and visualize the data using various tools to interpret the insights obtained.
Question: How is a training set different from a testing set?
Answer: Huge datasets are divided into training and testing sets, where a large portion is given to the training set. The training set is used to build the machine learning model, while for validating and improvising the ML model, we use testing sets. The training set finds the coefficients, and the testing set uses the same coefficients to produce accurate reports.
Question: What are the different types of machine learning algorithms?
Answer: There are three main types of machine learning algorithms – Supervised, Unsupervised, Reinforcement.
Question: List some differences between linear regression and logistic regression.
|Linear regression||Logistic regression|
|Used when dependent variables are continuous, and the line (equation) is linear.||Used when the dependent variable is binary (yes/no, 0/1, true/false).|
|Dependent and independent variables have a linear relationship.||There may or may not be a linear relationship between independent and dependent variables.|
|The independent variables can be correlated with each other.||Independent variables should not be correlated.|
Question: How is machine learning different from deep learning?
Answer: Both machine learning and deep learning are branches of AI. Machine learning finds extensive use in the field of data science to build models and create algorithms that give useful insights into data. ML uses data fed to the system to train itself and use the training algorithm on testing data for validation and improvement.
Deep learning is a further subset of ML that uses neural networks. Neural nets contain multiple layers of ‘neurons’ stacked one above the other similar to how the neurons of the human brain are structured. Deep learning requires more data and more training time, but the neural network selects the features, whereas, in machine learning, feature selection is made by humans.
You can think of deep learning as a more specialized form of ML.
Question: Which algorithm will you use to find out what kind of movies a particular person would prefer to watch?
Answer: The recommendation system is ideal to find user preferences. Most recommender systems are based on users based on an item-based collaborative filtering algorithm. Nonetheless, matrix decomposition and clustering are other popular algorithms to find what the user likes.
Question: What is data visualization? What are the tools used to perform data visualization?
Answer: After data is analyzed and the insights are produced, it can be tough for non-technical persons to fully grasp the outcomes. Visualizing the data in a graphical or chart form can enhance the viewing experience and make it easier for business analysts and stakeholders to understand the insights better.
This process where insights are communicated using visual tools is called data visualization. Some best data visualization tools are Tableau, D3, PowerBI, FineReport, and Excel. Basic visualization functions that are available in Python and R works just fine too.
Question: What is the difference between logistic regression and Support Vector Machine?
Answer: Both logistic regression and SVM are used to solve classification problems and are types of supervised learning algorithms. Some main differences between both are:
|Used for classification as well as regression problems.||Used for classification problems only.|
|Leverages a hyperplane (decision boundary) to separate data into classes. It uses a kernel to determine the best decision boundary.||Makes use of a sigmoid function to find the relationship between variables.|
|Works well on semi-structured or unstructured data.||Works well with pre-defined independent variables.|
|Less risk of overfitting.||More risk of overfitting.|
|Based on geometric properties.||Based on statistical approaches.|
Question: List some disadvantages of the linear model.
Answer: The linear model assumes that all the relationships between independent and dependent variables are linear, thus over-simplifying a problem that may not be very simple. This may lead to inaccurate results. Also, this model can also treat noisy data as useful and lead to overfitting. A linear model cannot be used in cases where the relationship between independent and dependent variables is non-linear.
Question: State some differences between a decision tree and a random forest.
|Decision tree||Random forest|
|Decision tree acts upon one dataset.||Random forest is a collection of multiple decision trees.|
|The tree is built by considering the entire dataset, taking into account all the variables and features.||Selects only random observations and selected features (variables) from the collection of decision trees to build the model.|
|Simple and easy to interpret.||The randomness cannot be controlled. For example, you cannot control which feature came from which decision tree.|
|Less accurate than random forest for unexpected validation dataset.||High accuracy that keeps increasing with more number of trees.|
Question: What is the ROC curve? How does it work?
Answer: ROC or Receiver Operator Curve shows the diagnostic ability of binary classifiers as a graphical plot. It is created by plotting the true positive rate (TPR), which is calculated as observations that were rightly predicted as positive among others, against the false positive rate (FPR) that is calculated as the observations that were incorrectly predicted as positive. The use of ROC is extensive in the medical field for testing.
Question: What is a confusion matrix?
Answer: A confusion matrix is a measure of performance for classification problems in machine learning. In a classification problem, the outcome is not a single value but two or more classes. The confusion matrix is a table (or matrix) that contains predicted positive and negative values and the actual positive and negative values. These values can determine the precision using the formula:
Precision = (TP)/(TP + FP), where TP is true positive and FP is false positive.
Question: Suppose you have a pack of 52 cards, and you pick one card. If you know that the card is a face card, then what is the probability that it is a queen?
Answer: This can be calculated using the Bayes formula as shown below:
- There are 4 queens in the pack, so the probability of getting a queen, P(Q) = 4/52.
- We know that queen is a face card, so P(F/Q) = 1, as queen will remain a face card always.
- There are 3 other face cards in a pack, which means the probability of getting a face card other than queen, P(F) is 12/52.
The probability of getting a queen from the pack of cards P(Q/F), when you know that it’s a face card is:
P(Q/F) = (P(F/Q)*P(Q))/P(F)
P(Q/F) = (1*4/52)/12/52 = 1/3
Question: How would you define the bias-variance trade-off?
Answer: Bias-variance trade-off is a statistics property. Its value is determined using a set of predictive models where models with high bias in the parameter estimation have lower variance (across the samples) and vice-versa. Both bias and variance cause error, and an optimum value that minimizes both these errors are said to be the trade-off value.
Question: What is meant by binary classification?
Answer: In binary classification, the elements of a data set are classified into two groups (categories) by applying a classification rule. For example, positive-negative, true-false, yes-no, etc. Some of the commonly used, simple and effective binary classification algorithms are logistic regression, random forest, and SVM.
Question: What is the normal distribution of data?
Answer: Most datasets follow a normal distribution, which is a bell curve (shaped like a bell) depicting probability distribution. The normal distribution is symmetric about the mean. This means that the data around the mean value occur more frequently than the data far from the mean. Some examples of a normal distribution are human height, age of people, and marks in an exam.
Question: Why are statistics important for data science?
Answer: Statistics provide fundamental tools and techniques to get deeper insights into the data. It also helps us to quantify and analyze data uncertainty with consistent results. Statistics is used in several stages of data science, namely problem definition, data acquisition, data exploration, analysis, modeling, validation, and reporting.
Question: Do you know about covariance and correlation?
Answer: Both covariance and correlation measure the dependency and relationship between two variables. However, some important differences between the two are:
|Indicates only the direction of the linear relationship between the variables.||Indicates both strength and direction of the relationship.|
|Measured in units.||Not measured in units.|
|Calculated as the total variation of two variables from the expected values of those variables.||Correlation coefficients of two variables can be determined by dividing their covariance with the product of standard deviations of the variables.|
Question: What is the p-value?
Answer: The p-value is used for hypothesis testing where the probability of getting extreme results the same as that of a statistical hypothesis test is correct. It assumes that the null hypothesis is correct as well. It is useful to give the smallest of significance to rejection points where otherwise the null hypothesis would be rejected. P-values can be calculated by using p-value tables.
Question: Define regularization and explain why it is important.
Answer: Regularizations are techniques used to avoid the issue of overfitting in the given training set and reduce the error by properly fitting the function on the dataset.
Overfitting happens when the learning algorithm also takes the noise into consideration along with the correct data. Thus, the result will be a pattern and noise (error or deviation). Regularization aims to eliminate the noise from the pattern.
Question: What is a confounder variable?
Answer: A confounder or confounding variable is an external variable that changes the effect of independent and dependent variables. Because of such an influence, the entire experiment can go wrong and yield useless outcomes. To avoid such problems, confounding variables need to be identified and considered well before starting the experiment.
Question: Give some differences between R and Python.
Answer: R and Python are two of the powerful languages for data science, particularly for data analysis. The following table lists the major differences between the two:
|Easy to learn and interpret. Features simple and readable text.||Needs more time to learn, understand the syntax, and write programs.|
|More popular than R.||Number of users switching from R to Python is more.|
|Used mostly for programming by developers.||Can be used by researchers and analysts for scholarly and research purposes.|
|Has a rich set of libraries, but less than R.||Has extensive libraries that are easily available.|
|Good documentation and huge community support.||Good documentation, GitHub repos, and wide community support.|
|Simple and understandable graphs, though not as fancy as R.||Great graphical depictions using RStudio.|
As a developer, Python is the preferred language to start with. However, if you do more statistical analysis and need excellent graphs for reporting, R should be your choice.
Question: What is the fastest and most accurate method of data cleaning?
Answer: There are many ways to perform data cleaning. The answer to this question can be subjective depending upon the tool(s) you have previously used. For example, you can perform data cleaning by writing code in Python or any other language, or you could do it at the database level using SQL queries. MS Excel is another excellent tool for filtering, correcting and cleaning data.
However, SQL and Python (code-level) can give performance issues, and Excel works only on medium or small data sets. For huge datasets, Apache Spark and Hadoop are the best tools that are easier to learn, work faster and give good results. Some other efficient data cleaning tools are PowerBI and Alteryx.
Question: What are the main types of statistical analysis techniques?
Answer: There are 3 main types of statistical analysis techniques:
- Univariate – The simplest type of analysis where there is only one variable, i.e. only one value changes. The purpose of the analysis is just to describe the data and find patterns. Also, it doesn’t deal with any cause or relationships in the underlying data.
- Bivariate – As the name suggests, this technique involves two variables. It deals with cause and relationships. For example, will the sales of some AC go up during summer or winter? Through the analysis, we can find the relationship between the two variables involved.
- Multivariate – When data contains more than two variables, this statistical analysis technique is used. The focus of the analysis depends on the questions that need to be answered through analysis. For example, a company trying to promote four products and analyzing which one has a better response.
Question: What is a false negative? How does it impact accuracy?
Answer: False-negative is a situation where a negative test result conducted turns out to be wrong, i.e. you should have got a positive result, but you got a negative. For example, a test for pregnancy might show negative during very early stages when the person might be pregnant.
It impacts accuracy because the system which you think is safe is not safe in a real sense and may lead to fatality in later stages. For example, if there is a quality check on some raw materials, and the test passes when it should fail, the item might pass, leading to a bad product later on.
Question: What is collaborative filtering? Where is it used?
Answer: It is a technique that filters out content or products (or any other items) that a particular user may like based on recommendations from users with similar interests. Collaborative filtering is done by finding a small subset of people from a relatively larger dataset. It finds use in recommendation systems like Amazon product recommendations and Netflix movie recommendations.
Question: How would you perform data analytics?
Answer: Data analytics involves many steps and interactions with different teams. Successful data analytics involves the following steps:
- Descriptive analytics – In this step, all the previous and current data is collected, and historical trends are obtained. This is to determine an answer to the question ‘What happened?’. It is just a summarization of the collected data to find trends and patterns.
- Diagnostic analytics – The next step is diagnostic analytics, where we find the main cause(s) of the problem. This step answers ‘Why it happened?’ and helps in improvising. This step is done using machine learning algorithms.
- Predictive analytics – Once we get to know the ‘why’, we can perform predictive analytics to understand ‘What will happen next?’ For example, forecasting for the next quarter, semester or year. This type of analytics requires past, present and future data.
- Prescriptive analytics – The final step is prescriptive analytics that uses advanced AI techniques like natural language processing, image recognition, and heuristics to train the model to get better at the tasks at hand. For example, recommendation systems or search engines become better as more and more data is fed to them.
Question: What is ensemble learning, and what are the techniques for it?
Answer: Ensemble learning involves using multiple algorithms to get better predictions and outcomes than we would obtain from using only one algorithm. Although this type of learning mostly finds use in supervised learning algorithms like decision trees and random forests, it can also be used with unsupervised algorithms. It involves more computation than using a single model to give the best results. Some common techniques of ensemble learning are:
- Bayes optimal classifier,
- Bayesian model averaging, and
- Bucket of models.
Question: Suppose you build a model and train it. Now, you feed the test data, but the results are not accurate. What could be the possible reasons for this?
Answer: There could be many reasons, such as:
- Wrong feature selection.
- Wrong choice of algorithm.
- Model is overfitting or underfitting.
Question: What is an eigenvector?
Answer: It is a vector whose direction does not change when a linear transformation is applied. Eigenvectors, along with eigenvalues, are used in many applications in machine learning and computer vision.
Question: What is the difference between a data scientist and a data analyst?
Answer: Data analysts work very closely with data and have strong technical knowledge – of programming languages, SQL, statistics, etc. – to apply tools and techniques and find trends and patterns in the data.
The question for which data has to be analyzed comes to a data analyst through a data scientist. Data scientists identify problems and questions and present the results of analysts to stakeholders and business analysts. They should know machine learning algorithms, have strong business acumen, and possess effective communication and presentation skills.
To crack the data science interview, you need to familiarize yourself with the data science tools and techniques. Also, stay updated about the new tools available in the market for data science. You should also have a good understanding of machine learning and its algorithms and be able to explain how you have used some of them in your previous projects (if any).
Besides this, it would be a good idea to know about a few nice books on data science and read at least one of them to through yourself with the topics. Sometimes interviewers do ask about the data science books you follow!
Hope this article helps you learn about the most important and practical data science interview questions asked by companies to hire data scientists. Happy learning!
People are also reading: