Data science is one of the top career options right now, and the average salary of a data scientist can go up to $117,345 per annum. Cracking a data science interview is not easy. There are a variety of technical and non-technical questions that the interviewer may ask you, to test your overall skills – analytical, programming, statistics, domain, creativity and communication skills.
Some interviews could be mixed i.e. the interviewer could ask you questions on both data science and machine learning, so you need to be well-prepared with both. We will be soon covering all the important machine learning interview questions in a separate article. For programming language interview questions, refer to our specific page about interview questions.
While this article mostly for uses on the technical and business-related Data Science interview questions, here are a few bonus questions that are important and asked in most of the interviews –
- Why did you choose data science as your career path?
- What are some of the tools you like to work with for data science?
- What have you learned about data science in your previous projects?
- Have you ever faced challenges in any stage of data science? If yes, how did you overcome them?
These data science interview questions are subjective and can have different answers based on individual experience. For example, for the first one, you can say that you have always liked maths, and love to solve analytical problems. Remember that the interviewer’s intention to ask you these questions is to test your motivation and involvement in the subject. If you are just beginning your career with data science, you can talk about any courses or certifications you did on data science and mention the projects, challenges you faced, etc…
Top 50 Data Science Interview Questions
That said, let us now focus on the top 50 data science interview questions and answers asked in most interviews and the answers that will give you the edge over other candidates –
Question: What is data science? What are the steps involved in data science?
Answer: Usually, the web definition starts as ‘data science is an interdisciplinary field…’.
However, you can answer the question in your words rather than just giving the definition.
Data science is a process or study of a huge amount of data collected from numerous sources, that helps businesses gain useful insights about customer needs, preferences, behaviors, and patterns, and help predict the future course of action based on past and present information. The steps involved in data science are –
- Problem definition – The problem is identified and defined. For example, “why did the product sales reduce in the last quarter?”
- Data collection – data is collected from various sources like databases, surveys, etc…
- Data wrangling or data preparation – remove redundancy, outliers, missing data and so on to filter relevant data
- Data mining and machine learning – various algorithms are built, trained and test to analyze data and produce useful insights
- Data visualization – to view the insights in a readable format that can be understood by non-technical (business) persons as well
- Predictions and generating solutions – In this step, business analysts and other stakeholders decide the next course of action for the business problem defined in the first step
Question: What is the difference between data science and machine learning?
Answer: Data science involves a lot of stages from problem definition, data collection, cleaning, analysis, visualization, prediction, etc, whereas machine learning is just one step in data science where the data is modeled for analysis and obtaining insights.
Answer: Artificial intelligence involves training machines to behave and make decisions similar to humans. Machine learning is the most basic type of AI, which is extensively used in data science to perform various analytical operations. ML is the main link between data science and AI.
Question: Which is the best language for data science and why?
Answer: You can choose your preferred language or the one that you have used most. But the reason should be more than just saying, “because I have worked only with this language”.
For example, if you say that your favorite language is Python, you can say that it has a lot of libraries that cater to data science operations, making it easy to code. Further, Python has a very easy syntax and enables faster programming.
Question: What are the tools and techniques used for data science?
Answer: There are many tools used in each stage of data science – for example, for data analysis, various tools like Python/R, KNIME, Excel, SAS, etc are used. Similarly, for visualization, Tableau, D3, Ggplot2 can be used. Check out the list of most popular data science tools.
If asked, you can mention 1-2 lines about a tool.
Question: If you have a dataset of 1 lakh students, who have applied for a particular course, how would you fetch only those students, who satisfy all the eligibility criteria to apply for the course?
Answer: This can be easily done using a SQL query. Suppose, the eligibility criteria are that the students should obtain a minimum of 75% percentage marks and are aged between 20 and 23 years. Let us write the simple query –
select * from student where avg_marks >= 75 and student_age between 20 and 23;
Question: What is the difference between data wrangling and data transformation?
Answer: There is a thin line of difference between data wrangling and transformation (ETL) process –
|Data wrangling||data transformation|
|data wrangling involves exploring the raw data and preparing it||the data is usually received in a well-structured format and is transformed into another format|
|data is used by data analysts, line managers, business analysts who want specific answers||Data is used by IT professionals who extract, transform and load data to the required data store and in the prescribed format|
|Some tools used are – Apache Spark/Hadoop, MS Excel, SAS Data Preparation, PowerBI, etc…||Some tools used are – MSSQL Server, SAS Data integration studio, Matillion, etc…|
Question: What is data mining? How is it useful?
Answer: Data mining is the process of analyzing data from various sources to find patterns and useful insights for making predictions. It is useful for better marketing, building customized user experience, understanding customer requirements better, improving sales and so on.
You can give an example – let us say, you own a coffee shop and the business is growing well. You now want to expand the business by adding new products to provide a better experience for coffee lovers. How would you know what products will sell the most? You have to study how customers react to a particular product, and what they like to eat most along with coffee. All this information and more has to be analyzed to make a prediction – this process is called data mining.
Question: What are some of the techniques used for data cleaning?
Answer: Some of the common techniques are –
- Remove redundant/duplicate values
- Fill missing values and remove irrelevant values
- Remove/correct data with typos and wrong inputs and information
- Correct the data types and assign appropriate values (like NA) for values that don’t fit into any data type category
Question: Tell us about the most common applications of data science.
Answer: The most common applications are recommender systems, speech and image recognition (like Alexa/Siri/Cortana), digital marketing, healthcare systems, logistics, etc… For a complete list with examples, check our top 10 data science applications.
Question: A company’s sales for a quarter dropped from 73% to 58%. How would you find out the possible causes?
Answer: The reduction in sales could be because of many factors. We can analyze the data for the particular quarter or the previous 2 quarters. Doing this will help us understand the various factors that led to the decline – it could be because of a competitor offering a better price for the same product/service, or change in company policies/structure, no new value addition or good offers for the customers to stay, etc…
Question: What are outliers? How can you treat them?
Answer: An outlier is an extremely high or low value in the data that lies outside of the region where the other data points in the data sample lie. An outlier can be a true one or just a mistake in the input data. Outliers can be handled in the following ways –
- Remove the records with outliers – if you think outliers can produce skewed results in your data, remove them.
- Correct the value – if you think that the outlier is just a mistake in input data, correct the value. For example, someone might have entered 2000 as 20.00.
- Cap the outlier data – set a maximum or minimum level for the particular field where the outlier is present.
- Transform the data – you can use a transformation of the data for the particular field rather than the data itself. For example, you can create an average of the values, and use the average instead of individual values.
Question: What is big data? How is raw data transformed into a more usable form?
Answer: Big data refers to the huge volumes of structured or unstructured data that is collected and stored in different data stores to be processed later. The volume of the data is quite high.
There are many tools to transform raw data into a more usable form – Microsoft SSIS, SAS Data Integration Studio, Oracle Data Integrator, Informatica PowerCenter, etc…
Question: Suppose in your data collected, 25% of data is missing. What can you do to correct it?
Answer: If the data set is huge, 20% of missing data may just be removed as it won’t affect the overall data.
If it is a relatively smaller dataset, the missing values can be predicted using the average or mean of other values. Python has a library named pandas, which has the respective functions for average, mean, etc…
Question: What are the steps you would follow to understand the shopping preferences of people between the age of 30 to 40?
Answer: First, we need to collect the data of the target audience (30-40 years). Next, we need to filter the data and save only the relevant information like their previous purchase history, page likes, browsing data, age, etc. Then we need to transform the data so that we can apply some statistical analysis methods and find common trends and patterns on the data. Lastly, we can use a visualization tool to view various outcomes and insights.
Question: What are the differences between overfitting and underfitting?
Answer: Overfitting and underfitting are two common modeling errors that can occur when Machine learning algorithms are applied to a training data set. Thus, both conditions are responsible for the poor performance of machine learning algorithms.
|The model starts learning even from noise and inaccurate or irrelevant data entries||The model is unable to capture or fit the underlying trend of data|
|happens when the data supplied to the model is more than the required. Because of too many details and noise, data categorization happens incorrectly.||Happens when there is not enough data to build the model or while trying to build a linear model with a non-linear set of data|
|model trains with more data and hence can improvise to build unrealistic models||model trains with fewer data so prone to making a lot of mistakes and wrong predictions|
|It can be avoided by pruning, regularization, or cross-validation of data||can be prevented by feeding more data, and by selecting only the required features i.e. reducing the total number of features.|
Question: What is sampling? What are the differences between cluster sampling and systematic sampling?
Answer: Sampling is selecting part of the data from a huge set. Based on the sample, assumptions about the entire data can be made. The sample size can be selected based on probability and randomization or non-probabilistic methods like research-based judgment. The sample size should be ideal – not too big or too small.
- Cluster sampling – in this method, the data scientist/analysts divide the data into different clusters. From each cluster, a random sample is collected and the sampled clusters are analyzed to produce an outcome for the entire data.
- Systematic sampling – In this method, samples are collected from a random starting point, but the interval is fixed and periodic. This interval is called the sampling interval. It is determined using the desired sample size.
Question: What types of biases can occur during data sampling?
Answer: Many biases can occur. The most common are –
- Confirmation bias – This can occur when analysts want to prove or confirm an assumption or support a particular conclusion that has been defined beforehand.
- Selection bias – This bias occurs if data selected is not a representation of the complete data. For Example, surveys, where the survey might be completed by many persons, however, the majority of the target audience might not have done it, leading to biased data.
- Outliers – Outliers are extreme values – too low or too high. Such values don’t fit with other data points causing errors in algorithms.
- Overfitting and underfitting – overfitting occurs due to the overload of data causing the model to analyze even noise and inaccurate values. On the other hand, underfitting is caused because of a lack of data thus building a model that doesn’t give the real picture of underlying data.
- Confounding variable – these are variable(s) that have a hidden impact on the other dependent variables but weren’t accounted for in the data. These can cause errors in judgment rendering the analysis useless.
Question: What is the difference between data analysis and data analytics?
Answer: Both terms are used interchangeably but there is a subtle difference. Data analytics is a general term that encompasses data analysis as well as other activities like data identification, data extraction and validation, filtering, and making predictions based on the analysis of data. While data analysis examines past data, analytics also includes looking into the future.
Question: What are the steps involved in data analysis?
- Identify and set the goals – what are we trying to achieve by performing the analysis? What is the problem we are trying to solve?
- Setting up the parameters of measurement – what are we going to measure? What kind of data do we require?
- Data collection and cleaning – identify and gather data from various possible sources, clean the data to remove inaccuracies, redundancy, duplicates, etc…
- Analyze and interpret the data – analyze the data using different tools and techniques, apply algorithms if necessary and visualize the data using various tools to interpret the insights obtained
Question: How is a training set different from a testing set?
Answer: Huge datasets are divided into training and testing sets, where a large portion is given to the training set. The training set is used to build the machine learning model, and the testing set is used to validate and improvise the model. The training set finds the coefficients and the testing set uses the same coefficients to produce accurate reports.
Question: What are the different types of machine learning algorithms?
Answer: There are 3 main types of machine learning algorithms – Supervised, Unsupervised, Reinforcement. Learn about the important algorithms here.
Question: List some differences between linear regression and logistic regression.
|Linear regression||Logistic regression|
|used when dependent variables are continuous and the line (equation) is linear||used when the dependent variable is binary (yes/no, 0/1, true/false)|
|dependent and independent variables have a linear relationship||there may or may not be a linear relationship between independent and dependent variables|
|the independent variables can be correlated with each other||independent variables should not be correlated|
Question: How is machine learning different from deep learning?
Answer: Both machine learning and deep learning are branches of AI. Machine learning has been fully explored in the field of data science to build models and create algorithms that give useful insights into data. Machine learning uses data fed to the system to train itself and use the training algorithm on testing data for validation and improvement.
Deep learning is a subset of ML that uses neural networks. Neural networks contain multiple layers of ‘neurons’ stacked one above the other similar to how the neurons of the human brain are structured. Deep learning requires more data and more training time, but the neural network selects the features, whereas in machine learning feature selection is done by humans.
You can think of deep learning as a more specialized form of ML.
Question: Which algorithm will you use to find out what kind of movies a particular person would prefer to watch?
Answer: To find user preferences, the recommendation system is used. Most recommender systems are based on users based on an item-based collaborative filtering algorithm. Matrix decomposition and clustering are other popular algorithms.
Question: What is data visualization? What are the tools used to perform data visualization?
Answer: After data is analyzed and the insights are produced, it can be tough for non-technical persons to fully grasp the outcomes. Visualizing the data in graphical or report form can enhance the viewing experience and make it easier for business analysts and stakeholders to understand the insights better. This process where insights are communicated using visual tools is called data visualization. Some specialized tools are – Tableau, D3, PowerBI, FineReport, Excel, along with the basic visualization functions that are available in Python and R.
Question: What is the difference between logistic regression and Support Vector Machine?
Answer: Both logistic regression and SVM are used to solve classification problems and are types of supervised learning algorithms. Some main differences between both are –
|used for classification as well as regression problems||used for classification problems|
|uses a hyperplane (decision boundary) to separate data into classes. It uses a kernel to determine the best decision boundary||uses a sigmoid function to find the relationship between variables|
|works well on semi-structured or unstructured data||works well with pre-defined independent variables|
|less risk of overfitting||more risk of overfitting|
|based on geometric properties||based on statistical approaches|
Question: List some disadvantages of the linear model.
Answer: The linear model assumes that all the relationships between independent and dependent variables are linear, thus over-simplifying a problem that may not be very simple. This may lead to inaccurate results. This model can also treat noisy data as useful and lead to overfitting. A linear model cannot be used in cases where the relationship between independent and dependent variables is non-linear.
Question: State some differences between decision tree and random forest.
|Decision tree||Random forest|
|decision tree acts upon one dataset||random forest is a collection of multiple decision trees|
|the tree is built by considering the entire dataset taking into account all the variables and features||selects only random observations and selected features (variables) from the collection of decision trees to build the model|
|simple and easy to interpret||the randomness cannot be controlled – for example, you cannot control which feature came from which decision tree, etc…|
|less accurate than random forest for unexpected validation dataset||high accuracy that keeps increasing with more number of trees|
Question: What is the ROC curve? How does it work?
Answer: ROC or Receiver Operator Curve shows the diagnostic ability of binary classifiers as a graphical plot. It is created by plotting the true positive rate (TPR), calculated as observations that were rightly predicted as positive amongst others, against the false positive rate (FPR), calculated as the observations that were incorrectly predicted as positive. ROC is extensively in the medical field for testing.
Question: What is a confusion matrix?
Answer: A confusion matrix is a measure of performance for classification problems in machine learning. In a classification problem, the outcome is not a single value but two or more classes. The confusion matrix is a table (or matrix) that contains predicted positive and negative values on one hand, and the actual positive and negative values. These values can determine the precision using the formula,
Precision = (TP)/(TP + FP), where TP is True positive and FP is false positive.
Question: If you have a pack of 52 cards, and you pick one card if you know that the card is a face card, what is the probability that it is a Queen?
Answer: This can be calculated using the Bayes formula,
There are 4 queens in the pack, so the probability of getting a queen, P(Q) = 4/52,
We know that queen is a face card, so P(F/Q) = 1, as queen will remain a face card always,
There are 3 other face cards in a pack, which means the probability of getting a face card other than queen, P(F) is 12/52.
The above question – the probability of getting queen from pack of face cards P(Q/F), when you know that it’s a face card can be calculated as,
P(Q/F) = (P(F/Q)*P(Q))/P(F)
P(Q/F) = (1*4/52)/12/52 = 1/3
Question: How would you define bias-variance trade-off?
Answer: Bias-variance trade-off is a statistics property the value of which is determined using a set of predictive models where models with high bias in the parameter estimation have lower variance (across the samples) and vice versa. Both bias and variance cause error, and an optimum value that minimizes both these errors are said to be the trade-off value.
Question: What is meant by binary classification?
Answer: In binary classification, the elements of a data set are classified into two groups (categories) by applying a classification rule. For example, positive-negative, true-false, yes-no, etc… Some of the commonly used, simple and effective binary classification algorithms are Logistic regression, random forest, and SVM.
Question: What is the normal distribution of data?
Answer: Most datasets follow a normal distribution which is a bell curve (shaped like a bell) depicting probability distribution. The normal distribution is symmetric about the mean which means that data around the mean value occur more frequently than the data far from the mean. Some examples of a normal distribution are – the height of people, age of people, marks in an exam, etc…
Question: Why are statistics important for data science?
Answer: Statistics provide fundamental tools and techniques to get deeper insights into the data. It also helps us to quantify and analyze data uncertainty with consistent results. Statistics is used in several stages of data science – problem definition, data acquisition, data exploration, analysis, modeling, validation, and reporting.
Question: Do you know about covariance and correlation?
Answer: Both covariance and correlation measure the dependency and relationship between two variables. However, some differences between both are –
|indicates only the direction of the linear relationship between the variables||indicates both strength and direction of the relationship|
|values are measured in units||values are not measured in units|
|covariance is calculated as the total variation of two variables from the expected values of those variables||correlation coefficients of two variables can be determined by dividing their covariance with the product of standard deviations of the variables.|
Question: What is the p-value?
Answer: The p-value is used for hypothesis testing where the probability of getting extreme results the same as that of a statistical hypothesis test is correct. It assumes that the null hypothesis is correct as well. It is useful to give the smallest of significance to rejection points where otherwise the null hypothesis would be rejected. P-values can be calculated by using p-value tables.
Question: Define regularization and explain why it is important.
Answer: Regularizations are techniques used to avoid overfitting on the given training set and reduce the error by properly fitting the function on the dataset. Overfitting happens when the learning algorithm also takes into consideration the noise along with the correct data. Thus, the result will be a pattern and noise (error or deviation). The goal of regularization is to eliminate the noise from the pattern.
Question: What is a confounder variable?
Answer: Confounder or confounding variable is an external variable that changes the effect of independent and dependent variables. Because of such an influence, the entire experiment can get ruined and produce useless outcomes. To avoid such problems, confounding variables should be identified and considered well before the experiment is started.
Question: Give some differences between R and Python.
Answer: R and Python are both the most preferred and powerful languages for data science, particularly analysis. Some differences between both are –
|Easy to learn and interpret, simple and readable text||needs time to learn and understand syntax, writing programs|
|more popular than R||number of users switching from R to Python is more than vice versa|
|used mostly for programming by developers||can be used by researchers and analysts for scholarly and research purposes|
|has a rich set of libraries, but less than R||has extensive libraries that are easily available|
|Good documentation and huge community support||Good documentation, GitHub interface, and community support|
|Great graphical depictions using RStudio||Simple and understandable graphs, though not as fancy as R|
As a developer, Python is the preferred language to start with. However, if you do more statistical analysis and need excellent graphs for reporting, R should be your choice.
Question: What is the fastest and most accurate method of data cleaning?
Answer: There are many ways to do data cleaning. This answer can be subjective depending upon what tool you have previously used. For example, you can perform cleaning by writing code in Python or any other language, or you could do it at the database level using SQL queries. MS Excel is another excellent tool for filtering, correcting and cleaning data. However, SQL and Python (code-level) can give performance issues, and excel works only on medium or small data sets. For huge datasets, Apache Spark, Apache Hadoop is the best tools that are easier to learn, faster and give good results. Some other tools are PowerBI, Alteryx.
Question: What are the main types of statistical analysis techniques?
Answer: There are three main statistical analysis techniques –
- Univariate – the simplest type of analysis where there are only one variable i.e. only one value changes. The purpose of the analysis is just to describe the data and find patterns and doesn’t deal with any cause or relationships in the underlying data.
- Bivariate – As the name suggests, this technique involves two variables. It deals with cause and relationships, for example, will the sales of AC go up during summer or winter? Through the analysis, we can find the relationship between the two variables involved.
- Multivariate – When data contains more than two variables, this technique is used. The focus of the analysis depends on the questions that need to be answered through analysis. For example, a company trying to promote four products and analyzing which one has a better response.
Question: What is a false negative? How does it impact accuracy?
Answer: False-negative is a situation where a negative test result conducted turns out to be wrong i.e. you should have got a positive result but you got negative. For example, a test for pregnancy might show negative during very early stages when the person might be pregnant. It impacts accuracy because the system which you think is safe is not safe in a real sense, and may lead to fatality in later stages. For example, if there is a quality check on some raw materials, and the test passes when it should fail, the item might pass leading to a bad product later on.
Question: What is collaborative filtering? Where is it used?
Answer: It is a technique that filters out content or products (or any other items) that a particular user may like based on recommendations from users with similar interests. It is done by finding a small subset of people from a relatively larger dataset. It is used in recommendation systems like Amazon product recommendations, Netflix movie recommendations, etc…
Question: How would you perform data analytics?
Answer: Data analytics involves many steps and interactions with different teams. Successful analytics involves the following steps –
- Descriptive analytics – in this step, all the previous and current data is collected and historical trends are obtained. This is to determine an answer to the question ‘what happened’. It is just a summarization of the collected data to find trends and patterns.
- The next step is diagnostic analytics where we find the main causes of the problem. This step answers ‘why it happened’, and helps in improvising. This step is done using machine learning algorithms.
- Once we get to know the ‘why’, we can perform predictive analytics to understand ‘what will happen’ i.e. forecasting for the next quarter, semester or year. This type of analytics requires past, present and future data.
- The final step is prescriptive analytics that uses advanced AI techniques like natural language processing, image recognition, heuristics, etc… to train the model to get better at tasks at hand. For example, recommendation systems or search engines become better as more and more data is fed into them.
Question: What is ensemble learning and what are the techniques for it?
Answer: Ensemble learning involves using multiple algorithms to get better predictions and outcomes than we would obtain from one algorithm alone. This type of learning is mostly used in supervised learning algorithms like decision trees and random forests, but can also be used with unsupervised algorithms. It involves more computation than using a single model, to give the best results. Some common techniques are –
- Bayes optimal classifier
- Bayesian model averaging
- Bucket of models
Check the whole list of techniques on the Wikipedia page.
Question: Suppose you build a model and train it. Now, you feed the test data, but the results are not accurate. What could be the possible reasons for this?
Answer: There could be many reasons –
- Wrong feature selection
- Wrong choice of algorithm
- Model is overfitting or underfitting
Question: What is an eigenvector?
Answer: It is a vector whose direction does not change when a linear transformation is applied. Eigenvectors along with eigenvalues are used in many applications in machine learning and computer vision.
Question: What is the difference between a data scientist and data analyst?
Answer: Data analysts work very closely with data and have strong technical knowledge – programming language, SQL, statistics, etc… to apply tools and techniques and find trends and patterns in the data. The question for which data has to be analyzed comes to a data analyst through a data scientist. Data scientists identify problems and questions, present the results of analysts to stakeholders and business analysts. They should know machine learning algorithms, strong business acumen, and effective communication & presentation skills.
To crack the data science interview, you need to familiarize yourself with the tools that you have used, and also be updated about the new tools available in the market for data science. You should also have a good understanding of machine learning and its algorithms, and how you have used some of them in your previous projects (if any). Besides this, it would be a good idea to know about a few nice books on data science – and read at least one of them to thorough yourself with the topics – sometimes interviewers do ask about the book you follow!
Hope this article helps you learn about the most important and practical aspects of data science used by companies to hire their data scientists. Happy learning!
You might be also interested in: