Data Science is said to be an interdisciplinary field that consists of many phases. Data Scientists require many different skills like statistics, algebra, probability, programming, domain knowledge, etc., to define business problems accurately and then solve them. At each stage of the entire data science lifecycle, data scientists use various tools and techniques to get the best possible results. We will discuss the various stages of the data science lifecycle and how datasets change after every stage.
Before that, we will see a little about the central component around which the entire data science is built – Data!
What is Data – Lifecycle of Big Data
We are well aware that this is the age of big data. Because the traditional relational databases are no longer able to handle the amount of data generated per day, rather per second, the concept of big data was introduced to store and manipulate data for data science purposes. Many frameworks, like Apache Hadoop, Spark, Hive, etc., enable us to process data in a faster and more efficient manner.
Read about the top big data frameworks.
The lifecycle of data has become simple because of big data frameworks:
- Data Disclosure: It involves the collection and storage of big data. Data can be acquired or collected from various data sources like cloud, databases, data warehouses, etc., and pre-processed to remove redundant, duplicate, or missing values. The resultant data is then stored appropriately using the big data storage systems, where the data integrity and security are of utmost importance.
- Data Manipulation: Data is then aggregated and analyzed. If there are multiple datasets, they are combined, and the ETL (Extract, Transform, and Load) process occurs. The data is then analyzed, and reports are generated to take further action on data.
- Data Consumption: Once the final dataset is ready, it can be used to extract maximum useful insights. One can use relevant data to generate information or knowledge or get business solutions that create more revenue or delete unwanted data.
This data lifecycle is a part of the data mining stage of data science, where data is acquired from various sources. Some analysis is performed on the data to make it usable for getting knowledge or insights.
Data Science Lifecycle Stages
Now that we have seen how big data works, we will dive into the various stages of the data science lifecycle. Each stage is dependent on the next and also goes back to previous if any, errors. For example, if during the Exploratory Data Analysis, we realized that all this while, we were using the wrong set of parameters, we would have to go back to the previous step, i.e., data acquisition, to get more relevant datasets.
I have seen many blogs on the web, defining many stages that are confusing and similar. Here is a simple yet complete version that I have extracted based on my experience and some research from the web.
Problem Definition and Business data Understanding
The first step is to find out the business problem and give it a clear, actionable definition. To do this, you need to ask a lot of relevant questions and formulate hypotheses. If some supporting data is available, you should explore the data and arrive at some inference regarding defining the business problem.
Hypotheses are nothing but an informed assumption about a particular target audience. Usually, a business problem is all about finding proof for the hypotheses. For example, if most 20-year-old girls prefer a specific hair color brand, a hypothesis can be made that it is the best brand out of all brands available in the market. To prove this hypothesis, however, we need more data like for how many days was this trend observed, was the product offered on discount at that time, and so on.
Suppose you are a competitor brand and this business problem is actual. Then the solution would be based on the data you collect and how you can improve sales of your brand by offering more unique hair colors, better price/discounts, perhaps a free service, and so on.
Excel is an excellent tool for writing down problem statements, and so are flowcharts and cause and effect matrices to define the problem.
Once we have defined our business problem, we need to find relevant data to support the hypothesis or find the solution to our problem. This step requires extensive research and data collection from various sources like the internet, interviews, surveys, databases, legacy systems, cloud servers, data warehouses, and so on. Multiple data sources are combined and integrated into a single large source through the mining process. Data mining is not just data collection – it involves pre-processing the data to find initial trends and patterns in the data to affirm that we are moving in the right direction in solving the business problem. If the insights we get through mining don’t support our initial statement, we have to go back to the first stage of (re)defining the business problem.
Data mining is a big step and involves the following steps:
- Data collection through multiple sources
- Understanding the data and how to make sense of it
- Make it structured and usable by pre-processing like aggregation, cleaning, sorting, etc.
- Transforming the data into a cleaner, better dataset
- Exploratory Data Analysis (EDA) to find general trends and patterns
Data sources can be offline resources like journals, files, magazines, newspapers, interviews, questionnaires, TV shows, etc., or online resources like cloud servers, databases, data warehouses, etc. The pipeline, i.e., streaming or batch data, is another source of data for mining.
Some popular tools for data mining are Knime, Weka, RapidMiner, Excel, and Spark.
Feature Engineering & Predictive modeling
Once we have a clean dataset with some data analysis in hand, we can decide the right parameters or features to help us create a model to solve our business problem. This means selection of most important features of data, determining dependent and independent variables, optimizing parameters, i.e., finding correlations between parameters, and reducing the total number of features using algorithms like Principal Component Analysis, Dimensionality Reduction, etc.
Once we have the features ready, we need to select an appropriate algorithm to train and test the model. We may need to apply more than one algorithm as algorithm and model selection are usually trial and error processes. Parameters should be tuned till we achieve a certain level of accuracy, and the model has to be continuously retrained based on feedback.
Some popular feature engineering techniques are handling outliers, binning, imputation, feature splitting, log transform, scaling, etc.
Some popular machine learning and deep learning algorithms for predictive modeling are classification models like Naïve Bayes, SVM, etc., clustering models like k-means clustering, regression models like time-series, linear regression, etc., and neural networks.
Model Continuous Evaluation
Once the model is trained and deployed, it has to be thoroughly evaluated by using fresh (unknown) datasets. There are many methods to perform model evaluation like model reporting, cross-validation, A/B testing, etc. Model evaluation should include all the aspects like context evaluation, input evaluation, and performance evaluation, i.e., we have to see if the model can fit into the context and solve the purpose for which the model was built (called the goodness of fit). It should clearly define the dependent and independent variables and the relationship between them. Based on the performance of a model, it has to be either re-evaluated or accepted. Some statistical tools for model evaluation are:
- Root Mean Square Error (RMSE): RMSE is a reliable and robust metric for regression problems. When we take the root of all the numbers, the deviations appear larger. Similarly, taking squares of the numbers removes the negative error values, so the magnitude is always positive. If there are N observations, RMSE is calculated as the sum of the squares of the difference between predicted & actual values, i.e.
RMSE = √(∑(predicted-actual)2)/N
- Confidence Interval (CI): CI is the amount of uncertainty in a sample of the dataset. The confidence level indicates the frequency of all the possible CIs that contain the ‘true’ value of the unknown parameter. Since it is an interval, CI is a range of values. IF the population is more, the CI range will be less, thus giving more precise results.
- Chi-square test: It is used to test if there is any statistical difference between the observed value and the actual value. It is a hypothesis test, also known as the X2 test. It works on chi-squared distribution under the null hypothesis. First, we need to find the expected frequencies and the observed frequencies in the distribution. The sum (∑) of the square of the difference between the expected frequency (fe) and observed frequency (fo) divided by the total expected frequency gives the chi-squared value:
X2 = (∑(fo-fe)2)/fe
- Confusion Matrix: A confusion matrix, represented by NxN, is also called an error matrix. It is used with classification models where N represents the number of classes (or categories) we want. We can see the true and false values in a visual representation using the confusion matrix.
|Actual Class||Predicted Class|
|Positive||True Positive (TP)||False Negative (FN) 🡪 Type II error||Sensitivity:
|Negative||False Positive (FP) 🡪 Type I error||True Negative (TN)||Specificity:
|Precision: TP/(TP+FP)||Negative Predictive Value:
- ROC (Receiver Operating Characteristic) Curve: ROC curve is a graphical plot between the TPR (Sensitivity) and FPR (Specificity) at different threshold settings. FPR (False Positive Rate) is the probability of false alarm, i.e. (1-TPR) or one minus the True Positive Rate. It is generated by plotting the area under the probability distribution from minus infinity (-∞) to the discrimination threshold.
- Gini Index: Gini Index is a statistical dispersion measurement used to calculate the purity of the decision tree, i.e., when a variable is randomly chosen as a feature, the Gini Index calculates the probability of the variable being wrongly classified so. The probability range is always between 0 and 1. If we want to calculate Gini Index using a graph based on the Lorenz curve, it is denoted by A/A+B where A & B are the areas represented as:
- Cross-validation: Cross-validation is a statistical method to test the performance of a machine learning model on a new dataset. There are many methods to perform cross-validation like the hold-out method, k-fold cross-validation, leave-p-out cross-validation, etc.
Data Visualization & Business Reports
Visualization is used in various phases of the data science lifecycle. It is used to analyze all the relevant data in a single glance and understood quickly for further analysis and processing. It helps us determine the important variables, separate them from non-essential variables, and find correlations between variables (columns of datasets).
Also, not all stakeholders and business analysts are technical people. To show them the outcomes of machine learning and the insights found by processing the data using all the above steps, we need an interactive way that is also easily understood by everyone. This is possible through graphs and plots that present data in various ways. These are visualizations that can be easily explained, and all the insights can be seen at once. Visualizations also consist of the business impact of the insights, the Key Performance Indicators (KPIs), the future plan to improve/solve the problem at hand. Some of the popular visualization tools are Excel and Tableau that provide different types of charts and visualizations suitable for various issues.
We have seen that to solve a problem using data science successfully, many steps are required. Each stage of the data science lifecycle is essential and makes the overall procedure much more comfortable and predictable. Read what is Data Science and check out the Data Science Roadmap to start your data science journey today!