## Introduction

Data science is a single term (having a lot of weight) that includes various stages, and each stage depends on the previous one for the complete process to happen smoothly. The most time-consuming step in the entire data science lifecycle is said to be the data preparation stage, where raw data is transformed into useful data by applying many methods and strategies. Many tools help perform data science in a better way and save a lot of time, but the process remains the same.

## Data Science Process [Step by Step Guide]

In this write-up, we will discuss the complete process involved in data science. You can also read about the different phases of data science through our introductory data science article.

The most important term in data science is ‘Big Data’ because this is where it all started. Big data is the data that is used for performing data science. You may wonder why it is called big data. For one, it is huge in volume; secondly, it has a lot of variety and is mostly unstructured, raw, and directly unusable and lastly, the rate or velocity with which the data is generated every day, oh wait, every second is enormous!

That said, to get something useful out of this messy big data, we need to follow a process, which we are calling the data science process and is as follows:

- Process and prepare data for the analysis
- Apply suitable algorithms based on the business requirement(s)
- Tune the parameters of the algorithms to optimize the obtained results
- Select, build and evaluate the final model

### 1. Data Preparation

As we already mentioned, what we receive from different sources is raw, unprocessed, messy data, and we need to make it suitable for analysis. Without this, even if we apply the best algorithms, we will not get the desired results. Data is mostly represented in tabular format for analysis as it is easy to view and process that way.

#### 1.1 Data Format

Txn ID | Item bought | Date | Price | Discount applied? |

1 | Amazon Kindle | Jan 23 | 6999 | No |

2 | Apple iPad Air | Jan 24 | 40,500 | Yes |

3 | Samsung washing machine | Jan 23 | 13,100 | No |

4 | Amazon Kindle | Jan 26 | 6800 | Yes |

5 | One Plus HDTV | Jan 27 | 13920 | Yes |

Here each row represents a data type, and each column is a variable or feature. We have given a representation with very few variables, and in reality, there can be more than 100 variables as well. Depending on the business objective, we may have to prioritize the variables to be shown. For example, the above data shows the product bought on a particular day and whether a discount was applied on that day or not. If we want to know how many customers purchased each item under the discount scheme, we can get a new set of variables:

Product name | Discount price | Number of purchases | Date |

Amazon Kindle | 6800 | 200 | Jan 25 |

Apple iPad Air | 40500 | 35 | Jan 26 |

Samsung washing machine | 12900 | 40 | Jan 26 |

Amazon Kindle | 6500 | 250 | Jan 26 |

Apple iPad Air | 40000 | 40 | Jan 27 |

One Plus HD TV | 13920 | 320 | Jan 26 |

#### 1.2 Variable types

To apply machine learning algorithms, the variables should be standardized and converted into values that can be compared. In the above, we see that the discount price is in float, the number of purchases is an integer, the discount applied is Boolean, etc. There are 4 main data types used for variables:

- Binary: The simplest, having two values – 0/1, true/false, yes/no. Here, binary is used to represent whether a discount is applied or not.
- Integer: Integer type is used to represent whole numbers. For example, the number of purchases above is an integer.
- Categorical: Categories can be used to describe and separate items based on similarity. For example, the products here are categories. A more general categorization will be tablet, washing machine, TV.
- Continuous: Continuous values are those with decimal points, like floating-point numerals. In our example, we have rounded off the price to the nearest integer, but in most real data, the price is a decimal value like 1.50, 300.75, etc.

#### 1.3 Variable selection

As we mentioned above, there may be 100’s of variables associated with a data point. But not all will be useful for a particular problem. To start with, the most critical variables are determined using the trial and error method. Once we get the results and the feedback, we can remove or add variables and improve the results on each iteration. For example,

Exercise hours per day | Calorie intake per day | Weight | Dress size |

3 | 1000 | 60 | L |

4 | 560 | 50 | S |

1 | 700 | 72 | XL |

0 | 800 | 89 | XXL |

5 | 900 | 62 | XL |

From the above data, we observe that weight depends on calorie intake and the number of hours of exercise. Weight also negatively correlates with exercise hours, i.e., if you do more exercise, the weight value becomes less. Weight has a positive correlation with calorie intake: the more the intake, the more the weight. Weight also has a positive correlation with dress size.

We try to identify variables that have either a positive or negative correlation with the variable in question. In the process, we eliminate those variables with zero correlations.

#### 1.4 Feature engineering

Sometimes, we have to engineer the features required to get the best features/variables from the data. For example, those who have a good calorie intake are less likely to reduce weight, or those who wear XL or XXL dresses are less likely to wear slim fit clothing.

Same, way more than one variable can be combined to get a single new variable that represents both the variables. For example, both calorie intake and dress size have a positive correlation with weight. Combining both, we can come up with a new variable – this is called dimensionality reduction, which helps find the most important features from the data.

#### 1.5 Missing data

In a dataset, we may not find all the values to be correct or filled. The reality is that we never have a complete data set with correct values. There are always missing, incorrect or redundant values, which are approximated or computed, or removed from the dataset. For example, observe this Netflix data for various movies from different users:

Movie 1 | Movie 2 | Movie 3 | |

User 1 | 5 | 4 | |

User 2 | 5 | 4 | |

User 3 | 3.5 | 3 | |

User 4 | 4.5 |

We see that these users seem to have watched all three movies, but only a few have rated them. That doesn’t mean they did not like the movie or wanted to give a zero-rating. But to recommend these movies or similar movies to other users, we need more data, i.e., more ratings. We have to fill up this data with approximate values. In this case, removing the data or marking all the empty entries as 0 won’t work, so the best option is to find the average rating and fill the missing value accordingly. For example, for movie 3, the average rating is 4+3/2 = 3.5, so the other two missing values can be 3.5.

If the column is not much important and has a lot of missing values, we can remove it or merge it with some other column too. Many times, supervised learning algorithms are also used to compute missing values. Though time-taking, this method is more accurate than any other method to populate missing values.

### 2. Algorithm selection

Through the knowledge of machine learning, we can know different algorithms for different types of problems. Some algorithms are supervised, some are unsupervised. More advanced forms of learning algorithms come from reinforcement learning. Some of the popular machine learning and deep learning algorithms are k-means clustering, support vector machines, principal component analysis, decision tree, association rules, neural networks, Q-learning, etc.

If you want to know whether you can play a cricket match today or not, you should use the decision tree to make that decision for you. If you want to categorize fruits based on their nutritional values, use classification. If you’re going to predict the value of a house in the next five years, use regression. Same way, you can use clustering algorithms to know what kind of food is preferred by different age groups or use association rules to understand the purchasing patterns of consumers.

Learn more about machine learning types and machine learning algorithms.

### 3. Parameter tuning

So, out of the many algorithms we listed above, how do you decide which one to use? In most cases, multiple algorithms will be suitable for a particular problem. For example, to decide about a cricket match, we can surely use a decision tree, but if there are too many factors involved, we may have to use principal component analysis and random forest! Usually, it isn’t easy to come to a concrete conclusion in the first go, and many models are tried before the best one is decided.

One algorithm itself can give different results based on the parameters selected and tuned. Check out our article on the Decision tree to understand this fact more. You will note that the minimum and a maximum number of terminal nodes and maximum tree depth are some of the tuning parameters in the decision tree.

Same way, each algorithm has at least one tuning parameter. k-nearest neighbors have ‘k’, i.e., the number of nearest neighbors as the parameter; neural networks have tuning parameters like the number of hidden layers, initial weights, learning rate, and many more.

Based on parameter tuning, a model can be,

- Overfit – yields perfect predictions for the current data but is too sensitive to a new dataset and takes random variations (noise) as patterns.
- Underfit – too generalized and ignores important patterns, thus compromising on the accuracy
- Ideally fit – the right parameter tuning, which takes in most important patterns and ignores the minor and random variations, giving the correct accuracy for any dataset.

The simplest way to minimize errors and keep the model simple is regularization, where-in, if the model’s complexity increases, a new parameter, known as the penalty parameter, artificially increases the prediction error, thus keeping both accuracy and complexity in check. Keeping the model simple helps in maintaining generalizability.

### 4. Model evaluation

Once a model is built by selecting the right algorithm and tuning the parameters, it must be evaluated on different datasets. The most common data set is the testing set, which is created by splitting the total dataset into training and testing sets at the beginning itself.

There are many metrics to determine the model accuracy, and the most common metrics are:

#### 4.1 Classification metrics

Percentage of correct predictions – The most straightforward way of getting accuracy is to determine the number of accurate predictions. For example, we could be 70% right about a weather forecast. Though simple, this method leaves out the actual causes of the prediction error (30%).

Confusion matrix – The confusion matrix provides details of where the model succeeded and where it failed. It is represented as:

Actual (↓) / Predicted (→) | Will rain | Will not rain |

Rained | 60 (True Positive or TP) | 30 (False Negative or FN) |

Did not rain | 30 (False Positive or FP) | 40 (True Negative or TN) |

This method indicates the prediction errors but does not tell about the type of prediction error. However, it is the most popular evaluation method.

#### 4.2 Regression Metric

Root Mean Squared Error (RMSE) – Regression involves continuous values, so the errors are determined by the difference between predicted and actual values. By taking the square of the errors, we amplify large errors and thus making this metric sensitive. This metric is excellent for determining outliers in the data.

#### 4.3 Validation

Metrics can show great results with the current dataset. But how about new datasets? Metrics may not be the most accurate evaluation measure, especially for overfitting models! Thus, we need more procedures for validation. While most of the time, validation is done by splitting the data into training and testing sets, if the dataset is small, we may not have the option to split the data further. In such cases, we can use the cross-validation method, where the dataset is divided into multiple segments. One segment is used for testing, while the others are used for training. The iteration continues until each segment has been used for ‘test’ once. Here is a representation of cross-validation:

SEGMENTS | |||

Test | Train | Train | Result 1 |

Train | Test | Train | Result 2 |

Train | Train | Test | Result 3 |

Each Result will be different from the other and have slight variation. By accounting for these variations, we can improve the accuracy of the model. The final accuracy estimate is the average of all the iterations.

What happens if the accuracy is low?

We may have to start from scratch: prepare the dataset again, build the model again, or tune the parameters again.

## How to learn and master data science

Data science is huge, and learning the complete end-to-end process is no joke. Instead, you would only get more challenges as you learn more. The rewards are many, and so are the responsibilities. To start with, we recommend you to check out our Data Science roadmap, which lists out the skills you need to start your career as a data scientist. Do also read our blog on data science career opportunities to see which area you want to specialize in. Each of these titles requires some specialized skills other than the general skills needed by data scientists.

## Conclusion

In this article, we have understood the various steps of data science. We have introduced you to a lot of important technical terms used in data science. This article is not exhaustive but an introductory account of the steps involved in the data science process. We will deal with each of the processes in detail in upcoming articles.