Data Science Roadmap

By | November 24, 2020
Data Science Roadmap

The approach

I searched a lot on how to become a data scientist on the web and frankly speaking, I did not get any comprehensive answer. There are lots of blogs and articles, but they miss the ‘hands-on experience’ part of it, which I feel overtakes all other skills. Till you have done a project or two, you will never know the ‘real’ problems faced by data scientists. Even after having all the foundational skills, business and presentation skills, you might face unknown challenges every single day.

That said, there should be a starting point, and therefore, I thought of putting all the resources that have helped me so far in one place, so that a beginner can know what to focus on. Of course, the hands-on experience and projects will come in handy later on, but to start with, everyone should concentrate on honing their basic skills.

Vamware

Data Science Roadmap

A roadmap is like a checklist that you keep referring to upon reaching each milestone. Here is a diagram that illustrates the roadmap:

Let us explore each one of these steps (milestones) for preparation:

Step 1: Mathematics

Mathematics

Knowledge of mathematics is a core skill for a data scientist. If you are an engineering graduate, you should already be familiar with most of the advanced concepts like matrix and vector. Well, if you are not, you will vaguely remember the below topics. You might think that high school math was scary, but frankly, you will appreciate what you have previously learned because you can see things happening for real.

1. Linear Algebra

Linear algebra includes linear equations (y = 2x+1), vectors, matrix and operations, types of matrices, sets, log and exponential functions, eigenvalues and eigenvectors, functions. If you have heard of a few machine learning algorithms, you will be able to appreciate the above linear algebra concepts. Vectors and matrices form the base of solving any data science problem. A large amount of data (unstructured) is most easily represented using vectors and matrices. As an example, if each employee of your organization has three core skills (SK1, SK2, SK3), then those can be represented using a skill vector having three dimensions. This vector acts like a list. Now, think about the actual data. Let’s say there are ten employees (E1, E2…E10), so we will have a skill vector for each employee, i.e. a list of lists, which is nothing but a matrix.

EmpSK = [[E1SK1, E1SK2, E1SK3], [E2SK1, E2SK2, E2SK3]………[ E10SK1, E10SK2, E10SK3]]

This is nothing but a matrix with ten rows and three columns.

2. Probability

Probability means ‘chance’. A chance that something happens. A very common example is the chance of getting head when you toss a one-rupee coin. Since there are two possible outcomes head or tail, the probability of each is 50:50. The probability types like joint probability, conditional probability and marginal probability are commonly used in decision trees and Bayes theorem for machine learning.

Probability distributions like Bernoulli distribution, uniform probability distribution, normal probability distribution, and exponential distribution are used extensively for likelihood estimations, exploratory data analysis, pattern analysis, detecting outliers etc.

3. Calculus

Calculus helps to generalize the solution. For example, for analysis, if we select a certain training set and get the results, we need to make sure that the same result applies for the rest of the data set.

Similarly, multivariate calculus becomes essential when the data has multiple variables, for example, height, weight, age are important parameters to screen a patient for an ailment. Can you recollect how this type of data is represented? Yes, vectors! Most of the real-world datasets have more than one variable. Example, Natural Language Processing, images, financial data etc. So, predict the outcome, we use multivariate calculus to determine how each variable contributes to the solution.

4. Optimization

The concepts of maxima/minima are very important for optimization. It means to find a maximum/minimum of a function to obtain the most accurate value. Optimization techniques help you understand the internal working of an algorithm. With the knowledge of optimization, you can get the desired accuracy faster, as you will know how to tune the parameters. With experience and more understanding of optimization, you will be able to develop new algorithms as well!

Let us see how the Math concepts of Linear Algebra, Probability, Calculus and Optimization are used for different machine learning algorithms:

  • Logistic regression is based on the S curve or the Sigmoid function.
  • PCA, the most popular dimensionality reduction algorithm, uses eigenvalue and eigenvector.
  • Naïve Bayes and Bayes theorem use probability theory.
  • Linear regression uses linear equations and matrices.
  • Backpropagation in Deep learning algorithms uses differential calculus. Same way, Gradient descent uses differential calculus.
  • SVM uses vector dot product for maximization and matrix for kernel trick
  • Decision tree uses conditional probability to determine entropy and thus the information gain.
Learning the required Math for data science:

Many courses and books teach Maths. If you pick up your high school books, you might be able to refresh some basics. However, it is best to start afresh with these courses:

Step 2: Statistics

Statistics

Statistics involves collecting huge amounts of data for processing and analysis. This perfectly fits into the definition of data science. Statistics gives an easy way to summarize, analyse and visualize data in various representations. Knowing the statistical formulae and how to use them helps in various stages of data science. There are different types of statistics:

1. Descriptive statistics

This is the simplest way of representing data. With descriptive statistics, we can get a basic understanding of data. For example, finding the mean, median, mode, central tendency, range, standard deviation, variance, correlation etc. This type of basic analysis is the first step for quantitative data and can be easily visualized using graphs and charts. Learn more.

2. Inferential statistics

With inferential statistics, we take one step ahead and start making inferences from the results of the descriptive statistics. Inferential statistics is subjective and not as easy as descriptive statistics. For example, if out of 100 people (population), a sample of 70 people is chosen, and they are asked to choose between two art designs A1 and A2. If 67 of them choose A1, we can then infer that about 67% of the population like A1. This is called hypothesis testing. The sample is described using the descriptive statistics calculations. Learn more.

Learn statistics through these resources:

Step 3: Programming

Programming

From data processing to analysis and visualization, programming concepts and computing skills are essential for data science. You can perform simple tasks like iteration, sequencing, recognizing patterns, arranging data, writing functions etc. that will help break down a big problem into smaller sub-problems. Some important aspects of programming you should know for data science (irrespective of the language) are:

1. Data structures

It is important to understand the concept of Arrays, lists, maps, Hashtable, set, graphs and tree. Data structures hold huge volumes of data and have several functions for sorting, filtering etc. making it easy to work with the data. Data structure knowledge is also a fundamental requirement for algorithms. Learn more.

2. Control structures

Control structures the workflow of an application. For example, if you want to print a welcome message, only if a person is above the age of 18, you can put a check as

if(age > 18){//print message}

Same way, if you want to print a special message for first five website visitors, you can use a ‘for’ loop as:

// for first 5 visitors
for(int i=1; i<=5; i++){
// print the message
}

There are a lot of control structures that help you choose the workflow like switch-case, while, do-while etc. Learn more.

3. OOP concepts

For any language that you want to learn, OOP concepts form the base. Most languages are object-based and make it easy to store similar types of data. For example, an object ‘Student’ can store all the details of a student like name, age, class, subjects, marks, hobbies, parents’ name etc. Watch this video for quick review on OOP concepts.

Step 4: Languages

Languages

So, how do you learn the above concepts and practice them? You can choose any of the below languages R or Python and start practicing. In my personal experience, Python is easier to understand and more readable. You can learn it quickly too. However, for more advanced users, R has better libraries and tools for data analysis and visualization.

1. Python

Python has a rich set of libraries, built particularly for data science. This is the most important reason why Python is much preferred for data science. If you have prior programming experience, learning Python will be a breeze for you. If you don’t, learning Python will still be a breeze. The syntax is simple and understandable; all you need are logical thinking and problem-solving skills—check out our article on Python for data science to start learning Python.

2. R

Similar to Python, R is another powerful language that has some more features like structured data analysis and data visualization apart from rich libraries to solve complex problems. R treats data in the form of vectors, arrays, lists, matrices, factors and data frames which makes it easy for you to play with data and manipulate it. Start learning the basic concepts of R for data science.

The language you choose for data science depends on a lot of factors, and both languages have their strengths and weaknesses. Here is a nice article that compares both – R vs Python.

Programming concepts are used in every phase of data science. You might be able to quickly write a routine or task for performing cleaning tasks which are mundane and repetitive, or you may write a sort program to sort the data on the fly. If you know how to write code, you can save on a lot of time doing things which will take a lot of time otherwise. Programming also helps in data analysis and visualization. You can also perform most of the statistics and mathematical operations using Python and R.

3. SQL

SQL is essential for data science. Writing queries, grouping data, selecting part of data, filtering, joining, merging, sorting etc. can be easily done at the database level using SQL. Further, SQL can run scripts that can perform multiple tasks at once. SQL is used in data preparation, manipulation and exploratory data analysis stages. SQL for data science is a great start to learn all the SQL you need for data science. You can also refer to the SQL for data analysis crash course for a quick understanding of SQL.

4. Java

Although for most purposes, Python and R are the preferred languages for data science, companies are also looking for Java as a good option because of many reasons:

  • Java is Object-oriented
  • It is robust and has applications in almost all the domains
  • A lot of older applications used Java as the primary programming language as it simplifies integration and can be used with many platforms
  • Most big data frameworks like Hadoop, Spark, Hive are written in Java
  • Java has rich and easy to understand libraries for machine learning (WEKA), statistical analysis, Natural Language Processing, data cleaning, data visualization, data mining and data import and export.
  • Java is powerful, scalable and strongly typed and much suitable for data science projects having strict deadlines

If you already know Python, C, C# or C++, Java will be easy to learn. You can take up the Object-Oriented Programming in Java specialization from Coursera to get started with Java learning.

Step 5: Big Data

Big data

Big data is the unstructured and structured data that is received by firms each day. This data is huge in volume and raw. It cannot be used immediately and needs a lot of processing before it becomes fit for use. In fact, cleaning the data is the most difficult and time-consuming process in the entire data science lifecycle. But before that, we need a stable data source to store this huge volume of dynamic data.

Relational database systems are not flexible for storing unstructured data, and hence we need something more. That’s where Hadoop and Spark help us!

1. Apache Hadoop

Hadoop, in itself, is a vast topic having its ecosystem and a combination of many components that help store data in a distributed environment. It uses a programming model called Map-Reduce to process big data. The main concepts you should familiarize yourself with are Hadoop ecosystem, clusters, HDFS, MapReduce, Hive, Pig, YARN. Start learning Hadoop through this short course on Hadoop from Udemy.

2. Apache Spark

Spark is a fast engine for large-scale data processing. It provides features like optimized query execution and in-memory caching. It follows an iterative approach for data handling and overcomes the limitations of Map-Reduce. Some important concepts you should know in Spark are Spark Core, Spark SQL, Spark Streaming, Spark MLlib and GraphX, The Graph API. Since Spark is a huge framework, having a book to learn, and reference will prove to be useful. Spark: The definitive guide is the best book for learning, guidance and reference. You will also be able to learn PySpark, which is the Python API for Spark.

3. NoSQL DB

As a data scientist, you should know about NoSQL databases. These databases store data in the form of a single data structure rather than multiple tables. For example, MongoDB, one of the most popular NoSQL databases, stores data as documents. The document directly maps to the objects. Some other NoSQL databases are Cassandra, DynamoDB, DynamoDB, ElasticSearch and HBase (based on Hadoop HDFS).

Step 6: Data Wrangling

Data Wrangling

Up till now, we were defining the individual core skills. Once you are done with all the bits and pieces of learning, it is now time to combine the above core skills and put them to use in the below stages of data science.

Data wrangling includes transforming and mapping raw data into a more usable format. This way, the data becomes more valuable and fit for analytics.

1. Data collection

Data collection requires extensive research because data has to be collected from various sources and then combined. Sometimes, data from multiple sources has to be merged. Data collection also includes skills like data cleaning, sorting, administration and reporting.

2. Data cleaning

Data cleaning is the most tedious and critical task out of all the data science phases. It includes treatment of missing values, outlier detection and correction, data validation, removal of duplicates etc.

3. Data manipulation

Data manipulation is extracting only the necessary data from the huge dataset. Not all the features and not all the values are fit to be included in the data. The data we need largely depends on the business problem that needs to be solved. Data manipulation includes indexing, sorting, grouping, filtering, aggregation, subsetting, data merge, pivot tables, combining and creating new variables (features), reshaping. Some good courses to take up are:

4. Data visualization

Data visualization involves viewing data in the form of graphs and charts to understand its features easily and clearly. It also helps in viewing different combinations of features to see what set of parameters of data can best describe the data and the business problem. You should be familiar with basic plots like histogram, scatter plots, line plots, bar charts, box plot, heatmap to start with.

Excel is a great tool to start with visualization, and so is Tableau. You can easily set up Tableau and drag and drop parameters to see different visualizations. Learn the basics of Tableau

Step 7: Cloud Computing

Cloud Computing

Cloud computing has become essential for big data analytics. Deploying data solutions and expanding computing power has become easier and faster with Cloud. There are three major cloud providers in the market: Amazon AWS, Microsoft Azure and Google Cloud. Each has its own set of powerful features, and it depends on the organization to decide which one to use. As a data scientist, you should know the basic concepts like pipeline, scheduling and performing tasks in the pipeline and using the toolkits provided by the platforms.

Step 8: Machine learning

Machine learning

Machine learning is an important phase where machines try to find patterns and trends in the data and solve business problems using datasets.

1. Algorithms

A data scientist is expected to know the basics of algorithms, the purpose of each algorithm, the different types of algorithms, and so on. For example, what is the purpose of PCA, or how can we use kernel trick, or what is reinforcement learning. For more machine learning topics, click here.

2. Identifying the type of problems

This is not an essential skill but will give you an advantage over other candidates if you apply for an interview. If you have done a couple of projects and courses, you will be able to tell a bit about what is the best algorithm for a particular problem. For example, if we want to know “whether to take up a course or not?”, which depends on factors like availability of seats, finance, location, safety etc., decision tree will be a good choice. Same way, to determine if an email is spam or not, Naïve Bayes is the most suitable and so on.

3. Identifying issues with data

Once the machine learning algorithm builds the model, it has to be evaluated. The model can be over-fitting or under-fitting. The dataset itself may be imbalanced. You should also know about the model evaluation techniques. For example, for classification, there are many metrics to evaluate the model: accuracy, precision and recall. Learn more about these parameters in this Google Machine learning crash course.

4. Deep learning

Once you have a fair idea of machine learning, you can get into the basics of deep learning. If you are from an engineering background, you might have already studied a bit of neural networks and fuzzy logic. If you are not, there is nothing to worry. These are slightly advanced topics and generally learnt with experience. If you wish to learn the basics, you can watch this YouTube video.

Step 9: GitHub

GitHub

If you have heard of GitHub, you might be thinking, why do we have to mention a version management system in a data science roadmap?

GitHub has no direct relation with data science, but knowing GitHub will help you build better code through various forums and connect you to experts in the field. It is a repository hosting service that provides basic task and version management for projects. Since GitHub is open-source, it has a huge community of millions of programmers who can view and review each other’s works.

Step 10: Data Science Projects

Now that we have put all the core learning pieces together, we have to learn and develop projects. There are many projects already available through courses and tutorials, and you should look at them to understand how these work. However, after that, try to create your own project to get the real feel of data science.

1. Movie recommendation system

Movie recommendation system

This is a very popular use case of data science. You must be watching YouTube videos, Netflix etc. When you watch a particular set of videos, you get similar kinds of recommendations the next time on. How? That’s through movie recommendation systems that work based on every user’s preferences and preferences of users who have similar interests. The movie recommendation system is based on collaborative filtering. You should definitely go for this guided project on Coursera.

2. Chatbot

Chatbot

The chatbot is an AI-based intelligent software that interacts with humans to solve common problems in a natural human language. Chatbots can be rule-based or self-learning bots. Chatbots are one of the best use cases and very popular nowadays. So, it would be a good project to take on. Learn here

3. Text mining

Text mining

Text mining is a process in which text from users is analysed to identify their sentiments or other insights. For example, if a customer reviews a certain product, from the tone of the text and usage of certain words, text mining can determine what the customer feels about the product. Text mining is used in many domains like manufacturing, retail, healthcare, telecom, insurance etc. Check out the project here.

Create your own project

Once you get the hang of a few existing projects, you should try some of your own too. You can download datasets from free websites and create any problem of your choice. There are a lot of open forums like GitHub and StackOverflow, so even if you are stuck, there will be a lot of help extended. Some examples of simple projects you can work on are:

  • Eating patterns of people in your community – for example, how many people prefer eating outside food often and what are the days when most people order outside food?
  • In what ways do people keep themselves healthy – exercise, gym, yoga, diet? What is the percentage of healthy vs unhealthy people?
  • How COVID-19 is changing people’s lifestyle choices?

Step 11: Data science tools

Data science tools

Along the way, while learning the other skills, you would have come across many tools used for data science. Tools make it easy for developers to concentrate on business logic as they can do most of the math and statistics. You do not have to master all the tools as yet, but knowing a bit about them won’t hurt. Read an overview of data science tools.

Step 12: Challenges and competitions

Taking part in online challenges and competitions will boost your confidence and help you build a network. You can learn a lot from these challenges. Some popular websites which host such challenges are:

Step 13: Interview Preparation

By this time, you should be confident enough and thorough with at least a few concepts in data science. If you have done a few projects, you should brush up the topics before the interview. Generally, a data scientist is interviewed on both data science as well as machine learning concepts, so you should be clear with both. Here are some resources to brush up your knowledge:

Road ahead

That’s it for this article guys! Before I end, there is an important note – along with all the skills above, you should also get the knack of strategic management, i.e. once you get the insights from data, you should be able to convert them into workable actions. That is when the full potential of data analysis and the whole purpose of data science will be fulfilled.

I hope you have enough information to kick-start your data science journey. The most important part of data science is cleaning the raw data, so you should practice that as much as possible. Another big hurdle you will face with the data is feature selection, which will not come easy. Out of 1000 features, you will need to select a few important ones, because the rest will be redundant. Though there are methods and techniques for it, it mostly depends on the capability of the data scientist. This is where the creative, out-of-the-box, problem-solving skills (that you must have read everywhere) will come into the picture. Start learning today, and master each topic at your pace!

Leave a Reply

Your email address will not be published. Required fields are marked *