Before we start exploring the skills for a data scientist – you should be clear that tools and skills are completely different aspects. While it is an advantage to know about data science tools, you don’t need to know loads of them. However, the right skills are a basic requirement for a data scientist. For example, there are many tools to visualise data, but the skills to understand what kind of data has to be visualised lies with the data scientist! You can learn to use the tools – on the go – but skills – you need them right from the beginning!
A little bit of Data Science
Data science is focussed on getting a lot of meaningful insights from the data received from various sources. For example, Facebook generates huge user data every day, and by analysing this data, Facebook can understand more about its users – their likes and dislikes, activities, comments, groups etc. This helps them in focussed advertising and improving their own services too.
To know the skills required for data science, we need to understand the various sub-areas of data science. Data science involves a series of processes and steps, and you can specialise in any or all of them. It is always a good choice to have a thorough knowledge of more than one phase. Here are the steps involved in data science:
- Looking at the business opportunity (defining the business problem)
- Finding data relevant to the opportunity (data collection)
- Clean, filter and sort data (data cleaning)
- Perform exploratory analysis (Exploratory data analysis)
- Train the model and check accuracy (machine learning)
- Visualising data (data visualisation)
- Measure success (making business decisions)
The first two steps require you to have good business acumen, the next few require you to be technically sound, and for all these steps you need problem-solving skills, curiosity and good communication skills.
Top skills you need to be a data scientist
There are different types of skills you need: technical and non-technical apart from the foundational skills like math, statistics, computing and domain knowledge. As a data scientist, you should possess the combined skills of a data analyst, machine learning engineer, data engineer and data scientist itself! For example, a data analyst need not have the know-how of machine learning concepts, but as a data scientist, you should have machine learning knowledge too. Here are some very important skills you need:
1. Math skills
I have seen many people loathe at the name of maths from childhood itself. Maybe what they learnt as part of academics was purely theoretical, and they never understood where all of it was supposed to be applied. Well, you are in luck, because all the math that you have studied before is put to practical use in data science. The following are math essentials you need to know:
Linear algebra: Linear algebra forms the core of machine learning algorithms. Linear algebra is nothing but an equation that finds the value of variables x & y. For example, 2x+y = 1, x+2y=1, what are x & y? Rings any bells?
Of course! The above equations are called linear equations and represent linear algebra. These linear equations can also be represented in the form of matrices or vectors. For example, a matrix representation would be:
2x x + y 2y = 1 1
A linear equation with two variables is a line, and we need only two dimensions to represent the same. But we can have n number of variables. For n variables, we need n dimensions, and it is easier to represent that with a matrix. We can perform many operations on a matrix like addition, subtraction, multiplication, inverse and so on. Even the concepts of Single Value Decomposition, EigenVector and EigenValue, which are very useful for Principal Component Analysis are based on matrix. You can learn about linear algebra and matrix through this free MIT course.
Calculus: For optimisation, you need to know concepts like derivatives and gradient descent, which are integral parts of calculus. These concepts are used in machine learning and deep learning algorithms. Other than these, univariate and multivariate calculus are extensively used. These are used in regression as well as in back-propagation of neural networks. Specifically, you should know limit, continuity, mean value theorem, maxima, minima, partial derivatives, differential equations, infinite series summation, chain rule and so on.
Statistics: Statistics is used in almost every phase of data science. From data summaries like finding the mean, median, mode, central tendency, to performing analysis using variance, covariance, correlation, etc., many statistical operations are performed on the data. Sampling, random number generation, measurement etc. are done on every dataset. In fact, programming languages like Python and R are much popular for data science because of the availability of many statistical functions in those, that can make a data scientist’s job easy.
Probability: Where there are statistics, there is probability. Concepts like expected outcome, likelihood estimation, conditional probability, random variables, probability distribution (normal, binomial, chi-square), probability calculus, Bayes theorem, central limit theorem, hypothesis testing are all extremely important for machine learning and data analysis.
For all the probability and statistics that you need, refer to our detailed article on statistics and probability concepts.
Discrete math: Since we mostly use programming languages for performing all the logic and computing, we often don’t worry much about discrete math concepts. However, if you are new to programming as well, it wouldn’t hurt for you to learn these concepts before you start. These will give you a thorough understanding of the internal working of various data structures, graphs etc. Some important concepts to learn are:
- Sets, stack, queue, graphs, array, hash table, trees
- Graphs, properties of graph, degree
- inductive, deductive and propositional logic
- O(n) notation and growth of functions
Although discrete math may seem dry in the beginning, you will enjoy learning through this beginner level discrete mathematics course from Coursera.
2. Programming skills
As a data scientist, you will have multiple roles to play. You would be involved in application development, data management, application testing, applying algorithms and much more. Knowing programming skills will help you think critically and arrive at useful insights through answering the right questions. R & Python are the most popular programming languages because of their rich set of libraries and ample support for statistical analysis.
Python: Python is easy to code and read. It is easy to learn, and if you know any other programming language, learning Python will be a breeze for you. You should be thorough with the syntax, object-oriented programming concepts, flow control, rich libraries like pandas, NumPy, scikit-learn etc. You can check all about the important Python data science libraries on our website.
R: R is another language popularly used for data science. Python and R are often compared; however, both have their advantages and limitations. R can easily perform all the steps of data science from cleaning the data to machine learning. R has vectors and a rich set of libraries for graphs. Some popular libraries are ggplot2, dplyr, Janitor, Shiny etc. To learn more about R, read our article on R for data science.
SQL: Since data science has everything to do with data, knowledge of SQL is essential. SQL forms the base of data collection, wrangling, sorting, filtering, extraction, exploratory data analysis, and other processing before any algorithms are applied to the data. You should know about SELECTing columns, aggregate functions like AVG, MAX, MIN, SUM etc. and combining them using WHERE, AS clause, filtering using WHERE, BETWEEN, NULL, LIKE etc., sorting and grouping using HAVING, GROUP BY.
While you won’t be involved in selling a product, balance a budget or create an advertisement, you should be thorough with the domain you are working on. That’s because you should know what insights you should derive for solving a particular problem. You should be able to look at the bigger picture of the whole project and why it was started in the first place.
Technical skills (Hard skills)
The above skills were technical enough, but they were basic or mandatory skills. The below skills are desirable, and you will surely get an advantage when you know more about data science. The below skills are specific to data science, and if you have decided to be a data scientist, you should be familiar with these:
1. Data wrangling
Data wrangling or data munging is the process where raw data is transformed into a more useful set which can be analysed. Raw data is complex and unstructured and has a lot of missing values, which need to be corrected to understand it better. Wrangling also includes discovering all the datasets from different sources and structuring it before it is cleaned up. As a good data scientist, you should know how to integrate your data from different sources, improve the quality of data and validate the dataset so that you can get better insights from it. Data wranglers use SQL extensively! Some tools like Excel, Google Dataprep and OpenRefine are quite popular for data wrangling. Python provides libraries like Numpy, Pandas, Matplotlib and more for data wrangling. In R, packages like Dplyr, Purrr, JSOnline are some good tools for data wrangling.
2. Machine learning
Well, machine learning is the heart of data science. Building a model and identifying the most important patterns and trends is the main purpose of data analysis and helps make appropriate business decisions. Learning the different types of algorithms like supervised, unsupervised and reinforcement – and understanding when to use which algorithm is the key to getting efficient results using less time and resources. Some popular algorithms that are easy for you to start with are:
- Linear regression
- Logistic regression
- Decision tree
- k-Nearest Neighbor
- Naïve Bayes
Most of the machine learning algorithms can be implemented using the libraries provided by Python (SciPy, scikit-learn) and R (Caret, Dplyr, kernLab), helping machine learning engineers to focus on their business logic than worrying about the model training, building and validation.
3. Data visualisation
It is always better to represent data graphically. Firstly, it is easier to understand trends and patterns when explained graphically, and secondly, it gives a neat and clear distinction between various trends and important aspects of data. For example, a simple bar chart enables us to understand how data is grouped into various categories. Same way, line plot can be used to define the relationship between 2 variables. Some other common plots that help visualise data are histogram, scatter plot, box and whisker, correlation matrices, word cloud etc.
Some good tools for visualisation are Excel and Tableau. Know which one is better for your project through this article on Excel vs Tableau. Other powerful visualisation tools are Power BI, Google Analytics and D3.
4. Software engineering
Having some software engineering experience will help in your data science. At least knowing the roles of a software engineer will help you gain the same knowledge to prepare for your journey. Of course, a software engineer is usually expert in at least one programming language. A software engineer can be a good asset in the data science team for having the following skills (usually by default):
- Writing reusable, modular code by following best practices, code refactoring
- Writing proper comments and documenting the code steps
- Knowledge of version control
- Unit testing of code
- Logging and various levels of logging
- Collaboration with the other teams for build, deployment and testing
You might be wondering what role does the cloud have to play in any phase of data science. Well, the cloud connects everything via the internet and makes it easily accessible. For example, we can access database servers, data analytics and other software that may reside in any part of the world from anywhere through cloud technologies. This makes data accessibility easy, fast and affordable. Data need not be stored locally and filtering, extraction, sorting etc. everything can be performed without transferring data from central servers into local systems saving loads of bandwidth (remember, we are talking about huge datasets). Data science with cloud computing has a new name as it is so popular: Data as a Service (DaaS). In this service, data vendors provide data storage, data processing, data integration analytics services to companies over a network connection. Some popular cloud computing platforms for data science are Amazon Web Services (AWS), Microsoft Azure and Google Cloud.
Soft skills (non-technical skills)
Soft skills are as important as technical skills but cannot be acquired by training or reading books. These come with hands-on experience. You have to work on projects (in a company or from internet resources, courses etc.) to practically understand the common issues, important highlights and how to extract important information from data.
Data intuition: This is the skill that sets apart a data scientist and data analyst. While a data analyst works on facts and data which is in front of eyes, data scientists think one step ahead. A data scientist has a keen eye on identifying important parts of data, just with a high-level look at the data. Critical, out-of-the-box thinking and data-driven problem-solving skills are very important for a data scientist to be able to look at data from different perspectives.
1. Communication skills
A data scientist should be able to communicate the findings and reports to the stakeholders and decision-makers. Good communication skills can help you put forward your point effectively and influence major decisions. Good communication doesn’t just mean command over English; it is the ability to explain your points in the simplest possible manner so that your audience can relate to and understand your presentation or report.
2. Business sense
If you have to be able to explain your findings and insights to various stakeholders, you should know the business & technical jargon well. This will build trust and confidence in your work and help you present your points in a much better manner. Remember, as we already told, these are the skills that will come with time and experience.
Knowing tools is an advantage
While you can do everything on your own from scratch, using tools will help you save a lot of time and resources. Tools can make tasks easier. For example, with the data visualisation tool Tableau, you can view different insights just by drag and drop. This will help you understand data in a better manner that too in less time. Same way, there are a lot of machine learning libraries and packages that have the common functionalities in place, so that you don’t have to code those from scratch. That is why languages like Python and R are preferred for data science – they have loads of libraries to perform common math and statistical tasks. Our comprehensive list of data science tools will help you learn the tools you need for each stage of data science.
How to start learning data science skills?
There are many free and paid courses on the internet. You can also buy some good books to get the basic knowledge and then go ahead with the courses. If you have a bit of programming knowledge and are from a math/statistics background, I would suggest you start self-learning and exploring before joining any big courses. If you do not have any programming or math background, start with those concepts first. Pick any one language and learn the basics of math and statistics. Here is how you can start learning:
- Books and videos: Read books on data science, offline or online. Many books are available for free online. Here is a list of the best data science books. Complement your learning through videos and tutorials on YouTube and other sites for the topics on which you seek more information.
- Coding: Be thorough with at least one popular data science programming language. If you already know C++/Java, it will be easier for you to learn Python or R. Further, knowledge of SQL is quite important. Start learning with Google’s Python course.
- Visualisation tools: Visualisation tools can help you get the data scientist mindset. You will be able to explore and view data easily so that you can explore various aspects of the data and find unintended insights as well. Learn more about data science tools.
- Work on projects: Try as many projects as possible. Datasets are freely available on the internet. You can also think of your own business problem and work on it.
- Certifications: If you are planning to make data science your career path, certification will help you stand out from others. Other than the learning, you will be able to set yourself apart from data science courses provided by Coursera, Udemy, edX and other online learning platforms.
- Having a basic understanding of the cloud environment will be a huge plus point.
We have covered a lot of skills, but remember that the list can never be complete. There are many other small and big skills required to be proficient. However, you cannot be learning all of them at once. Some are learnt on the go, and you should be prepared for receiving new challenges and skills every other day! These are perhaps not as important as the ones we discussed but are surely nice to have – cloud, deep learning, Hadoop/Spark, TensorFlow etc. Your desire to consistently learn and upskill yourself will help you become a highly skilled data scientist!