Introduction to Data Analysis
The word data has been the world’s most popular word for quite some time now. Analyzing data means to make sense out of it by using various techniques and tools. For example, if a child has some problems in learning, you collect information from their parents or peers on their behaviour otherwise, and then gather more information through research on what could be the real problem and why the child is not able to learn well. Based on the information collected from various sources and perspectives, you can derive patterns, understand behaviours, mood swings, emotions under different situations, and so on. All of this and more is called data analysis. Through the right analysis, we can help to solve the child’s learning problem and make plans for their learning.
Why Data Analysis
Every industry is data-driven and out of the billions of data that users generate every day, even if we can put 1% of data to use, it can give us a lot of insight on the prospects of a business. Data can be analyzed and used in many ways by businesses to get useful predictions and use cases for the future. For example, using the preferences and most visited pages of a user, Facebook suggests other similar pages, Amazon recommends products to users based on their shopping preferences, Netflix recommends movies based on what kind of movies a user generally prefers, sales representatives look for customers’ buying preferences to promote different products and so on. This improves customer experience as well as gives better direction to businesses.
The Data Analysis Process
Before we get into techniques, we want to touch upon the entire data analysis process so that you can appreciate the techniques better. Data analysis involves the following steps:
- Understanding the problem statement (specification): identify why you need to perform data analysis and what kind of data you would need.
- Data collection: collect data from various sources, like surveys, databases, interviews, etc
- Data cleaning and filtering: remove the unnecessary data, clean, sort, filter data
- Apply algorithms and statistical tools and techniques to gain insights
- Visualize the results, generate results, make predictions
- Find trends and make future business decisions
Data Analysis Techniques Classification
There are many types of classifications of data analysis techniques. The information on the web on this is overwhelming and sometimes confusing. You would find the different classifications as:
1. Initial and Main Data Analysis
Initial data analysis doesn’t talk about the main business problem at all. It is concerned with the following –
- Data quality
- Quality of measurements
- Mathematical transformations
- The success of the randomization process
- Characteristics of the data sample
- Documentation of the findings in this stage
- Nonlinear analysis
The techniques used for initial analysis are
1.1 Univariate (single variable)
The simplest technique of all, involving just one variable. There are no relationships. For example,
|Age group||number of people|
|> 65 years||3|
The patterns in univariate analysis can be determined using central tendency, standard deviation, and dispersion methods (we will describe them later in this article).
1.2 Bivariate (correlations)
If there are two variables involved in your analysis, you tend to find correlations. The two variables can be independent or dependent. For example, you might want to categorize students based on their average marks and branch of study –
|Group #||Avg. Marks||Branch|
|Group 1||> 70||Computers and IT|
|Group 2||55-70||Electronics and Electrical|
|Group 3||45-55||Chemical Engg.|
Scatter plots, regression analysis, and correlation coefficients are the common types of bivariate analysis.
1.3 Graphical Techniques
Graphs provide quick visual references to view the entire data characteristics. Graphs are easier to comprehend than statistical equations. Some of the most useful plots are box plots, time series, scatter plots, histograms, probability plots, etc These can easily identify outliers, ranges, relationships between variables and datasets, and much more.
1.4 Nominal and Ordinal Variables
Nominal and ordinal variables are used to subcategorize different types of data.
Nominal scale: is used to label the data. For example,
what is your favourite colour? – yellow, blue, red, pink, orange.
What is your gender?- male, female
How is the drink? – hot, cold, warm
The labels having only 2 categories are called ‘dichotomous’.
1.5 Ordinal Scale
It is a scale that sub categorizes values based on the order. There is a vague difference between the shift from one value to others. For example, you might have seen surveys like:
How was the food? – bad, average, good, very good, excellent. You cannot quantify the difference between average and good, or good and very good, etc.
1.5 Continuous Variables
Analysis of continuous variables can be done by comparing means. Analysis of continuous variables is much useful in medical research which can have a different range of values. For example, the measure of heart rate, BP, etc. Different types of statistical tests can be applied to the datasets based on whether the data is distributed normally or non-normally.
Here is a table that summarizes the different tests –
Main data analysis is done to answer the question that led to the analysis and submit a report based on the research. It involves –
- Exploratory or Confirmatory Approach: In exploratory data analysis (EDA), datasets are analyzed visually to infer the main characteristics. EDA is the next step to initial data analysis and can use statistical methods as well. EDA is done just before applying algorithms for building models and if the results are not up to the mark, the data is again processed or cleaned. The most popular methods for EDA are:
- Graphical Techniques: Box plots, histograms, multi-variate charts, run charts, scatter plots, etc
- Dimensionality Reduction: PCA, multilinear PCA, non-linear dimensionality reduction
- Quantitative Techniques: Trimean, ordination, median
- Stability of Results: The model applied to analyze the data should work for new data with accuracy at least as much as the training data. This can be done through model validation and bootstrapping.
2. Quantitative vs Qualitative Data Analysis Techniques
2.1. Quantitative Data Analysis
Quantitative data can be expressed in numbers and can be discrete (like the number of questions in the exam, number of words in a document, etc..) or continuous (age of a person, height of a building etc). It can be represented using graphs and tables. Quantitative data analysis can provide limited outputs such as identification of a problem, or give direct answers to a question in terms of numbers. For example, it cannot say whether a person failed or passed – it can just say that they got 45% marks based on some calculations. In simple words, quantitative data analysis alone cannot make any concrete decisions.
Quantitative Data Analysis Techniques
2.1.1 Frequency Distribution (Histogram)
Through frequency distribution, we can get a big picture of how often specific values occur in the dataset. The most popular tool for frequency distribution is the histogram. A common example of a frequency distribution is to find the income brackets of various persons in a dataset i.e. how many in the bracket of < 5 lakhs per annum, 5-10 LPA, 10-25 LPA, etc
2.1.2 Descriptive Statistics
Central tendency (mean, median, mode) and dispersion are used to further analyze the data.
- Mean is the average value of the entire sample. It is very commonly used, especially when there are no outliers in the data.
- Median is the middle value of the entire distribution. For example, for numbers 1 to 23, the median would be 12, whereas from 1-22 will be 11. If there is an outlier or data values are unevenly distributed, the median is quite useful.
- Mode – when there are non-numeric data, we can use mode i.e. the most popular item in the distribution.
- Range: It is the difference between the highest and lowest value of the variable. If the range is high, it indicates that the values are more dispersed.
- Variance: It is the average of the squared difference from the mean
- Standard deviation: indicates how much variation a variable exhibits from its mean value. It is the square root of variance.
2.1.3 Comparing Means (T-Tests)
The T-test is an inferential statistical method used to find if the difference of means of two groups related to each other in features is statistically significant i.e. the sample that we have taken for testing represents the population accurately.
Also called a pivot table in Excel. It is a popular technique wherein the relationships and interactions between different variables can be clearly understood. It can analyze categorical data and represent data in an easy to view manner.
Correlation is used to analyze the relationship between two variables. For example, the chances that a person who is going for work will not have time to watch mega serials are quite high, so these factors/variables are directly correlated.
|Correlation value||Relationship between the 2 variables|
|1||direct positive relation, i.e. increase in one variable will increase the other too.|
|-1||inverse or negative relation, i.e. increase in one variable will decrease the second and vice versa.|
2.1.6 Linear Regression
Regression techniques are more accurate than correlation. It shows a good fit in the data and helps in the statistical testing of the variables. Linear regression is the most popular and easiest way. It can be represented as y = ax + b, where y is the dependent variable and a is the intercept that indicates the relationship between x & y. For example, if a is in negative, it would indicate an inverse relationship between x & y.
2.1.7 Text Analytics
Text analysis comes in handy when there is a huge amount of data from which we need to find only the useful text, for example, text analytics can be used to scan thousands of resumes by looking for certain keywords based on the requirements such as digital, SEO, marketing, manager, tools, etc.
2.2 Qualitative Data Analysis
Qualitative data provides depth to the results of quantitative analysis. It can define a problem, generate new ideas, or make decisions. Qualitative data can be collected through group discussions, interviews, archived data, observations, case studies, etc. This type of analysis is exploratory and subjective.
Qualitative data analysis techniques are further divided into two approaches: deductive and inductive.
1. Inductive Analysis
Starts with a ‘question’ and collection of relevant data, that are used to obtain hypothesis or theory and identify patterns. This type of qualitative analysis requires more time and is thorough.
2. Deductive Analysis
Starts with a hypothesis or a theory that is then proved by collecting data and performing analysis. Since we already know what we are collecting data for, this type of analysis is quick.
Qualitative Data Analysis Techniques
2.2.1 Content Analysis
Content analysis is the process where verbal or behavioural data is categorized to classify, summarize, and tabulate the same. Using this technique, researchers can look for specific words or concepts in the entire content and identify patterns in communication.
2.2.2 Narrative Analysis
Involves the analysis of the same stories, interviews, letters, conversations told by different users, and identifying the context of the respondents while doing so. It is mainly used for knowledge management i.e. identifying, representing, sharing, and communicating knowledge rather than just collecting and processing the data.
2.2.3 Discourse Analysis
It is used for analyzing spoken language, sign language as well as written text to understand how language is used for real-life purposes. Discourse can be argument, narration, description, and exposition. For example, a discourse between teacher and student, child and doctor, etc
2.2.4 Grounded Analysis
In this approach, one single case is analyzed to formulate a theory. After that, additional cases are examined to determine if the theory works for those cases as well. This method uses inductive reasoning i.e. moves from the specific case to generalization.
2.2.5 Framework Analysis
Framework analysis is applied to projects that have specific questions and a limited time frame. This technique provides highly structured outputs of summarized data. For example, analysis of interview data. The researcher listens to the audio and reads the transcripts line by line, applies labels (or code) describing the important parts of the interview. Multiple researchers can work on the same data and then compare their labels or code to further narrow down the codes to analyze and apply for the rest of the transcripts. The code then acts as a framework. Since the interview data is qualitative, it is huge in volume. That’s why the data is charted into a matrix and summarized.
Steps to conduct qualitative data analysis –
- Applying codes
- Determining patterns, themes, and relationships
- Data Summarization
3. Mathematical and Statistical Techniques
3.1 Dispersion Analysis
This type of analysis helps us in determining the variation in the items. The dispersion measures the variation of items amongst themselves and then around the average. Variation means how spread out the entire data is. Some techniques for dispersion analysis are variance, standard deviation, interquartile range, box plots, dot plots, etc.
3.2 Regression Analysis
This is a powerful data analysis technique to find relationships between two or more variables. It is used to find trends in data. Regression can be linear, non-linear, logistic, etc.
Consists of taking out small, repeated samples from the original data samples. Rather than analytical methods, it uses experiments to get a unique sampling distribution.
3.4 Descriptive Analysis
Gives a fair idea of the distribution of data, detects typos or outliers, identifies relationships between variables etc, to make the data for further deep analysis. Descriptive analysis can be performed on three types of variables
- Both variables are quantitative – in this case, we can use scatter plots
- Both variables are qualitative – we can prepare a contingency table
- One qualitative and one quantitative variable – we can calculate the summary for the quantitative variable classified using the qualitative variable and then plot box or whisker plots of the quantitative variable with the qualitative variable.
3.5 Factor Analysis
It is a data analysis technique that reduces the number of features or variables to a fewer number of factors. The technique uses eigenvalue as a measure of variance in the observed variable. Factors with less value of eigenvalue are discarded.
3.6 Time Series Analysis
It is a data analysis technique to find trends in the time series data (data collected over a period represented in terms of the time or interval). Through this analysis, we can use an appropriate model for forecasting and hence make better business decisions.
3.7 Discriminant Analysis
This technique classifies data into smaller groups based on one or more quantitative variables. For example, a doctor can identify those patients who are at high, medium, or low risk of COVID-19 based on attributes like cough, fever, sneezing, etc.
4. AI and ML Techniques
4.1 Decision Trees
Decision trees split the data into subsets by asking relevant questions related to the data. Read more about decision trees.
This is a technique where the dimensionality of the data is reduced by rotating and achieving the principal directions. PCA considers the direction with the highest variance to be the most important.
In this data analysis technique, data is mapped to higher dimensional space to categorize the data points. Even if the data is otherwise linearly separable, mapping can help achieve the results. The hyperplane is selected such that the variables are separated in the best possible manner. Amongst other applications, SVM is widely used in sentiment analysis to detect if the statement made by a person is positive, negative, or neutral. It separates positive and negative words (happy, angry, sad, good, excited etc) using the training dataset, and then classifies any new statement for validation.
4.4 Artificial Neural Networks
These are a series of algorithms that recognize the underlying relationships in a set of data that follows how a human brain would work. ANN is inspired by biological neural networks. Such programs learn on their own without much external supervision or support. ANN algorithms are widely used in gaming, pattern recognition, face detection, image recognition, handwriting recognition, etc They can also be used to detect diseases like cancer.
4.5 Fuzzy Logic
Most of the situations in real-life are vague or fuzzy. For such uncertain cases, fuzzy logic provides flexible reasoning. For example, “is the dessert sweet?” can have answers like yes, a little, a lot, no. Here, little and lot tell us the degree of sweetness. These can be thought of by humans, for example, rather than just saying the dessert is sweet, a human would perhaps say it is less sweet, or too sweet. Fuzzy logic aims to incorporate the same deductive thinking to a computer which can help in decision making and dealing with uncertainties.
4.6 Simulation Analysis
To get real-time data for training and testing ML algorithms to become a challenge, especially for critical projects where high-quality data is required and you are unable to judge which algorithm is best to use. Using simulation data can give you control over the features, volume, and frequency of data. You can perform an analysis of simulation data to identify the right algorithm for a particular problem.
4.7 Market Segmentation Analysis
It is the study of customers by dividing them into groups based on their characteristics like age, gender, income, lifestyle preferences, etc. This Data Analysis Techniques way companies can target a smaller amount of audience, known as the target audience for promoting their products. For example, men in the early ’30s are more likely to use electronic products than men above 65 years. K-means clustering, latent class analysis are some popular methods for market segmentation analysis.
4.8 Multivariate Regression Analysis
This data analysis technique measures the degree of linearity between one or more independent variables with one or more dependent variables. It is an extension of linear regression and involves selecting the features, normalizing the features, selecting the hypothesis and cost function, and minimizing the cost function and testing the hypothesis. It is implemented using matrix operations. For example, a doctor wants to analyze the health of people of various age groups based on their eating patterns, exercise, and sleep. She collects all this data to analyze how a healthy diet, exercise, and discipline can lead to a healthy lifestyle.
5. Visualization and Graphs techniques
5.1 Charts and plots
5.1.1 Column and bar chart
These charts are used to represent numerical differences between different categories. A bar chart can be plotted horizontally or vertically. Vertical bar charts are called column charts. Each category in the chart is represented by a rectangle (bar) where the height of the rectangle represents the value.
5.1.2 Line chart
The data points in line charts are represented as a continuous straight line. Line charts are used to show trends over a period of time.
5.1.3 Area chart
Area charts are used to depict time-series relationships. In area charts, data points are plotted using line segments. The area between the line and the x-axis is filled with colour or shading.
5.1.4 Pie char
It is a circular chart that represents proportions of data based on various categories. The length of each slice accurately represents its proportion when compared to other slices in the chart.
5.1.5 Funnel chart
Of course, it is a funnel-shaped chart and often represents stages and potential revenue per stage in a sales process. We can visualize the progressive reduction in data as it passes through each stage. The top of the chart is broadest while the bottom is the narrowest.
5.1.6 Scatter plot
Scatter plots show the relationship between two sets of numeric variables as X-Y cartesian coordinates. The values of the variables are represented using dots.
5.1.7 Bubble Chart
Bubble charts add dimension to the data; hence you can represent the relationship between three variables. The third dimension is represented by the size of the bubble (circle).
5.1.8 Gantt Chart
It is a horizontal bar chart often used for project scheduling, planning, tracking tasks, and overall project management. They can take in complex information about the project and display it in an easy to understand way.
5.1.9 Frame Diagrams
Using frame diagrams, we can represent the hierarchy in the form of an inverted tree structure.
5.2.1 Heat map
Uses colour as a visualization tool for data analysis. For example, if you want to know the red zones where COVID-19 is more prevalent, the heat map will show it in an easy-to-visualize manner.
5.2.2 Point Map
It stores spatial geographical information represented using points. The points can be identified using class names, values, or ID.
5.2.3 Flow map
Flow maps show the movement of objects/people from one location to another. For example, migrant workers moving from one state to another during this lockdown period can be tracked using flow maps.
Represents a horizontal hierarchical structure as a big rectangle. Each smaller rectangle inside the big rectangle indicates the proportion of a variable.
Excel and Tableau are the best tools for data analysis through visualization.
Tools for Data Analysis
- R/Python: Powerful and flexible, these programming languages have rich libraries that can be used to perform complex calculations, generate graphs, and perform predictive analysis. R particularly has loads of libraries for statistical analysis, such as regression analysis, cluster algorithms, and more.
- KNIME: It is an open-source tool that can perform data pre-processing, cleansing, ETL, analysis, and visualization. This is a great tool for those who have less programming knowledge but have good business know-how.
- Tableau: Tableau is a powerful analysis and visualization tool. You can load data from various sources and drag and drop elements to analyze data from various perspectives. Learn more about Tableau by reading our ‘What is Tableau’ article.
- SAS: SAS is another great tool for non-programmers that can perform data preparation and integration of data from various sources. SAS tools use the SAS programming language as the base and have complex data analysis capabilities.
- Excel: Excel is everyone’s favourite. Well, almost. Starting your data analysis journey with excel will help you get a feel of charts, filters, and other data preparation and transformation activities easier and faster. Excel works well in providing insights for smaller data sets and can be integrated with SQL for real-time analysis.
- Fine Report: It is a nice drag and drops tool that has several options for data analysis. The format is similar to excel, however, the tool provides visual plug-in libraries and a variety of dashboard templates to design different types of reports.
Check out more Data analysis tools in our detailed article.
Challenges inaccurate Data Analysis
Data analysis comes with its challenges, the good part being they are all solvable. Some challenges are:
- Overwhelming amounts of data: There is so much data generated every day that it is difficult to know what is required and what is not. This could be a challenge, however, if there is an automated system that organizes data and eliminates all the unnecessary data, this could be solved.
- Data from multiple sources: There might be different sets of data, located in different places and formats. Missing any key factor can lead to incorrect analysis. The data collection stage is thus extremely important.
- Visual representation of data: Representing data in the form of graphs, charts, or any visualization is easier to understand. However, doing this manually, for example, gathering information from various sources, then creating a report is quite slow and annoying. However, various tools make the life of a data analyst easier.
- Cognitive Biases: It is possible that people unintentionally tend to ignore or overlook information that doesn’t support their personal views and concentrate on information that supports their views. Such kinds of assumptions and uncertainty can lead to biased analysis.
- Lack of Relevant Skills: Sometimes, even experienced professionals may not have the necessary or experience in some techniques. This could lead to longer analysis time and less accuracy. The solution for this is to hire people with the right skills and to use analysis systems that are easily usable and understandable.
Examples of Data Analysis
Some common examples of data analysis are:
- Business intelligence
- Smart buildings
Check out Wikipedia to know more about these examples.
In this article, we have seen various types of data analysis techniques and how each is unique in achieving business solutions. To perform accurate data analysis, we need a strong team with good analytical skills, perform thorough data collection, determine the statistical significance or goal, ensure credibility of data and analysis methods and report the extent to which analysis was performed.
Once the analysis is performed, we should check for reliability and consistency of the created results using techniques like cross-validation and sensitivity analysis.
Before you choose the technique to use, you should take into account the scope and nature of work, budget, infrastructure constraints, final reports to be generated, etc
So, which data analysis technique are you going to use for your project?
You might be also interested in:
- Best Data Science Tools
- Best Big Data Tools
- Best Data Science Books
- Best Data Science Packages
- Best Data Science Interview Questions and Answers
- Difference between Data Science and Machine Learning
- Best Data Visualization Tools
- Best Data Analytics Tools
- Best Data Modelling Tools
- Best Data Mining Tools