Data Visualization is one of the most significant skills that every Data Scientist possesses, and it’s not a field that a Data Scientist can ignore or underrated. Data visualization is an ocean itself and there are a number of courses and resources in this specific field of Data Science. There are many universities that provide Data Visualization courses and there are many labs that only focus on Data Visualization.
Right now, we deal with a set of Big Data, and the data generally we get is messy and uncleaned. And as a Data Scientist, we process the data filter and extract the valuable data from the big data and visualize it for communication. After we process and extract the data, we need to visualize it so we can communicate with data and perform the appropriate actions, that what Data Science is all about.
What is Data visualization?
There are many definitions of Data Visualization and two of the most popular Visualization definitions are
“The use of computer-supported”, interactive visual representation of abstract data amplify cognition [by Card et al. 1999]”
“The Representation and presentation of data to facilitate understanding [by kirk in 2016]”
Both the definition of Data visualization center around presenting and representing data, and both the definition end with a key part of Data visualization.
How does Data Visualization help in understanding data?
Suppose there is raw data of x and y coordinates and just by looking at the numerical values it might take you 4 to 5 minutes to tell the relationship between x and y coordinates. But if those coordinates are plotted on a graph then within a second you can tell the relation between the two coordinates, and this shows the power of Data visualization.
With Data visualization we can understand the flow of data. If the data is visualized properly then you do not need to be a master in mathematics, or statistics to find out the relation of every data value. you can just look at the graph or other visualization presentation and tell the insight of data.
Data Visualization in Python using Matplotlib
Python is popular for its data science and data visualization libraries. In Python, you will find many open-source libraries that can plot graphical graphs based on the passed data.
Here is the list of Best Python Data Visualization Libraries
Here in this article, we will only be covering the Python Matplotlib library and go through some of its basic graphs plotting style. The matplotlib is the standard Python data visualization library and it highly compatible with other Python Data Science Libraries like Pandas, Numpy, scikit-learn, etc. Even one of the most popular Machine Learning Library PyTorch uses matplotlib to plot graphs.
Install Python Matplotlib library for Data Visualization
Matplotlib is an open-source library, you can easily install it using the Python pip command.
pip install matplotlib
Plot a line Graph using Python Matplotlib library
Line graphs are very straight forward and only need x and y coordinates to plot the graph. To plot a line graph we use
plot() method with x and y coordinates data. The plot() method also accepts some optional parameters such as color, marker, markersize, etc.
import matplotlib.pyplot as plt x_points = [10, 20, 30, 40, 50 ,60] y_points = [100, 292, 68, 300, 50, 70] plt.plot(x_points, y_points, color= "green", marker ="*" ,markersize=10) plt.xlabel("X axis") plt.ylabel("Y axis") plt.show()
Plot histogram using Python matplotlib Library
Histograms are similar to Bar Charts, but they generally used to represent a group of numbers. The maximum height of the histogram represents the maximum range of a specific data set. A histogram is often used to check the frequency of a data set.
To plot a histogram using matplotlib we can use the pyplot
import matplotlib.pyplot as plt studnet_ages = [11,13,12,12, 13,15,12,12,13,14,15,14,15,12,15,16,16,13] bins = len(set(studnet_ages)) plt.hist(studnet_ages,bins) plt.xlabel("ages") plt.ylabel("Number of Students") plt.show()
Plot Bar charts using Python Matplotlib Library
In bar charts, we use the long bar height to represent the data. They really come very useful when you are representing the data related to a short survey.
import matplotlib import matplotlib.pyplot as plt import numpy as np sections = ['Section 1', 'Section 2', 'Section 3', 'Section 4', 'Section 5'] boy_means = [89, 80, 75, 71, 70] girl_means = [84, 85, 83, 80, 79] x = np.arange(len(sections)) width = 0.35 fig, ax = plt.subplots() rects1 = ax.bar(x - width/2, boy_means, width, label='Boys') rects2 = ax.bar(x + width/2, girl_means, width, label='Girls') ax.set_ylabel(' Average Marks') ax.set_title('Result') ax.set_xticks(x) ax.set_xticklabels(sections) ax.legend() fig.tight_layout() plt.show()
Plot Pie Chart using Python Matplotlib library
The pie chart represents the data sets like a slice from a pie. The larger slice represents the high frequency of the data set, the smaller size represents the low frequency.
Using the matplotlib
pie() method we can create a pie graph with Python.
import matplotlib.pyplot as plt labels = 'Berries', 'Oranges', 'Grapes', 'Apples' prices = [15, 30, 45, 10] explode = (0, 0.1, 0, 0) fig1, ax1 = plt.subplots() ax1.pie(prices, explode=explode, labels=labels, shadow=True, startangle=90) plt.show()
Plot Scatter Plot using Python Matplotlib Library
The scatter plot is used to display the relationship between two data sets. For example, it can be used to display the relationship between the height and weight of different persons.
To plot a scatter using matplotlib we use the
scatter() method. The scatter method can plot a scatter graph just by accepting the x and y data sets.
import matplotlib.pyplot as plt heights = [5, 5.5, 5,7, 5.8, 5.9, 5.10, 6] weight = [50, 60, 65, 67,67,68, 70, 75] plt.scatter(heights, weight, marker='^' ) plt.xlabel("Height") plt.ylabel("Weight") plt.show()
How to design Good Visualization?
To design a Good Data Visualization the visualization must be trustworthy, Accessible, and Elegant.
by trustworthy means, the visualization must be evident in its context. There must be solid evidence behind the visualization, and every aspect of the visualization must be clear to the user. For instance, just by looking at the data visualization, the viewer must get the overall insight of the data.
Your data visualization will not only be inspected by the expert but by non-expert too, so the data visualization must be presented in such a way that everyone could understand the most out of it. If the viewer is not able to get the representation, then the data could mislead the viewer. So it becomes very important for the data scientist to know his audience and place such a presentation which is easily understandable.
When we visualize our data, we only visualize that part of data that is more relevant to the context. A data could have more than one property associated with it, but during visualization, we only show one or two properties, this is very important. We do not want so many properties to distort the visualization. If we associate too many properties in a single visualization, we will lose the main result and output. When we design the visualization, we should consider the elegant styling and decoration, and there should not be unnecessary information on visuals.
Data Visualization is an integral part of Data Science, and without it, you will have a hard time communicating with the data. Data Visualization in Python is very easy, and matplotlib is the standard library to plot data on a graph. There is a lot in Data Visualization and here we have just provided a brief introduction of it. Python which is one of the most popular programming languages supports many libraries for data visualization.