Python Data Science Libraries

By | January 31, 2021
Python Data Science Libraries

Introduction

Data science, as you must be knowing, is a field that involves a lot of steps that help data scientists derive useful information from data and make business decisions. Typically, the entire data science lifecycle consists of the following steps –

  • Data collection – surveys, crawlers etc.
  • Data preparation – cleaning, wrangling, processing, filtering etc.
  • Data mining – modelling, classification/clustering etc.
  • Data analysis – exploratory, confirmatory, descriptive, predictive, regressive, qualitative analysis etc.
  • Data visualization – summarization, business intelligence, decision making etc.

Each of these stages is a humongous process in itself, and if we were to do them manually, it would be an utter waste of resources and time. Instead, we can let the machines do all of this – through various libraries – so we can focus on the business aspects and reuse the code that’s already there.

Vamware

A little more about the stages of data science

Programming languages like Python offer loads of packages to make data science much more straightforward, which is why it is one of the most preferred languages. You name the task, and Python gives the library for you. All you need to do is utilize it for your business scenario. Here are some essential libraries that you should know about if you are planning to practise data science with Python:

  1. Data collection: In this stage, data is collected from various sources. This could be from databases, clouds, interviews, surveys etc. Some popular libraries used in this stage are Pandas, Beautiful Soup, Requests.
  2. Data preparation: Data preparation involves cleaning the raw data to make it suitable for further processing. This stage uses libraries like Pandas, Numpy.
  3. Data processing: Data processing is the initial step of data analysis, where exploratory analysis is carried out to get the main features and statistical information about data. Main libraries used in this step are Pandas, Seaborn and matplotlib.
  4. Data analysis: In this step, we train and build a model that helps us think about the best possible solution to the given problem. A problem can be classification or regression, and for each of these, there are many machine learning algorithms. Although it is difficult to determine which algorithm is the best for a particular problem, Python libraries like scikit-learn, TensorFlow and genism help identify the same, as many algorithms can be applied and results can be compared without much manual effort.
  5. Data visualization: Presenting data is essential for making non-technical, and business people understand what we want to convey. It also helps visualize the various approaches in a better way, making it easy to make decisions and conclusions. Some popular libraries for visualization are seaborn, matplotlib, ggplot, Plotly.

Best Python data science libraries

In this section, let us dive deep into the libraries, what each contains and how they are useful in various stages of data science. We will also see how to install and import each of these.

1. Pandas

As you would have seen above, Pandas is used in many stages of data science. It has compelling features like:

  • Data handling capabilities through Dataframes and series. Through these, data can be represented and manipulated efficiently and quickly.
  • Indexing, organizing, labelling and alignment of data can be done using Pandas methods. This helps us view and manipulate data easily. For example, we can see just the first few items in a dataset with millions of records using Dataframe. One can create as many dataframes as needed and rename the columns to give them more recognizable labels. Without such methods, it will be impossible to view data and understand its main features.
  • Handling missing data – missing data can give wrong results, and make the model inaccurate. Pandas handle missing data using the fillna() function, where missing values can be replaced by special values. Similarly, to check if a value is null, we can use functions like notnull() and isnull().
  • Cleaning – Pandas has some great methods for data cleaning. For example, we can change the index of DataFrame and use the .str() method to clean columns. Same way, the .applymap() method can be used to clean the whole dataset, element-wise. Removing columns can be done simply by using drop() method.
  • Built-in functions for read and write operations – Pandas accepts many file types for read and write both. There are various methods for writing and reading from different file types. For example, to write into .csv, we can use the to_csv() method of DataFrame and to read from CSV, we use the read_csv() method. The general syntax for methods is to_<filetype>() and read_<filetype().
  • Combine multiple datasets – usually, the complete data consists of data from various sources combined. It is difficult to manually merge data from various sources. However, Pandas makes it easy through various methods like
    • merge() – to join data of common columns,
    • join() – to combine data on one column,
    • append()/concat() – to combine dataframes across columns or rows
  • Visualization – using Dataframe’s plot() method, we can create various charts, and graphs like scatter plots, bar graph, pie charts, histograms,
  • Statistical functions – methods like median(), mode(), std(), sum(), mean() are commonly used for descriptive analysis. Further, the describe() function summarizes all the statistics at once.

Installing and importing Pandas

If you have a Python Jupyter notebook, you can try these commands right after installing Python itself.

pip install pandas
import pandas as pd

A simple example of dataframe:

pd.DataFrame({"EmployeeId": [12, 13, 14, 15], 
              "Skill": ['Data science', 'machine learning', 'java', 'python']})

employee Table

2. Numpy

Numpy is one of the basic packages of Python, which helps you convert your data into n-dimensional arrays making the data processing process extremely useful and reduces the burden of scientific computations from developers. For example, here are the steps to create an array:

pip install numpy
import numpy as np

arr = np.array( [[ 1, 2, 3], [ 4, 5, 6], [ 7, 8, 9]])

we can get the dimensions and type of array simply by using the methods arr.ndim and type(arr) respectively. Numpy also offers methods for slicing, indexing and performing operations on arrays. For example, if we want to double the elements of an array arr = np.array([1, 2, 3]), we can simply use a = a*2, and the result will be [2, 4, 6].

Numpy also provides trigonometric functions like sin, cos etc. Same way, we can perform sorting of structured arrays by specifying the order. For example,

# define the data types for the array
dtypes = [('name', 'S10'), ('phone', int)] 
 
# the actual array data 
data = [('John', 7390898934), ('Mac', 8889283421),  
           ('Joey', 8779233420), ('Dave', 8342287730)]            

# create the array with data and data types
arr = np.array(data, dtype = dtypes) 

# Sort the array with name
print ("\nSorted array by names:\n", 
            np.sort(arr, order = 'name'))

Numpy can perform a lot more functions, and you can check them all on the official reference documentation page.

3. Seaborn

Seaborn is a visualization library built on matplotlib (which we will discuss next). Seaborn makes it easier to work with DataFrames as compared to Matplotlib, but it is in no way a replacement of matplotlib. Seaborn complements matplotlib beautifully and has many unique features like:

  • Visualizing linear regression models
  • Styling graphics through built-in themes
  • Plotting time-series data
  • Viewing bivariate and univariate data

You will still need to import and use matplotlib to plot the basic graph and then set styles and scales using seaborn. There are five themes available in the latest version of seaborn – dark, white, ticks, darkgrid, whitegrid. Darkgrid is the default style. To remove the right and top axes, we can use the despine() function (which is not available in matplotlib). A lot of customizations are possible using the set_styles() method to which you can pass parameters like font.family, axes.grid, grid.color, xtick.color, ytick,direction, figure.facecolor and many more. We can enhance the aesthetics of plots by giving various colors using the color_palette() function. Here is a simple histogram using seaborn.

To install seaborn, use pip install seaborn

import numpy as np
from matplotlib import pyplot as plt
import seaborn as sb

sb.set_style("darkgrid", {'axes.axisbelow': False, 'axes.facecolor': 'Orange'})

a = np.array([12,88,45,47,80,73,43,84,12,20,34,45,79,26,37,39,11]) 
plt.hist(a, bins = [0,20,40,60,80,100]) 
plt.title("example for seaborn") 
plt.show()

This will give the output as:

example for seaborn

4. Matplotlib

We have already seen how simple it is to use matplotlib in the previous section. It is the central library for data visualization and contains loads of methods to plot any type of graphs and charts. Some features of matplotlib are:

  • Most suitable and commonly used library for 2D plots
  • Any type of plot can be created using this library like the bar, line, scatter, pie etc.
  • Multiple subplots can be easily created using subplot() function
  • Matplotlib can also display images using the imshow() function
  • Also supports 3d graphs like surface, scatter, wireframe etc.
  • Supports streamplots and ellipses
  • The name comes from its MATLAB like interface (‘mat” plot” lib’) for plots

Here is an example to show how an image can be displayed using matplotlib. But before that, let us see the installation and import statements:

pip install matplotlib
from matplotlib import pyplot as plt
#get face image of panda from misc package
panda = misc.face()
# rotate it for some fun; you can skip this if you don't want to rotate your image
panda_rotate = ndimage.rotate(panda, 180)
plt.imshow( panda_rotate)
plt.show()

A simple scatter plot using matplotlib:

from matplotlib import pyplot as plt 
# x-axis values 
x = [1, 3, 5, 7, 9] 
# Y-axis values 
y = [2, 4, 6, 4, 1]   
# Function to plot scatter 
plt.scatter(x, y) 
# function to show the plot 
plt.show()

function to show the plot

Here is a simple 3D plot using matplotlib. Note that to plot 3D graphs; we need the mpl_toolkits package that is automatically installed when you install matplotlib package:

from mpl_toolkits import mplot3d
import numpy as np
import matplotlib.pyplot as plt
fig = plt.figure()

def func(x, y):
    return np.cos(np.sqrt(x ** 2 + y ** 2))
# create evenly spaced sequences of the given intervals 
x = np.linspace(-5, 5, 20)
y = np.linspace(-5, 5, 20)

X, Y = np.meshgrid(x, y)
Z = func(X, Y)
fig = plt.figure()
#this will create a 3d mesh grid
ax = plt.axes(projection='3d')
# You can use the method contour3D as well (try it!)
ax.contourf(X, Y, Z)
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_zlabel('z');
#to view at a different angle, use the property view_init
ax.view_init(45, 30)

fig

fig

In this section, we have seen some advanced functions of both matplotlib and Numpy! Notice that all the libraries work together to produce the desired results.

5. Scikit-learn

A must-have library for machine learning, this package contains all the algorithms you can think of. From classification to regression, dimensionality reduction, clustering, you will find all of them. Scikit-learn is built on top of the libraries like scipy, NumPy and matplotlib. You can reuse the code in various contexts. Scikit-learn also provides methods to split the data into training and testing sets, and test accuracy of the model in case of a supervised learning model. You can also use more metrics to use the newly added cross-validation feature. The package comes with sample datasets that you can use for practice, for example, the iris dataset. To access the features of the dataset, you can use .data member:

from sklearn import datasets
iris = datasets.load_iris()
print(iris.data)

For the sake of simplicity in understanding, let us take this dataset itself to explore the various models of scikit-learn. Let us say we want to predict the probability of a particular iris to be of type setosa. We will start with the simplest of all algorithms – the Naïve Bayes. Look out for the explanation of the code in the inline comments.

from sklearn import datasets
import pandas as pd
from sklearn.naive_bayes import GaussianNB
iris = datasets.load_iris()
#set the petal width
X = iris["data"][:,3:]  
# our target type is setosa, whose int value as we gather from dataset is 0
y = (iris["target"]==0).astype(np.int)

#We use the gaussian formula for classifier
clf = GaussianNB()
clf.fit(X,y)
X_new = np.linspace(0,6,100).reshape(-1,1)

# predict the probability of the new data being setosa

y_proba = clf.predict_proba(X_new)

# Plot the data
plt.plot(X,y,"b.")
#find the probability; remember our target is setosa
plt.plot(X_new,y_proba[:,1],"r",label="Iris-Setosa")
plt.plot(X_new,y_proba[:,0],"b",label="Not Iris-Setosa")
# these are for the text to appear
plt.xlabel("Petal width", fontsize=10)
plt.ylabel("Probability", fontsize=10)
plt.legend(loc="upper right", fontsize=10)
plt.show()

When we plot the data, we get the probability value between 0 and 1 as:

probability value between 0 and 1

Note that we have used GaussianNB.

sklearn has classes for each algorithm. For example, for logistic regression, you will import the same from a linear model:

from sklearn.linear_model import LogisticRegression

For Random forest, you will use,

from sklearn.ensemble import RandomForestClassifier

Use from sklearn.cluster import KMeans for K-means algorithm, and so on.

sklearn has a function train_test_split that makes partition in data for training and testing purposes.

6. Requests

As the name suggests, Requests is a module that handles all types of HTTP requests. It is easy to use and has many features, like adding headers, passing URL parameters, SSL verification etc. You can use pip install requests to install the requests package.

When we use the GET method, we get a detailed response too.

import requests
resp = requests.get('https://www.techgeekbuzz.com/')

We can print individual elements of the response,

print(resp.encoding) 
print(resp.status_code)
print(resp.elapsed) 
print(resp.url) 
print(resp.headers['Content-Type'])

We can send a query string as a dictionary containing strings using the params keyword.

query = {'s': 'supervised+learning'}
resp = requests.get('https://www.techgeekbuzz.com/', params=query)
print(resp.url)

Similarly, we can use the POST method too. Post methods are useful for submitting forms automatically.

resp = requests.post('https://www.techgeekbuzz.com', data = {'search':'Data Science'})
resp.raise_for_status()
with open('what-is-data-science', 'wb') as filed:
    for chunk in resp.iter_content(chunk_size=100):
        filed.write(chunk)
print(filed)

The above will write the contents of the page ‘what-is-data-science’ into a file for up to the size 100. I have set it to 100 for simplicity sake. You can increase the number to 50000 or more. Requests also allows you to set cookies, session objects, headers. The requests package is helpful when you want to scrape webpage information from the web.

7. PyTorch and TensorFlow

Pytorch is an open-source library, explicitly used for applications like Natural Language Processing and computer vision. PyTorch provides tensor computing through Graphics Processing Unit (GPU) and deep neural networks. It defines Torch.tensor class to work on homogeneous rectangular arrays of numbers. Companies like NVIDIA and AMD use PyTorch.

We have already explored how Numpy can do scientific computations. However, it is slow and cannot utilize GPUs for faster computations. The fundamental concept of PyTorch, i.e. Tensor is also an n-dimensional array, but along with scientific computations, can take care of gradients and computational graphs. PyTorch has many methods to manipulate Tensors.

So, what’s a tensor after all?

As we mentioned before, Tensor is an n-dimensional array – rather a vector or matrix that can represent any type of data. A tensor holds values of the same data type with known shape. The shape of the data represents the dimensionality of the matrix/array/vector.

Data scientists widely use PyTorch, AI developers, researchers and machine learning experts for deep learning models. It is flexible, has a simple interface, and offers dynamic computational graphs.

So, why, Tensorflow?

PyTorch is more intuitive than Tensorflow as you can create graphs on the go, whereas TensorFlow can create only a static graph, i.e. before you run the model, you should have defined the entire computational graph. It is also much easier to learn than TensorFlow.

As you might have guessed by now, TensorFlow is a library, similar to PyTorch, used for deep learning. Only TensorFlow, developed by Google, is based on Theano, and PyTorch, by Facebook is based on Torch. TensorFlow is relatively more established and has a bigger community, more tutorials and resources for learning. For production-ready scalable projects, TensorFlow is more suitable. It also has a visualization tool, Tensorboard, through which you can view ML models in the browser itself. To install TensorFlow, you need to install Anaconda and then create a .yml file to install the dependencies. Then use pip install tensorflow to add Tensorflow. Open the .yml file you create it and add the following:

name: <filename>
dependencies:  
- python=3.6  
- jupyter  
- ipython  
- pandas

The above code, when run, will install the dependencies mentioned. Compile the yml file using conda env create -f <filename>.yml command and activate the file using active <filename> command. After installation, check whether all the dependencies are installed and then install tensorflow using the pip command mentioned above.

The next step is to import and use it!

import tensorflow as tf
hi = tf.constant("hey there, good evening")
hi

TensorFlow conducts all operations in a graph, i.e. computations take place successively. The nodes are connected, and each operation is called an op node. The main input to a tensor is the feature vector, which then goes through an operation and creates a new tensor which is fed to another operation. Since tensors work on three or more dimensions, they have three properties – label (name), data type and a shape or dimension. For example,

        tens1  = tf.constant(12, tf.int16) will produce output as, 
Tensor("Const:0”, shape=(), dtype=int16)

in the order name, shape and data type respectively. Dtype can be anything from float32, string, bool etc.

If a function is applied to the value,

       X = tf.constant([3.0], dtype = tf.float32)
   print(tf.sqrt(X))

The tensor object will now have a shape value,

  Tensor("Sqrt:0", shape=(1,), dtype=float32)

This is just to show you the basics of TensorFlow objects. The detailed explanation of TensorFlow is beyond the scope of this article and will be covered separately in another article. You can learn more about TensorFlow from the official website.

8. Arrow

Arrow is specifically to handle date, time and timestamp. Usually, dates and time become a big headache to work with and need extensive coding. With Arrow, this constraint can be removed. Arrow has a lot of features like:

  • Very simple creation options for standard inputs
  • Time Zone conversions, UTC by default
  • Generates ranges, floor, ceiling, time span for time frames from microseconds to years
  • Supports locales
  • Support for relative offsets
  • Extensible

You can install Arrow using pip install Arrow and then import it using import arrow statement.

A simple example,

import arrow as ar
utc = ar.utcnow()
utc
<Arrow [2020-07-22T18:38:12.126947+00:00]>

local = utc.to('UTC-5')
local
<Arrow [2020-07-22T13:44:32.675160-05:00]>

Arrow can search a date in a string and parse it:

ar.get('I was born on 05 September 1975', 'DD MMMM YYYY')
<Arrow [1975-09-05T00:00:00+00:00]>

You can humanize the time:

now = ar.utcnow()
later = now.shift(hours=4)
later

This will give output as <Arrow [2020-07-22T22:53:55.509976+00:00]>

If you humanize,

later.humanize(now), you will get
'in 4 hours'

The API is self-explanatory, and functions are exhaustive yet simple to understand. You can use Arrow whenever your project requires extensive date manipulation and working with date ranges.

9. Beautiful soup

Beautiful soup scrapes information from websites for mining. It is written in an HTML file or XML parser and creates a parse tree that gathers data from HTML. The installation process is the same –

pip install beautifulsoup4
from bs4 import BeautifulSoup

soup = BeautifulSoup('https://www.techgeekbuzz.com', 'html.parser')

Once you create a soup object, you can use methods like findall(), prettify(), contents, etc. to extract data from the HTML files.

How to create a custom library in Python

So, that was about some of the most popular and widely used libraries.

Python is one of the most popular languages for data science, mainly because of the built-in libraries it supports. However, if as a developer you still feel that there is something for which a library has not yet been developed, and you can reuse the code for other purposes too, Python allows you to create custom libraries too. The library is nothing but a collection of Python modules that are organized in a package. It is very simple to create a module:

def myownfunc(param):
print(param)

After creating the module, save the file with an appropriate name (say, myownmodule) and .py extension. To use this module,

import myownmodule
myownmodule.myownfunc(a)

The next step is to package this module into a library. We can add more modules and combine all of them into one library. We do this using the setup module – create setup.py in the root directory of the package. Then run, python setup.py sdist_distrname to create the source distribution. Running this will create a tar-gzipped file, stored under the dist (distribution) directory. For others to use this package, you have to upload it to PyPI (Python Packages Index). This requires registration and creation of a new account on PyPI. You can then upload your package using ‘twine’.

Why libraries

We have already explored the use of various libraries in Python. If we have a library, we can reuse the code already written by somebody else for the same purpose. This saves a lot of time and resources and helps businesses focus on their business logic alone. Further, in Python, it is easy to find out which library can be used for what purpose by using the ‘help’ option. We can also create custom libraries and modules in Python that, when distributed can be used by others, thus building a strong community and network of developers.

Conclusion

Someone recently asked me – how is a package different from a library? The main difference is that functions and modules in a library may or may not be related to each other but are put together as one so that you can build your program over it. The programs in a package are closely related to each other. A library consists of many packages that serve different purposes.

The above list of libraries is in no way comprehensive. There are many libraries like Keras, Ggplot, Plotly, Scraper etc. which are very useful. The libraries you choose depend on your business needs and use cases, but the basic ones like Numpy, Pandas, Matplotlib and scikit-learn are used in almost all data science and machine learning problems. Let us know in comments if you want us to write about a specific Python library!

Leave a Reply

Your email address will not be published. Required fields are marked *