SQL for Data Science

By | September 28, 2021
SQL For Data Science

SQL is as essential for data as data is for data science. If you don’t know already, SQL stands for Structured Query Language. It is a programming language just like R and Python, but unlike them, SQL stores and manages data in a relational database through the use of queries.

For example, if you want to get the details of a particular customer out of millions of customer detail records in a database, you can simply do so by ‘querying’ or asking the database for the record. We can perform different types of operations using SQL statements. But before we explain that, let’s first look into relational database management systems.

Vamware

DBMS and Relational Databases

So, what is a relational database? A relational database management system (or RDBMS) is a type of DBMS that stores data in the form of tables. This helps us to store data in a more organized and structured manner.

For example, if you have to store data about a customer and her purchases, you can have two tables. The first table is to store the customer data (e.g. customer table) and the other for storing details of her orders (e.g. order table).

The relationship between the customer and order tables can be created using a unique identifier. The most common identifier being the unique id or customer identification number associated with each customer. SQL is the most common language that facilitates accessing and querying databases.

SQL

Structure Query Language can perform the following operations:

  • DML (Data Manipulation Language) statements – This includes select, insert, update and delete. These statements manipulate or filter data. For example, select customer_id, customer_name, invoice_id, invoice_date from orders where customer_id = ‘1234’. Here, id, name, date, and others are the columns of the table orders.
  • DDL (Data Definition Language) statements – These include create, drop and alter commands. For example, drop table orders will remove the table orders from the database.
  • DCL (Data Control Language) statements – These statements control the access to the data by granting and revoking permissions to different users. GRANT and REVOKE are DCL commands.
  • TCL (Transaction Control Language) statements – These control the transactions. For example, these statements decide the point the transactions like insert, update, and delete should be committed, and what happens when an error occurs. Also, to what extent the transaction should be rolled back. Transaction control is done using commands like begin, commit, and rollback.

Primary and Foreign Keys

In the earlier section, we talked about a unique identifier that draws the relationships between two or more tables of an RDBMS. This unique identifier is called ‘ the primary key’. Each table has one primary key that has unique values. For example, in a customer table, customer_id can be a primary key, as it will be unique. Interestingly, customer_name cannot be a primary key because there may be two or more customers with the same name.

Now, if customer_id is the primary key of the customer table, and if the orders table also has the same column customer_id, then customer_id and perhaps the order_id combined together (or the customer_id alone) can act as a foreign key that identifies a unique record for a particular customer.

This foreign key also establishes the necessary relationship between the customer and the orders tables. Now, because of the presence of the foreign key, it becomes easier to fetch both the customer and order details using a single query. How? By using joins.

Joins in SQL

Both the tables can be ‘joined’ together to get the required details from both the tables. SQL joins can be performed on more than two tables as well. There are 4 types of SQL joins:

  1. Full outer,
  2. Left outer,
  3. Right outer, and
  4. Inner join.

Note: Learn more about joins.

While getting data from multiple tables using joins, it is possible that some values are missing. These missing values are stored as NULL. Please note that NULL is different from a zero value. Rather than being zero, NULL means the value does not exist.

How is it possible to scan through millions of records to get the data that we are searching for?

To speed up the retrieval of data using select statements or even where clauses, SQL indexes are used. An index can be created on any column (which is not modified much) of the table. Further, nested queries (query inside a query) can be used to further filter down the criteria and make searching and selecting of data quicker.

Up above, we discussed the concepts of SQL statements, primary and foreign keys, joins, indexes, and null values in brief. These all are important SQL concepts for data science.

There are some more important concepts that we will be discussing in this article and are essential for a data scientist. Before we go on discussing those, let us take a brief look into data science and where SQL fits in the overall data science lifecycle.

Data Science

The importance of data has been increasing. With data, companies gain new insights and solutions to business problems. Data science is centered around data, and so is SQL. A typical data scientist has to collect data from various sources, then filter the data, manipulate it to suit a business problem, or transform it to analyze it further, apply algorithms and statistical measures to it, finally process and visualize data that can be then interpreted to put to use or perform further analysis on it.

Data science is so important because data is not easy to collect. The data that a data scientist receives are just chunks of raw or unstructured data. It can be in any form – text, image, sensor data from devices, and so on.

Through filtering, sorting, and other mechanisms, data is transformed into a more usable form that can lead to solutions for complex business problems. Some typical examples of data science applications are recommendation systems for movies and online purchases, inventory tracking and update for manufacturing unit, and speech and image recognition.

No wonder data science has set its foot in almost all the domains, may it be e-commerce, healthcare, finance, logistics, manufacturing, or banking.

Data Science Lifecycle

It has 2 important phases:

  1. Data collection, conditioning, and data cleaning
  2. Data modeling and mining

SQL plays a pivotal role during data conditioning, cleaning, modeling, and mining.

1. Data cleaning and conditioning

When we receive input data from various sources, it can contain missing values, redundant values, duplicates, null values, and so on. Also, the data is highly unstructured (big data) and raw. By using DBMS normalization, a proper database schema can be designed to structure the data by splitting it into tables and adding constraints, dependencies and keys to indicate the relationships between the different tables.

Next, SQL queries can be used to work upon this organized schema, and automatic batch scripts are run to remove irrelevant or corrupt data without any manual intervention. This makes the cleaning process faster and more reliable. We can also use R or Python for data cleaning; however, most of the things can be done via SQL queries. This results in reducing the amount of code to be written.

2. Data modeling and mining

With a proper schema in place, model planning and building become easy. As we have seen previously, we have customer details in one table and order details in another. Having big amounts of unstructured big data loaded into a structured form saves time, effort and makes querying the database easier and faster.

By creating queries to get information, data can be mined easily to get patterns, trends, and relevant information in the least amount of time. SQL has specific features to aggregate, filter, sort, index, and display data based on many conditions.

Also, data from multiple tables can be joined together, and only the required columns can be displayed. Moreover, triggers can be created to automate certain tasks, and indexes can be created to fetch data faster.

Why is SQL Essential for Data Science?

In spite of the wide popularity of languages like Python and R, SQL has remained an essential tool for data science. It is one of the most trusted and reliable ways to obtain, filter and manipulate data. Moreover, SQL can be used with R, Python, Hadoop, Hive, and other data science tools to get the best of both worlds.

Data Analysis and SQL

Now, let’s get started with some basic examples to understand how much SQL one needs for data science. To keep things simple, we will take the example of a simple student database that stores information about students of an engineering college. Consider the following small set of data:

student_id student_name date_of_birth department hobby total_marks
1 John 04-03-1984 CSE singing 94
2 Mark 17-02-1985 CSE reading 92
3 Joe 04-04-1985 ECE reading 91
4 Richard 03-11-1984 EEE photography 93.5
5 Samantha 08-08-1983 IT dancing 86
6 Janice 14-05-1984 IT watching movies 84

Choice of Data

Let us say you want to see the details of all the students. For doing so, you need to use the following SQL query:

select * from student;

This will give all the records present in the database. However, if we want to display only certain columns, we can just query those using the following SQL query:

select student_name, department from student;

We can use different types of expressions to present data in different ways. For example:

select student_id + ‘—’ + department as roll_number from student;

This will fetch the following result:

roll_number

-----------

1-CSE
2-CSE
3-ECE
4-EEE
5-IT
6-IT

Note that roll_number is just a temporary variable that stores the result. It is not a column.

Filtering the Most Relevant Data and Applying Aggregate Functions

How about finding only those who like reading in their pastime? Or those who belong to the ECE department? Or just the number of students who scored marks between 85 and 90? Let’s see.

Filtering the required data is amazingly simple in SQL. Consider the following statement:

select * from student where hobby = ‘reading’;

This fetches all the columns of the table where the hobby is reading. Now, look at this SQL query:

select count(student_id) from student where marks between 85 and 90;

This fetches a single value, the count of students who secured marks between 85 and 90. Next, we have a big SQL query:

select student_name, department, marks from student group by department having marks > 90 order by marks desc;

Okay, this requires a detailed explanation, but first, let us see the result:

John         CSE      94
Mark         CSE      92
Richard      EEE      93.5
Joe          ECE      91

We see that only those records with marks > 90 are displayed in descending order. Also, the group by clause ensures that the rows are grouped department wise. That’s why we see 92 before 93.5.

Let us say we want to get the list of departments, but we don’t want the values to repeat. Since we have more than one value for some departments, this situation is likely. To avoid that, we use the word distinct.

select distinct(department) from student;

How about we get the overall average marks of students in each department? For such, we use the following SQL statement:

select avg(marks), department from student group by department;

It will give an output as:

93.0|CSE
91.0|ECE
93.5|EEE
85.0|IT

Do you see how powerful yet simple SQL is? We can get any combination of data that we need. There are many such functions like SUM, MIN, and MAX that are useful to get trends, patterns, and other insights.

Identifying Missing Values

In real-world scenarios, the data won’t be as nice and readable as in our examples above. It will have lots of missing values, incorrect values, etc. Databases have a special way of identifying missing values. It is called NULL.

NULL values can cause differences in the measurements and final results. As we saw above, in SQL queries, we filter data using where conditions. For example:

select * from student where department = ‘IT’;
or select * from student where department != ‘IT’;

With null values, these comparisons (= and !=) doesn’t work. To get a NULL value, we have to use IS and IS NOT as shown below:

select * from student where department IS NULL;
select * from student where department IS NOT NULL;

That’s why NULL values will never appear in any result where we make regular comparisons leading to errors in data. Imagine having a dataset with millions of records and thousands of NULL values. This will greatly impact data analysis. Thus, null values should be corrected to include default or zero values so that no columns (or data) are missed out.

Date Time Functions

Date time functions are extensively used to compare dates, analyze data based on different quarters, months or years. There are many functions that can convert data into any format. The timestamp data type includes the time in hours, minutes and seconds.

For example, we can get only those records where the year of birth of the student is 1984 using the following SQL query:

select year(date_of_birth), student_name from student where year(date_of_birth) = 1984;

Check out all the date-related functions here.

Combining Data from Multiple Tables

Real data is huge, complex, and split into multiple tables. That means getting required collective information will need joining one or more tables and displaying the relevant columns from those tables. For this, SQL uses different types of joins:

  • Inner join – Gets only matching values in both the tables.
  • Left join – Returns all the records of the left table and matching records of the right table.
  • Right join – Returns all the records of the right table and matching records of the left table.
  • Full join – Returns all the records where there is a match on either the left or the right table.
  • Self-join – Join of a table with itself.

Suppose we want to list the students who have the same marks. In our above table, we do not have any such values right now. Let us update Joe’s marks to 93.5:

update student set marks = 93.5 where student_name = ‘Joe’;

and,

select s1.student_name, s2.student_name from student s1, student s2 where s1.marks = s2.marks and s1.student_name != s2.student_name;

This will now fetch – Joe|Richard. We used the same table to join itself in the above query. Let us say there is another table, books_library, that stores details of students borrowing books from the library.

book_id book_name author student_id date_of_issue date_of_return
23 Let us C++ Yashwant K. 1 23-01-2020 23-02-2020
24 Head First Java Kathy S. 2 20-01-2020 20-02-2020
25 Data Science from scratch Joel Grus 3 19-12-2019 19-01-2020
26 Headfirst Python Paul B. 4 12-12-2019 12-01-2020

If we wanted to get the details of which student bought which book, we can join the tables and display the desired columns as shown below:

select student.student_name, books_library.book_name, books_library.date_of_issue from books_library inner join student on s.student_id = b.student_id;

The above SQL query will yield the following output:

John
Let us C++
23-01-2020
Mark
Head First Java
20-01-2020
Joe
Data Science from scratch
19-12-2019
Richard
Headfirst Python
12-12-2019

Joins are a powerful way to consolidate data and represent it in different ways. More about joins.

Indexing

Indexing helps to get results faster and optimize performance, especially when we join two or more tables. Particularly, during a full join, which does a cartesian product, each row of the left table is compared with each row of the right table (m x n).

When there are millions of records, imagine the time it would take to execute a query. Thus, to improve the performance, SQL uses indexes. Think of them as the index that you have at the back of books. Indexing in SQL serves a similar purpose, i.e. making querying faster.

If there are no indexes, then SQL has to look through each row to match a column. This results in making querying slower. If we create an index on the column, then the lookup becomes extremely efficient and fast. You can also think of an index as a pointer to the row that has the value your query is searching for.

SQL Analysis Services (SSAS)

SSAS or SQL Server Analysis Services is an analytical data engine from Microsoft that helps in analyzing and mining data and performs data modeling, integration, analysis, and reporting. Check out more about SSAS.

Further Reading

In this article, we talked about the most important features of SQL that you should know to succeed as a data scientist. SQL is useful during data pre-processing, data analysis, and machine learning stages of data science. Interestingly, BigQuery ML from Google Cloud enables you to execute machine learning models using standard SQL queries.

Finally, there could be more commands and features that you might have to know to complete data science tasks. For example, creating triggers and writing small procedures or functions to get repetitive daily routines done faster. SQL, therefore, can easily handle complex scenarios with just one or two queries.

People are also reading:

Leave a Reply

Your email address will not be published. Required fields are marked *