SQL For Data Science

By | February 6, 2020

SQL is as essential for data as data is for data science. If you don’t know already, SQL stands for Structured Query Language, a programming language just like R or Python which stores and manages data in a relational database through the use of queries. For example, if you want to get a customer’s details out of millions of records in a customer database, you can simply do so by ‘querying’ or asking the database for the record. We can perform different types of operations using SQL language statements.

DBMS and Relational Database

So, what is a relational database?

A relational database management system (or RDBMS) is a type of DBMS, in which data is stored in the form of tables. This helps us to store data in a more organized and structured manner. For example, if you have to store data about a customer and his purchases, you can have one table (eg. customer table) to store customer’s data and another and another to store details of his orders (eg, order table). The relationship between the customer table and order table can be created using a unique identifier, the most common being the unique id or customer identification number associated with each customer. The most common language supported for accessing and querying the database is SQL.

SQL

As we have learned in the introduction, SQL can perform the following operations –

  • DML (Data Manipulation Language) statements – select, insert, update and delete – statements that manipulate or filter data. For example, select customer_id, customer_name, invoice_id, invoice_date from orders where customer_id = ‘1234’; where id, name, date, etc are the columns of the table orders.
  • DDL (Data Definition Language) statements – These include create, drop and alter commands. For example, drop table orders will remove the table orders from the database.
  • DCL (Data Control Language) statements – These control the access to the data by granting and revoking permissions to different users using the GRANT and REVOKE commands.
  • TCL (Transaction Control Language) statements – These control the transactions for example at which point the transactions like inserts, updates, deletes should be committed, and what happens when an error occurs – to what extent the transaction should be rolled back. Transaction control is done using commands like begin, commit, rollback.

In the earlier section, we talked about a unique identifier that draws the relationships between two or more tables of an RDBMS. This unique identifier is called ‘ the primary key’. Each table has one primary key that has unique values. For example, in a customer table, customer_id can be a primary key, as it will be unique. customer_name cannot be a primary key, because there may be two or more customers with the same name.

Now, if customer_id is the primary key of the customer table, and if the orders table also has the same column customer_id, then customer_id and perhaps the order_id combined together (or the customer_id alone) can act as a foreign key that identifies a unique record for a particular customer. This foreign key also establishes the necessary relationship between the customer and the orders table. Now, because of the presence of the foreign key, it becomes easier to fetch both the customer and order details using a single query.

How?

Both the tables can be ‘joined’ together to get the required details from both the tables. SQL joins can be performed on more than two tables as well. There are 4 types of SQL joins – full outer, left outer, right outer and inner join. Learn more about joins.

While getting data from multiple tables using joins, it is possible that some values are missing. These missing values are stored as NULL. NULL is different from a zero value. NULL means the value does not exist.

How is it possible to scan through millions of records to get the data that we are searching for?

To speed up the retrieval of data using select statements or even where clauses, SQL indexes are used. Index can be created on any column (which is not much modified) of the table.

Further, nested queries (query inside a query) can be used to further filter down the criteria and make searching and selecting of data quicker.

The concepts that we discussed above in brief – SQL statements, primary and foreign keys, joins, indexes and null values – are all important SQL concepts for data science. There are some more detailed concepts that we will be discussing in this article and are essential for a data scientist. Before we go on discussing those, let us just touch a bit upon data science and where SQL fits in the overall data science lifecycle.

Data Science

The importance of data has been increasing by the data with companies getting new insights and solutions to business problems. Data science is centered around data and so is SQL. A typical data scientist has to collect data from various sources, filter the data, manipulate it to suit a business problem, or transform it to analyze it further, apply algorithms and statistical measures to it, to finally process and visualize data that can be then interpreted into use cases or further analysis.

Data science is so important because data is not easy to collect. The data that a data scientist receives are just chunks of raw or unstructured data in any form – text, image, sensor data from devices and so on. Through filtering, sorting and other mechanisms, data is transformed into a more usable form – that can lead to solutions for complex business problems. Some typical examples are recommendation systems for movies and online purchase, inventory tracking and update for manufacturing unit, speech and image recognition, etc… Check out more data science applications.

No wonder, data science has set its foot in almost all the domains – e-commerce, healthcare, finance, logistics, manufacturing, banking and many more.

Data Science lifecycle

Data science lifecycle has 3 important phases.

  • Data collection, conditioning, and data cleaning
  • Data modeling and mining

SQL plays a pivotal role during the first and second stages of the lifecycle i.e. – data conditioning, cleaning, modeling, and mining.

1. Data cleaning and conditioning

When we receive input data from various sources, it can contain missing values, redundant values, duplicates, nulls and so on. The data is highly unstructured (big data) and raw. By using DBMS normalization, a proper database schema can be designed to structure the data by splitting it into tables and adding constraints, dependencies and keys to the tables to indicate the relationships between the different tables. SQL queries can be used to work upon this organized schema and automatic batch scripts are run to remove irrelevant or corrupt data without any manual intervention. This makes the cleaning process faster and more reliable. We can also use R or Python for data cleaning, however with SQL along, most of the things can be done via SQL queries directly reducing the amount of code to be written.

2. Data modeling and mining

With a proper schema in place, model planning and the building become easy. As we have seen previously, we have customer details in one table and order details in another. Having big amounts of unstructured big data loaded into a structured form saves time, effort and makes querying the database easier and faster.

By creating queries to get information, data can be mined easily to get patterns and trends, get relevant information in the least amount of time. SQL has specific features to aggregate, filter, sort, index and display data based on many conditions. Data from multiple tables can be joined together, only the required columns can be displayed, triggers can be created to automate certain tasks and indexes can be created to fetch data faster.

Why SQL is essential for data science

In spite of the wide popularity of languages like Python and R, SQL has remained an essential ingredient for data science. It is one of the most trusted and structured ways to obtain, filter and manipulate data. SQL can be used with R, Python, Hadoop, Hive, and others giving the best of both worlds.

Data Analysis & SQL

So, let’s get started with some basic examples to understand how much SQL one needs for data science.

I will take the example of a simple student database that stores information about students of a college. Consider a small set of data –

student_id student_name date_of_birth department hobby total_marks
1 John 04-03-1984 CSE singing 94
2 Mark 17-02-1985 CSE reading 92
3 Joe 04-04-1985 ECE reading 91
4 Richard 03-11-1984 EEE photography 93.5
5 Samantha 08-08-1983 IT dancing 86
6 Janice 14-05-1984 IT watching movies 84

Choice of data

Let us say you want to see the details of all the customers –

select * from student;

This will give all the records present in the database.

If we want to display only certain columns, we can just query those –

select student_name, department from student;

We can use different types of expressions to present data in a different ways. For example,

select student_id + ‘—’ + department as roll_number from student;

This will fetch the following result –

roll_number

———–

1-CSE
2-CSE
3-ECE
4-EEE
5-IT
6-IT

Note that roll_number is just a temporary variable that stores the result. It is not a column.

Filtering most relevant data and applying aggregate functions

How about finding only those who like reading in their pastime? Or those who belong to the ECE department? Or just the number of students who scored marks between 85 and 90?

Filtering the required data is amazingly simple in SQL.

select * from student where hobby = ‘reading’;

This fetches all the columns of the table where the hobby is reading.

select count(student_id) from student where marks between 85 and 90;

This fetches a single value – the count of students who secured marks between 85 and 90.

select student_name, department, marks from student group by department having marks > 90 order by marks desc;

Okay, this is a big one. Let us see the result –

John        CSE      94
Mark       CSE      92
Richard  EEE      93.5
Joe          ECE      91

We see that only those records with marks > 90 are displayed with the highest marks first (order by). Also, the group by clause ensures that the rows are grouped department wise, that’s why we see 92 before 93.5.

Let us say we want to get the list of departments, but we don’t want the same value to be shown again and again. Since we have more than 1 value for some departments, this situation is likely. To avoid that, we can use the word distinct.

select distinct(department) from student;

How about we get the overall average marks of students in each department?

select avg(marks), department from student group by department;

will give an output as –

93.0|CSE
91.0|ECE
93.5|EEE
85.0|IT

Do you see how powerful yet simple SQL is? We can get any combination of data that we need. There are many such functions like SUM, MIN, MAX, etc… that are useful to get trends, patterns, and other insights.

Identifying missing values

In reality, the data won’t be as nice and readable as in our table above. It will have lots of missing values, incorrect values, etc… Databases have a special way of identifying missing values – NULL.

NULL values can cause differences in the measurements and final results. As we saw above, in SQL queries, we filter data using where conditions.

select * from student where department = ‘IT’;
or select * from student where department != ‘IT’;

With null values, this comparison (= & !=) doesn’t work. To get a NULL value, we have to use IS and IS NOT –

select * from student where department IS NULL;
select * from student where department IS NOT NULL;

That’s why NULL values will never appear in any result where we do regular comparison leading to errors in data. Imagine having a dataset with millions of records, with thousands of values having a NULL for a column. This will greatly impact data analysis. Thus, null values should be corrected to include default or zero values so that no columns (or data) are missed out.

Date time functions

Date time functions are extensively used to compare dates, analyze data based on different quarters, months or years. There are many functions that can convert data into any format. The timestamp data type includes the time in hours, minutes and seconds too.

For example, we can get only those records where the year of birth of the student is 1984.

select year(date_of_birth), student_name from student where year(date_of_birth) = 1984;

Check out all the date-related functions here.

Combining data from multiple tables

For example sake, I created a simple table. However, real data will be huge and complex and split into multiple tables. That means getting all the information together will need joining one or more tables and displaying the relevant columns from those tables. For this, SQL uses different types of joins –

  • Inner join – gets only those values that are matching in both the tables.
  • Left join – Returns all the records of the left table and matching records of the right table.
  • Right, join – Returns all the records of right table and matching records of the left table.
  • Full join – Returns all the records where there is a match on either the left or the right table.
  • Self-join – join of a table with itself

Suppose, we want to list the students who have the same marks. In our above table, we do not have any such values right now. Let us update Joe’s marks to 93.5.

update student set marks = 93.5 where student_name = ‘Joe’;

select s1.student_name, s2.student_name from student s1, student s2 where s1.marks = s2.marks and s1.student_name != s2.student_name;

This will now fetch – Joe|Richard

We used the same table to join itself in the above query.

Let us say there is another table, books_library which stores details of which student borrowed which books.

book_id book_name author student_id date_of_issue date_of_return
23 Let us C++ Yashwant K. 1 23-01-2020 23-02-2020
24 Head First Java Kathy S. 2 20-01-2020 20-02-2020
25 Data Science from scratch Joel Grus 3 19-12-2019 19-01-2020
26 Headfirst Python Paul B. 4 12-12-2019 12-01-2020

If we wanted to get the details of which student bought which book, we can join the tables and display the desired columns –

select student.student_name, books_library.book_name, books_library.date_of_issue from books_library inner join student on s.student_id = b.student_id;
John Let us C++ 23-01-2020
Mark Head First Java 20-01-2020
Joe Data Science from scratch 19-12-2019
Richard Headfirst Python 12-12-2019

Joins are a powerful way to consolidate data and represent it in different ways. More about joins.

Indexing

Indexing helps to get results faster and optimize performance, especially when we join two or more tables. Particularly, during a full join, which does a cartesian product, each row of the left table is compared with each row of right (m x n). When there are millions of records, imagine the time it would take to execute a query. To improve the performance, SQL uses indexes – think of them as the index that you have at the back of books – easy to get to a page by looking up a text.

Indexing in SQL is the same – if there are no indexes, SQL has to look through each row to match a column, if we create an index on the column, the lookup becomes extremely efficient and fast. Think of an index as a pointer to the row that has the value your query is searching for.

SQL Analysis services

SSAS or SQL Server Analysis Services is an analytical data engine from Microsoft that helps in analyzing and mining data and performs data modeling, integration, analysis, and reporting. Check out more about SSAS.

Further reading

In this article, we have talked about the most important features of SQL that should be on your tips as a data scientist. SQL is useful during data pre-processing, data analysis and machine learning stages of data science. BigQuery ML from Google Cloud enables you to execute machine learning models using standard SQL queries. Learn more about BigQuery ML.

There could be more commands and features that you might have to know, for example, creating triggers, writing small procedures or functions to get repetitive daily routines done faster. SQL can easily handle complex scenarios with just one or two queries. Start learning SQL today to play with data.

You may be also interested in:

Leave a Reply

Your email address will not be published. Required fields are marked *