
HTML is the standard language to create web-pages, it generally used to structure the text, images, and video data on the web-page. HTML can represent the Text data in various formats such as links, normal text, tables, lists, etc. Let’s say you have an HTML or you want to grab the HTML web-page from the Internet and wish to extract the table data to analyze or crunch the data.
Here in this Python tutorial, I will walk you through the Python program on How to extract table data from the HTML web-pages and save it locally in CSV files.
Before we dive into the Python program let’s discuss and install the libraries we will be using in this Python tutorial.
Required Libraries
Python requests
library
We will be using the requests library to send HTTP GET request
to the web-page and in response getting HTML text data.
to install the requests library run the following pip command on your terminal or command prompt.
pip install requests
Python beautifulsoup4
Library
The beautifulsoup4 library is an open-source Python HTML & XML data extractor library. We will be using this library to extract data table data from the HTML page using HTML tag names like <th>, <table>, <tr>, and <td>.
You can install this library using the following pip install command.
pip install beautifulsoup4
Python csv Module
CSV(Comma Separated Values) is one of the Python Standard Libraries, you do not need to install it separately. As the Library names suggest, we can use it to write and read between the CSV files.
To know more about How to write CSV files in Python click here.
Note: As in this tutorial, we will be extracting data from HTML tables, I am assuming that you have some knowledge on how HTML <table> tag works or how other tags like <th>, <tr>, and <td> used.
How to Convert HTML Tables into CSV Files in Python?
Let’s begin with importing the modules for our Python program.
import requests from bs4 import BeautifulSoup import csv
Now define a Python variable url
for the web-page URL
url= r"https://www.techgeekbuzz.com/difference-between-repeater-datalist-and-gridview/" response = requests.get(url) #send get request html_page = response.text #fetch HTML page
get()
function will send a get request to the url and,
text
property will fetch the response HTML web-page.
Now we will parse the html_page
using BeautifulSoup()
module so we can extract the html_page using BeautifulSoup find_all()
function.
page = BeautifulSoup(html_page, 'html.parser') #parse html_page
As in this tutorial we are only fetching tables data, let’s extract all the tables present in the page
.
tables = page.find_all("table") #find tables #print the total tables found print(f"Total {len(tables)} Found on page {url}")
The find_all("table")
will return a list of all the <table> tags present in page
.
Now we will loop through every table
present in tables
list, create the new CSV file and write table data on the CSV file.
for index, table in enumerate(tables): print(f"\n-----------------------Table{index+1}-----------------------------------------\n") table_rows = table.find_all("tr") #open csv file in write mode with open(f"Table{index+1}.csv", "w", newline="") as file: #initialize csv writer object writer = csv.writer(file) for row in table_rows: row_data= [] #<th> data if row.find_all("th"): table_headings = row.find_all("th") for th in table_headings: row_data.append(th.text.strip()) #<td> data else: table_data = row.find_all("td") for td in table_data: row_data.append(td.text.strip()) #write data in csv file writer.writerow(row_data) print(",".join(row_data)) print("--------------------------------------------------------\n")
Now put all the code together and execute
Python program to Convert web-page tables to CSV files
import requests from bs4 import BeautifulSoup import csv url= r"https://www.techgeekbuzz.com/difference-between-repeater-datalist-and-gridview/" response = requests.get(url) html_page = response.text soup = BeautifulSoup(html_page, 'html.parser') #find <table> tables = soup.find_all("table") print(f"Total {len(tables)} Table(s)Found on page {url}") for index, table in enumerate(tables): print(f"\n-----------------------Table{index+1}-----------------------------------------\n") #find <tr> table_rows = table.find_all("tr") #open csv file in write mode with open(f"Table{index+1}.csv", "w", newline="") as file: #initialize csv writer object writer = csv.writer(file) for row in table_rows: row_data= [] #<th> data if row.find_all("th"): table_headings = row.find_all("th") for th in table_headings: row_data.append(th.text.strip()) #<td> data else: table_data = row.find_all("td") for td in table_data: row_data.append(td.text.strip()) #write data in csv file writer.writerow(row_data) print(",".join(row_data)) print("--------------------------------------------------------\n")
Output
Total 3 Table(s) Found on page https://www.techgeekbuzz.com/difference-between-repeater-datalist-and-gridview/ -----------------------Table1----------------------------------------- DataList,Repeater Object Rendering Its object rendered as a table,It uses templates to render its object. Columns generation It automatically generates columns using the data source.,It cannot generate columns. Row selection It can select a row from the data source.,It cannot select rows. Content Editing Using DataList control, we can edit object content.,It does not support content editing. Data arrangement It has RepeatDirection property which allows it to arrange data vertically and horizontally.,It cannot manually arrange data. Performance It is slower than Repeater.,It is very fast because of its lightweight. -------------------------------------------------------- -----------------------Table2----------------------------------------- GridView,Repeater Debut GridView was introduced in Asp.Net 2.0,The Repeater was introduced in Asp.Net 1.0. Columns generation It automatically generates columns using the data source.,It cannot generate columns. Row selection It can select a row from the data source.,It cannot select rows. Content Editing Using GridView control, we can edit object content.,It does not support content editing. In-built methods It comes with built-in paging and sorting methods.,No built-in support for Built-in paging and sorting developer has to code. Auto formatting and styling In GridView we get inbuilt auto format and styling feature.,It does not support these features. Performance It is slower than Repeater.,Because of its lightweight, it is faster as compared to GridView. -------------------------------------------------------- -----------------------Table3----------------------------------------- GridView,DataList Debut GridView was introduced in Asp.Net 2.0 version.,DataList was introduced in Asp.Net 1.0 version. In-built methods It comes with built-in paging and sorting methods.,No built-in support for Built-in paging and sorting, the developer has to code for these features. Build-in CRUD operation It comes with built-in Update and Deletes Operations, so the developer does not need to write code for simple operations.,If developer use DataList then he/she has to write code for the Update and Delete operations. Auto formatting and styling In GridView we get inbuilt auto format and styling feature.,It does not support these features. Customizable Row We do not get Customizable row separator feature in GridView.,DataList has SeparatorTemplate for customizable row separator. Performance: Its performance is the lowest as compared to Repeater and DataList.,It is faster than the GridView. --------------------------------------------------------
When you execute the above program you will see that it will save the csv file in the same directory where your Python script is located.
Conclusion
In this Python tutorial, we learned “How to convert HTML tables to CSV files in Python?” this tutorial is an small application of web-scrapping with Python. If you want to learn more about extracting data from web-pages, you can read the official documentation of BeautifulSoup4.