Web scraping is the process of parsing and extracting data from a website and putting it in an excel/text file or database for further analysis 📊 In the age of the internet, our website is a database; there is a huge amount of data generated every day, and manually extracting such data is time-consuming (text, link, image, web-table, etc.).
As a result, automating the process is important. If we want to extract product reviews from an e-commerce website (or financial data from yahoo finance and so on), and there is no direct way to do so, web scraping skills will come in handy.
There are many scenarios in which web scraping can be automated, including extracting all links from a page, data from multiple tables, tweets with a hashtag, an image, or a paragraph.
In this tutorial, I’ll show you how to extract a web table from a website and save the data in a CSV file using the Python module “BeautifulSoup” 🧼
What is BeautifulSoup?
BeautifulSoup is a Python library for pulling data out of HTML and XML files. It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping.
Example: Extract web table data from the “worldometer” website 🌐
I used the website to extract the “World Population by Region” table:
Inspect web element
Before writing python code, we need to understand web page structure and inspect web elements from the page to get the web element’s address.
Each web element on the page has an address that represents its location. Once we find the location, we can easily extract the information of the web element. We can go to any browser, open a website, right-click on the page, and click on inspect. This will get the window to inspect the web element. We can find an HTML tag and attribute it to every web element to find an element address on the page.
“World Population by Region” is saved in tabular format with the ag ‘table’ and attribute ‘class’. Now, our next step is to extract data from the table. I have split the code into the below steps:
Import required python modules: For this, we’ll need to import the following Python modules: requests, BeautifulSoup & CSV:
import requests from bs4 import BeautifulSoup from csv import writer
Get HTML content from a website: The code below will request a web page and retrieve its HTML content.
url="https://www.worldometers.info/world-population/" response= requests.get(url).text
Parsing the HTML content using BeautifulSoup:
soup = BeautifulSoup(response, "html.parser")
Extract web table- World Population by Region: We can use the soup. find method() to locate a web table with the tag table and the class attribute “table table-hover table-condensed” and save it to a ‘tabl’ object. If we want to get all of the tables on a web page, we may use the soup. find all() method.
tabl = soup.find("table", class_="table table-hover table-condensed")
Extract data from web table and save it in CSV file: Table data is organized into rows and columns, and the same principle applies to web tables. In most circumstances, a table will have a header row and a data row. However, in other cases, the header data will be absent and only the data row is provided. So, before extracting data from a table, one must examine the structure of the table. The code below extracts data from a table and saves it as a CSV file:
header = [] # Created Empty List # Extract header row for i in tabl.find_all('th'): header.append(i.text) with open("world_population_by_region.csv", "wt",newline='',encoding='utf-8') as csv_file: csv_writer = writer(csv_file, delimiter ='|') csv_writer.writerow(header) # write header # Write data to csv file for row in tabl.find_all('tr')[1:]: td = row.find_all('td') r = [i.text.replace('\n','') for i in td] csv_writer.writerow(r)
Output:
The data is extracted into a CSV file with a delimiter like ‘|’ because we have a comma(,) in the number fields. We can also use an excel file as an output file to save that data. Users can use the python database module-Sqlite3 to save data in DB tables if they have a huge quantity of data or need to extract historical data from a web page.
This is a simple example of how to extract web table content from a web page. You can also try extracting tables without headers, extracting all web tables, extracting dynamic web tables, extracting data from nested tables, extracting tables from several pages, and so on. We may need to use the Selenium module to manage dynamic content, web scrolling, and other things in some cases.
Conclusion
The most important thing to remember is that users must understand the web page structure. Developers continuously update the web page to improve user experience, which is why a web page structure can change frequently, and we need to update our script or framework accordingly 💫
In this article, we learned about extracting a web table using BeautifulSoup. We can apply the same approach to extract other information as well, like links, images, paragraphs, page titles, etc.
amazing