Tuesday, 28 April 2026

April Python Bootcamp Day 17

 

Day 17: Web Scraping using Python

What is Web Scraping?

Web scraping is the process of extracting data from websites automatically using code instead of manually copying it.

It helps in:

  • Data collection
  • Automation
  • Building datasets
  • Market research

Tools Required

1. requests

  • Used to fetch webpage or API data
  • Works with HTTP requests

2. BeautifulSoup

  • Parses HTML content
  • Helps extract specific elements like headings, links, tables

Data Flow Understanding

HTML Scraping Flow:

Website → HTML → BeautifulSoup → Data

API Data Flow:

API Endpoint → JSON → Python → Data

Sample HTML File (index.html)

<!DOCTYPE html>
<html>
<head>
<title>Sample Webpage</title>
</head>
<body>

<h1>Main Heading</h1>
<h2>Sub Heading</h2>
<h3>Section Heading</h3>

<p>This is a paragraph about web scraping.</p>
<p>Python makes scraping easy using BeautifulSoup.</p>

<a href="https://www.google.com">Google</a><br>
<a href="https://www.github.com">GitHub</a>

<h2>Student Table</h2>
<table border="1">
<tr>
<th>Name</th>
<th>Age</th>
<th>City</th>
<th>Email</th>
</tr>
<tr>
<td>Piyush</td>
<td>21</td>
<td>Nagpur</td>
<td>piyush@example.com</td>
</tr>
<tr>
<td>Rahul</td>
<td>22</td>
<td>Pune</td>
<td>Rahul@gmail.com</td>
</tr>
</table>

</body>
</html>

Web Scraping using BeautifulSoup

from bs4 import BeautifulSoup

with open("index.html", "r", encoding="utf-8") as file:
html_content = file.read()

soup = BeautifulSoup(html_content, "html.parser")

# 1. Title
print(f"Title: {soup.title.text}")

# 2. Headings
for tag in ["h1", "h2", "h3"]:
for heading in soup.find_all(tag):
print(f"{tag.upper()}: {heading.text}")

# 3. Paragraphs
for p in soup.find_all("p"):
print(p.text)

# 4. Links
for a in soup.find_all("a"):
print(f"Text: {a.get_text()}, URL: {a.get('href')}")

# 5. Table Data
table = soup.find("table")
rows = table.find_all("tr")

for row in rows:
cols = row.find_all(["td", "th"])
data = [col.text.strip() for col in cols]
print(data)

# 6. Extract all text
print(soup.get_text(separator="\n"))

Web Scraping using APIs (JSON Data)

import requests

url = "https://jsonplaceholder.typicode.com/posts"

response = requests.get(url)

if response.status_code == 200:
data = response.json()

for post in data[:5]:
print(f"Title: {post['title']}")
print(f"Body: {post['body']}")
else:
print("Error:", response.status_code)

Advanced Example: Fetch API Data and Save to CSV

import requests
import csv

def fetch_api_data(url):
headers = {
"User-Agent": "Mozilla/5.0"
}
try:
response = requests.get(url, headers=headers)
response.raise_for_status()
return response.json()
except requests.exceptions.RequestException as e:
print("Error:", e)
return None

url = "https://jsonplaceholder.typicode.com/users"
data = fetch_api_data(url)

if data:
for user in data:
print(f"{user['name']} - {user['email']}")

with open("HR.csv", "w", newline="", encoding="utf-8") as file:
writer = csv.writer(file)
writer.writerow(['Name', 'Email'])

for user in data:
writer.writerow([user['name'], user['email']])

Key Concepts Learned

  • Difference between HTML scraping and API scraping
  • How to parse HTML using BeautifulSoup
  • Extracting headings, paragraphs, links, and tables
  • Fetching JSON data using requests
  • Saving extracted data into CSV files

Best Practices for Web Scraping

  • Always check website permissions (robots.txt)
  • Avoid sending too many requests (rate limiting)
  • Use headers like User-Agent
  • Prefer APIs over HTML scraping when available

Summary

Web scraping is a powerful skill for automating data extraction.
Using BeautifulSoup, you can extract structured data from HTML, while requests helps fetch data from APIs efficiently.


Assignment Questions

Theory-Based

  1. What is web scraping? Explain with an example.
  2. Difference between web scraping and API data fetching.
  3. What is BeautifulSoup used for?
  4. Why is JSON preferred in APIs?
  5. What are headers in HTTP requests and why are they used?

Practical Questions

  1. Extract all:
    • Headings (h1, h2, h3)
    • Paragraphs
    • Links
      from the given HTML file.
  2. Modify the code to:
    • Extract only emails from the table.
  3. Scrape:
    • Only table data and convert it into a list of dictionaries.
  4. Fetch data from:

    https://jsonplaceholder.typicode.com/comments
    • Print name and email of first 10 users.
  5. Save API data into a CSV file with:
    • ID, Name, Email

Challenge Tasks

  1. Build a scraper that:
    • Extracts all links from a webpage
    • Saves them into a text file
  2. Create a script that:
    • Scrapes table data
    • Converts it into JSON format
  3. Combine both:
    • Scrape HTML data
    • Store it in CSV
    • Also fetch API data and merge both datasets
  4. Add error handling:
    • Handle missing tags
    • Handle request failures


0 Comments:

Post a Comment

Popular Posts

Categories

100 Python Programs for Beginner (119) AI (252) Android (25) AngularJS (1) Api (7) Assembly Language (2) aws (29) Azure (10) BI (10) Books (262) Bootcamp (11) C (78) C# (12) C++ (83) Course (87) Coursera (300) Cybersecurity (30) data (5) Data Analysis (32) Data Analytics (22) data management (15) Data Science (351) Data Strucures (17) Deep Learning (158) Django (16) Downloads (3) edx (21) Engineering (15) Euron (30) Events (7) Excel (19) Finance (10) flask (4) flutter (1) FPL (17) Generative AI (72) Git (10) Google (51) Hadoop (3) HTML Quiz (1) HTML&CSS (48) IBM (42) IoT (3) IS (25) Java (99) Leet Code (4) Machine Learning (291) Meta (24) MICHIGAN (5) microsoft (11) Nvidia (8) Pandas (14) PHP (20) Projects (32) pytho (1) Python (1325) Python Coding Challenge (1130) Python Mistakes (51) Python Quiz (489) Python Tips (5) Questions (3) R (72) React (7) Scripting (3) security (4) Selenium Webdriver (4) Software (19) SQL (49) Udemy (18) UX Research (1) web application (11) Web development (8) web scraping (3)

Followers

Python Coding for Kids ( Free Demo for Everyone)