Day 17: Web Scraping using Python
What is Web Scraping?
Web scraping is the process of extracting data from websites automatically using code instead of manually copying it.
It helps in:
- Data collection
- Automation
- Building datasets
- Market research
Tools Required
1. requests
- Used to fetch webpage or API data
- Works with HTTP requests
2. BeautifulSoup
- Parses HTML content
- Helps extract specific elements like headings, links, tables
Data Flow Understanding
HTML Scraping Flow:
Website → HTML → BeautifulSoup → Data
API Data Flow:
API Endpoint → JSON → Python → Data
Sample HTML File (index.html)
<!DOCTYPE html>
<html>
<head>
<title>Sample Webpage</title>
</head>
<body>
<h1>Main Heading</h1>
<h2>Sub Heading</h2>
<h3>Section Heading</h3>
<p>This is a paragraph about web scraping.</p>
<p>Python makes scraping easy using BeautifulSoup.</p>
<a href="https://www.google.com">Google</a><br>
<a href="https://www.github.com">GitHub</a>
<h2>Student Table</h2>
<table border="1">
<tr>
<th>Name</th>
<th>Age</th>
<th>City</th>
<th>Email</th>
</tr>
<tr>
<td>Piyush</td>
<td>21</td>
<td>Nagpur</td>
<td>piyush@example.com</td>
</tr>
<tr>
<td>Rahul</td>
<td>22</td>
<td>Pune</td>
<td>Rahul@gmail.com</td>
</tr>
</table>
</body>
</html>
Web Scraping using BeautifulSoup
from bs4 import BeautifulSoup
with open("index.html", "r", encoding="utf-8") as file:
html_content = file.read()
soup = BeautifulSoup(html_content, "html.parser")
# 1. Title
print(f"Title: {soup.title.text}")
# 2. Headings
for tag in ["h1", "h2", "h3"]:
for heading in soup.find_all(tag):
print(f"{tag.upper()}: {heading.text}")
# 3. Paragraphs
for p in soup.find_all("p"):
print(p.text)
# 4. Links
for a in soup.find_all("a"):
print(f"Text: {a.get_text()}, URL: {a.get('href')}")
# 5. Table Data
table = soup.find("table")
rows = table.find_all("tr")
for row in rows:
cols = row.find_all(["td", "th"])
data = [col.text.strip() for col in cols]
print(data)
# 6. Extract all text
print(soup.get_text(separator="\n"))
Web Scraping using APIs (JSON Data)
import requests
url = "https://jsonplaceholder.typicode.com/posts"
response = requests.get(url)
if response.status_code == 200:
data = response.json()
for post in data[:5]:
print(f"Title: {post['title']}")
print(f"Body: {post['body']}")
else:
print("Error:", response.status_code)
Advanced Example: Fetch API Data and Save to CSV
import requests
import csv
def fetch_api_data(url):
headers = {
"User-Agent": "Mozilla/5.0"
}
try:
response = requests.get(url, headers=headers)
response.raise_for_status()
return response.json()
except requests.exceptions.RequestException as e:
print("Error:", e)
return None
url = "https://jsonplaceholder.typicode.com/users"
data = fetch_api_data(url)
if data:
for user in data:
print(f"{user['name']} - {user['email']}")
with open("HR.csv", "w", newline="", encoding="utf-8") as file:
writer = csv.writer(file)
writer.writerow(['Name', 'Email'])
for user in data:
writer.writerow([user['name'], user['email']])
Key Concepts Learned
- Difference between HTML scraping and API scraping
- How to parse HTML using BeautifulSoup
- Extracting headings, paragraphs, links, and tables
- Fetching JSON data using requests
- Saving extracted data into CSV files
Best Practices for Web Scraping
- Always check website permissions (robots.txt)
- Avoid sending too many requests (rate limiting)
- Use headers like User-Agent
- Prefer APIs over HTML scraping when available
Summary
Web scraping is a powerful skill for automating data extraction.
Using BeautifulSoup, you can extract structured data from HTML, while requests helps fetch data from APIs efficiently.
Assignment Questions
Theory-Based
- What is web scraping? Explain with an example.
- Difference between web scraping and API data fetching.
- What is BeautifulSoup used for?
- Why is JSON preferred in APIs?
- What are headers in HTTP requests and why are they used?
Practical Questions
- Extract all:
- Headings (h1, h2, h3)
- Paragraphs
- Links
from the given HTML file.
- Modify the code to:
- Extract only emails from the table.
- Scrape:
- Only table data and convert it into a list of dictionaries.
Fetch data from:
https://jsonplaceholder.typicode.com/comments- Print name and email of first 10 users.
- Save API data into a CSV file with:
- ID, Name, Email
Challenge Tasks
- Build a scraper that:
- Extracts all links from a webpage
- Saves them into a text file
- Create a script that:
- Scrapes table data
- Converts it into JSON format
- Combine both:
- Scrape HTML data
- Store it in CSV
- Also fetch API data and merge both datasets
- Add error handling:
- Handle missing tags
- Handle request failures
.png)

0 Comments:
Post a Comment