Web Scrapping Using Python
Web Scraping using Python: A Practical Guide with Examples
Learn how to extract data from websites using Python tools like Requests, BeautifulSoup, Selenium, and Scrapy. This guide covers ethics, pagination, error handling, and exporting data to CSV/JSON—with a complete mini-project.
What is Web Scraping?
Web scraping is the automated process of fetching web pages and extracting structured information (e.g., prices, reviews, articles) for analysis or integration. In Python, it's commonly done using requests to download HTML and BeautifulSoup or lxml to parse it. For dynamic, JavaScript-rendered sites, tools like Selenium or frameworks like Scrapy are used.
Ethics, Legality & Robots.txt
- Check Terms of Service: Some sites prohibit scraping; ensure you comply.
- Respect
robots.txt: Indicates crawl allowances but isn’t a legal contract. - Rate Limit: Avoid overloading servers; add delays and caching.
- Identify Yourself: Use a descriptive
User-Agentstring. - Use Public/Test Sites: For practice, use quotes.toscrape.com, books.toscrape.com, etc.
Core Tools You’ll Use
| Library | Purpose | When to use |
|---|---|---|
| requests | Download HTML content | Static pages, APIs |
| BeautifulSoup | Parse & navigate HTML | Extract text/attributes from tags |
| Selenium | Automate a browser | Dynamic sites requiring JS execution |
| Scrapy | Full scraping framework | Large-scale, multi-page, robust pipelines |
Environment Setup
# Create and activate a virtual environment (recommended)
python -m venv .venv
# Windows: .venv\Scripts\activate
# macOS/Linux: source .venv/bin/activate
# Install essentials
pip install requests beautifulsoup4 lxml pandas
# Optional for JS-heavy sites
pip install selenium webdriver-manager
# For large projects
pip install scrapy
Example 1: Scrape Static HTML with Requests + BeautifulSoup
We’ll scrape quotes, authors, and tags from the classic practice site https://quotes.toscrape.com/.
import time
import csv
import requests
from bs4 import BeautifulSoup
BASE_URL = "https://quotes.toscrape.com"
HEADERS = {
"User-Agent": "Mozilla/5.0 (compatible; ChandanScraper/1.0; +https://example.com/contact)"
}
def fetch(url):
"""Fetch a URL and return BeautifulSoup-parsed document."""
resp = requests.get(url, headers=HEADERS, timeout=15)
resp.raise_for_status() # raises for 4xx/5xx
return BeautifulSoup(resp.text, "lxml")
def parse_quotes(doc):
"""Yield quote dicts from a page."""
for q in doc.select(".quote"):
text = q.select_one(".text").get_text(strip=True)
author = q.select_one(".author").get_text(strip=True)
tags = [t.get_text(strip=True) for t in q.select(".tags .tag")]
yield {"text": text, "author": author, "tags": ", ".join(tags)}
def find_next_page(doc):
next_link = doc.select_one(".pager .next a")
if next_link:
return BASE_URL + next_link.get("href")
return None
def crawl_all_pages(start_url=BASE_URL):
url = start_url
while url:
doc = fetch(url)
for item in parse_quotes(doc):
yield item
url = find_next_page(doc)
time.sleep(0.8) # be kind to the server
if __name__ == "__main__":
with open("quotes.csv", "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=["text", "author", "tags"])
writer.writeheader()
for row in crawl_all_pages(BASE_URL):
writer.writerow(row)
print("Saved quotes.csv")
What this covers: requests headers, parsing, pagination, rate limiting, CSV export.
Example 2: Handling Dynamic Pages with Selenium
Some pages render content via JavaScript. Use Selenium to control a real browser.
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
# Set up Chrome (headless)
options = webdriver.ChromeOptions()
options.add_argument("--headless=new")
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
driver.get("https://quotes.toscrape.com/js/") # JS-rendered version
quotes = driver.find_elements(By.CLASS_NAME, "quote")
data = []
for q in quotes:
text = q.find_element(By.CLASS_NAME, "text").text
author = q.find_element(By.CLASS_NAME, "author").text
tags = [t.text for t in q.find_elements(By.CSS_SELECTOR, ".tags .tag")]
data.append({"text": text, "author": author, "tags": ", ".join(tags)})
driver.quit()
print(data)
Example 3: Quick Scrapy Spider
Scrapy is excellent for scalable crawls with built-in concurrency, pipelines, and retry logic.
scrapy startproject quotesproj
cd quotesproj
scrapy genspider quotes quotes.toscrape.com
# quotesproj/quotesproj/spiders/quotes.py
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
allowed_domains = ["quotes.toscrape.com"]
start_urls = ["https://quotes.toscrape.com/"]
custom_settings = {
"USER_AGENT": "ChandanScraper/1.0 (+https://example.com/contact)",
"DOWNLOAD_DELAY": 0.6,
}
def parse(self, response):
for q in response.css(".quote"):
yield {
"text": q.css(".text::text").get(),
"author": q.css(".author::text").get(),
"tags": q.css(".tags .tag::text").getall(),
}
next_href = response.css(".pager .next a::attr(href)").get()
if next_href:
yield response.follow(next_href, callback=self.parse)
scrapy crawl quotes -O quotes.json
# Outputs JSON to quotes.json
Robustness: Error Handling, Retries, & Politeness
- Retry on transient failures: wrap requests in try/except; retry on HTTP 429/5xx with backoff.
- Time-outs: always set
timeoutin requests. - Randomized delays & headers: avoid appearing like a bot; but never attempt to bypass explicit blocks.
- Caching: cache pages locally (e.g.,
requests-cache) during development.
import random, time, requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
session = requests.Session()
retry = Retry(
total=5, backoff_factor=0.6,
status_forcelist=[429, 500, 502, 503, 504],
allowed_methods=["GET"]
)
session.mount("https://", HTTPAdapter(max_retries=retry))
def polite_get(url, headers):
resp = session.get(url, headers=headers, timeout=15)
# jittered delay
time.sleep(0.5 + random.random())
resp.raise_for_status()
return resp
Mini Project: Scrape “Books to Scrape” with Pagination & CSV Export
We’ll extract book title, price, stock, rating, and product page URL from https://books.toscrape.com/.
import csv, time
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
BASE = "https://books.toscrape.com/"
HEADERS = {"User-Agent": "Mozilla/5.0 (compatible; ChandanScraper/1.0)"}
def get_soup(url):
r = requests.get(url, headers=HEADERS, timeout=15)
r.raise_for_status()
return BeautifulSoup(r.text, "lxml")
def parse_list_page(soup):
for art in soup.select("article.product_pod"):
title = art.h3.a["title"].strip()
price = art.select_one(".price_color").get_text(strip=True)
stock_text = art.select_one(".availability").get_text(strip=True)
rating = art.select_one(".star-rating")["class"][1] # e.g., "Three"
product_rel = art.h3.a["href"]
product_url = urljoin(BASE, product_rel)
yield {"title": title, "price": price, "stock": stock_text, "rating": rating, "url": product_url}
def next_page_url(soup, current_url):
nxt = soup.select_one("li.next a")
if nxt:
return urljoin(current_url, nxt["href"])
return None
def crawl_books():
url = BASE + "catalogue/page-1.html"
while url:
soup = get_soup(url)
for row in parse_list_page(soup):
yield row
url = next_page_url(soup, url)
time.sleep(0.7)
if __name__ == "__main__":
with open("books.csv", "w", newline="", encoding="utf-8") as f:
w = csv.DictWriter(f, fieldnames=["title", "price", "stock", "rating", "url"])
w.writeheader()
for item in crawl_books():
w.writerow(item)
print("Saved books.csv")
pandas for analysis/visualization.
Data Cleaning & Export with Pandas
import pandas as pd
df = pd.read_csv("books.csv")
df["price_num"] = df["price"].str.replace("£", "", regex=False).astype(float)
rating_map = {"One":1, "Two":2, "Three":3, "Four":4, "Five":5}
df["rating_num"] = df["rating"].map(rating_map).astype("Int64")
# Basic stats
print(df["price_num"].describe())
# Save clean CSV/JSON
df.to_csv("books_clean.csv", index=False)
df.to_json("books_clean.json", orient="records", force_ascii=False)
Common Pitfalls & How to Avoid Them
- Brittle selectors: Prefer stable attributes (
id,data-*) over positional CSS. - Ignoring robots.txt: Review crawl directives; throttle accordingly.
- Missing encodings: Use
response.apparent_encodingorr.encoding="utf-8"when needed. - No logging: Add logs for troubleshooting and monitoring.
- No persistence: Save intermediate results; resume after failures.
Cheat Sheet
# Quick BeautifulSoup patterns
soup.select("div.item") # CSS select many
soup.select_one("h1.title") # CSS select single
el.get("href") # attribute
el.get_text(strip=True) # text
# Requests with headers & timeout
requests.get(url, headers=HEADERS, timeout=15)
# Selenium waits (recommended)
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, ".quote")))
Conclusion
Python makes web scraping accessible—from quick scripts with Requests/BeautifulSoup to production-grade crawlers in Scrapy. Start with static HTML, add Selenium only when necessary, and always respect ethics and performance.
Want the full source code or a Jupyter notebook version? Drop a comment and I’ll share it!
No comments:
Post a Comment