Cross_Column

Thursday, 22 January 2026

Web Scrapping Using Python




Web Scrapping Using Python



Web Scraping using Python: A Practical Guide with Examples

Web Scraping using Python: A Practical Guide with Examples

Learn how to extract data from websites using Python tools like Requests, BeautifulSoup, Selenium, and Scrapy. This guide covers ethics, pagination, error handling, and exporting data to CSV/JSON—with a complete mini-project.

What is Web Scraping?

Web scraping is the automated process of fetching web pages and extracting structured information (e.g., prices, reviews, articles) for analysis or integration. In Python, it's commonly done using requests to download HTML and BeautifulSoup or lxml to parse it. For dynamic, JavaScript-rendered sites, tools like Selenium or frameworks like Scrapy are used.

Use cases: competitive pricing, news monitoring, SEO research, research datasets, trend analysis, job/real-estate listings, and more.

Ethics, Legality & Robots.txt

  • Check Terms of Service: Some sites prohibit scraping; ensure you comply.
  • Respect robots.txt: Indicates crawl allowances but isn’t a legal contract.
  • Rate Limit: Avoid overloading servers; add delays and caching.
  • Identify Yourself: Use a descriptive User-Agent string.
  • Use Public/Test Sites: For practice, use quotes.toscrape.com, books.toscrape.com, etc.
Disclaimer: This post is for educational purposes. Always follow applicable laws and site policies.

Core Tools You’ll Use

Library Purpose When to use
requests Download HTML content Static pages, APIs
BeautifulSoup Parse & navigate HTML Extract text/attributes from tags
Selenium Automate a browser Dynamic sites requiring JS execution
Scrapy Full scraping framework Large-scale, multi-page, robust pipelines

Environment Setup

# Create and activate a virtual environment (recommended)
python -m venv .venv
# Windows: .venv\Scripts\activate
# macOS/Linux: source .venv/bin/activate

# Install essentials
pip install requests beautifulsoup4 lxml pandas

# Optional for JS-heavy sites
pip install selenium webdriver-manager

# For large projects
pip install scrapy

Example 1: Scrape Static HTML with Requests + BeautifulSoup

We’ll scrape quotes, authors, and tags from the classic practice site https://quotes.toscrape.com/.

import time
import csv
import requests
from bs4 import BeautifulSoup

BASE_URL = "https://quotes.toscrape.com"
HEADERS = {
    "User-Agent": "Mozilla/5.0 (compatible; ChandanScraper/1.0; +https://example.com/contact)"
}

def fetch(url):
    """Fetch a URL and return BeautifulSoup-parsed document."""
    resp = requests.get(url, headers=HEADERS, timeout=15)
    resp.raise_for_status()  # raises for 4xx/5xx
    return BeautifulSoup(resp.text, "lxml")

def parse_quotes(doc):
    """Yield quote dicts from a page."""
    for q in doc.select(".quote"):
        text = q.select_one(".text").get_text(strip=True)
        author = q.select_one(".author").get_text(strip=True)
        tags = [t.get_text(strip=True) for t in q.select(".tags .tag")]
        yield {"text": text, "author": author, "tags": ", ".join(tags)}

def find_next_page(doc):
    next_link = doc.select_one(".pager .next a")
    if next_link:
        return BASE_URL + next_link.get("href")
    return None

def crawl_all_pages(start_url=BASE_URL):
    url = start_url
    while url:
        doc = fetch(url)
        for item in parse_quotes(doc):
            yield item
        url = find_next_page(doc)
        time.sleep(0.8)  # be kind to the server

if __name__ == "__main__":
    with open("quotes.csv", "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=["text", "author", "tags"])
        writer.writeheader()
        for row in crawl_all_pages(BASE_URL):
            writer.writerow(row)
    print("Saved quotes.csv")

What this covers: requests headers, parsing, pagination, rate limiting, CSV export.

Example 2: Handling Dynamic Pages with Selenium

Some pages render content via JavaScript. Use Selenium to control a real browser.

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager

# Set up Chrome (headless)
options = webdriver.ChromeOptions()
options.add_argument("--headless=new")
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)

driver.get("https://quotes.toscrape.com/js/")  # JS-rendered version

quotes = driver.find_elements(By.CLASS_NAME, "quote")
data = []
for q in quotes:
    text = q.find_element(By.CLASS_NAME, "text").text
    author = q.find_element(By.CLASS_NAME, "author").text
    tags = [t.text for t in q.find_elements(By.CSS_SELECTOR, ".tags .tag")]
    data.append({"text": text, "author": author, "tags": ", ".join(tags)})

driver.quit()
print(data)
Tip: Prefer Requests+BS4 when possible; Selenium is heavier. Consider APIs or network calls via your browser DevTools first.

Example 3: Quick Scrapy Spider

Scrapy is excellent for scalable crawls with built-in concurrency, pipelines, and retry logic.

scrapy startproject quotesproj
cd quotesproj
scrapy genspider quotes quotes.toscrape.com
# quotesproj/quotesproj/spiders/quotes.py
import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    allowed_domains = ["quotes.toscrape.com"]
    start_urls = ["https://quotes.toscrape.com/"]

    custom_settings = {
        "USER_AGENT": "ChandanScraper/1.0 (+https://example.com/contact)",
        "DOWNLOAD_DELAY": 0.6,
    }

    def parse(self, response):
        for q in response.css(".quote"):
            yield {
                "text": q.css(".text::text").get(),
                "author": q.css(".author::text").get(),
                "tags": q.css(".tags .tag::text").getall(),
            }
        next_href = response.css(".pager .next a::attr(href)").get()
        if next_href:
            yield response.follow(next_href, callback=self.parse)
scrapy crawl quotes -O quotes.json
# Outputs JSON to quotes.json

Robustness: Error Handling, Retries, & Politeness

  • Retry on transient failures: wrap requests in try/except; retry on HTTP 429/5xx with backoff.
  • Time-outs: always set timeout in requests.
  • Randomized delays & headers: avoid appearing like a bot; but never attempt to bypass explicit blocks.
  • Caching: cache pages locally (e.g., requests-cache) during development.
import random, time, requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

session = requests.Session()
retry = Retry(
    total=5, backoff_factor=0.6,
    status_forcelist=[429, 500, 502, 503, 504],
    allowed_methods=["GET"]
)
session.mount("https://", HTTPAdapter(max_retries=retry))

def polite_get(url, headers):
    resp = session.get(url, headers=headers, timeout=15)
    # jittered delay
    time.sleep(0.5 + random.random())
    resp.raise_for_status()
    return resp

Mini Project: Scrape “Books to Scrape” with Pagination & CSV Export

We’ll extract book title, price, stock, rating, and product page URL from https://books.toscrape.com/.

import csv, time
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

BASE = "https://books.toscrape.com/"

HEADERS = {"User-Agent": "Mozilla/5.0 (compatible; ChandanScraper/1.0)"}

def get_soup(url):
    r = requests.get(url, headers=HEADERS, timeout=15)
    r.raise_for_status()
    return BeautifulSoup(r.text, "lxml")

def parse_list_page(soup):
    for art in soup.select("article.product_pod"):
        title = art.h3.a["title"].strip()
        price = art.select_one(".price_color").get_text(strip=True)
        stock_text = art.select_one(".availability").get_text(strip=True)
        rating = art.select_one(".star-rating")["class"][1]  # e.g., "Three"
        product_rel = art.h3.a["href"]
        product_url = urljoin(BASE, product_rel)
        yield {"title": title, "price": price, "stock": stock_text, "rating": rating, "url": product_url}

def next_page_url(soup, current_url):
    nxt = soup.select_one("li.next a")
    if nxt:
        return urljoin(current_url, nxt["href"])
    return None

def crawl_books():
    url = BASE + "catalogue/page-1.html"
    while url:
        soup = get_soup(url)
        for row in parse_list_page(soup):
            yield row
        url = next_page_url(soup, url)
        time.sleep(0.7)

if __name__ == "__main__":
    with open("books.csv", "w", newline="", encoding="utf-8") as f:
        w = csv.DictWriter(f, fieldnames=["title", "price", "stock", "rating", "url"])
        w.writeheader()
        for item in crawl_books():
            w.writerow(item)
    print("Saved books.csv")
Next steps: normalize price to numeric, map ratings (One–Five to 1–5), and feed into pandas for analysis/visualization.

Data Cleaning & Export with Pandas

import pandas as pd

df = pd.read_csv("books.csv")
df["price_num"] = df["price"].str.replace("£", "", regex=False).astype(float)
rating_map = {"One":1, "Two":2, "Three":3, "Four":4, "Five":5}
df["rating_num"] = df["rating"].map(rating_map).astype("Int64")

# Basic stats
print(df["price_num"].describe())

# Save clean CSV/JSON
df.to_csv("books_clean.csv", index=False)
df.to_json("books_clean.json", orient="records", force_ascii=False)

Common Pitfalls & How to Avoid Them

  • Brittle selectors: Prefer stable attributes (id, data-*) over positional CSS.
  • Ignoring robots.txt: Review crawl directives; throttle accordingly.
  • Missing encodings: Use response.apparent_encoding or r.encoding="utf-8" when needed.
  • No logging: Add logs for troubleshooting and monitoring.
  • No persistence: Save intermediate results; resume after failures.

Cheat Sheet

# Quick BeautifulSoup patterns
soup.select("div.item")             # CSS select many
soup.select_one("h1.title")         # CSS select single
el.get("href")                      # attribute
el.get_text(strip=True)             # text

# Requests with headers & timeout
requests.get(url, headers=HEADERS, timeout=15)

# Selenium waits (recommended)
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, ".quote")))

Conclusion

Python makes web scraping accessible—from quick scripts with Requests/BeautifulSoup to production-grade crawlers in Scrapy. Start with static HTML, add Selenium only when necessary, and always respect ethics and performance.

Want the full source code or a Jupyter notebook version? Drop a comment and I’ll share it!

© Web Scraping with Python — by Chandan Chauhan

No comments:

Post a Comment

Few More

Performance Optimization

Performance Optimization in Python Performance Optimization in Python focuses on improving the speed, efficiency, and resour...