
Web scraping is the process of:
Basic HTML document structure:
<html>: root element containing all content<head>: metadata, title, and resource links (not visible)<body>: visible page content<!DOCTYPE html>
<html>
<head>
<title>Page Title</title>
<link rel="stylesheet" href="styles.css">
</head>
<body>
<!-- Visible content goes here -->
</body>
</html>Essential HTML tags:
| Tag | Purpose | Example |
|---|---|---|
<h1> - <h6> | Headings (h1 largest) | <h1>Main Title</h1> |
<p> | Paragraph | <p>Text content</p> |
<div> | Division/container | <div class="card">...</div> |
<span> | Inline container | <span class="price">$20</span> |
<a> | Anchor/link | <a href="url">Link text</a> |
<table> | Table structure | Contains rows and cells |
<tr> | Table row | Contains table cells |
<td> | Table data cell | Contains cell content |
<th> | Table header | Header cell |
HTML attributes provide additional information about elements:
| Attribute | Purpose | Example |
|---|---|---|
class | CSS styling identifier (can be reused) | <div class="course-card">...</div> |
id | Unique identifier (used once per page) | <div id="main-content">...</div> |
href | Link destination | <a href="https://example.com">Link</a> |
The DOM represents HTML as a tree structure:
When requesting (GET) an webpage, the request has a code that defines the status of the request:
| Code | Meaning | Action |
|---|---|---|
| 200 | OK | Success |
| 301/302 | Redirect | Follow redirect |
| 403 | Forbidden | Check headers/permissions |
| 404 | Not Found | Verify URL |
| 429 | Too Many Requests | Add delays |
| 500 | Server Error | Retry later |
| 503 | Service Unavailable | Server overload |
| Python library | Purpose | Use Case |
|---|---|---|
| Beautiful Soup | HTML/XML parsing | Extracting data from HTML |
| Requests | HTTP requests | Fetching web pages |
| lxml | High-performance parser | Parsing large/malformed HTML |
| Pandas | Data manipulation | Structuring scraped data |
| Selenium | Browser automation | JavaScript-rendered content |
| Scrapy | Full scraping framework | Large-scale projects |
Installation:
pip install beautifulsoup4
pip install lxmlImport and initialize soup object with lxml parser:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_content, 'lxml')| Parser | Speed | Fault Tolerance | Recommendation |
|---|---|---|---|
html.parser | Moderate | Good | Default, built-in |
lxml | Fast | Excellent | Recommended. |
html5lib | Slow | Best | HTML5 compliance |
| Signature | Purpose | Example |
|---|---|---|
find(tag, attributes) | Returns first matching element | soup.find('div', class_='card') |
find_all(tag, attributes) | Returns list of all matching elements | soup.find_all('p', limit=5) |
element.text | Extract text content | |
element.get('href') | Get specific attribute |
In Python, data can be easily manipulated with DataFrames (like Pandas). They can also export to multiple formats. Most common are CSV, JSON, SQLite, Pickle, etc.
Website can be static:
or dynamic:
Beautiful Soup is a parserβit converts HTML strings into navigable objects. It does not execute JavaScript or render dynamic content. Dynamic websites require a rendering engine (browser) to execute JavaScript.
When to use Selenium:
Selenium architecture:
Basic Selenium workflow:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome(options=chrome_options)
# 2. Load page (JavaScript executes automatically)
driver.get('https://example.com')
# 3. Get rendered HTML
html = driver.page_source
# 4. Close browser
driver.quit()Notice how Selenium can use a headless browser, a browser that runs without visible GUI. This means faster execution (no rendering overhead) and lower resource consumption (CPU, RAM, bandwidth).
Then, to extract data from rendered pages, a hybrid approach with Beautiful Soup can be implemented:
Hybrid Approach (Selenium + Beautiful Soup):
from selenium import webdriver
from bs4 import BeautifulSoup
# 5. Use Selenium to render JavaScript
driver.get('https://dynamic-site.com')
html = driver.page_source
driver.quit()
# 6. Use Beautiful Soup to parse rendered HTML
soup = BeautifulSoup(html, 'lxml')
data = soup.find_all('div', class_='content')Instead of rendering full pages, an efficient alternative is to locate where dynamic sites store data. This would even faster than healdess browsing, and would get clean structured data, while reducing load on target side.
Location 1: embedded JSON in Script Tags:
<script> tagsvar data = {...} or window.__DATA__ = {...}Location 2: XHR/Fetch API Requests:
Need to scrape a website?
β
βΌ
Is content visible with JavaScript disabled?
β
βββββ΄ββ-ββ
Yes No
β β
βΌ βΌ
Standard Is data embedded in
scraping page source as JSON?
(Requests) β
ββββ΄β-ββ
Yes No
β β
βΌ βΌ
Extract Use Selenium
JSON (browser
data automation)Time-based job scheduler for automating script execution.
Syntax Pattern:
* * * * * command
β β β β β
β β β β ββββ Day of week (0-7)
β β β ββββββ Month (1-12)
β β ββββββββ Day of month (1-31)
β ββββββββββ Hour (0-23)
ββββββββββββ Minute (0-59)Example:
# 7. Every Monday at 9 AM
0 9 * * 1 /usr/bin/python3 /path/to/scraper.pyUse Cases:
Implementation:
| Measure | Description | Mitigation |
|---|---|---|
| CAPTCHAs | Human verification challenges | Headless browsers, solving services |
| Rate Limiting | IP-based request throttling | Request delays, IP rotation |
| IP Blocking | Banning scraper IP addresses | Proxy rotation, residential IPs |
| Dynamic Content | JavaScript-rendered content | Selenium, Playwright |
| Honeypots | Hidden traps for bots | Careful element selection |
Robots.txt:
domain.com/robots.txtBest Practices: