/
Tech-study-notes

Scraping

Web scraping is the process of:

HTML fundamentals

Basic HTML document structure:

<!DOCTYPE html>
<html>
  <head>
    <title>Page Title</title>
    <link rel="stylesheet" href="styles.css">
  </head>
  <body>
    <!-- Visible content goes here -->
  </body>
</html>

Essential HTML tags:

TagPurposeExample
<h1> - <h6>Headings (h1 largest)<h1>Main Title</h1>
<p>Paragraph<p>Text content</p>
<div>Division/container<div class="card">...</div>
<span>Inline container<span class="price">$20</span>
<a>Anchor/link<a href="url">Link text</a>
<table>Table structureContains rows and cells
<tr>Table rowContains table cells
<td>Table data cellContains cell content
<th>Table headerHeader cell

HTML attributes provide additional information about elements:

AttributePurposeExample
classCSS styling identifier (can be reused)<div class="course-card">...</div>
idUnique identifier (used once per page)<div id="main-content">...</div>
hrefLink destination<a href="https://example.com">Link</a>

The Document Object Model (DOM)

The DOM represents HTML as a tree structure:

HTTP Status Codes

When requesting (GET) an webpage, the request has a code that defines the status of the request:

CodeMeaningAction
200OKSuccess
301/302RedirectFollow redirect
403ForbiddenCheck headers/permissions
404Not FoundVerify URL
429Too Many RequestsAdd delays
500Server ErrorRetry later
503Service UnavailableServer overload

Web Scraping tools and libraries

Python libraryPurposeUse Case
Beautiful SoupHTML/XML parsingExtracting data from HTML
RequestsHTTP requestsFetching web pages
lxmlHigh-performance parserParsing large/malformed HTML
PandasData manipulationStructuring scraped data
SeleniumBrowser automationJavaScript-rendered content
ScrapyFull scraping frameworkLarge-scale projects

Beautiful Soup key concepts

Installation:

pip install beautifulsoup4
pip install lxml

Import and initialize soup object with lxml parser:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_content, 'lxml')

Parser Comparison

ParserSpeedFault ToleranceRecommendation
html.parserModerateGoodDefault, built-in
lxmlFastExcellentRecommended.
html5libSlowBestHTML5 compliance

Core Beautiful Soup Methods

SignaturePurposeExample
find(tag, attributes)Returns first matching elementsoup.find('div', class_='card')
find_all(tag, attributes)Returns list of all matching elementssoup.find_all('p', limit=5)
element.textExtract text content
element.get('href')Get specific attribute

Scraping methodology

The scraping workflow

  1. Inspect the Target: Use browser developer tools (F12 on Chrome) to
    • View and edit HTML/CSS in real-time
    • Monitor HTTP requests
    • Test JavaScript and view errors
    • Hover over elements to see their HTML
  2. Identify Data Location: Find HTML tags containing target data
  3. Make HTTP Request: Fetch the page content
  4. Parse HTML: Convert to Beautiful Soup object
  5. Extract Data: Navigate and select elements
  6. Store Results: Save to file or database

Data storage and structuring

In Python, data can be easily manipulated with DataFrames (like Pandas). They can also export to multiple formats. Most common are CSV, JSON, SQLite, Pickle, etc.

Dynamic Content

Website can be static:

or dynamic:

Selenium for dynamic content

Beautiful Soup is a parserβ€”it converts HTML strings into navigable objects. It does not execute JavaScript or render dynamic content. Dynamic websites require a rendering engine (browser) to execute JavaScript.

When to use Selenium:

Selenium architecture:

  1. Browser: Chrome, Firefox, Edge, Safari (must be installed)
  2. WebDriver: Browser-specific automation driver
  3. Selenium Package: Python bindings for browser control

Basic Selenium workflow:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')

driver = webdriver.Chrome(options=chrome_options)

# 2. Load page (JavaScript executes automatically)
driver.get('https://example.com')

# 3. Get rendered HTML
html = driver.page_source

# 4. Close browser
driver.quit()

Notice how Selenium can use a headless browser, a browser that runs without visible GUI. This means faster execution (no rendering overhead) and lower resource consumption (CPU, RAM, bandwidth).

Then, to extract data from rendered pages, a hybrid approach with Beautiful Soup can be implemented:

Hybrid Approach (Selenium + Beautiful Soup):

from selenium import webdriver
from bs4 import BeautifulSoup

# 5. Use Selenium to render JavaScript
driver.get('https://dynamic-site.com')
html = driver.page_source
driver.quit()

# 6. Use Beautiful Soup to parse rendered HTML
soup = BeautifulSoup(html, 'lxml')
data = soup.find_all('div', class_='content')

Finding hidden data sources

Instead of rendering full pages, an efficient alternative is to locate where dynamic sites store data. This would even faster than healdess browsing, and would get clean structured data, while reducing load on target side.

Location 1: embedded JSON in Script Tags:

Location 2: XHR/Fetch API Requests:

  1. Open developer tools β†’ Network tab
  2. Filter by “XHR” or “Fetch”
  3. Trigger content load (scroll, click)
  4. Inspect requests for JSON data endpoints
  5. Extract data directly from API responses

Scraping decision tree

Need to scrape a website?
        β”‚
        β–Ό
Is content visible with JavaScript disabled?
        β”‚
    β”Œβ”€β”€β”€β”΄β”€β”€-─┐
   Yes       No
    β”‚        β”‚
    β–Ό        β–Ό
Standard   Is data embedded in
scraping     page source as JSON?
(Requests)        β”‚
               β”Œβ”€β”€β”΄β”€-─┐
              Yes     No
               β”‚      β”‚
               β–Ό      β–Ό
         Extract    Use Selenium
          JSON       (browser
          data       automation)

Automation and Scheduling

Cron Jobs

Time-based job scheduler for automating script execution.

Syntax Pattern:

* * * * * command
β”‚ β”‚ β”‚ β”‚ β”‚
β”‚ β”‚ β”‚ β”‚ └─── Day of week (0-7)
β”‚ β”‚ β”‚ └───── Month (1-12)
β”‚ β”‚ └─────── Day of month (1-31)
β”‚ └───────── Hour (0-23)
└─────────── Minute (0-59)

Example:

# 7. Every Monday at 9 AM
0 9 * * 1 /usr/bin/python3 /path/to/scraper.py

Monitoring and Alerting

Use Cases:

Implementation:

Scraping considerations

Anti-Scraping Measures

MeasureDescriptionMitigation
CAPTCHAsHuman verification challengesHeadless browsers, solving services
Rate LimitingIP-based request throttlingRequest delays, IP rotation
IP BlockingBanning scraper IP addressesProxy rotation, residential IPs
Dynamic ContentJavaScript-rendered contentSelenium, Playwright
HoneypotsHidden traps for botsCareful element selection

Robots.txt:

Best Practices:

  1. Rate Limiting: Add delays between requests (1-3 seconds)
  2. User-Agent: Identify your scraper properly
  3. Terms of Service: Review website ToS
  4. Copyright: Don’t redistribute scraped content
  5. Server Load: Avoid overloading target servers