The hottest Data Extraction Substack posts right now

Using different credentials per website when using proxies helps detect outliers and set thresholds per website.
Be selective about the resources to load in your browser when using automation tools like Playwright to avoid unexpected usage of resources.
Set a threshold to proxy usage for your scraper, even for seemingly trivial scrapers, to avoid unexpected costs.

Microsoft Playwright is a browser automation tool used for web scraping that supports multiple browsers and can handle dynamic content and complex user interactions.
Browser automation tools like Playwright are crucial for scraping modern websites with dynamic content, interactive elements, and sophisticated front-end frameworks.
Playwright excels at rendering JavaScript, simulating user interactions, and evading anti-scraping measures, but may be slower and require more resources compared to Scrapy.

Data extraction from relational databases is important for building data pipelines. It's key to choose the right method based on factors like the type and size of the data.
There are several techniques to detect changes in the data, including using timestamps and database triggers. These help identify what new or changed records need to be extracted.
Different data extraction patterns exist, like periodic full reloads or incremental updates. Each method has its pros and cons, and the choice depends on the specific data needs and conditions.

Extracting data from systems involves querying and saving the data.
Consider incremental data load options like Change Data Capture or using updated_at columns.
Choosing between saving data in text (like CSV) or binary (like Parquet) format has implications on efficiency and data structure.

Laziness can be a superpower, leading to finding clever solutions to avoid tedious tasks.
Using Python and XML extraction, the author automated the process of extracting code snippets from a book.
By leveraging existing elements in the document structure, the author efficiently organized and named thousands of code snippets.

Get a weekly roundup of the best Substack posts, by hacker news affinity:

The web scraping course provided by The Web Scraping Club is always free and appreciated if you want to subscribe with a paid plan.
The choice between using Scrapy and Playwright for web scraping depends on factors like anti-bot protection and content loading.
Setting up the environment for building a web scraper with Playwright involves installing Python, Playwright, and browser binaries.