The hottest Data Extraction Substack posts right now

And their main takeaways
Category
Top Technology Topics
The Web Scraping Club 39 implied HN points 28 Jan 24
  1. Using different credentials per website when using proxies helps detect outliers and set thresholds per website.
  2. Be selective about the resources to load in your browser when using automation tools like Playwright to avoid unexpected usage of resources.
  3. Set a threshold to proxy usage for your scraper, even for seemingly trivial scrapers, to avoid unexpected costs.
The Web Scraping Club 39 implied HN points 21 Jan 24
  1. Microsoft Playwright is a browser automation tool used for web scraping that supports multiple browsers and can handle dynamic content and complex user interactions.
  2. Browser automation tools like Playwright are crucial for scraping modern websites with dynamic content, interactive elements, and sophisticated front-end frameworks.
  3. Playwright excels at rendering JavaScript, simulating user interactions, and evading anti-scraping measures, but may be slower and require more resources compared to Scrapy.
Practical Data Engineering Substack 59 implied HN points 18 Sep 23
  1. Data extraction from relational databases is important for building data pipelines. It's key to choose the right method based on factors like the type and size of the data.
  2. There are several techniques to detect changes in the data, including using timestamps and database triggers. These help identify what new or changed records need to be extracted.
  3. Different data extraction patterns exist, like periodic full reloads or incremental updates. Each method has its pros and cons, and the choice depends on the specific data needs and conditions.
Get a weekly roundup of the best Substack posts, by hacker news affinity:
The Web Scraping Club 0 implied HN points 04 Feb 24
  1. The web scraping course provided by The Web Scraping Club is always free and appreciated if you want to subscribe with a paid plan.
  2. The choice between using Scrapy and Playwright for web scraping depends on factors like anti-bot protection and content loading.
  3. Setting up the environment for building a web scraper with Playwright involves installing Python, Playwright, and browser binaries.