The hottest Data Extraction Substack posts right now

And their main takeaways
Category
Top Technology Topics
The Web Scraping Club 39 implied HN points 28 Jan 24
  1. Using different credentials per website when using proxies helps detect outliers and set thresholds per website.
  2. Be selective about the resources to load in your browser when using automation tools like Playwright to avoid unexpected usage of resources.
  3. Set a threshold to proxy usage for your scraper, even for seemingly trivial scrapers, to avoid unexpected costs.
The Web Scraping Club 39 implied HN points 21 Jan 24
  1. Microsoft Playwright is a browser automation tool used for web scraping that supports multiple browsers and can handle dynamic content and complex user interactions.
  2. Browser automation tools like Playwright are crucial for scraping modern websites with dynamic content, interactive elements, and sophisticated front-end frameworks.
  3. Playwright excels at rendering JavaScript, simulating user interactions, and evading anti-scraping measures, but may be slower and require more resources compared to Scrapy.
!important 3 HN points 01 Nov 23
  1. Laziness can be a superpower, leading to finding clever solutions to avoid tedious tasks.
  2. Using Python and XML extraction, the author automated the process of extracting code snippets from a book.
  3. By leveraging existing elements in the document structure, the author efficiently organized and named thousands of code snippets.
Get a weekly roundup of the best Substack posts, by hacker news affinity:
The Web Scraping Club 0 implied HN points 04 Feb 24
  1. The web scraping course provided by The Web Scraping Club is always free and appreciated if you want to subscribe with a paid plan.
  2. The choice between using Scrapy and Playwright for web scraping depends on factors like anti-bot protection and content loading.
  3. Setting up the environment for building a web scraper with Playwright involves installing Python, Playwright, and browser binaries.