The Web Scraping Club

The Web Scraping Club Substack focuses on web scraping techniques, tools, and legal insights, featuring practical tutorials, interviews, and analyses. It covers anti-detect browsers, legal challenges, data monetization, and reviews of tools like Playwright and Scrapy for bypassing bot protections and building web scrapers.

Web Scraping Techniques Anti-Bot Bypass Tools Legal Issues in Web Scraping Data Monetization Interviews with Industry Experts Web Scraping Tools Review Privacy Concerns Marketplaces for Web-Scraped Data

The hottest Substack posts of The Web Scraping Club

And their main takeaways
78 implied HN points 08 Feb 24
  1. Data Boutique is a marketplace for legally obtained web-scraped data with a focus on quality and easy accessibility.
  2. Sellers on Data Boutique align interests with the platform by offering affordable, high-quality data which encourages more purchases and recurring buyers.
  3. Ensuring data quality on Data Boutique involves embedded checks and a Peer Review program, promoting stackable standard data schemas for wider use cases.
98 implied HN points 18 Jan 24
  1. Kameleo is an anti-detect tool that creates different profiles for scrapers with unique characteristics like OS and browser settings.
  2. Profiles created with Kameleo can have realistic fingerprints mimicking various devices and operating systems.
  3. Using Kameleo can help in bypassing Cloudflare's anti-bot protection measures by generating profiles that seem legitimate.
Get a weekly roundup of the best Substack posts, by hacker news affinity:
39 implied HN points 28 Jan 24
  1. Using different credentials per website when using proxies helps detect outliers and set thresholds per website.
  2. Be selective about the resources to load in your browser when using automation tools like Playwright to avoid unexpected usage of resources.
  3. Set a threshold to proxy usage for your scraper, even for seemingly trivial scrapers, to avoid unexpected costs.
39 implied HN points 21 Jan 24
  1. Microsoft Playwright is a browser automation tool used for web scraping that supports multiple browsers and can handle dynamic content and complex user interactions.
  2. Browser automation tools like Playwright are crucial for scraping modern websites with dynamic content, interactive elements, and sophisticated front-end frameworks.
  3. Playwright excels at rendering JavaScript, simulating user interactions, and evading anti-scraping measures, but may be slower and require more resources compared to Scrapy.
19 implied HN points 11 Feb 24
  1. Browser fingerprinting is used as an alternative to cookies and raises privacy concerns due to its unique identification capabilities.
  2. Desktop devices are more easily uniquely fingerprinted compared to mobile devices, with Chrome providing more detailed configurations.
  3. Innovative approaches like using WebGPU for web fingerprinting pose privacy risks and may require countermeasures to prevent misuse.
0 implied HN points 04 Feb 24
  1. The web scraping course provided by The Web Scraping Club is always free and appreciated if you want to subscribe with a paid plan.
  2. The choice between using Scrapy and Playwright for web scraping depends on factors like anti-bot protection and content loading.
  3. Setting up the environment for building a web scraper with Playwright involves installing Python, Playwright, and browser binaries.
0 implied HN points 25 Jan 24
  1. Kasada is an anti-bot solution that collects data points from browsers and uses AI to determine if a request is legitimate or not.
  2. Scrapers that do not use a Javascript rendering engine, like Scrapy, are blocked by Kasada's challenges.
  3. One way to bypass Kasada for web scraping is to use tools like undetected-chromedriver, a Selenium Chromedriver patch that avoids triggering anti-bot services.