The Web Scraping Club

The Web Scraping Club Substack focuses on web scraping techniques, tools, and legal insights, featuring practical tutorials, interviews, and analyses. It covers anti-detect browsers, legal challenges, data monetization, and reviews of tools like Playwright and Scrapy for bypassing bot protections and building web scrapers.

Web Scraping Techniques Anti-Bot Bypass Tools Legal Issues in Web Scraping Data Monetization Interviews with Industry Experts Web Scraping Tools Review Privacy Concerns Marketplaces for Web-Scraped Data

The hottest Substack posts of The Web Scraping Club

And their main takeaways

The Lab #37: Bypassing Cloudflare with anti-detect browsers - Part 2

98 implied HN points • 18 Jan 24

🕹 Technology Web scraping

Kameleo is an anti-detect tool that creates different profiles for scrapers with unique characteristics like OS and browser settings.
Profiles created with Kameleo can have realistic fingerprints mimicking various devices and operating systems.
Using Kameleo can help in bypassing Cloudflare's anti-bot protection measures by generating profiles that seem legitimate.

The Lab #40: start a web data monetization project with Data Boutique

78 implied HN points • 08 Feb 24

🕹 Technology Data Monetization Web scraping Quality control

Data Boutique is a marketplace for legally obtained web-scraped data with a focus on quality and easy accessibility.
Sellers on Data Boutique align interests with the platform by offering affordable, high-quality data which encourages more purchases and recurring buyers.
Ensuring data quality on Data Boutique involves embedded checks and a Peer Review program, promoting stackable standard data schemas for wider use cases.

The Lab #39: Mouse movements in Playwright

58 implied HN points • 01 Feb 24

🕹 Technology Web scraping Automation Programming

Anti-bot protections in web scraping projects can detect unnatural mouse movements.
Playwright does not offer native human-like mouse movement emulation capabilities.
Python packages like python_ghost_cursor can be used to create human-like mouse movements for web scraping.

Monetize your web scraping skills

58 implied HN points • 14 Jan 24

💼 Business Freelancing Marketplaces Web scraping Data Monetization Skills

Freelance your web scraping skills through traditional marketplaces or job boards like Upwork and Fiverr.
Consider selling your web scraping code on the Apify platform to monetize your skills.
Explore selling the data you extract from web scraping on marketplaces like Data Boutique for additional income.

Legal Zyte-geist #2: Web Scraping and AI 2023 Legal Wrap-Up

58 implied HN points • 09 Jan 24

🕹 Technology AI Web scraping Regulation

Increased legal activity surrounding web scraping and AI in 2023
Key legal issues for web scrapers include copyright infringement and data protection
EU's AI Act creates different regulations based on the risk of AI systems

Get a weekly roundup of the best Substack posts, by hacker news affinity:

How scraping a single website costed thousands of dollars in proxy

39 implied HN points • 28 Jan 24

🕹 Technology Web scraping Data Extraction

Using different credentials per website when using proxies helps detect outliers and set thresholds per website.
Be selective about the resources to load in your browser when using automation tools like Playwright to avoid unexpected usage of resources.
Set a threshold to proxy usage for your scraper, even for seemingly trivial scrapers, to avoid unexpected costs.

Web scraping from 0 to hero: Microsoft Playwright

39 implied HN points • 21 Jan 24

🕹 Technology Web scraping Data Extraction

Microsoft Playwright is a browser automation tool used for web scraping that supports multiple browsers and can handle dynamic content and complex user interactions.
Browser automation tools like Playwright are crucial for scraping modern websites with dynamic content, interactive elements, and sophisticated front-end frameworks.
Playwright excels at rendering JavaScript, simulating user interactions, and evading anti-scraping measures, but may be slower and require more resources compared to Scrapy.

The Lab #36: Bypassing Cloudflare with anti-detect browsers

39 implied HN points • 11 Jan 24

🕹 Technology Web scraping API

Understanding the importance of device fingerprinting in bypassing Cloudflare bot detection.
Using tools like GoLogin to create customized profiles with different device fingerprints.
Testing different levels of device masking to successfully bypass anti-bot protections.

Web scraping from 0 to hero: creating our first Scrapy spider - Part 2

39 implied HN points • 07 Jan 24

🕹 Technology Web scraping

The web scraping course provided by The Web Scraping Club is always free and offers practical articles on complex topics.
Scrapy allows for easy crawling of product category pages by making requests per URL retrieved.
Using JSON is preferred for extracting and cleaning data in web scraping, offering more stability compared to manual selectors.

The latest papers about browser fingerpinting

19 implied HN points • 11 Feb 24

🕹 Technology Cybersecurity Data Privacy

Browser fingerprinting is used as an alternative to cookies and raises privacy concerns due to its unique identification capabilities.
Desktop devices are more easily uniquely fingerprinted compared to mobile devices, with Chrome providing more detailed configurations.
Innovative approaches like using WebGPU for web fingerprinting pose privacy risks and may require countermeasures to prevent misuse.

Web Scraping from 0 to hero: our first scraper with Microsoft Playwright

0 implied HN points • 04 Feb 24

🕹 Technology Web scraping Python Web Development Data Extraction

The web scraping course provided by The Web Scraping Club is always free and appreciated if you want to subscribe with a paid plan.
The choice between using Scrapy and Playwright for web scraping depends on factors like anti-bot protection and content loading.
Setting up the environment for building a web scraper with Playwright involves installing Python, Playwright, and browser binaries.

The Lab 38: Bypassing Kasada for web scraping 2024 edition

0 implied HN points • 25 Jan 24

🕹 Technology Web scraping

Kasada is an anti-bot solution that collects data points from browsers and uses AI to determine if a request is legitimate or not.
Scrapers that do not use a Javascript rendering engine, like Scrapy, are blocked by Kasada's challenges.
One way to bypass Kasada for web scraping is to use tools like undetected-chromedriver, a Selenium Chromedriver patch that avoids triggering anti-bot services.