The majority of websites you’d consider web scraping for information have elaborate anti-bot countermeasures in place. While this is to protect them from malicious scripts, it can impede benign data collection too. When planning your web scraping project, you need to consider how to outsmart anti-scraping techniques to get the data you want.
Using a premium prebuilt web scraping API or an advanced residential proxy service will take into account all of the following anti-anti bot countermeasures for you. However, when running a scraping tool or spider of your own, you have to set up these configurations yourself.
Circumventing Anti-Bot Countermeasures
If you’re programming a scraper yourself, a lot of the advanced libraries for web scraping include functions for most, if not all, of the following configurations. For example, Python’s Scrapy and Selenium make it just a matter of calling the right function and assigning the appropriate values.
In essence, all of these settings serve one main purpose: masking the fact that your bot is, well, a bot. First and foremost, you need to make sure that it isn’t blatantly obvious that it’s a bot!
Use A Rotating Residential Proxy
A rapid series of requests from a single IP is the most obvious red flag of all. By using a rotating proxy, you’ll cycle through IP addresses, making it seem like all of that traffic is coming from multiple sources.
However, datacenter proxies can be easily identified as such. Any site that has strong countermeasures in place will block the entire subnet range once it detects a datacenter proxy, assuming it doesn’t already have it blocked in advance.
Similarly, free proxies have likely been previously used for botting activities. The IP addresses in its pool are probably already blacklisted. Then there’s also the inherent hazards of using publicly open free proxies on top of that fact,
That’s why your best bet is to use a paid rotating residential proxy service. By using residential IPs, it will look like actual people making the requests.
When a persistent login is needed to access your target site, you can use a sticky session configured for an appropriate timeframe. However, when that isn’t necessary, I advise sticking to rotating on every request. That way you reduce the chances of several requests coming from the same IP in a suspiciously short timeframe.
Use Appropriate Geo-Locations
Keep in mind that IP addresses have locational data attached to them. Depending on what you are scraping, certain locations may alter what information is available, if not prevent access entirely.
For example, if you were scraping something off of a Facebook page, you shouldn’t use an IP address coming from China.
Your proxy provider should advertise what countries their residential IPs are located in and provide the means of selecting what locations to pull from.
Replicate Human Behavior
Take a moment to think about how you normally surf the web.
How often do you go to the exact URL for the subpage you’re looking for, rather than following a link from either the root site, an external site or a search engine?
Do you navigate the site sequentially with exactly the same time delay between clicks?
When you use Google, do you use twenty search operators to refine the results?
These are just a few of the telltale signs of bot activity. To offset them, be sure to:
- Configure appropriate referral sources. This could be direct, search engine, email, social media, or unknown.
- Send requests asynchronously. This should already be implemented in most scraping libraries and frameworks by default, but be sure to double-check.
- Use a randomized time delay between requests.
- Set a rate limit so you don’t overload the host server nor cause a suspicious traffic spike that prompts them to start sending out CAPTCHAs.
- Don’t use too many search operators.
Hide Your Digital Fingerprint
Sophisticated anti-bot measures will look at more than just the IP address. A digital fingerprint is comprised of a bunch of little details that make your particular source unique.
Some of the things that form a digital fingerprint are:
- Which browser is being used and its version number.
- The browser’s settings and addons.
- The make and model of the source’s device.
- The OS being used.
- Which fonts are installed.
The best way to obfuscate this information is by using changing the user agents between requests, alongside when you cycle the IP address.
In addition to changing user agents, headless browsers help in this department as well. Using headless browsers also reduces the resources needed on your end to run the bot while simultaneously improving speed.
So, in short, to hide your bot’s fingerprint you should:
- Change user agents with each request, but ensure that none of them are for outdated browsers. Because most browsers will force updates, the ID from a really outdated one is only going to come from an improperly configured bot.
- Use headless browsers for your scraper for optimization and dealing with sites that use AJAX.
Watch Out For Honeypots
A honeypot is one of the anti-bot measures that a website may use. This is a decoy system that will provide intentionally incorrect information once triggered, as well as track information on the source that accessed it.
Honeypots are HTML links that aren’t normally available to regular users. However, a spider crawling around will stumble on those hidden parts of the site. Perforce of normally being inaccessible, anything navigating to a honeypot is easily assumed to be a bot.
Some of the things your bot can look for to avoid falling for honeypots are:
- CSS attributes of visibility:hidden or display:none in the link styling.
- Link text color set identical to its background.
As long as you are only scraping publicly available information, web scraping is completely legal. However, please always do so respectfully. Sending hundreds of requests to a single site within a few seconds can overwhelm the host server and negatively impact its services.
Similarly, when crawling with a spider, you should respect the site’s robot.txt. It’s a plaintext file saved in the site’s root directory which indicates which file paths are consensually available to the spider.
Now that you know how to outsmart anti-scraping techniques, all that’s left to do is grab one of the best proxy services for web scraping and start collecting your data right away.