The majority of websites you’d consider scraping for information have elaborate anti-bot countermeasures in place. While this is to protect them from malicious scripts, it can impede benign data collection too. When planning your web scraping project, you need to consider how to outsmart anti-scraping techniques to get the data you want.
Using a premium prebuilt web scraping API or an advanced residential proxy service will take into account all of the following anti-anti bot countermeasures for you. However, when running a scraping tool or spider of your own, you have to set up these configurations yourself.
If you’re programming a scraper yourself, a lot of the advanced libraries for web scraping include functions for most, if not all, of the following configurations. For example, Python’s Scrapy and Selenium make it just a matter of calling the right function and assigning the appropriate values.
In essence, all of these settings serve one main purpose: masking the fact that your bot is, well, a bot. First and foremost, you need to make sure that it isn’t blatantly obvious that it’s a bot!
A rapid series of requests from a single IP is the most obvious red flag of all. By using a rotating proxy, you’ll cycle through IP addresses, making it seem like all of that traffic is coming from multiple sources. This can get you safe through many anti-scraping techniques.
However, datacenter proxies can be easily identified as such. Any site that has strong countermeasures in place will block the entire subnet range once it detects a datacenter proxy, assuming it doesn’t already have it blocked in advance.
Similarly, free proxies have likely been previously used for botting activities. The IP addresses in its pool are probably already blacklisted. Then there’s also the inherent hazards of using publicly available, free proxies on top of that fact,
That’s why your best bet is to use a paid rotating residential proxy service. By using residential IPs, it will look like actual people are making the requests.
When a persistent login is needed to access your target site, you can use a sticky session configured for an appropriate timeframe. However, when that isn’t necessary, I advise sticking to rotating on every request. That way, you reduce the chances of several requests coming from the same IP in a suspiciously short timeframe.
Keep in mind that IP addresses have locational data attached to them. Depending on what you are scraping, certain locations may alter what information is available, if not prevent access entirely.
For example, if you were scraping something off of a Facebook page, you shouldn’t use an IP address coming from China.
Your proxy provider should advertise what countries their residential IPs are located in and provide the means of selecting which locations to pull from.
Take a moment to think about how you normally surf the web.
How often do you go to the exact URL for the subpage you’re looking for, rather than following a link from either the root site, an external site, or a search engine?
Do you navigate the site sequentially with exactly the same time delay between clicks?
When you use Google, do you use twenty search operators to refine the results?
These are just a few of the telltale signs of bot activity. To offset them, be sure to:
Sophisticated anti-scraping techniques will look at more than just the IP address. A digital fingerprint is comprised of a bunch of little details that make your particular source unique.
Some of the things that form a digital fingerprint are:
The best way to obfuscate this information is by using changing the user agents between requests, alongside when you cycle the IP address.
In addition to changing user agents, headless browsers help in this department as well. Using headless browsers also reduces the resources needed on your end to run the bot while simultaneously improving speed.
So, in short, to hide your bot’s fingerprint you should:
A honeypot is one of the anti-scraping techniques that a website may use. This is a decoy system that will provide intentionally incorrect information once triggered, as well as track information on the source that accessed it.
Honeypots are HTML links that aren’t normally available to regular users. However, a spider crawling around will stumble on those hidden parts of the site. Perforce of normally being inaccessible, anything navigating to a honeypot is easily assumed to be a bot.
Some of the things your bot can look for to avoid falling for honeypots are:
As long as you are only scraping publicly available information, web scraping is completely legal. However, please always do so respectfully. Sending hundreds of requests to a single site within a few seconds can overwhelm the host server and negatively impact its services.
Similarly, when crawling with a spider, you should respect the site’s robot.txt. It’s a plaintext file saved in the site’s root directory which indicates which file paths are consensually available to the spider.