Web scraping is the process of extracting data that is available on the web using a series of automated requests generated by a program.
It is known by a variety of terms like screen scraping, web harvesting, and web data extracting. Indexing or crawling by a search engine bot is similar to web scraping. A crawler goes through your information for the purpose of indexing or ranking your website against others, whereas, during scraping, the data is extracted to replicate it elsewhere, or for further analysis.
A crawler also strictly follows the instructions that you list in your
robots.txt file, whereas, a scraper may totally disregard those instructions.
During the process of web scraping, an attacker is looking to extract data from your website - it can range from live scores, weather information, prices or even whole articles. The ideal way to extract this data is to send periodic HTTP requests to your server, which in turn sends the web page to the program.
The attacker then parses this HTML and extracts the required data. This process is then repeated for hundreds or thousands of different pages that contain the required data. An attacker might use a specially written program targeting your website or a tool that helps scraping a series of pages.
Technically, this process may not be illegal as an attacker is just extracting information that is available to him through a browser, unless the webmaster specifically forbids it in the terms and conditions of the website. This is a gray area, where ethics and morality come into play.
As a webmaster, you should, therefore, be equipped to prevent attackers from getting your data easily. Uncontrolled scraping in the form of an overwhelming number of requests at a time may also lead to a denial of service (DoS) situation, where your server and all services hosted on it become unresponsive.
The top companies that are targeted by scrapers are digital publishers (blogs, news sites), e-commerce websites (for prices), directories, classifieds, airlines and travel (for information). Scraping is bad for you as it can lead to a loss of competitive advantage and therefore, a loss of revenue. In the worst case, scraping may lead to your content being duplicated elsewhere and lead to a loss of credibility for the original source. From a technological point of view, scraping may lead to excess pressure on your server, slowing it down and eventually inflating your bills too!
Since we have established that it is good to forbid web scrapers from accessing your website, let us discuss a few ways through which you can take a strong stand against potential attackers. Before we proceed, you must know that anything that is visible on the screen can be scraped and there is no absolute protection, however, you can make web scraper's life harder.
Take a Legal Stand
The easiest way to avoid scraping is to take a legal stand, whereby you mention clearly in your terms of service that web scraping is not allowed. For instance, Medium’s terms of service contain the following line:
Crawling the Services is allowed if done in accordance with the provisions of our robots.txt file, but scraping the Services is prohibited.
You can even sue potential scrapers if you have forbidden it in your terms of service. For instance, LinkedIn sued a set of unnamed scrapers last year, saying that extracting user data through automated requests amounts of hacking.
Prevent denial of service (DoS) attacks
Even if you have put up a legal notice prohibiting scraping of your services, a potential attacker may still want to go ahead with it, leading to a denial of service at your servers, disrupting your daily services. In such cases, you need to be able to avoid such situations.
You can identify potential IP addresses and block requests from reaching your service by filtering through your firewall. Although it’s a manual process, modern cloud service providers give you access to tools that block potential attacks. For instance, if you are hosting your services on Amazon Web Services, the AWS Shield would help protect your server from potential attacks.
Use Cross Site Request Forgery (CSRF) tokens
By using CSRF tokens in your application, you'll prevent automated tools making arbitrary requests to guest URLs. A CSRF token may be present as a session variable, or as a hidden form field. To get around a CSRF token, one needs to load and parse the markup and search for the right token, before bundling it together with the request. This process requires either programming skills and the access to professional tools.
.htaccess to prevent scraping
.htaccess is a configuration file for your Apache web server, and it can be tweaked to prevent scrapers from accessing your data. The first step is to identify scrapers, which can be done through Google Webmasters or Feedburner. Once you have identified them, you can use many techniques to stop the process of scraping by changing the configuration file.
In general, the
.htaccess file is not enabled on Apache and it needs to be enabled, after which Apache would interpret
.htaccess files that you place in your directory.
.htaccess files can only be created for Apache, but we would provide equivalents for Nginx and IIS for our examples too. A detailed on converting rewrite rules for Nginx can be found in the Nginx documentation.
When your content is scraped, inline links to images and other files are copied directly to the attacker’s site. When the same content is displayed on the attacker’s site, such a resource (image or another file) directly links to your website. This process of displaying a resource that is hosted on your server on a different website is called hotlinking.
When you prevent hotlinking, such an image, when displayed on a different site does not get served by your server. By doing so, any scraped content would be unable to serve resources hosted on your server.
In Nginx, hotlinking can be prevented by using a location directive in the appropriate the configuration file (
nginx.conf). In IIS, you need to install URL Rewrite and edit the configuration (
Deny or Allow specific IP addresses
If you have identified the IP addresses or patterns of IP addresses that are being used for scraping, you can simply block them through your
.htaccess file. You may also selectively allow requests from specific that you have allowlisted.
In Nginx, you can use the
ngx_http_access_module to selectively allow or deny requests from an IP address. Similarly, in IIS, you can restrict IP address accessing your services by adding a Role in the Server Manager.
Alternately you may also limit the number of requests from one IP address, but it may not be useful if an attacker has access to multiple IP addresses. A captcha may also be used in case of abnormal requests from an IP address.
You may also want to block access from known cloud hosting and scraping service IP addresses to make sure an attacker is unable to use such a service to scrape your data.
A “honeypot” is a link to fake content that is invisible to a normal user, but that is present in the HTML which would come up when a program is parsing the website. By redirecting a scraper to such honeypots, you can detect scrapers and make them waste resources by visiting pages that contain no data.
Do not forget to disallow such links in your
robots.txt file to make sure a search engine crawler does not end up in such honeypots.
Change DOM structure frequently
Most scrapers parse the HTML that is retrieved from the server. To make it difficult for scrapers to access the required data, you can frequently change the structure of the HTML Doing so would require an attacker to evaluate the structure of your website again in order to extract the required data.
As Medium’s terms of service say, you can selectively allow extracting data from your site by making certain rules. One way is to create subscription-based APIs to monitor and give access to your data. Through APIs, you would also be able to monitor and restrict usage to the service.
Report attacker to search engines and ISPs
If all else fails, you may report a web scraper to a search engine so that they delist the scraped content, or to the ISPs of the scrapers to make sure they block such requests.
A fight between a webmaster and a scraper is a continuous one and each must stay vigilant to make sure they remain a step ahead of the other.
All the solutions provided in this article can be bypassed by someone with a lot of tenacity and resources (as anything that is visible can be scraped), but it's a good idea to remain careful and keep monitoring traffic to make sure that your services are being used in a way you intended them to be.