I recently attempted to scrape a website for assets in order to recreate the site as a personal coding exercise. I tried to do this with wget, flagging particular file suffixes like jpeg, gif, png in order to download the images I needed for this project. However, all I got back was a single file — robots.txt.tmp.
The Robots Exclusion Protocol
So what is this robots.txt file? It allows website owners to dictate which areas of the site web robots are allowed to visit, a concept called the robots exclusion protocol or robots exclusion standard. In practice, this is how it works: a web bot, typically a web crawler systematically browsing the web for indexing purposes, wants to visit a URL. Before it can visit http://mysite.com/home, it first checks in at http://mysite.com/robots.txt to see what pages on this site it has access to, if any. This is a publicly available file — if you take any domain and add the /robots.txt suffix to it you can view the permissions that the website owner is giving to web bots.
Now, what does this file contain? There are two primary properties of the robots.txt files: 1) which robot/s are being target, 2) which pages they are blocked, or disallowed, from visiting. So you might see something like the following:
This gives us a user agent (the robot we’re targeting) of * which means ALL robots. Next are the disallow instructions which tell the targeted bot which pages it is blocked from visiting. You can’t string these instructions together like
Disallow: /home /about /contact, each page must get its own separate line. However, if you want to block access to all pages connected to your site you can use
Disallow: / to accomplish this.
There are plenty of reasons to want to allow bots access to your site, one of which is that web crawlers can help improve your SEO and make it easier for users to find your site. So, you might do something like this to allow a specific, desired bot in:
You can include as many
User-agent's as you’d like in your robots.txt file so if you wanted to prevent all bots except for AdsBot-Google from visiting your site’s pages you could add the following to the previous example:
I wanted to point out that just because a bot reads the restrictions imposed by the robots.txt file does not mean that they will obey those instructions — this is especially true of malicious bots. It is not impervious to bad actors but it does help significantly in controlling much of the bot traffic to your site.