Introduction to robots.txt

In this article we are going to learn about a very important and a very simple file "robots.txt" . While solving CTF challenges we find robots.txt file which tells us which search engine can open which page of the website. In Simple words we can say that this file defines the rule to access a particular webpage . Basic Syntax of this file is as follows :


Now every search engine is identified by its different user-agent, that helps us in creating robots.txt file more easily.


1. Bing : Bingbot

2. DuckDuckgo : DuckDuckbot

3. yahoo : slurp

4. Google : Googlebot

5. Baidu : Baiduspider

we can find more user-agents on the internet but these are very common to use.


Now we should know how we can use this file effectively .

This means that every user-agent is allowed to open this particular webpage. Now suppose you want that only Googlebot can crawl or can open this webpage then content of the file will look like this :

Now by doing this only Googlebot can access this URL . if you want that no one access /admin then you can change the content like this

User-agent : *

Disallow : /admin/


In the previous image we go through a word [directive ] it means the rule that user-agent will follow while crawling a website .


Now let's consider an example :


User-agent : *

Allow : /haclabs/blog/Muzzybox


This rule means that every user-agent can access Muzzybox walkthrough but they can not access /haclabs/blog/Tr0ll , /haclabs/blog/openetadmin.


Sometimes user-agent plays a very important role in CTF challeneges.Means if the rule is something like this :


User-agent : Googlebot

Allow : /secret


now just in case if our User-agent is different then by intercepting the packet using burpsuite or by using curl we can define the User-agent as Googlebot and after making the request we can see the content of robots.txt file easily.


To find robots.txt file you can run different directory bruteforce tools such as dirb , dirbuster and gobuser and many more and you can also search manually as :

http://192.168.43.248/robots.tst


User-agent : *

Disallow : /haclabs/

Allow : /haclabs/muzzy

This rule clearly states that /haclabs/ is not accessible by any User-agent but any user-agent can access /haclabs/muzzy .


So this was a very small guide about robots.txt , you can find more ways to configure this file.




The reason I write this article is, There is a machine on vulnhub with name inclusiveness . So After reading this article completely you will be able to get a reverse shell and try harder to gain root-access!

If you want any hint regarding the machine inclusiveness then contact me at : yash@haclabs.org


Subscribe to HacLabs newsletter

Get priority notification on the release of the latest articles.

  • YouTube
  • Twitter
  • Instagram
  • Linkedin

© 2020 by HacLabs.