Google Dorking
Last updated
Last updated
Date: 15, January, 2021
Author: Dhilip Sanjay S
to go to the TryHackMe room.
Crawlers discover content through various means.
Pure Discovery - URL visited by the crawler and information regarding the content type fo the website is returned to the Search Engine.
Following URLs found from previously crawled sites.
Answer: index
Answer: Crawling
Answer: Keywords
SEO ranking - Search Engines will priortise those domains that are easier to index.
Many factors:
How responsive your website is to the different browser types.
How easy it is to crawl your website.
What kind of keywords your websit has.
This text file defines the permissions the Crawler has to the website.
It can specify what files and directories that we do or don't want to be indexed by the Crawler (like admin panel).
User-agent
Specify the type of "Crawler" that can index your site (the asterisk being a wildcard, allowing all "User-agents"
Allow
Specify the directories or file(s) that the "Crawler" can index
Disallow
Specify the directories or file(s) that the "Crawler" cannot index
Sitemap
Provide a reference to where the sitemap is located (improves SEO as previously discussed, we'll come to sitemaps in the next task)
You can use Regex to allow/disallow contents to be indexed by Crawlers.
Generally, Sitemap is located at /sitmeap.xml
.
Answer: ablog.com/robots.txt
Answer: /sitemap.xml
Answer: User-Agent: Bingbot
Other bots like Googlebot, msnbot won't be allowed to index the site.
Answer: Disallow: /dont-index-me/
Answer: .conf
Sitemaps are indicative resources that are helpful for crawlers - they specify the necessary routes to find content on the domain.
It is in XML file format.
They show the route to the nested content.
Sitemaps are favourable for search engines - because all the necessary routes to content are already provided in this file.
Answer: XML
Answer: Map
Answer: route
Using google for advanced searching.
Answer: site:bbc.co.uk flood defences
Answer: filetype:
Answer: intitle:login
Refer