Sunday, December 8, 2013

Search engine reconnaissance - obtaining usernames from metadata

In my current role as a Senior Security Engineer I am often refereed to as “Lead Penetration Tester” because I have the joy of attempting to infiltrate websites; bypass or brute-force authentication mechanisms, obtain usernames/passwords, discover logic flaws, inject SQL... essentially identify and exploit vulnerabilities. This is called Web Application Penetration Testing – its what I do, targeting websites that my company owns with explicit written permission to do so while utilizing a four-step methodology: Reconnaissance, mapping, discovery and exploitation.

This post is going to cover ground in the area of reconnaissance, as taking the time to conduct proper and thorough reconnaissance can greatly enhance the likelihood of a fruitful penetration test. Specifically, this post will discuss the topic of leveraging the work that search engines have already performed in terms of spidering a site, caching content and indexing pages in a queryable fashion, for the purpose of obtaining valid usernames of the in-scope application and organization/company. Note the differentiation between application and organization usernames. Application usernames can be completely different than the usernames of employees of an organization; think of a personal email account (application account) contrasted with the the Active Directory Domain account (company account) of the system administrator working for the company that provides the email service, two separate accounts, two different ways of logging in.

Since this post is centered around search engine reconnaissance and most all internet search engines make use of a robot to spider the web, let's take a moment to mention the difference between a well-behaved web crawler (in this context synonymous with bot/search robot/ web spider) and the opposite without delving into the intricacies of the bot and the various policies that govern how a bot works, as this will come into play later on.

A well behaved web crawler will reference the rules set fourth by the administrator of a website e.g. the robots.txt file if it exists and meta references in the HTML source code:


A misbehaving web crawler will disregard the aforementioned wishes of the administrator instead opting to download and parse everything that it can from a given site. This is worth mentioning because of the content contained within a robots.txt file – directories that are at least interesting and on the other end of the spectrum, potentially sensitive. To reiterate what this means: Website administrators (or default application settings) are sometimes guilty of specifying the files and folders, in the robots.txt file, that they DO NOT want to be public knowledge. A well behaved spider like the “googlebot” will honor this, a misbehaving spider will do the opposite.

Using popular search engines to gather information during the reconnaissance phase of a web application penetration test is par for the course. Leveraging the built-in capabilities (search operators) of a search engine can prove to be very useful to narrow down results and hone in on the specific information one is seeking. This information can be anything from determining what sites have links to the target domain to identifying directory browsing and cached SSL content including usernames and authenticated (logged in) session attributes.  

Identifying valid application users is almost always (in my experience) achievable via search engine reconnaissance or through a username harvesting vulnerability in the web application. It is very difficult to present a web interface that can conveniently let a valid user know she has failed to login, has successfully logged in/ advanced in the mulit-factor authentication process (MFA), has provided valid or invalid information in an attempt to reset a password all in a cumulative fashion– and not confirm that the provided user name is indeed a valid application user. Identifying valid users of the target web application is useful, identifying valid user names of the target organization (which may require a different approach) can also prove to be useful for social engineering and SQL injection attacks to name two.

Organizational usernames typically follow a naming convention such as first initial followed by last name “cjohnson” or first name dot last name “calvin.johnson” and determining this is usually a doable task. Once again relying on the work search engines have already performed, it is time to couple that work with the knowledge that metadata is often found in documents that have been published to the web and metadata will sometimes include organizational usernames. A penetration tester can discern not only the naming convention that the target organization has established, but also a plethora of usernames depending on certain circumstances such as the amount of employees the target employs, the web footprint and availability of web-facing documentation (think PDFs). Take for instance the following search query: ext:pdf

The “site” operator tells the search engine to only search whereas the “ext” operator specifies to only return results for PDF documents; if there are any PDFs on, the above query would return those results. Try replacing “” with the name of your own organization's website and.... If your organization's website has web-facing PDFs, and the PDFs are allowed to be indexed per the rules set fourth by your website administrator then you may see some results. If your company has web-facing PDFs, or other documents like .doc, .docx, .xls, .xlsx, etcetera, and a policy that does not allow its web content to be searched and indexed, try the same search query (altering the “ext” operator as needed) from a misbehaving search engine, one that does not honor robots.txt or the HTML meta tags, and compare results. Download the results and parse the metadata of the documents to look for “author” for instance... note any valid usernames?

After manually performing the above queries, changing the extension operator, downloading files one by one and parsing metadata, one ponders a method of automation. To that end I wrote a script that takes the target website as input and proceeds to leverage search engines to determine if the target has documents (DOC, PDFs, spreadsheets, etc) on it's website. Discovered documents are downloaded and subsequently have their metadata parsed for potentially interesting information such as usernames.

When performing search engine reconnaissance it is important to vet the results and understand what you are looking at as well as looking for i.e “Adobe Photoshop” is not a valid username. Keep in mind the most popular search engines usually yield the most and best results, they also typically honor the robots.txt file which can limit the results. That's all for now, lots more ground to cover...