In my current role as a Senior Security
Engineer I am often refereed to as “Lead Penetration Tester”
because I have the joy of attempting to infiltrate websites; bypass
or brute-force authentication mechanisms, obtain usernames/passwords,
discover logic flaws, inject SQL... essentially identify and exploit
vulnerabilities. This is called Web Application Penetration Testing –
its what I do, targeting websites that my company owns with explicit
written permission to do so while utilizing a four-step methodology:
Reconnaissance, mapping, discovery and exploitation.
This post is going to cover ground in
the area of reconnaissance, as taking the time to conduct proper and
thorough reconnaissance can greatly enhance the likelihood of a
fruitful penetration test. Specifically, this post will discuss the
topic of leveraging the work that search engines have already
performed in terms of spidering a site, caching content and indexing
pages in a queryable fashion, for the purpose of obtaining valid
usernames of the in-scope application and organization/company. Note
the differentiation between application and organization usernames.
Application usernames can be completely different than the usernames
of employees of an organization; think of a personal email account
(application account) contrasted with the the Active Directory Domain
account (company account) of the system administrator working for the
company that provides the email service, two separate accounts, two
different ways of logging in.
Since this post is centered around
search engine reconnaissance and most all internet search engines
make use of a robot to spider the web, let's take a moment to mention
the difference between a well-behaved web crawler (in this context
synonymous with bot/search robot/ web spider) and the opposite
without delving into the intricacies of the bot and the various
policies that govern how a bot works, as this will come into play
later on.
A well behaved web crawler will
reference the rules set fourth by the administrator of a website e.g.
the robots.txt file if it exists and meta references in the HTML
source code:
<META NAME="ROBOTS"
CONTENT="NOINDEX, NOFOLLOW">
A misbehaving web crawler will
disregard the aforementioned wishes of the administrator instead
opting to download and parse everything that it can from a given
site. This is worth mentioning because of the content contained
within a robots.txt file – directories that are at least
interesting and on the other end of the spectrum, potentially
sensitive. To reiterate what this means: Website administrators (or
default application settings) are sometimes guilty of specifying the
files and folders, in the robots.txt file, that they DO NOT want to
be public knowledge. A well behaved spider like the “googlebot”
will honor this, a misbehaving spider will do the opposite.
Using popular search engines to gather
information during the reconnaissance phase of a web application
penetration test is par for the course. Leveraging the built-in
capabilities (search operators) of a search engine can prove to be
very useful to narrow down results and hone in on the specific
information one is seeking. This information can be anything from
determining what sites have links to the target domain to identifying
directory browsing and cached SSL content including usernames and
authenticated (logged in) session attributes.
Identifying valid application users is
almost always (in my experience) achievable via search engine
reconnaissance or through a username harvesting vulnerability in the
web application. It is very difficult to present a web interface that
can conveniently let a valid user know she has failed to
login, has successfully logged in/ advanced in the mulit-factor
authentication process (MFA), has provided valid or invalid
information in an attempt to reset a password all in a cumulative
fashion– and not confirm that the provided user name is indeed a
valid application user. Identifying valid users of the target web
application is useful, identifying valid user names of the target
organization (which may require a different approach) can also prove
to be useful for social engineering and SQL injection attacks to name
two.
Organizational usernames typically
follow a naming convention such as first initial followed by last
name “cjohnson” or first name dot last name “calvin.johnson” and
determining this is usually a doable task. Once again relying on the
work search engines have already performed, it is time to couple that
work with the knowledge that metadata is often found in documents
that have been published to the web and metadata will sometimes
include organizational usernames. A penetration tester can discern
not only the naming convention that the target organization has
established, but also a plethora of usernames depending on certain
circumstances such as the amount of employees the target employs, the
web footprint and availability of web-facing documentation (think
PDFs). Take for instance the following search query:
site:example.com ext:pdf
The “site” operator tells the
search engine to only search example.com whereas the “ext”
operator specifies to only return results for PDF documents; if there
are any PDFs on example.com, the above query would return those
results. Try replacing “example.com” with the name of your own
organization's website and.... If your organization's website has
web-facing PDFs, and the PDFs are allowed to be indexed per the rules
set fourth by your website administrator then you may see some
results. If your company has web-facing PDFs, or other documents like
.doc, .docx, .xls, .xlsx, etcetera, and a policy that does not allow
its web content to be searched and indexed, try the same search query
(altering the “ext” operator as needed) from a misbehaving search
engine, one that does not honor robots.txt or the HTML meta tags, and
compare results. Download the results and parse the metadata of the
documents to look for “author” for instance... note any valid
usernames?
After manually performing the above
queries, changing the extension operator, downloading files one by
one and parsing metadata, one ponders a method of automation. To that
end I wrote a script that takes the target website as input and
proceeds to leverage search engines to determine if the target has
documents (DOC, PDFs, spreadsheets, etc) on it's website. Discovered
documents are downloaded and subsequently have their metadata parsed
for potentially interesting information such as usernames.
When performing search engine
reconnaissance it is important to vet the results and understand what
you are looking at as well as looking for i.e “Adobe Photoshop”
is not a valid username. Keep in mind the most popular search engines
usually yield the most and best results, they also typically honor
the robots.txt file which can limit the results. That's all for now,
lots more ground to cover...