The only way to guarantee that you are safe is to do your own crawl. You can
manipulate the data any way you would like, and see all of it at once instead
of having to guess queries to try, and see the site before it goes "live" on the
Internet.
Start off by picking a really good crawler tool. There are many different ones
on the net from open-source to commercial. Wikipedia has a pretty good list,
and Google also has a decent list ("Downloadable Tools")
although I tried a couple of these and the output was simply a list of links
found for uploading the results back to Google - which is the exact opposite
of this exercise! You will definitely need one that saves all of the response
data (and not just a list of links) and find one that has an easy way to sift
through the data - because you're going to have a lot of it - and you'll save
yourself some time later.
Look for a bot that submits web form data to be
sure you have the most extensive crawl possible, and, as a bonus, find one
that does a good job at parsing JavaScript and other content such as Flash.
Just as a note, Google's crawler does not execute JavaScript or submit form
data but it is always a good idea to stay ahead of the pack. You might
consider throwing a couple different crawlers at your site to cross-reference
their results to ensure the best possible coverage.
After you have your crawler of choice, or while you're waiting for it to
download, gather a bit of information about the website. Start up a browser
and surf the site as thoroughly as possible.
Look for the application to change
servers or ports (ftp - 21, http - 80, https - 443), roughly how many unique pages it
has, and try to identify any areas of the application you will want to exclude.
If you have a form submission crawler, you may want to omit certain web
forms. Most crawlers have all of these options so set them appropriately.
Do not run the crawler as an authenticated user; i.e. do not use a login
script/macro.
Many sites have a back-end where the user can provide login
credentials and see their account data or preferences, but Google's bot is not
going to be logged in, so neither should you. If you are using a vulnerability
scanner, run it in Crawl-Only mode.
Audit mode will actually attack the
application, but search engine bots do not perform this step so it
unnecessary when replicating what a search engine does.
When you're ready, kick off the crawler. Don't be worried as it is simply
visiting all the links as a search engine bot would.
Monitor the bot as much
as possible to make sure that you configured it appropriately. When it
completes, quickly glance through the data and make sure that it found as
many pages that you expected it to find.
|