In general a so called web crawler is a computer program that starts with a given URL (or list of URLs) to visit and then browses the corresponding web content methodical and automated. In most cases the web content is visited in a recursively manner, meaning you get all (sub-)pages of a given web site. Web crawlers save the content of the visited sites for further processing and analysis (e. g. to build up some kind of searchable index etc.).
I did a lot of demanded crawlings on web sites in the past and in most cases the results were really amazing. Weak web servers and content management systems were just the tip of the iceberg. By just crawling the conent of a given web site I was able to find test-accounts left by the web site’s administrator or designer, I found "funny and stupid" test- or dummy pages that could lead in serious reputation loss, already infected servers acting as a malware distributor just to name a few. At the end I always wondered why the responsible administrator or IT-section never ever did such a basic test by themselves?! It is so easy to scan your web site, download nearly all its public content and having a closer look on the crawled data, scan it by an ordinary malware-scanner, check for weird content etc. To give you a starting point for your own crawling analysis the following lines and recommendations give you a introduction into crawling your web site using wget.
When it comes to simplicity wget is a really nice tool for downloading and even for crawling resources from the internet, for more details see http://www.gnu.org/software/wget/. Its simplicity makes it perfectly suitable for a in depth analysis. The basic usage is e.g:
This downloads the main (index.html) page of the given domain. To recursively crawl bitnuts.de call wget with the recursion (-r) option on. Because many servers do not want you to download their entire site they prevent this by checking the callers user string or disable robots. I recommend to change wget’s user-string (--user-agent="your user sting") and discard robots-limits (-e robots=off). I also recommend to use the options limiting the crawling speed between retrievals, this makes sure you are not added to a blacklist (-t 7 -w 3). To make wget use a proxy (e. g. TOR), you must set up an environment variable before using wget. Adjust the environment variable
and turn on the --proxy=on feature in wget. It also makes sense to exclude some file types like iso images, mp3s or other large files to speed up crawling without loosing time downloading large files. Just call wget using its -R option.
You might start up crawling your web site using the options I recommended like:
wget -r -l 0 -e robots=off -t 7 -w 3 -R 7z,zip,rar,cab,iso,mp3 --waitretry=10 --random-wait --user-agent="Botzilla/1.0 (+http://botzilla.tld/bot.html)" http://www.your-domain.tld
wget -r -l 0 -e robots=off -t 7 -w 3 -R 7z,zip,rar,cab,iso,mp3 --waitretry=10 --random-wait --cookies=on --save-cookies=cookies.txt --proxy=on --user-agent="Botzilla/1.0 (+http://botzilla.tld/bot.html)" http://www.your-domain.tld