Brainstorms and Raves

Notes on Web Design, Development, Standards, Typography, Music, and More

Sun

9

OCT

2005

Behind the Scenes with Apache’s .htaccess

Although I’m a designer and not a programmer or server-side specialist, for a few years I’ve used Apache’s .htaccess to a limited degree for clients' websites, primarily for simple URL redirects and setting up custom error pages. Now that I can use Apache’s .htaccess for my own websites, I’ve been immersed in learning more about how to use this powerful tool conservatively but effectively to redirect URLs and to combat spammers and bad bots. Today’s post provides links to some of the online sources that I’ve found especially helpful.

First, A Word of Warning

Keep in mind that one little typo or incorrect rule within an .htaccess file can cause an internal server error and take your entire website offline. Especially if you’re new to using an .htaccess file, I highly recommend setting up a test directory to work on your .htaccess file. In addition, always make a backup of your .htaccess file before making any changes. That way, if you do happen to make a typo or other error, you can load your backup file again to keep your website up and running while you look for the source of the problem(s).

In addition, many caution those new to .htaccess about not getting too carried away and ending up creating excessively big .htaccess files. Keep in mind that the server will process this file for each request at your website, so you don’t want to negatively impact your server’s performance. For those with access to the httpd.conf file on your Apache server, many recommend using that instead of .htaccess, especially for better server performance. Many of us on shared servers, though, don’t have access to it, including myself.

I prefer to think of .htaccess as just one of a variety of approaches and tools for managing URLs (especially URL redirecting), managing custom error pages, and combating bad bots and spammers. It’s a fantastic tool that I’m thrilled to be able to use for my own websites finally, including this one. (About two months ago all of my websites moved to a new server.)

Regarding combating bad bots and spammers, .htaccess is one of several tools and approaches that I use. My goal is to keep things simple and block the bad guys without blocking everyone else. No one single approach can do it all, though, and bad bots and spammers continually work on ways to get past all the blocking approaches discussed online. So far I’m able to block nearly all of the bad bots and spammers, but new ones always come along, so I watch my logs closely, too.

On to some website links that I’ve found especially helpful.

Apache Documentation

First, here are several links to the definitive source for Apache 1.3 and Apache 2.0 specifically related to using .htaccess, especially for redirecting URLs and blocking bad bots and spammers.

Apache 1.3
Apache 2.0

How to Use .htaccess, mod_rewrite, and Related (for Apache)

.htaccess Tools

I’ve been scouring the Internet looking for tools that will check .htaccess files for typos or other potential problems. So far I haven’t found anything, although I did find some tools that will help you create .htaccess rules and test user agent strings. They’re listed below.

Tools to Generate .htaccess Rules

Try one of these tools to generate redirects, hotlink protection, password protection, or blocking bad bots. At the minimum, you can try them out as learning tools to see how something might be handled. Note that they might not do very complex rules.

Tools to Test .htaccess Rules
  • WannaBrowser
    Wanna Browser is a helpful test to see if your rewrite rules are working as you wish for user agents. If you’ve blocked a certain user agent string for a bad bot, for example, you can see if your rule is working properly with their online tool.

Forums Devoted to Apache .htaccess, mod_rewrite, mod_setenvif, and Related

You’ll find enormously helpful tips and troubleshooting help for using .htaccess, mod_rewrite, mod_setenvif, and related Apache features via these forums. You don’t need to subscribe to read most discussions, although you’ll need to sign up to post your questions or comments, and Webmasterworld Forums has subscriber-only areas in addition to their freely available areas.

  • .htaccess Tools Forum
    Helpful forum devoted to all things .htaccess via the .htaccess Tools website.
  • mod-rewrite.com
    Website has a forum all about using mod_rewrite, such as URL handling, access restriction, regular expressions, and more.
  • SitePoint Forums: Apache
    SitePoint’s Apache forum is a busy one, filled with lots of tips, examples, resources, and more.
  • Webmasterworld Forums: Apache Web Server
    This top-notch forum covers .htaccess, mod_rewrite, and other Apache topics. I’ve found countless tips and insight here. Be sure to check out their Charter - Apache Web Server, too, as you’ll find helpful resources there in addition to reviewing their rules for posting.

Using .htaccess to Block Hotlinking, Stop Bandwidth Theft

I absolutely love the availability of preventing other websites from directly linking to my server’s images, CSS, JavaScript, etc. using .htaccess. Here are a couple of tutorials on how to do it.

  • Preventing image hotlinking: An improved tutorial
    By Tom Sherman, via Underscorebleach.net, November 21, 2004 (updated September 14, 2005.)
  • Smarter Image Hotlinking Prevention
    Prevent others from directly linking to your server’s images, CSS, JavaScript, and other files. By Thomas Scott, via A List Apart, July 13, 2004.
  • Stop Hotlinking and Bandwidth Theft with .htaccess
    Helpful, easy-to-understand tutorial at altlab.com. The approach used in this tutorial is basically what I do for my websites, as I’ve also chosen to send a Forbidden (403) error message.
  • URL Hotlink Checker
    You can test the effectiveness of your website’s hotlink protection with this online tool by entering a complete URL from your website to see if your image can be loaded and hotlinked by a remote server. Via altlab.com.

Note that you might wish to allow certain sites to directly link to a specific image, such as an icon image for your newsfeeds, while still not allowing hotlinking to all your other images. I recently added my newsfeeds-related icon image to a separate directory, and in that directory’s .htaccess file I’ve specified a rule using Apache’s <Files> directive to allow hotlinking to that specific image only. I’m currently testing that to see how it goes for the next few weeks. I prefer that people download the icon to use from their own servers, so if I find other websites abusing the hotlinking for that image, it’s easy enough to individually prevent them from hotlinking to it and make more restrictive rules within that separate directory’s .htaccess file.

Using .htaccess to Ban Bad Bots and Spammers

Note that some of the Webmasterworld forum links might require a subscription.

Some helpful forum threads:

Weblogs, Wikis, Sites, Sections Devoted to Combating Bad Bots, Spammers

  • Chongqed
    Manni’s weblog (Manfred Heumann) devoted to hunting down and sharing spammer information, wiki spam, email spam, and life in general.
  • chongqed.org
    Another invaluable weblog and wiki devoted to hunting down spammers and sharing info with everyone to fight wiki spam, blog spam, and guestbook spam. Run by Joe (from Texas) and Manni (Manfred Heumann).
  • Spam Chongqing
    Joe’s (from Texas) weblog devoted to hunting down and sharing spammer information.
  • Spam Huntress
    An invaluable weblog and wiki devoted to hunting down spammers, sharing info with everyone to help combat spam and block spammers from your websites.
  • Spam Kings Blog
    News and information about catching, prosecuting spammers covering topics from the book, by Brian McWilliams.
  • Tom Raftery’s I.T. Views: .htaccess Category
    Tom’s site is also quite helpful with strategies, tips, and links to combat spammers and bat bots.

Thoughts on Dealing with Comment, Referral, Trackback Spam

As I mentioned above, no one approach will be totally effective or even practical in blocking comment spam, referral spam, or trackback spam. Blocking by IP address or host can quickly become impractical, as anyone knows who’s tried to block solely by IP address. Your ban list will grow rapidly, IPs get outdated just as fast, and IPs often come from zombie machines. Blocking by user agent can help, but spammers spoof user agents and you don’t want to block legitimate users. There are known spoofed user agent strings that you can add to your ban list, though, which can help quite a bit. Blocking by referrer can be helpful, but once again your ban list will grow quickly, too, similar to IP lists. Blocking by keywords for referrers and hosts can help cover most spam referrals and hosts, but I’ve also recently found spammers trying more legitimate-looking domain names. Keep in mind that spammers are always coming up with new ways to get around blocking approaches, too.

Largely for these reasons I’ve found it most effective for my own websites to use a combination of several approaches and tools. Each of my websites is different, though, so I don’t do the same things at each site, although there is certainly some overlap.

Here are some helpful articles on ideas and ways of helping to combat the spammers.

Regular Expressions

Learning even just a little about regular expressions can be valuably helpful. Learning more about regular expressions can go a long way with writing leaner mod_rewrite rules and other rules for your .htaccess files.

Robots.txt

Unfortunately, many bots disregard or don’t even look at your robots.txt file. Good ones will, though, and it’s worth creating, even if the bad bots ignore or don’t even look at it.

For my own websites, as long as the bot or spider behaves itself properly, I typically allow it, but I do have exclusions in my robots.txt file. Known bad bots or spiders and bots or spiders that disregard the rules or behave badly are banned from my website via my .htaccess file.

Here’s some information on how to create and check a robots.txt file for your website.

  • The Web Robots Pages
    Martijn Koster’s website all about robots.txt and the Robots Exclusion standard.
  • Put your robots.txt on a diet
    How to reduce the file size of your robots.txt file by removing duplications, compressing multiple records, and more. Via Webmasterworld.
  • The Robots.txt Our Big Crawl
    Common problems and errors found after researching 2.4 million URLs and 75,000 Robots.txt files. Great insight so you avoid these problems! Via Webmasterworld.
  • Robots.txt Validator
    Check your robots.txt here with this helpful online tool. Via SearchEngineWorld.

Which Bots or User Agents are Good or Bad?

  • Bots, Blogs and News Aggregators Presentation Sources and White Paper
    By Marcus P. Zillman, M.S., A.M.H.A.
  • Information Retrieval Software
    A website devoted to providing information about information retrieval software (including email scrapers, spambots, etc.), search engine robots, and more.
  • List of Bad Bots
    Helpful information here on quite a few user agents, including what type of bot, user agent strings, IP addresses, links to more details, and more. Well done. By Ralf D. Kloth, via kloth.net.
  • List of User-Agents (Spiders, Robots, Crawler, Browser)
    Hundreds listed in these helpful charts that include type of user agent, descriptions and links to information about hundreds of spiders, robots, crawlers, and browsers. Types include: (Client) browser, Link-, bookmark-, server- checking; Downloading tool; Proxy server, web filtering; Robot, crawler, spider; Spam or bad bot. By Andreas Staeding, via psychedlix.com.
  • Project Honey Pot Statistics: Top Spam Harvester User Agents
    Listings by type of user agent, including the page linked here. You’ll also find Robot User Agents, currently active Top 25 Global Spam Harvester List, and more. Via projecthoneypot.org.
  • RSS user agent identifiers
    A helpful list of RSS user agents categorized by Web aggregators and search engines, Desktop readers and aggregators, RSS tools and services. By Philip Shaw, via Code Style.
  • Search Engine Spider Identification: Ultimate short list of banned bots
    Includes helpful links for ways to fend off bad bots and spammers. Via Webmasterworld.
  • Search Engine Robots
    Fabulous listings here with descriptions and links. The categorizes include: Search engine robots and others, Browsers, Link Checkers, Link monitors and bookmark managers, Validators, FTP clients and download managers, Research projects, Software packages, Offline browsers and other agents, Other miscellaneous agents, Sites that regularly visit, Other useful sites, some fakers. By John A. Fotheringham, via jafsoft.com.
  • Statistics
    An informative and helpful post about stats logs (primarily AWStats) handling xml feeds and how to sort them out to get a better view of good guys and bad guys. By Tomas Jogin, via Jogin.com, June 15, 2004.
  • System: User Agents
    A searchable directory of user agent strings that includes their source, purpose, links to more information for most of them, and you can search their database or paste a user agent string into a form there, too. Fantastic and helpful features. Provided by The Art of Web.
  • User Agent Strings
    Helpful table of user agents with descriptions, and includes opinion of whether they’re legitimate or not, good or bad, etc., and has links to more info. Via 50by50.com.

HTTP Error Codes

Most of us probably know what a 404 error is (page not found), but there are lots more server-side error codes. You can create custom error pages with more helpful error messages, adding rules for them within your .htaccess files if you wish, such as a custom 404 message. You can view this website’s custom 404 error message to see what I mean. Here are some helpful sources for more information about error codes.

Server Vulnerabilities

Comments

Comments, Trackbacks: 11 so far. Add yours!

  1. Once again, you are the consummate scout. This permalink definitely belongs with my Apache favorites for the next time I try to wrangle .htaccess.

    05:24 am, pdt10 October, 2005Comment by Bob Stein

    comment #1 permalink ·

  2. Wow - great article, great resource. I appreciate how you briefly mentioned htaccess - but didn’t go into some long-winded example, which just confuses people. Rather, you briefly touched on some valuable htaccess features, and pointed to many great resources.

    06:50 am, pdt10 October, 2005Comment by Matthom

    comment #2 permalink ·

  3. Wow, what an outstanding resource, you are the bomb!

    02:37 pm, pdt14 October, 2005Comment by Robin

    comment #3 permalink ·

  4. Someday I'll go totally geek and I'll need this. Plus Brainstorms and Raves is just a great site. Behind the Scenes with Apache's .htaccess - Brainstorms and Raves...

    16 Oct, 2005Trackback from The Daily Glyph

    trackback #4 permalink ·

  5. This may be more than you wanted to know about .htaccess:Now that I can use Apache’s .htaccess for my own websites, I’ve been immersed in learning more about how to use this powerful tool conservatively but effectively to redirect URLs and to comba...

    24 Oct, 2005Trackback from Modulator

    trackback #5 permalink ·

  6. Noch zwei Links zu später Stunde für den, der sich mit sowas rumschlagen muss ;) Quick Lookup ist ein Tool zum schnellen nachschlagen von PHP-, MySQL-, JavaScript- und CSS-Befehlen. Läuft mit AJAX und in der Sidebar. Ferner habe ich noch eine umfan...

    15 Jan, 2006Trackback from RWo 2k5

    trackback #6 permalink ·

  7. Now back to the fascinating world of inexhaustible .htaccess reference. Bow to Digg’s svachon. ...

  8. You should know by now that I love tweaking with things on the server side. It just becomes second nature when you spend so much of your time inside the Web hosting industry. Well, I've found one heck of a good guide for folks out there on and <a hre...

    16 Jan, 2006Trackback from Mitchelaneous

    trackback #8 permalink ·

  9. TIME Puzzles: Brain Calisthenics (tags: Brain Cool Fun Games Health Mental Puzzles Creativity Lifehacks Interesting) Behind the Scenes with Apache's .htaccess - Brainstorms and Raves A collection of links and reference about the wonderful .htacces fil...

    16 Jan, 2006Trackback from Roshan.info

    trackback #9 permalink ·

  10. A bunch of great resources to the magic that is .htaccess...

    17 Jan, 2006Trackback from The Crooked Links

    trackback #10 permalink ·

  11. Reference to an awesome collection of links and related discussion about Apache’

This discussion has been closed. Thanks to all who participated.

top


Visit iStockPhoto - Royalty-free stock images. Click Upload Earn, Click Download Create 

I Wrote a Book

Deliver First Class Web Sites: 101 Essential Checklists  Via amazon.com: Deliver First Class Web Sites: 101 Essential Checklists, by Shirley Kaiser. SitePoint Books (July 2006). 

Available now via: SitePoint Books, Amazon.com, Amazon.ca (Canada), Amazon.co.uk (UK), Amazon.fr (France), Amazon.de (Germany), Amazon.co.jp (Japan), Tower Books U.S. and elsewhere! You'll also help support this site and its owner if you purchase via any of these links.

Learn more at SKDesigns - Deliver First Class Web Sites and via SitePoint Books.

Recommended Books

Cover - Deliver First Class Web Sites: 101 Essential Checklists, by Shirley Kaiser. SitePoint Books (July 2006).

Cover: The CSS Anthology: 101 Essential Tips, Tricks & Hacks, Practical Solutions to Common Problems, by Rachel Andrew 

Hand-picked best book recommendations for Web site design, CSS, graphics, Photoshop, color, accessibility, more

In association with
 In Association with Amazon.com 
http://brainstormsandraves.com/archives/2005/10/09/htaccess/
Page last modified 14 July, 2007 - 10:19pm PDT Page load time: 0.011453 seconds.