URL cruft, and how to remove it

By Oli
At 2:19 PM · Monday, 29 December · 2003
To Coding · Unix · Weblogging

I just posted a comment on non-crufty URLs that turned into a novella over at Whitespace, so here it is rewritten as an article for future reference:

An important but overlooked aspect of websites is the URL, or Uniform Resource Locator. This is the ‘web address’, usually of a page on the internet. While it seems minor, this is part of the page’s interface, and some simple rules can make a big difference in ease of use. Important ease-of-use elements for people are the URL should be:

  • easy to type
  • easy to remember
  • short if possible
  • ‘hackable’ ie predictable enough to guess
  • and permanent

For search engines there’s really only one important thing:

  • the URL shouldn’t contain troublesome characters, usually ? and &. ? usually means the URL came from a script, so search engines won’t index it. & is the HTML delimiter for escaped characters eg & is written in HTML as &. This makes URLs with & in them not validate (insignificant but annoying).

Big sites using a CMS usually have a programmer using mod_rewrite or something to do URL mapping — this is definitely the way to go. Unfortunately mod_rewrite is hard (for us mere mortals anyhow). For sites that don’t use URL mapping, here are some links that might be interesting:

Background Information:

Cool URLs don’t change by Tim Berners-Lee
This article advocates making sure a URL will exist in 2, 20 or even 200 years. URLs only change when some information in them becomes outdated, so the important thing is not to put in anything that might change. This basically leaves you with the date of creation and the page title. Not using an extension is also good, as you can just set up extensions to check in order of preference in .htaccess.
URL as UI by Jakob Neilsen
Of course Jakob has something to say about this :-) He advises on domain name choices & avoiding link-rot, and also states that URLs should be shorter than 78 characters. This is the typical line length where email will break a URL over two lines.
The Perfect Blogging System by Matthew Thomas
Matthew lists what he thinks the requirements of a perfect weblog are. His ideal permalink is something like http://domain.name/2003/07/26/keyword.
How to recognize a Weblog tool by its permalinks by Matthew Thomas
This shows the cruft in a variety of weblog software default URLs.
When Good Interfaces Go Crufty by Matthew Thomas
This article is about software and OS interfaces, but is relevant to many things. Highly recommended. Cruft is used to mean “extra stuff that isn’t necessary”.

General Summary:

Friendly, Lasting URLs by Shirley Kaiser
Shirley’s article is similar to this one but covers a little more including; more background information, swapping from one URL scheme to another, & redirecting from old URLs to new ones. She wrote it after changing from Blogger to MT.

How to improve URLs:

…for MovableType
Cruft-free URLs in Movable Type, by Mark Pilgrim
Mark details how he changed from MT’s default URLs to cruft-free date+title ones. You can also check out the Dive Into Mark templates. Note that to remove the index filename from category and date archives, you need to replace all links to these archives with manual ones that don’t include index. eg instead of using the <MTArchiveList archive_type="Monthly"> and <$MTArchiveLink$> tags, you’ll have to hard-code the links, eg
(<a href="http://domain.name/year/month/">December 2003 archive</a>)
Howto: Future-proof URLs in Movable Type, by Már Örlygsson
This achieves a similar result, but Már uses regex and hacks the MT source code to do it. For the hard-core ;-)
More on Friendly URLs by Shirley Kaiser
Shirley discusses the pros and cons of various naming schemes. She’s used a combination of Már Örlygsson’s approach and Dave Dribin’s MT-ShortTitle plugin, which utilizes a user-selected part of the title for the filename. Dave has also made a MT source modification called Index Patch to allow Category and Date archives to appear as directories. I’m tempted to use this myself, but want to avoid patching the MT source code at all costs.
…General Methods
Making “clean” URLs with Apache and PHP, by stef*notabene on eVolt
This article explains how to turn query string-type URLs into clean ones, and is aimed at people using PHP (eg /index.html?cat=coding&story=00056 to /coding/00056.html, or even to /coding/goodurls). You’ll need PHP, Apache (.htaccess and ForceType), and some way of storing the lookup tables (maybe a database).
Clean url’s, by Thijs van der Vossen
Thijs details a mod_rewrite rule that forwards URLs lacking a .html extension to the filename+.html, if the file exists.
Apache Content Negotiation and Multiviews
For the hard-core! You could also consider going super-hard-core with the Apache 1.3 URL Rewriting Guide by Ralf S. Engelschall (very scary :-)

The method I use is based on using Mark Pilgrim’s code with Brad Choate’s MT-KeyValues plugin. I use a KeyValue (if present) to replace the article title; instead of a dirified title (I like long, informative ones) or using keywords like Mark does (I sometimes actually use keywords, for things like UK/USA spelling differences), I can specify exactly what I want. You can see the code I’m using in Read the funky manual. Note the title of the post is “Read the funky manual” (not RTFM ;-) It gets the URL slug via a KeyValue of “url=rtfm”. The title for this page is “URL cruft, and how to remove it”, but using a KeyValue of url=urls I end up with http://oli.boblet.net/2003/12/29/urls.

Discussion...

Comments (2) · TrackBacks (2)  to  http://www.boblet.net/cgi-bin/mttb-external.cgi/53
1. Comment by Shirley Kaiser  · 8 Jan, 2004 · 12:18 PM

Hi, you might be interested in my posts on this topic, too, although your content has similar links to mine:

Also, Mark Pilgrim’s approach does NOT require hard-coding, as you mention in your post. That’s all automated via his approach with Movable Type.

2. Comment by oli  · 8 Jan, 2004 · 4:36 PM

Hi Shirley,

Thanks for the links - I’m looking forward to checking them out.

Unfortunately I have to disagree with you over Mark using hard-coding. And even better I know I’m right because he told me himself (!) (at least I hope I’m right? ;-) While his method generates clean URLs for entry archives, category and date archives must still be built with a filename. This means that even if you remove the sufix .html from them, you’ll still end up with the rather fugly http://oli.boblet.net/2003/12/index (the date archive for December 2003). I personally want the URL http://oli.boblet.net/2003/12/ to display this archive (whether the file name is index, index.html, or whatever). However if I try to remove the “index” from the archive naming template section, MT fails with a rebuild error. Basically I’m telling it to make a file with no filename.

Presently I’ve got the fugly version (that includes the trailing /index), but Mark said that to remove it from his site he manually hard-coded all category and date archive links without this. Of course Apache looks for index.html or whatever you specify in the .htaccess file, and displays the file without altering the nice ‘nameless’ URL.

Of course if you know of a cunning way around this (preferably not involving regex plugins ;-) I’d love to hear it! Aha! Your archives lack the trailing index! How did you do that?? Go on, pretty please? ;-)

Update: 2004-01-08T20:25:00+9; PS the relevant comments in Mark’s article are probably my one, Mike Steinbaugh’s, and Mark’s non-informative announcement.

Update: 2004-01-09T0840:00+9; You are using a different method to Mark’s — now I understand. The Index Patch you link to would let Mark’s method work without hard-coding, but I’m reluctant to patch the MT source code. Hrm…

Thanks again for your comment!

3. Trackback from SocialTwister  · 29 Jan, 2004 · 2:58 AM

The world we know has quickly changed from one of 10-digit phone numbers to significantly longer URIs, URLs for us common folk. For some members of the web community, especially the blogging and web standards aficionados, this outgrowth of the…

Read more in The Case For Cruft »
4. Trackback from SocialTwister  · 29 Jan, 2004 · 3:00 AM

The world we know has quickly changed from one of 10-digit phone numbers to significantly longer URIs, URLs for us common folk. For some members of the web community, especially the blogging and web standards aficionados, this outgrowth of the…

Read more in The Case For Cruft »