Sunday, May 01, 2005

Crufty URLs and link auto-detection

There’s been plenty of discussion in the past on cruft in URLs: the non-human-readable gibberish which turns URLS which should be single into long and unreadable monstrosities. This piece at Brainstorms & Raves is as good a place as any to start.

However, there’s one important use case of URLs that these discussions don’t seem to touch on much: auto-detection of links in plain text. This happens more often than you might realise. Mail programs often look for links in messages and turn them into active links as a convenience. Web forums often do this too, so users can post links without needing to know HTML or a forum-specific markup language. So too do some weblog comment systems.

URL detection is often fairly simplistic. A regular expression is matched against each line of the text, looking for "http://" or maybe "://" or even, to catch the case where users enter only the address and not the protocol, just "www.". And then the match is extended forward and back to the nearest whitespace or punctuation. Why punctuation as well as whitespace? Because people often use URLs as they would words: “if you go to www.yahoo.com, you’ll find…”.

So, make life easy for plain-texters. Keep URLs short, so they don’t break across multiple lines: the penalty for this is often that the first fragment of the URL gets linked, but the second fragment doesn’t. And keep punctuation characters — apart from the ubiquitous question-mark — out of URLs. If you don’t, you risk your URLs getting truncated to the first punctuation mark.

Two case studies, one old and one brand new:

www.moneysavingexpert.com vs www.fool.co.uk:

Moneysavingexpert’s article URLs end in a comma. (Here’s a recent example). When links to articles are posted to Motley Fool’s text-based discussion boards, truncation strikes: the trailing comma is treated as punctuation, not part of the link, and the truncated link leads to a “not found” error.

This led to the odd situation of Motley Fool posters — even the proprietor of Moneysavingexpert himself — carefully warning readers to copy rather than click the link:

I’ve done a full financial assessment of the 60 main credit card reward schemes on the market. If you want to read it then cut and paste the following link http://www.moneysavingexpert.com/cgi-bin/viewnews.cgi?newsid1048180772,56733, (NOTE: DON’T CLICK ON LINK — for some reason it doesn’t pick up the last comma that’s important, cut and paste it or go to www.moneysavingexpert.com and click the top guide.)

(Martin Lewis in http://boards.fool.co.uk/Message.asp?mid=7812817)
This was eventually fixed at the Moneysavingexpert end: their system still generates URLs with trailing commas, but now redirects URLs missing the trailing comma to the correct pages:

I’ve had a lot of emails from people who have posted links to moneysavingexpert.com on here which don’t work. The key is links to articles on the site all end with a comma that the fool system (as well as other boards and e-mail softwares) misses out — so it just links to an error page.

We’ve tried a fix and now the links should work even if the comma isn’t hyperlinked.

(Martin Lewis in http://boards.fool.co.uk/Message.asp?mid=8990407)
MSN Spaces vs Manila comments

MSN Spaces, Microsoft’s new community/blogging tool, generates permanent links for articles which include an exclamation mark in the middle of the URL. This doesn’t play well with, for example, Manila’s comment system, which auto-links URLs. See for example this comment on Robert Scoble’s weblog:

Robert, I applaud your stance. My comment on Ballmer’s memo is over on my blog at http://spaces.msn.com/members/gcoupe/Blog/cns!1pfnKMM_BORf8-PhonbrwGoA!542.entry

(Geoff Coupe on http://scoblecomments.scripting.com/comments?u=1011&p=9919)
The truncated link leads to the root of the poster’s blog, not to the post in question; a misdirected link which is gradually going further and further out of date.

This is a surprisingly poor choice of URL construction on the part of MSN Spaces: not only are the permanent link URLs packed with crufty gibberish, but they also invite breakage when they’re used in plain-text contexts.

Comments: