Wednesday, August 03, 2005

Gaming the system: hidden ads and comment spam

There’s an interesting shift in spam on the web: Google and other search engines now have so much power that spam is increasingly being targeted at search engines, rather than at humans. Links are important in raising your position in Google’s rankings, so the more links you can throw out to yourself, the higher you go.

One twist on this that seems to be increasing recently: spam that’s visible only to search engines. CSS makes it relatively easy to include elements on a page which are made invisible to readers: one way to achieve this is to position the elements outside the page boundary.

A recent high-profile case was this story, about hidden articles on the Wordpress website:

These articles are designed specifically to game the Google Adwords program, written by a third-party about high-cost advertising keywords like asbestos, mesothelioma, insurance, debt consolidation, diabetes, and mortgages.
The twist in this scheme: hoist these hidden articles up the Google rankings by linking to them from the very-highly-ranked Wordpress home page. Arve Bersvendsen describes how:

The key here being the -9000px text indent: This makes the link invisible to human visitors with CSS, and visible to every search engine on the planet.
After a community outcry, the articles and the hidden links were removed.

More recently, The Republic of Geektronica discussed BlogSpot spam blogs:

A large percentage (maybe up to a third) of all Blogspot blogs are spam-logs—sites created to increase the Google ranking of some other site (which is itself usually a Google-spamming site). The ultimate purpose of these spamlogs is usually to drive traffic to a commission-paying pharmacy, pr0n, or casino site.
BlogSpot spam, despite Blogger’s protestations otherwise, appears endemic. In a quick spin through ten “next blog” clicks, I found two obvious spam blogs: leftists bunting, which seems to mix autogenerated text with spammy links; kaar028, which links from the article titles and stuffs the bodies full of keywords.

Geektronica continues:

Spammers are becoming less obvious by creating posts that link to actual news articles (complete with excerpts); by all appearances, these blogs are just like scores of real blogs. But if you look at the code of the page, there are tons of external spam links, cleverly hidden by CSS. […] With this additional layer of subterfuge, it’s remotely possible that someone will even link to [such a] blog from their highly-ranked site.

[Note: the original post links to an example of a blog using this trick, which has since been removed by Blogger.]
So, while CSS has been an enormous boon to the web, in allowing web designers enormous flexibility and expressiveness, it's also handed a valuable weapon to spammers: you can never be sure that what you see is the same as what a machine sees.

Earlier this week I spotted a new example. This comment, on Accordion Guy’s blog, looks innocuous enough. But take a look at the source:

Good...<div style="position: absolute; top: -1000px; left: -1000px; visibility: hidden;">The true fast way to enjoy and catch luck is Free online poker. <A href="http://online-poker-rooms.t35.com/z1.html"><strong><font size="+2">Hundreds fans come onto Free online poker constantly. </font></strong></A>. Invite your friends about Free online poker immediately and to get true real cash together. </div>
Yep: there’s spam there, safely tucked out of sight off the top-left of the page.

It would seem that Blogware doesn’t properly sanitize HTML in comments, allowing the style attribute through. A dangerous practice, given that comments come from outside the system and so should not be trusted. Mark Pilgrim talked about the dangers of untrusted HTML back in 2003; although he’s talking about HTML in RSS feeds, the points he raises and the suggestions he makes are just as valid for comments:

HTML is nasty. Arbitrary HTML can carry nasty payloads: scripts, ActiveX objects, remote image web bugs, and arbitrary CSS styles that [...] can take over the entire screen.

Still, dealing with arbitrary HTML is not impossible. [...] I offer this advice:

  • Strip script tags. This almost goes without saying. [...]
  • Strip embed tags.
  • Strip object tags.
  • Strip frameset tags.
  • Strip frame tags.
  • Strip iframe tags.
  • Strip meta tags, which can be used to hijack a page and redirect it to a remote URL.
  • Strip link tags, which can be used to import additional style definitions.
  • Strip style tags, for the same reason.
  • Strip style attributes from every single remaining tag. [...]
Alternatively, you can simply strip all but a known subset of tags. Many comment systems work this way. You’ll still need to strip style attributes though, even from the known good tags.
A quick play around with comment previewing suggests that Blogger does quite well on these, disallowing all the tags above and more, although it’s not clear if it disallows the style attribute completely or whether it simply disallows or allows specific style properties. Blogware does quite poorly, rejecting the <script> tag but appearing to allow everything else. Unless Blogware performs more stringent validation or stripping on submit than it does on preview, it’s handing malicious commenters quite an arsenal to work with.

Categories: Spam

Comments:

You don't seem to have trackbacks, so I'll shamelessly point to a post on my blog where I've gone off on a serious tangent, starting here. :)