Gaming the system: hidden ads and comment spam
One twist on this that seems to be increasing recently: spam that’s visible only to search engines. CSS makes it relatively easy to include elements on a page which are made invisible to readers: one way to achieve this is to position the elements outside the page boundary.
A recent high-profile case was this story, about hidden articles on the Wordpress website:
These articles are designed specifically to game the Google Adwords program, written by a third-party about high-cost advertising keywords like asbestos, mesothelioma, insurance, debt consolidation, diabetes, and mortgages.The twist in this scheme: hoist these hidden articles up the Google rankings by linking to them from the very-highly-ranked Wordpress home page. Arve Bersvendsen describes how:
The key here being the -9000px text indent: This makes the link invisible to human visitors with CSS, and visible to every search engine on the planet.After a community outcry, the articles and the hidden links were removed.
More recently, The Republic of Geektronica discussed BlogSpot spam blogs:
A large percentage (maybe up to a third) of all Blogspot blogs are spam-logs—sites created to increase the Google ranking of some other site (which is itself usually a Google-spamming site). The ultimate purpose of these spamlogs is usually to drive traffic to a commission-paying pharmacy, pr0n, or casino site.BlogSpot spam, despite Blogger’s protestations otherwise, appears endemic. In a quick spin through ten “next blog” clicks, I found two obvious spam blogs: leftists bunting, which seems to mix autogenerated text with spammy links; kaar028, which links from the article titles and stuffs the bodies full of keywords.
Spammers are becoming less obvious by creating posts that link to actual news articles (complete with excerpts); by all appearances, these blogs are just like scores of real blogs. But if you look at the code of the page, there are tons of external spam links, cleverly hidden by CSS. […] With this additional layer of subterfuge, it’s remotely possible that someone will even link to [such a] blog from their highly-ranked site.So, while CSS has been an enormous boon to the web, in allowing web designers enormous flexibility and expressiveness, it's also handed a valuable weapon to spammers: you can never be sure that what you see is the same as what a machine sees.
[Note: the original post links to an example of a blog using this trick, which has since been removed by Blogger.]
Earlier this week I spotted a new example. This comment, on Accordion Guy’s blog, looks innocuous enough. But take a look at the source:
Yep: there’s spam there, safely tucked out of sight off the top-left of the page.
Good...<div style="position: absolute; top: -1000px; left: -1000px; visibility: hidden;">The true fast way to enjoy and catch luck is Free online poker. <A href="http://online-poker-rooms.t35.com/z1.html"><strong><font size="+2">Hundreds fans come onto Free online poker constantly. </font></strong></A>. Invite your friends about Free online poker immediately and to get true real cash together. </div>
It would seem that Blogware doesn’t properly sanitize HTML in comments, allowing the style attribute through. A dangerous practice, given that comments come from outside the system and so should not be trusted. Mark Pilgrim talked about the dangers of untrusted HTML back in 2003; although he’s talking about HTML in RSS feeds, the points he raises and the suggestions he makes are just as valid for comments:
HTML is nasty. Arbitrary HTML can carry nasty payloads: scripts, ActiveX objects, remote image web bugs, and arbitrary CSS styles that [...] can take over the entire screen.A quick play around with comment previewing suggests that Blogger does quite well on these, disallowing all the tags above and more, although it’s not clear if it disallows the
Still, dealing with arbitrary HTML is not impossible. [...] I offer this advice:
Alternatively, you can simply strip all but a known subset of tags. Many comment systems work this way. You’ll still need to strip style attributes though, even from the known good tags.
- Strip script tags. This almost goes without saying. [...]
- Strip embed tags.
- Strip object tags.
- Strip frameset tags.
- Strip frame tags.
- Strip iframe tags.
- Strip meta tags, which can be used to hijack a page and redirect it to a remote URL.
- Strip link tags, which can be used to import additional style definitions.
- Strip style tags, for the same reason.
- Strip style attributes from every single remaining tag. [...]
styleattribute completely or whether it simply disallows or allows specific
styleproperties. Blogware does quite poorly, rejecting the
<script>tag but appearing to allow everything else. Unless Blogware performs more stringent validation or stripping on submit than it does on preview, it’s handing malicious commenters quite an arsenal to work with.