About

What this site does

This site uses a combination of Yahoo Pipes, Wordpress and the Yaab autoblogging plugin to plunder the Wikipedia:Recent Changes pages to extract entries that have some offensive language in them. 

The Yahoo Pipes feed takes the RSS feed from the Wikipedia page, and filters it for certain offensive terms. The Yaab Wordpress plugin checks the RSS output from the pipe and publishes each item as a separate blog post.

What is the point?

1) Well, you could use it to identify and fix certain bits of vandalism on Wikipedia. Indeed, I have done this in the past. However, there is at least one interesting bot  - Cluebot – in operation on Wikipedia that seems to be doing a better job, and with much more sophisticated logic.

2) It’s an opportunity for me to play around with some interesting services to aid my understanding of them.

3) There is a certain puerile delight to be gained from dipping into the output of the Wikipedia vandal, who typically seems to be a bored schoolboy reluctantly doing some research for homework who stumbles upon the marvellous discovery that you can edit the Internet when you’re supposed to be studying

Known flaws

This is version 0.1. There are a lot of issues to be dealt with. In no particular order, they are:

1) Styling: the mark-up on the content doesn’t fit well with the standard Wordpress template. Most posts therefore look a bit horrible.

2) Logic: currently the Yahoo Pipes filter is basically searching for strings representing a handful of offensive words. This means that we see a number of false positives, as when someone edits a page containing the word “Scunthorpe” or “Widow Twankey”, or indeed when someone edits a page that legitimately contains the relevant terms. It also misses a lot of other offensive vandalism, largely because there are limitless ways of being offensive, and because vandals do not always spell correctly!

3) Completeness: apart from logic issues, there are problems with how often the feed is updated, caching issues etc.

4) Duplication issues

5) Apparent removal from Google index as of 28 January 2009

6) Why the autoblogger disappeared on 27 January 2009 – might have been something to do with leaving a half-edited post unsaved.

7) Completing this list.