blog.kfish.org

My name is Conrad Parker, and I live in Kyoto, Japan. I work with Renesas in Tokyo, designing the Linux multimedia architecture for a new line of mobile processors; and for Wikimedia Foundation, working on Ogg integration for Mozilla Firefox. I am also working towards a PhD in Computer Science at Kyoto University. Free software projects include the Sweep sound editor and the Annodex media system, and various smaller ones that you can read about here.

Follow me on Twitter: @conradparker.

Monday, 9 March 2009

The economics of Twitter spam

Recently more and more people have reported that they are being followed by spammers on Twitter. It's easy to track this problem: just search for #spam. Being followed by a Twitter spammer isn't like being stalked by a murderer; actually in the current environment, these guys are a fairly benign parasite that can work in your favor. So let's look at the economics of Twitter spam.

The upside for spammers is the usual obvious SEO shite: you've got something useless to peddle (yourself, your scam, your illegitimate business selling poor copies of pretentious luxury goods, your legitimate business selling enhancement placebos to suckers); you spend your time trying to defile fine and upstanding web pages with links to your pathetic piece of virtual real estate; Twitter comes along and your primitive brain realizes it can post its links there. You follow people so that they get a notification in their email pointing to your Twitter feed. Maybe they read it, maybe they click the tinyurl-obscured link. You cream yourself if they choose to follow you, because then they'll get all your spam, and you'll look more legit by having actual followers (like, real people from outside your cluster of bots and morons).

Now, what's the upside for normal humans in being followed by these scum?

Knowledge is work, a means for putting food on the table; information is power, a means for taking food from others.

Following as many people as you can on Twitter is a useful way to stay in front of your game: you know what people are up to, you see trends evolve, you get notice of articles before they're syndicated, you watch news unfold in your little niche of the world. And of course, the more people that follow you, the further your own message spreads: how great you are, how you're beating the system, how your pretentious beautiful designs and products can uplift and empower.

So there's an incentive to increase both the number of people you follow and the number of people who follow you. The first is easy; you just find people and press their button. The second is more difficult: you need to say something worthwhile in your tweets. Sometimes, not always, people will reciprocate when you follow them -- (SEO tip here!:) it helps if your own tweets are interesting.

However, there is a 2000 following limit: you can't follow more than 2000 people until you have 2000 followers. So, if you want to expand your reach into the info-verse, every follower counts -- even those spambots. So, now, these guys have evolved a little symbiotic, parasitic relationship with their hosts (you). You feel the first bite when they follow, but it feeds your ego. All you need is followers! no-one's going to do background checks on your popularity!

Relevance ranking anyone?

There's more to it though: Twitter search is currently being rolled out across the default user interface, and various bloggers are describing Twitter as a "search engine" (apparently that's the appropriate noun to describe someone that collects ideas). Twitter search is currently a realtime feed of query matches (the zeigeist! *fap* *fap* *fap*) with no relevance ranking. As the search feature gains usage, people will want relevant results to more complex queries. An obviously useful ranking input is the number of followers that a Twit has. These spambots will make you appear relevant!

We can follow this down silly paths -- eg. the more you tweet, the more spambot-followers you get, the more ranking relevance you have. The spammers introduce an incentive to posting often, and that mechanism has positive feedback.

More useful ranking mechanisms are things like reply frequency and analysis of re-tweets. Re-tweets are interesting to track because you can find the users who originate popular ideas: give them the microphone, dammit.

Action items

So there's an imbalance in the Twitter economy. Spammers are using Twitter and the environment encourages it.

Wishlist for Twitter:

  • Track how often users are blocked, warn against and auto-ban them.
  • Add user-initiated "Report spammer" buttons.
  • Implement detection of spammer clusters and auto-ban them.

Action items for Twitter users:

  • Block spammers on Twitter.
  • Block spammers on Twitter.
  • Block spammers on Twitter.

Please rant about how much you love the symbiotic parasitic relationship with your spambot-followers!

Labels: ,

Sunday, 1 March 2009

Random code: Pretty printing durations in Haskell

Recently I've really enjoyed reading blog posts which just explain a little bit of code, so that's what this is. I had this code lying around from a few months ago so I added some context and links. It combines two of my favourite things: Annodex and Haskell!

YouTube's video offset syntax

Some time last year, YouTube introduced a feature which allows you to specify a hyperlink that plays a video from a given time offset. If you used the syntax on a random video site, it would look like this:

http://www.example.com/player.html#t=3m54s

That syntax for this is very close to that which we use in Annodex for Temporal URIs, now running on Archive.org (and soon on Wikipedia):

http://www.example.com/video.ogv?t=3:54

Two differences:

1. YouTube uses a fragment instead of a query parameter.

A fragment is something starting with '#' that tells the client to jump to a particular offset in the document -- in general the fragment text is never seen by the server. In the case of YouTube the HTML page contains JavaScript that tells the embedded Flash video player to seek to the offset in the video.

Fragments are useful in this use case, where you are instructing the embedding web page to play the video from a given time offset. How it actually retrieves the video from the network is not specified, but importantly there is no requirement for the embedding web page to be reloaded.

(This distinction between fragments and queries is part of the W3 Media Fragments WG discussion on syntax).

2. The syntax uses unit markers h, m, s to separate the parts of time, whereas our specification uses the kind of specifiers common in industrial equipment (and clock radios).

Perhaps one advantage of the format YouTube have chosen is readability: sometimes it is difficult to read times such as 03:36:14.

http://www.example.com/video.ogv?t=3:54
http://www.example.com/video.ogv?t=00:03:54.000
http://www.example.com/video.ogv?t=npt:00:03:54.000
http://www.example.com/video.ogv?t=smpte-25:00:03:54::0

We had a recent discussion about these issues in the Media Fragments WG: Action-28: updated syntax document with time formats. I'm pretty happy with the syntax we have settled on, allowing for both readable short timestamps and more accurate long ones.

Pretty printing of durations

Anyway, I was bored so I hacked up a sweet fold to display the format used by YouTube.

Haskell hackers use folds like C programmers use for loops; the Haskell wiki page Fold is a beautiful introduction to the topic. My favourite Web 1.0 interactive visualization of a left fold is at foldl.com (and also be sure to check out its companion site for right folds, foldr.com).

Here's a concise fold that gets us most of the way to the right syntax:

> ts = [("ms", 1000), ("s", 60), ("m", 60), ("d", 24), ("y", 365)]
>
> duz ms = ss
>   where (ss, _) = foldl (\(ss, x) (s, y) -> (show (rem x y) ++ s ++ ss, quot x y)) ("", ms) ts

Yeah, concise. Read it slow! if it was in C or Python, that one-liner would be a 10 or 5 line loop.

You might say that you use the fold function to iterate through a list of time units, and at each step of the iteration you do an integer division by the unit, label the remainder, and pass the quotient on to the next step of the iteration. A real Haskell programmer, however, might say something like "you fold the duration quotiently through the units, labelling into the syntax!", with much wringing of hands and wishful glances for abstract ponies. Fold is a verb, because functions are alive! Quotiently is not a word.

A problem with duz (apart from the crappy name) is that it shows times like 0y0h3m54s0ms. The next implementation of duration strips the leading and trailing zeroes:

> dur ms = years:rest
>   where (rest, years) = foldl (\(ds, x) y -> ((rem x y):ds, quot x y)) ([], ms) [1000, 60, 60, 24, 365]
>
> duration ms = concat $ map (\(n, s) -> show n ++ s) (takeWhile (not . zero) $ dropWhile zero labelled)
>   where labelled = zip (dur ms) ["y", "d", "h", "m", "s", "ms"]
>         zero (n, _) = (n==0)

eg. to display the duration of 2^32 milliseconds:

*Main> duration (2^32)
"49d17h2m47s296ms"

*Main> duration 3600000
"1h"

Fold is a generic list processing device; if you want to limit the amount of the list that is processed, you can use functions like takeWhile and dropWhile. These will take, or drop, elements from the list as long as some criterion is satisfied; you can use them both together to trim both the start and end of the list. Of course you can use these on the input list to limit what data is processed; but because Haskell evaluates lazily, you can also use these on the output list to limit how much of the processing is actually done (like in duration above). The bits of the evaluation that don't really need to get done, aren't: the idea of doing them is written down (on a "thunk") and thrown away. Burn your todo lists! Be lazy lazy lazy! Haskell rules. Do you like verbs?

Labels: ,