I’ve written before about Amazon Web Services (AWS), and the Amazon Mechanical Turk (AMT) service in particular. This month, I’m going to tell you about my experiences with AMT to solve a problem that’s been vexing me and my pal Ryan for some time now.
But first, a little background. The original Mechanical Turk was a chess-playing “machine” built in 1769, which featured the head and torso of a man dressed in Turkish robes and a turban, hence the name. Using its mechanical arms and hands, the Turk would move pieces on the chessboard. About 1820, it was revealed that the machine was cleverly constructed to let a small person sit inside, providing both the intelligence and motive power behind the Turk.
AWS is an umbrella for a number of hosted services, including its cloud-based storage system, Amazon Simple Storage System (S3), its on-demand computing environment, Amazon Elastic Compute Cloud (EC2), and, of course, AMT, which is described as an “on-demand workforce.” The latest addition to AWS is Elastic MapReduce. Although the name is aimed at the developer audience, in simple terms, Elastic MapReduce makes the speed of “divide and conquer” computing available on a metered basis. MapReduce is what Google uses to return search results at blinding speed by mapping a search request across multiple computers, and then reducing results into a single list. Amazon is really a leader in this area, and it pays to stay abreast of what it’s developing at AWS.
The AMT is sort of a human-based version of MapReduce. Take a repetitive task and, using technology, divide it among lots of people. Do you have 10,000 pictures and want to know which ones have a mountain in them? A perfect job for AMT. Anytime you have a task that requires human input, but would take one person a long time to complete, AMT is something to consider. Let me tell you my story.
Ryan “Fritz” Holznagel is a friend of mine. He’s also a past Jeopardy Tournament of Champions winner and generally a mensch. In 1998, after leaving Lycos, he started an “encyclopedia of famous people” called Who2 (www.who2.com). I’ve known Fritz since 1994 and have served as the Who2 tech guru since 2002 or thereabouts. It’s a fun hobby site for both of us, since its ad-based income, while pretty good, isn’t enough to support either of us full-time. It’s one way of making sure my hands-on technical skills don’t completely dissolve.
In addition to a good capsule biography and basic facts about a given famous person, each profile on Who2 features “Four Good Links,” selected by human editors as the four best sites available to provide additional details about that individual. And that’s where my story begins.
Like everyone else, we’d like to be the first result when someone searches for any of the 3,800 plus people that Ryan and our other editors have profiled. One factor search engines consider is how many people link to your site, which is (to some extent) a measure of your “authority.” Alas, lots of people link to Wikipedia, not so many to Who2. So, we wanted to ask the owners of roughly 13,600 Web pages that we link to via Four Good Links to, please, link back to us. The problem? How do you find the email address for the owner of each of those pages?
Several years ago, before AMT existed, I tried to create a program that would automatically sift through all those pages, and, using some heuristics, come up with an email address. It didn’t work all that well, because the problem really needs a real, live human being to determine whether an email address on a Web page is likely to belong to the person who owns that page.
This is the sort of problem for which AMT is perfectly suited: lots of relatively simple tasks that require a human. I started by creating a Human Intelligence Task (HIT), basically a Web page that displays information and requests human input. In my case, a HIT is looking up the owner’s email address of a single Web page.
It’s pretty simple to create a HIT, since Amazon provides sample templates to start with, and my task matched up with its “data collection” sample template very well. I ended up with a page that showed some instructions, a link to the Web page in question, and a box for the human worker (“Turker” in their parlance) to enter their answer. There are spaces in the template where you specify the piece(s) of information that will change with each HIT, to be replaced by actual data from a file you supply. In my case, I had a title and a Web address (URL) for each page that I wanted a Turker to review for an email address.
You also have to specify information about the task itself. The obvious one is how much you’re willing to pay for each completed task. For example, if my “find the email address of the owner of this Web page” task takes two minutes (30 HITs an hour), then paying a Turker 25 cents to complete the task works out to paying about $7.50 an hour. Amazon recommends paying on this equivalent hourly basis.
Fortunately, you can review results before authorizing payment on a particular HIT. Obviously, a detailed review of thousands of results is difficult, so AMT lets you assign a task to multiple users. For example, I could assign each of my HITs (a single Web page) to three Turkers, and pay only if two or more agreed on the answer to my question. If you want high accuracy, this approach can make sense.
Once the template is ready, you’ve set up a positive balance in your account and you’ve uploaded your data in comma-separated-value form, you’re ready to go. I started with a small test of 100 HITs, paying $0.05 for each, and assigning each task to only one Turker. Amazon recommends starting small, and I agree. Although it’s a powerful system, there’s a bit of trial and error to find what will give you the best results.
How did my quest for email addresses turn out? Tune in next month.