MTurk Quality Experiment #1

As I’ve previously mentioned in my Twitter (nolar), I’ve been experimenting with Amazon MTurk, and here are the preliminary results of the quality of the results for one specific task of translation of the texts from English to Russian.

Initially I wanted to name this article as “Zoo of Mechanical Turkeys”, but changed my mind for better SEO of this article (I mean relevancy, not the traffic and popularity of the Dark SEO). I don’t know why MTurk is so associated with turkeys in my mind, when it has nothing to do with these birds or nationality.

I guess, the results will show why it happens so. But before I will give you the results, I want to describe what the task is, and why is it so unusual.

Pathetic intentions

MTurk is a service for human intelligence tasks, i.e. the tasks, which are yet impossible for computer to perform successfully. Image, emotion, storyline, meaning recognition, items classification, text writing, paraphrasing, abstract writing, etc.

My task was all about translation. I hope no one of you would say that computers can translate texts. If you ever saw what they create as a translation, you would not.

So, when you want to make a human translation, for example, from English to Russian, you probably want it to be made by native Russian speakers. Unfortunately, MTurk has one significant disadvantage here: it works mostly within U.S. So, if you are in U.S., you can transfer your earned U.S. money to your U.S. bank account, and everything is fine. If you are outside of U.S. — well, sorry then. You can only spent your money on Amazon online shop, which is not a universal solution, since you will have additional expenses for shipment (you are abroad, remember?) and customs fees. As a result, there is little of Russians or any non-U.S. workers on MTurk, who could translate to their native language; maybe only few of those based in U.S., but originally from the country of your interest.

Nevertheless, I decided to give it a try. Mostly to test MTurk API with boto library. But also to see how bad things are.

Introduction to MTurk

First, I would like to tell you how MTurk works in common, and what the terms are.

As I said above, the main entity is HIT – Human Intelligence Task. Everything rotates around the HITs. And there are two roles in MTurk: requester and worker. The former creates a HIT, the latter performs it.

HIT consists of few questions, and each questions consists of the structured content (actually the question) and a type of expected result — an answer. Answer can be selection of one or few options, a number or a free text. In case of translation it can be free-text only, obviously.

Also, the HIT can have few qualifications attached. The qualification is a filter for possible workers for this specific HIT. Qualifications can be standardized system-wide, such as the number or percent of approved/rejected tasks of the workers, so on. Or it can be one of your own qualifications, either assigned manually or after passing a test. Usually qualifications are measured on a scale 0 to 100, though this is not required, and you require workers to have this value to be more or less than threshold.

The workflow from the requester side is as follows. The requester creates a HIT fulfilled with questions and qualifications, specifies the number of possible assignments and expiration time, so as other parameters. Then they monitor this HIT, either manually through web interface or automatically through API, to approve or reject the answers. Note that answer are automatically approved after the timeout the requester specifies when creating the HIT.

The workflow from the worker side is as follows. The worker searches the system for HITs he is qualified for (or by keyword, or whatever). Then they look at the HIT selected, and if they want and can, they accept it. After accepting they have some time to answer the HIT (specified by requester), and to submit the answer. Once submitted, they can only wait once their answer will be approved or rejected.

To be more precise, there are also HIT types (as seen to the requester) and HIT groups (as seen to the worker). But they are just a way to organize a flow of similar HITs with the same parameters but different questions. In our case they are not very important, so we will ignore them here.

First tries

After registering both as a requester and a worker, and refilling my MTurk balance (it is not merged nor connected with usual AWS billing system for some reason), I’ve tried to create one simple HIT like this:

The first answers were a lesson for me. Humans are cheaters, what a surprise! If you have not said something explicitly, it will be treated the totally different way than you thought it will. Even if there is only 1 out of 100 possible unacceptable ways, all 100 answers will be made this single unacceptable way.

The second attempt was to clarify that I need human translation, not machine one. It looked like this, and haven’t helped much:

The other mistake I made is setting auto-approval time for one hour. It means that that if I have not approved or rejected the answer explicitly in one hour, it becomes approved automatically. Which is totally wrong if you want to sleep or to be offline for a while.

Experiment

So, after few short experiments with these HITs, I’ve ended up with one very clearly formulated, pretty formatted, straightforward task:

The code for this task can be viewed here: https://gist.github.com/1139295

Just to measure how the price can affect the quality, I’ve launched this same task in two variants: one is for $0.05 per assignment, the other is for $0.50 per assignment (i.e. 10x time more!).

Results

And I’ve launched it for a long run, with lots of assignments. Finally, here is the statistics on approved and rejected tasks and some funny translations (may be interesting for my Russian friends).

Price	Answers	Approved		Rejected
Price	Answers	Amount	%%	Amount	%%
$0.05	40	4.5	11.3%	35.5	88.7%
$0.50	72	11	15.3%	61	84.7%

As you can see, no matter how much do you pay, the quality is the same: about 10-15% of acceptable answers. Larger price just gets you more answers in absolute numbers, which means you have something to chose from.

For those of you who can read Russian, here are some funny translations:

i just want to more that text for translations because to translate the text my Russian. must good so i like this so much.

Привет! Я-пробка текстового сообщения, которое надо перевести с английского на русский язык. ….

I know Russian very well i have been in Russia more than 20 times [and then they attached usual machine translation]

Я из России. так что эта задача была для меня легко.

human bening is emotions it must be very important. any feelings is best feelings human benign [don’t worry, I didn’t get it either – nolar]

This one is a piece of art (especially on “web developer” interpretation):

Здравствулте!! Я текстовое сообщение испытания, котор нужно перевести от английского к русскому. Если вы спрашиваете мне, то я был рожден в разуме шального проявителя паутины, который испытывает MTurk API для того чтобы начать очень перспективнейшее обслуживание более поздно. // There should be more English to Russian hits, its the only alternative language I know

And just for the case you wish to make your own estimation of what you are going to get there, here are the results as downloaded from MTurk requester interface: for $0.05 and for $0.50 per assignment. Note, that some answers were actually automatically approved once when I forgot to check them. But I’ve marked them as rejected in these files.

Quality control

So, now I think on how to improve the quality of the results. Luckily, I have no need to pay for rejected answers. But the question remains: how to determine which results are acceptable, and which are not. Even If you had some selection or numeric answer, you cannot just compare to average values, since with 90% of wrong answers the average will be wrong too. And when you have free-text answer, things get really complicated.

One of the way I see here, in case of translation tasks only, is to perform machine translations, and then to compare workers’ answer with the machine one. You only need to know what machines are available to the workers. Or, even better, you can just track the users who give exactly the same answer, word-to-word. Most likely it will be the one from machine.

Harder way is to try to analyze the text itself. For example, feed it to grammar checker (not only spell checker, but to the checker of semantics, punctuation, etc. Machine translations will be detectable as having a lot of errors. In case of Russian, these will be the in comprehensive forms of the nouns, verbs, wrong gender of the objects. For these, of course, you need some kind of semantic parser.

But the very common way of checking of the answer is good, is to compare it with help of another humans. So, you can split workers into two different roles: translators and verifiers (or correctors). The only non-obvious thing here is what quality will verifiers’ work has? How to verify the verifiers?

Anyway, you can use qualifications (those made by you for each of the languages), and assign them to the workers manually. And after some time of running, you will end with a pool of trusted workers, who can perform your tasks with predictable results. But this is only you have a flow, significant flow of these tasks to keep your crowd close. If these tasks are occasional, the workers will just ignore them and the efforts required to qualify.

Other advices

The other tiny advices I can give to those starting with MTurk are:

Do not set auto-approval for less than a day. Prefer 3-day or 7-day auto-approval at least, so you could manually review the answers, but still be able to… well, relax, be offline for a weekend. 30-day auto-approval time is okay too, but be loyal to your good workers: approve their results quickly, so they could get their money as soon as possible. Cheaters can wait. Who cares about cheaters, after all?

Formulate the question as precise as possible. Make the exact instructions what the workers should do and how do you judge their work. In case of translations, for example, say how precise the translation should be (word-to-word, or just to keep the meaning), what subject it is about (computer, legal, etc). If possible, run few test HITs with different formulations and decorations of the same question, and ask the workers for feedback in an additional field.

Give them adequate price for the task. If the tasks (in a flow) are of varying complexity or size, split them into few different types, each with its own price, qualification requirements, etc. Or just split the tasks into smaller ones, if that is possible.

Never trust the workers. Always verify their work, one way or another. Do not allow cheaters to penetrate MTurk too deep. This “elastic workforce” is an ecosystem after all.

Further plans

As for now, I’m going to continue the experiments with MTurk. First, I would like to see how the workers can handle translation to English, native language of most of them. Here is the problem that I’m not a native English speaker by myself, so I cannot judge how good or bad that translation is (for example, if it is made by other English-as-second-language speaker).

For this I’m going to make the second kind of task, which will take the translation and feed it to verifiers, which I mentioned above. They can rate each and every translation on a scale of 0..10 (or 0..100 – who cares). The system, obviously, becomes more complicated and requires some automatic handling. Luckily, I am a programmer.

The other direction of improvements is to build a team of workers. Actually, I’ve already started with those whose answers were approved: I’m rating them on a scale of 0..100 with values of mostly 80..100 just according to the correctness of punctuation and the number of mistypes.

Though, there is still such a possibility that MTurk will be not enough in its current form, with U.S. only workers. There is a possibility that I will end with building my own HIT platform, with blackjack and… And worldwide, really worldwide, not like Amazon claims it is. And will host it on AWS. What a cruel irony ;-)

Anamnesis

of a kooky software developer with megalomaniacal inferiority complex