Institute for Telecommunication Sciences / About ITS / 2017 / A Promising Crowdsourcing Research Experiment

Research Spotlight: A Promising Crowdsourcing Research Experiment

March 21, 2017 by Stephen D. Voran

The results are in—when it comes to speech intelligibility testing, the crowd and human subjects in a lab have a lot in common. Initial results from a crowdsourced speech intelligibility test designed by ITS align closely with similar testing done in the lab. Since crowdsourced testing requires significantly less time and money than laboratory testing, this could help ITS quickly gather large amounts of high-quality speech intelligibility data as part of a more efficient overall test plan.

ITS conducts speech intelligibility testing to identify strengths and weaknesses in new telecommunications offerings. While intelligibility is necessary in any telecomm system, it can be especially important for telecommunications that support public-safety operations. Getting a clear message through on the first try can be critical and may even be a matter of life and death. Protective gear, harsh noise conditions, and the necessity of hands-free operation can present serious intelligibility challenges to the fire, medical, and law enforcement personal who protect us.

For years ITS conducted various types of subjective testing in tightly controlled laboratory conditions, with sound isolated chambers, professional sound equipment, and individually recruited and supervised listeners. It was quite impressive—and also very costly. These tests allowed ITS to sort through emerging telecom options to find those that sound better or work better in some respect.

The lab approach is classic in science and engineering—you control everything that you possibly can so that you can attribute the variation in results to the thing that you are trying to measure. But an equally iconic rule in statistics says that more data is better. Turning away from the lab and deploying a subjective test to the uncontrolled, self-selecting, anonymous crowd of workers at the Mechanical Turk is one way to get more data. But would that data be worth anything at all?

NTIA Technical Memo NTIA Technical Memo TM-17-523: A Crowdsourced Speech Intelligibility Test that Agrees with, Has Higher Repeatability than, Lab Tests, released in February 2017, describes ITS’s first crowdsourced speech intelligibly test and demonstrates that this method has great potential.

The Mechanical Turk allows requesters to outsource microwork (small, well-defined tasks that make up a large virtual project) to willing workers. For this experiment, the workers play ITS recordings and click or tap on one of six options to indicate which word they heard. Sometimes it’s easy, sometimes it’s a challenge, and sometimes the sound is so bad that the worker just has to guess. But this variability is the whole point.

The workers are actually performing a long-standing standardized speech intelligibility test protocol, the Modified Rhyme Test. The number of correct words selected provides the basis for a speech intelligibility rating — more correct words means higher intelligibility — and we calculate this for each of the 56 different telecom conditions represented by the 22,000 recordings in this test. The recordings are standard sound files that play in a worker’s Web browser. Presentation order is random and names are scrambled so there is no basis for man or machine to successfully cheat – a worker has to listen and select a word.

The number of correct words selected also lets us check to see that the workers are taking the task seriously, and so far we have had 100% compliance (along with the expected small number of “technical difficulties.”)

We made 1,536 short (several minute) tasks available, and 345 different workers polished them off in under 6 hours. It’s hardly fair, but we just can’t refrain from comparing this with the 36 listeners and the weeks of time we invested in a similar test in the lab. But what of the data produced by this self-selecting crew, who could be working in airports, or while tending to crying babies, using low-quality earbuds or tiny built-in speakers?

It turns out that large numbers can overcome a lot of obstacles. Of course the range of results is large, but the average word success rate of the 345 crowd workers is only 4% lower than that of the 36 listeners who worked in our lab. And better still, thanks to a clever design and a modest bit of data filtering, we can pull out a complete set of data produced by 152 workers with an average word success rate that is a smidge higher (0.2%) than that of the lab listeners.

So the crowd is certainly up to the task. But the larger goal in all of this is to measure the speech intelligibility of the 56 telecom conditions. If we treat our lab results as the “truth,” the crowd results are statistically equivalent for 55 of the 56 conditions. The final condition differs only in the borderline sense, and we are comfortable saying the crowd and lab results are essentially equivalent.

Since the results are not identical, it might be tempting to muse on whether one is somehow more correct than the other. Tempting yes, but probably not worthwhile. It really boils down to philosophy. The lab gives great control, but logistics and expense mean that we will always be limited to a relatively small number of listeners. By tapping into the crowd we can incorporate a large number of listeners and their respective uncontrolled environments. And it’s also true that when taken in aggregate, those multiple uncontrolled environments could more accurately reflect real use cases than the lab can.

But we won’t be abandoning our labs any time soon. They still serve multiple valuable functions. For one thing, they offer quiet spaces where small differences far out on the thin edge of human hearing are audible—and for some of our testing, this is essential. In addition, labs may still be our best option choice for quality of experience (QoE) testing. While different groups have studied how to best deploy QoE tests to the crowd, the fact remains that there is no correct answer as in speech intelligibility testing. All personal opinions of QoE are potentially valid ones. Without a correct answer, we lose a key tool for vetting and filtering the participation of the masses.

To be certain, we remain excited about the potential to efficiently gather large amounts of high-quality speech intelligibility data from the crowd.