How Team Diversity Can Improve Consensus Processes and Reduce Bias in NLP

If a tool that’s supposed to be universal only works for a small portion of the population, is it truly a tool that works? 

Even as artificial intelligence research has advanced, and technology like voice recognition software, chatbots and search engines are now helping to make our lives easier, there is still a gap. 

Technology was supposed to be an equalizer but it hasn’t been as inclusive as expected. 

If tech products continue to be developed by only one type of demographic, there’s a danger that this gap could grow even further, especially as machine learning continues to become a common practice in the tech industry.  

It’s no surprise then, that one of the biggest controversies in artificial intelligence and machine learning  is diversity and inclusion

When it comes to reducing bias, there are many areas within the machine learning process that could benefit from having more diversity — from modeling to writing the algorithms — but all of it begins with having high quality training data. 

What is consensus and how does it work?

One quality control method that has proved effective for significantly reducing bias is to use consensus during the data labeling process. 

As its name suggests, consensus requires different people to label the same data set and the data is only used once a level of agreement between the labelers is reached

Say you wanted to train a machine to recognize pictures of apples. Labelers going through the pictures should label the apples but there are cases where some of the apples are labeled wrongly. 

With consensus, if there are differences from one labeler to another, that incorrectly labeled picture is removed or not taken into account. In this way, it’s more likely to determine whether a labeled image is an apple or not, resulting in higher data set accuracy. 

In simpler terms, if 100 people look at an image of a cat, and say it’s a cat, it’s most probably a cat. 

However, some data labeling processes may not be as clear cut as identifying apples or cats. For example, sentiment analysis is more subjective. An individual may consider a statement negative, but another might see it as something neutral. 

And if these labelers are labeling data for more complex types of sentiment analysis — for example, judging a person’s emotion based on voice data — there can be even more differences. So much of sentiment analysis depends on cultural contexts and local language nuances. 

This is why it’s important to have the right tools for using consensus, where you would be able to set the percentage of agreement required. 

And this is also why it’s important to have diverse data labelers, especially for NLP.

Doing consensus the right way

There are a few things that are important when it comes to data labeling using consensus as part of the QC process, and it helps to work with a partner who has experience in consensus work. 

At Supahands, using consensus means distributing the same set of data to at least three labelers, typically from diverse backgrounds and demographics. At the same time, the Supahands system allows for randomization with specific settings, for example, directing the task through multiple layers of consensus depending on the quality threshold required.  


Other things Supahands could assist with might be randomly selecting data labelers within a specific age group from three different countries within the Southeast Asian region. Or getting labelers from a wider age range, but within the same country. 

A partner who has experience with using consensus would be able to assist you with the tools and settings that you need to ensure high quality training data. 

When it comes to NLP and sentiment analysis, it’s vital to have a diverse workforce — for both ethical and quality control reasons. And this is where working with a partner like Supahands, which already has a workforce spread across the Southeast Asian region, can be beneficial. 

The importance of developing inclusive models

In order to develop a product that functions for all its users, developers must write algorithms that take diversity into account. 

Imagine automated subtitling software that cannot capture female voices. Imagine running sentiment analysis on a specific product and missing out on half the population’s reviews. It’s a dismal thought that these dysfunctional products are still very much the norm. 

But the first step towards developing more inclusive models starts with preparing accurate, less biased training data sets. To do that, start by working with a data labeling team that has built-in diversity.

Interested to see what we can do for you when it comes to preparing inclusive NLP training data sets? Get in touch!


Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.