Looking to Expand throughout Southeast Asia? Here’s an Efficient Way to Get the NLP Training Datasets You Need

The Internet economy is booming in Southeast Asia and forward-thinking companies are looking to expand into the region. As in any other market, understanding and catering to the local audience is a must. Customer experience, user feedback and happiness are all important factors in ensuring a successful expansion. 

Many companies use natural language processing (NLP)  as part of this customer experience (CX) optimization process. And although NLP is a rapidly growing market, resources — specifically training datasets — are still limited. Most of the available training datasets are in English, while some libraries also have Chinese or other European languages

Although English may be the most widely-spoken language in the world, it’s just one of many.  And in Southeast Asia, countries like Malaysia and Singapore have applied local nuances to English, to the point that it can sometimes sound like a very different language. In fact, these countries have given their locally spoken English a colloquial name  — Singlish, in Singapore and Manglish, in Malaysia.

Companies looking to conduct sentiment analysis in Southeast Asia have realised that their typical tools cannot be scaled up for use in this region, which is extremely diverse. Languages like Thai and Vietnamese do not even use the ISO basic Latin alphabet

In research conducted, where existing methods for sentiment analysis in English were run on Thai textual data, performance was “far from satisfactory”. Just think about how well Google Translate performs and you’ll realize that when there’s a language gap, a lot is lost in translation. In the case of understanding human sentiment, context is key.

The ideal scenario is to obtain human-labeled training datasets for sentiment analysis or semantic understanding. This makes it possible to get a clearer picture of what customers are saying. 

A high cost endeavour? 

Considering the number of Southeast Asian countries and the number of languages that are spoken within the  region, this can sound like a costly feat. 

Some CX experts might think that it’s necessary to engage multiple companies to conduct data labeling in order to cover the multitude of languages that are spoken in Southeast Asia. Or that they might need to set up multiple local teams to prepare the training data.

Because beyond language, there’s also diversity in age, gender and race, each of which can produce textual nuances. Context, slang, localised grammar, emoji use — all these need to be captured in order to represent the most accurate spectrum of sentiments. 

If cost is a concern, companies might have to conduct their expansion in stages — starting in one country, before slowly expanding across the region into others. This can result in missed opportunities because how do you choose which country to expand to? 

Sure, you can be strategic about it, but who’s to say what might happen? 

Scaling up your business efficiently in Southeast Asia

One way to lower the cost of NLP in Southeast Asia is by working with the right partner. Rather than setting up multiple local teams, or engaging multiple data labeling companies, businesses looking to expand within the Southeast Asian region could also look for a data labeling partner that already has diversity built into their operations

Besides having the right tools for data labeling, this partner should have considerable experience within the market, as well as an already distributed workforce that can provide the diversity required to capture the nuances in textual data. 

A local partner with a diversified workforce spread out within the region can be a cost-saver — in terms of money, as well as time. You won’t have to spend time searching for and vetting multiple vendors in each country. And since you’ll only be working with one local partner, you will likely be able to get a better deal. 

This partner should also have sufficient experience so that it has the right quality control methods in place to ensure your NLP training datasets are accurate in all languages using quality control methods such as consensus. 

If you’re looking to expand to Southeast Asia

Some CX experts might think that expansion into Southeast Asia has to happen in stages, simply because of the costs incurred by regional diversity. However, it doesn’t have to be this way. 

By working with the right data labeling partner, it’s possible to expand to throughout the region at the same time. This partner will enable you to obtain accurate NLP training data for all the different regional languages, and help you drive towards a high quality customer experience. 

The market in Southeast Asia is considerably large and at this time, is still an equal playing field. This means the best time to strike is now. And the best way to do it is all at once.

Keen to know more about how we can help you get affordable NLP training data sets in Southeast Asian languages? Get in touch!


Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.