AI data collection and annotation guide for speech and NLP:
Best practices for creating high-quality training data

Introduction

Implementing an Artificial Intelligence solution that accurately understands spoken or written language is no easy task. Human language is incredibly complex and nuanced. Understanding homophones and homonyms, interpreting cultural references and idioms, and discerning sarcasm and tone is no easy matter. The phrases “I want a cheeseburger so bad” and “my cheeseburger was so bad” sound a lot alike, but mean very different things.

Whether you are introducing a digital assistant, an automated speech recognition (ASR) system, a customer-support chatbot, or any other type of speech-to-text or natural language processing (NLP) app, the right training data makes all the difference. To create an effective machine learning model you need to collect the right volume and variety of data, in your target languages and dialects, gathered from native speakers. For an ASR system, you need to collect data in the optimal acoustical environment for your particular application. Similarly, for an NLP model or chatbot, data should be gathered from various sources and contexts to simulate the ‘noise’ of diverse and nuanced conversational environments. Then you need to properly annotate the data so your application can understand the difference between a bad cheeseburger and a bad cheeseburger craving.

Data collection considerations for speech and NLP
Data transcription and annotation considerations for speech and NLP
How LXT can help

Key findings

AI investment

More than half of organizations surveyed (56%) are spending between $1 million and $50 million on AI annually, and 15% are spending $51 million or more.
Over a third of high revenue companies are spending between $51 million and $100 million on AI annually.
AI strategies are primarily driven by innovation and growth needs, where AI enables businesses to scale and innovate faster, and to secure competitive advantage.
Efficiency and productivity gains are seen as the most dominant problems that AI can solve across industries. Improved analytics and business expansion are also high priorities.

AI and maturity levels

40% of organizations rate themselves within the three highest levels of AI maturity according to Gartner’s AI Maturity Model: Operational, Systemic, and Transformational.
Companies in the Systemic and Transformational levels of the AI maturity model are using AI to scale and drive competitive advantage and product innovation. They are also budgeting higher amounts overall for AI programs, and are using both supervised and semi-supervised machine learning methods.
AI investment and AI maturity correlate, with a quarter of AI maturing organizations spending $51 million or more on AI, compared to just 8% of experimenters (those organizations in the awareness and experimental stages of AI adoption).
Regarding the drivers behind AI strategies, businesses at the Transformational end of the scale have already experienced the benefits of better risk management and delivering on customer needs. Now, they are looking to scale up and accelerate product innovation. AI Experimenters are focused on driving innovation and managing large volumes of data.
AI Maturing organizations rely more on semi- and fully-supervised machine learning and AI nascent organizations report a greater use of unsupervised machine learning.
Maturing organizations view quality training data as an essential contributor to AI success.
When asked about the benefits experienced as a result of high-quality training data for AI, Experimenters see efficiency and agility gains, while Maturing organizations report accelerated time to market and improved competitive advantage.
Two-thirds of all respondents expect their need for training data to increase over the next five years, and AI Maturing organizations indicate the highest need to increase their training data budgets over this timeframe.

AI trends by industry

The financial services industry is leading the way, with 43% of organizations having reached Systemic or Transformational levels of AI maturity.
The tech industry follows the financial services industry, with 22% of organizations having reached the highest levels of AI maturity.
Respondents state that efficiency and productivity gains are the most common goals of AI strategies for their industries as a whole, except for financial services companies, which say that improved analytics is the main goal.
The tech, retail, manufacturing/automotive and professional services industries are deploying AI to innovate and advance product development, while financial services companies are driven by competitive advantage and risk management.
Text data is the leading type used across all industries; future data types include audio, user behavior and video.

01 Data collection considerations for speech and NLP

Data collection is the first and most important stage of the AI/ML lifecycle. The quality of your AI model is only as good as the quality of your training data. The following factors are important when collecting data for speech and NLP solutions:

Establish clear collection guidelines. Ensure that clear instructions are provided to all individuals involved in collecting data for your AI application As the collection progresses, you may need to adjust the guidelines accordingly.
Collect data from a variety of sources. Cast a wide net. If you are implementing a text chatbot, gather data from an array of sources: social media apps, peer review sites, emails, website inquiries, etc. Each source brings a unique style of language and provides a different context.
Ensure Demographic representation. Be sure your dataset fully reflects your target markets and users. Include a variety of age groups, genders, ethnicities, educational backgrounds, and professions to improve results and mitigate bias. Include a wide range of languages and regional dialects to maximize coverage.
Understand your audience. If your app targets specific constituencies like doctors, lawyers, or gamers, include data samples from those groups to incorporate the right jargon into your solution.
Collect speech data in the right setting. If you are implementing a voice recognition app, capture audio in an environment that reflects real-world scenarios. When developing an in-car speech-recognition system, for example, record drivers speaking in various real-life conditions—with the car windows down, with the radio on, with emergency sirens in the background, etc.
Ensure data privacy and security. Check to make sure that the data you collect for your Speech and NLP solutions complies with relevant privacy laws such as GDPR and CCPA.
Establish the optimal size of the training dataset. It affects the accuracy of the model and impacts computational requirements and training efforts. As a general rule of thumb, you need to collect ten data points for each parameter in your model. (For speech, a data point corresponds to a frame of audio, about 20 milliseconds long.)
Prioritize data quality. When collecting data for your Speech or NLP application, make sure to focus on data accuracy and consistency. Completeness is also important, meaning that the dataset should aim to encompass a broad and representative range of scenarios that the model could encounter once it’s in production. Irrelevant data should also be removed as it could have a negative effect on performance.

02 Data transcription and annotation considerations for speech and NLP

Data transcription is simply the process of converting speech to text. Data annotation is the process of labeling the text to provide context or meaning. Well-annotated data is fundamental for understanding intent (interpreting commands, wake words, queries, etc.), discerning emotions and sentiment, identifying and tagging individual speakers, and understanding language nuances and subtleties.

As in data collection, clear guidelines and data quality, privacy and security are also important factors in the annotation process.

Here are some additional factors to consider:

Use humans to transcribe and label voice recordings. Some AI systems can transcribe and annotate speech well, but proficiency depends on the language and availability of digital resources in that variety. Today, even for models developed with an abundance of input data humans are still much better at understanding subtle language distinctions than machines. A “human-in-the-loop” approach provides greater accuracy and reliability.
Use a large team of annotators with different backgrounds. A large and diverse team can help you mitigate annotation bias and improve data accuracy while providing important cultural context.
Take a multi-pass approach to data labeling. By having multiple annotators label the same data point you can further reduce biases and improve accuracy.
Separate data into discrete training and testing sets to avoid overfitting. An overfitted model performs exceptionally well on a training dataset but poorly on unseen data. You can improve performance by using a distinct dataset to validate your AI Model.

By following these proven best practices you can create high-quality training data, build accurate and reliable models, and make the most of your AI investments.

03 How LXT can help

LXT can help ensure your Speech and NLP solutions get the high-quality data they need to deliver the optimal experience for your customers. Whether you are looking to benchmark your solution or expand into new languages, our custom data collection services and human-in-the-loop transcription, data annotation, and audio annotation services can help you accelerate time-to-market, improve the accuracy of your models, and improve user experiences. Our global expertise spans more than 145 countries and over 1000 language locales. Contact one of our experts today.

Ready to discuss your data collection needs?

Contatct our experts today

AI data collection and annotation guide for speech and NLP: Best practices for creating high-quality training data