AI data collection and annotation guide for speech and NLP:
Best practices for creating high-quality training data
Introduction
Implementing an Artificial Intelligence solution that accurately understands spoken or written language is no easy task. Human language is incredibly complex and nuanced. Understanding homophones and homonyms, interpreting cultural references and idioms, and discerning sarcasm and tone is no easy matter. The phrases “I want a cheeseburger so bad” and “my cheeseburger was so bad” sound a lot alike, but mean very different things.
Whether you are introducing a digital assistant, an automated speech recognition (ASR) system, a customer-support chatbot, or any other type of speech-to-text or natural language processing (NLP) app, the right training data makes all the difference. To create an effective machine learning model you need to collect the right volume and variety of data, in your target languages and dialects, gathered from native speakers. For an ASR system, you need to collect data in the optimal acoustical environment for your particular application. Similarly, for an NLP model or chatbot, data should be gathered from various sources and contexts to simulate the ‘noise’ of diverse and nuanced conversational environments. Then you need to properly annotate the data so your application can understand the difference between a bad cheeseburger and a bad cheeseburger craving.
01 Data collection considerations for speech and NLP
- Establish clear collection guidelines. Ensure that clear instructions are provided to all individuals involved in collecting data for your AI application As the collection progresses, you may need to adjust the guidelines accordingly.
- Collect data from a variety of sources. Cast a wide net. If you are implementing a text chatbot, gather data from an array of sources: social media apps, peer review sites, emails, website inquiries, etc. Each source brings a unique style of language and provides a different context.
- Ensure Demographic representation. Be sure your dataset fully reflects your target markets and users. Include a variety of age groups, genders, ethnicities, educational backgrounds, and professions to improve results and mitigate bias. Include a wide range of languages and regional dialects to maximize coverage.
- Understand your audience. If your app targets specific constituencies like doctors, lawyers, or gamers, include data samples from those groups to incorporate the right jargon into your solution.
- Collect speech data in the right setting. If you are implementing a voice recognition app, capture audio in an environment that reflects real-world scenarios. When developing an in-car speech-recognition system, for example, record drivers speaking in various real-life conditions—with the car windows down, with the radio on, with emergency sirens in the background, etc.
- Ensure data privacy and security. Check to make sure that the data you collect for your Speech and NLP solutions complies with relevant privacy laws such as GDPR and CCPA.
- Establish the optimal size of the training dataset. It affects the accuracy of the model and impacts computational requirements and training efforts. As a general rule of thumb, you need to collect ten data points for each parameter in your model. (For speech, a data point corresponds to a frame of audio, about 20 milliseconds long.)
- Prioritize data quality. When collecting data for your Speech or NLP application, make sure to focus on data accuracy and consistency. Completeness is also important, meaning that the dataset should aim to encompass a broad and representative range of scenarios that the model could encounter once it’s in production. Irrelevant data should also be removed as it could have a negative effect on performance.
02 Data transcription and annotation considerations for speech and NLP
Data transcription is simply the process of converting speech to text. Data annotation is the process of labeling the text to provide context or meaning. Well-annotated data is fundamental for understanding intent (interpreting commands, wake words, queries, etc.), discerning emotions and sentiment, identifying and tagging individual speakers, and understanding language nuances and subtleties.
As in data collection, clear guidelines and data quality, privacy and security are also important factors in the annotation process.
Here are some additional factors to consider:
- Use humans to transcribe and label voice recordings. Some AI systems can transcribe and annotate speech well, but proficiency depends on the language and availability of digital resources in that variety. Today, even for models developed with an abundance of input data humans are still much better at understanding subtle language distinctions than machines. A “human-in-the-loop” approach provides greater accuracy and reliability.
- Use a large team of annotators with different backgrounds. A large and diverse team can help you mitigate annotation bias and improve data accuracy while providing important cultural context.
- Take a multi-pass approach to data labeling. By having multiple annotators label the same data point you can further reduce biases and improve accuracy.
- Separate data into discrete training and testing sets to avoid overfitting. An overfitted model performs exceptionally well on a training dataset but poorly on unseen data. You can improve performance by using a distinct dataset to validate your AI Model.
By following these proven best practices you can create high-quality training data, build accurate and reliable models, and make the most of your AI investments.