Collecting quality training data: A how-to guide for better AI
Introduction
Artificial intelligence (AI) is a top priority as the go-to solution for helping companies drive digital transformation and gain a competitive edge. Demand for AI will grow significantly over the next few years, and those companies that pass will run the risk of being left behind by their competitors.
An AI solution’s effectiveness and accuracy are directly linked to the quality of the data used to train its machine learning algorithms. Ensuring the availability of quality data begins with incorporating a data collection plan into the data strategy that a company creates to support its AI initiatives.
In this guide, we will discuss topics such as why data is essential to AI, what quality data looks like, how data collection fits into the lifecycle of AI, and how to create and sustain unbiased AI.
Table of Contents
- Why data is essential to effective AI
- How data fits in the AI solution lifecycle
- Data collection plans: What they are and how to build them
- How to collect data for AI
- Different types of data for AI
- Improving AI performance
- Signs that it is time to retrain
- Reducing bias through data collection
- The benefits of a data collection partner
- What to look for in a data collection partner
- Building a custom data collection process
01 Why data is essential to effective AI
Many AI solutions are powered by a machine learning algorithm that informs every decision. These decisions are based on the algorithm’s processing of information, or training data, and a team of people who iteratively annotate or correct the algorithm’s incorrect responses in order to help improve its accuracy. Put another way, a machine learning algorithm has the raw potential to be trained based on the data to which it’s been exposed. But not just any data will do.
When bringing an AI solution to market, you’re placing your confidence in the solution’s ability to make sound decisions for your company and customers. As a business or technology lead, you must have confidence that your algorithms are making decisions that are accurate, reliable, and defendable.
To this end, algorithms must be trained using quality data. More specifically, the data must meet the following guidelines:
Complete:
Data sets must contain all essential information, with no missing values.Timely:
Data must be updated to reflect current market conditions.Consistent:
Data must not change as it moves across a company’s network into different storage locations.Distinct:
There should not be any duplication or overlapping of values across data sets.Accurate:
Data must reflect actual, real-world scenarios backed by verifiable sources.GDPR compliant:
All data and personally identifiable information must abide by guidelines outlined in the EU’s security and privacy regulations.
Once data is collected, it’s essential that you clearly and accurately label to ensure your algorithm makes sense of it. While this guide isn’t focused on the details of labeling data, it’s another important element for the effective training of supervised and semi-supervised machine learning algorithms.
02 How data fits in the AI solution lifecycle
The AI solution life cycle
- Prototyping an AI solution
- Collecting the data
- Labeling and organizing data
- Using the data to train an algorithm
- Deploying the AI solution
- Collecting live data
- Using live data to improve the user experience
Because data plays such an integral part in the life cycle, one of the first things you should do is formulate a data collection plan. This will help ensure that you have enough quality, relevant data to develop your solution, get it quickly to market, and sustain it over the long-term.
Short and Sweet Headlines are Best!
03 Data collection plans: What they are and how to build them
Crystallize your problem and solution
Determine how much data you need
When quantifying how much data you need to collect, consider the following factors:
The complexity of the problem:
In a speech recognition scenario, for example, the amount of data you need is driven by the number of languages and dialects you’ll support. With computer vision solutions, consider the complexity of the scenario in which it will operate. For example, if you’re building an autonomous vehicle, you must account for roadways, other cars, pedestrians, construction zones, and other hazards. There are also numerous scenarios and edge cases that you must consider, as well as the algorithm’s ability to process and synthesize data from a car’s various sensors, including camera, LiDAR, radar, and ultrasound.The complexity of your algorithm:
For example, non-linear algorithms, in which data elements can’t be arranged in a linear or sequential order, require more data in order to clearly map out the relationships between different data points. Common uses of non-linear algorithms include tracking anomalies in financial transactions, segmenting consumer behavior, and identifying patterns in inventory based on sales activity.The number of features you are training for:
A feature is a measurable property of the object you’re trying to analyze. For example, this might include the name, age, gender, ticket price, and seat number on a passenger manifest.How much data you have on hand:
An algorithm that’s been trained using a curated and diverse dataset of external and in-house data may be able to provide greater value than by training solely with in-house data. Therefore, it may be advisable to curate additional data from external sources and train the algorithm with a dataset containing features that are tangentially related to the business problem.
The amount of data needed to get your AI solution from pilot to production varies by use case. An experienced data provider can help you navigate the data collection process.
04 How to collect data for AI
Once you’ve developed a data collection plan, it’s time to start collecting data. In some cases, you may not have enough in-house data on hand, what you do have may be improperly formatted, or removing errors to improve the quality may not be cost-effective. In such cases there are other avenues to consider. Consider one of the following options:
Generate in-house data:
Automatically generate data from line of business apps and websites; sensors embedded across your company’s physical space and assets; or scrape data from social media, product review sites, and other online platforms.Collect primary or custom data:
Conduct focus groups or surveys with end users or scrape data from the web. Either way, this involves generating a unique dataset that’s tailored to your business problem.Export data from one algorithm to another:
Use one algorithm as the foundation to train another. This collection method can save time and money, but it only works when moving from a general context to one that’s more specific.Generate synthetic data:
Rather than using data containing personally identifiable information that will raise security and privacy concerns, you can elaborate upon that data. Synthetic data replaces all of the personal data with random, anonymized data that approximates the same relationships. One limitation is that it may not reflect the full scope of the problem you’re trying to solve.Use open source datasets:
Open source datasets can help accelerate your data training, and there are several online providers of open source datasets (such as Kaggle and Data.gov). When using an open source dataset, be sure to consider its relevance, the potential for any security and privacy concerns, and the reliability or lack of bias.
05 Different types of data for AI
Audio data
Audio data covers a range of sounds including music, animals and other objects, human sounds (e.g. coughs, sneezes, or snores), and other background noises.
Audio data is used in training algorithms for a variety of uses, including virtual assistants, smart car systems, smart home devices and appliances, voice bots, and voice recognition-enabled security systems.
Speech data
When it comes to capturing human language, it’s best to capture the audio in a live environment that reflects the scenario in which your AI solution will be used. For example, when capturing audio for an in-car solution, it’s best to record the driver speaking while they are driving.
If budget or time constraints prevent you from taking this approach, do your best to approximate background noise and other aspects of the environment in which customers will use your solution.
And when recording speech data in foreign languages, be sure to account for the many dialects. If different dialects aren’t recorded, the variations in accent and pronunciation can make it difficult for an algorithm to understand a command or lead to inaccurate interpretations of the command, whether for the entire user base or only a fraction.
Once speech is recorded, it is often paired with a text transcription version of the recording.
There are three types of speech data:
Scripted speech:
Scripted speech data is focused more on the different ways in which the same set of words might be pronounced, based on different accents, dialects, and speech mannerisms. Typically uses include voice commands or wake words.Scenario-based speech:
Scenario-based speech data is a step closer to natural language collection. The goal is to capture what and how a person would say something in a particular situation, such as speaking a command to an audio assistant or mobile app.Unscripted/conversational speech data:
Unscripted or conversational speech data is usually generated by recording two or more people talking about a particular topic.The primary goal is to train algorithms on the dynamics of multi-speaker conversations, such as changing of topics, flow of a conversation, unspoken assumptions between speakers, and multiple people talking at once.
Image/video/gesture recognition data
Visual data—video, still imagery, and gesture recognition imagery—is used in a wide variety of scenarios, such as robotics, gaming, autonomous vehicles, in-store inventory management, and defect detection systems. Imagery can be captured using a phone or camera, but if you’re doing it at scale (with multiple people taking pictures), everyone should use the same equipment.
In addition to the images, visual datasets include annotations describing what the algorithm should be looking for, and where it’s located in the image. This “ground truth” helps an algorithm understand the sorts of patterns it should look for and detect.
Computer image & video datasets must contain hundreds, if not thousands, of high-quality images. There must also be a wide variety of images. In addition to creating your own custom dataset, consider one of the many public sources of imagery data, such as ImageNet and MS Coco (Common Objects in Context).
Text data
Text data can be gathered from a variety of sources, including documents, receipts, handwritten notes, and chatbot intent data. It is used to develop AI solutions that understand human language in text form, most notably in chatbots, search engines, virtual assistants, and optical character recognition.
Conversational AI systems, such as audio assistants and chatbots, also require large amounts of high-quality data in a variety of languages and dialects so they can be used by customers around the world.
06 Improving AI performance
Algorithm training isn’t over once an AI solution has been released to market. Data is a representation of real life, and over time the data used to train an algorithm becomes a less accurate reflection of current market conditions. This phenomenon is known as model drift, of which there are two types. Both require continued retraining of an algorithm.
Concept drift
happens when the relationship between the training data and the AI solution outputs changes, either suddenly or gradually. For example, a retailer might use historical customer data to train an algorithm, but when a monumental shift in consumer behavior occurs, the algorithm’s predictions will no longer reflect reality.Data drift
, sometimes called co-variate drift, happens when the input data used to train an algorithm no longer reflects the actual input data used in production. Typical causes include changes brought about by seasonality, demographic shifts, or an algorithm being used in a new geography.
07 Signs that it is time to retrain
Establish a benchmark
Monitor feedback
08 Reducing bias through data collection
Humans will inescapably introduce bias when training an algorithm. For example, in the test of a widely-used facial recognition solution, the ACLU found that the software “incorrectly matched 28 members of Congress, identifying them as other people who have been arrested for a crime.” Of particular note, the false matches were disproportionately of people of color.
Following are some of the different types of biases that can materialize:
Bias with pre-processing:
You don’t have domain expertise, don’t fully understand the data, and don’t have sufficient understanding of the variables.Feature engineering bias:
A machine learning model’s treatment of an attribute, or set of attributes (e.g. social status, gender, ethnic characteristics) negatively impacts its results or predictions.Data selection bias:
A training dataset isn’t large or representative enough, leading to a misrepresentation of the actual population.Model training bias:
There’s an inconsistency between the actual and trained model results.Model validation bias:
A machine learning model’s performance has not been sufficiently assessed using testing data, rather than training data.
Data collection and human intervention can eliminate those biases, but it takes planning ahead, an understanding of the domain in question, and how the presence or absence of a data point can skew the AI solution’s results.
At a more fundamental level, by working with people from a diverse range of backgrounds, collaboration styles, and worldviews, you can mitigate biases and develop responsible AI solutions.
09 The benefits of a data collection partner
Cost savings
Bias reduction
Expertise
Time to market acceleration
10 What to look for in a data collection partner
Quality
Customization
Scalability
Agility
Speed
Privacy and security
Ethical crowdsourcing
11 Building a custom data collection process
In some cases, you may need to build out a custom data collection process, such as when expanding to a different geography, launching a new business initiative, or conducting business in a market with especially sensitive security and privacy concerns. Creating a data collection plan that’s tailored to your needs enables you to collect the right amount of data that’s specific to your brand or target audience.
In addition, you can account for environmental factors that are unique to a particular industry or country and gain the level of visibility and insight that are necessary to build an explainable AI solution. No matter your challenges, an experienced data collection partner can help you build a custom data collection process that meets your needs.
Learn more about LXT’s custom data collection services here.