Data Quality

Data as the foundation for machine learning applications

Lisa Moore
October 14, 2021
5 min read

The increasing adoption and advancement of cloud technology are driving unprecedented access to vast amounts of data. It has also aided and expedited machine learning (ML) and artificial intelligence (AI) systems to give companies the power to crunch expansive volumes of data. These ML and AI solutions now deliver meaningful insights that guide business decisions to automate operations and sales and marketing processes. Modern AI/ML and analytical platforms are architected to ingest and evaluate data attributes at scale. As a result, the number of input attributes that feed into these systems is no longer limited by size or volume. Today, there are no limitations to the amount and variety of data employed. Tech-savvy companies will reap the benefits.

Our resident expert, Lisa Moore, answers pressing questions about how companies can leverage AI and ML for advanced applications to serve their clients better. For a glossary of the terms used in this Q&A, download our whitepaper, “A marketer’s guide to AI and machine learning.“

MEET THE EXPERT

Lisa Moore

Account Director

With over 25 years of data industry experience, Lisa owns a deep knowledge and understanding of actionable data and predictive outcomes. She is passionate about architecting data-driven solutions that fit customer needs and helping them exceed their business goals. She owns avid listening skills, has an insatiable sense of curiosity, and loves to network with like-minded professionals.

Why invest in AI/ML?

If you’re reading this, you probably already know why you need to invest in these technologies. But for the uninitiated or those who want to make a business case – let’s look. The benefits from AI-powered analytics is significant; McKinsey and Co. reported deep learning techniques could enable the creation of between $3.5 trillion and $5.8 trillion in value each year.¹ AI and ML can help companies with everything from autonomous testing to audience insights, predictive marketing, and business intelligence by – using algorithms, data, and predictive models to analyze and detect patterns in data. When used correctly, businesses can use these tools to create a better product that will optimize customer experience and create a robust competitive advantage.

How can companies optimize AI/ML to provide enhanced insights to their customers?

For effective ML, successful organizations tap into a wide variety of data sourced from inside and outside their organizations. Investing in high-quality third-party data with an extensive array of attributes is the best way to gain deep knowledge of your customers and prospects. ML’s methodologies improve as you feed in more data attributes. Third-party elements make strong ML predictors because business compilation rules require the triangulation of multiple independent sources to activate a feature for sale. As a result of this large volume of input sources, third-party data has high accuracy, depth and is more expansive than data gathered in-house or sourced directly from the customer.

The wider the breadth of data used for decision-making, the richer the resulting insights. So, data scientists develop algorithms that train and tune ML models to improve data quality and model performance. It boils down to this, “quality data in, quality insights out.”

A good reason to work with a third-party data provider is they deliver attributes in an ML-ready format. Features are normalized, fixed field, and essentially free of the pitfalls and inconsistencies that plague some first-party data sets. When it comes down to it, first-party data only offers a partial view of customers and prospects. Third-party data delivers dimensional knowledge on your customer, lifestyle, behaviors, demographics, affluence, firmographics, and more.

What firmographic or demographic data is critical for AI/ML solutions?

The best third-party data is comprehensive, accurate, updated frequently, and easily integrated into your systems and products. In addition, using third-party data sources gives an organization an edge because marketers better understand how customers behave and engage across channels by combining third-party data with first-party data. When investing in third-party data, seek actionable attributes unique to what already lives in your first-party data repository.

These use cases demonstrate how data drives ML success:

Incomplete contact information: Suppose your first-party data only has an email address. By reverse appending your hashed email with personally identifiable information like first, last, postal, or company names, you capture the individual’s identity, which will be the key that allows you to recognize if this individual is a customer. You gain first-party insight into their interactions with other parts of the organization, as well as the natural activation of missing data for your identity graph once you recognize the individual. In addition, the now full PII record opens you to the opportunity to append hundreds of additional data points from third-party providers.

Add static attributes to predict behavior better: Static attributes like date of birth and gender make solid feature investments because, in general, these attributes offer near-perfect coverage, and they don’t change over time. Features with comprehensive coverage are fantastic predictors.

Enrich patient data for better healthcare outcomes: The COVID pandemic accelerated the long-overdue rise of data-driven healthcare approaches. In healthcare, one of the hottest analytical trends is understanding Social Determinates of Health. According to the CDC, Social Determinants of Health (SDOH) are conditions where people live, learn, work, and play that affect a wide range of health and quality-of life-risks and outcomes. Unfortunately, these facets of well-being have been historically absent in most healthcare data lakes. Enriching patient data in a privacy-safe and HIPPA compliant way with SDOH fields like socioeconomic status, lifestyle & behaviors, education, POI Data (which identifies the proximity to business locations that support care like clinics, food banks, churches, hospitals. Etc.) leads to predictions that result in improved treatment outcomes and healthier communities.

Maximize political donations and mission engagement: If you are working on a political campaign, the number of registered voters, their political party affiliation, and the affluence of the individuals (socioeconomic status & income, and net worth) are relevant to supporting fundraising initiatives.

Optimize Account-Based Marketing strategies: Winning ABM requires understanding your target’s business and each contact’s role, peers, and reporting relationships in their organization. To put this powerful technique to work, you need:

Organization firmographics, including corporate linkage, SIC, revenue, and employee size.
Contact level data provide requisite insight into buying groups and key decision-makers.

It is impossible to capture the depth of this kind of firmographic data via a lead form. The fastest, most accurate way to realize this information is through third-party enrichment.

What criteria should companies look for when choosing a data provider? 

All datasets are flawed. That’s why going the extra mile to vet and interrogate data quality from your chosen partner is so important.

Offline Data is the crown jewel of third-party data. Winning third-party data is rooted in offline information: behavior, actions, and transactions. As a result, offline third-party datasets are the fuel for rock-solid, scalable analytics and predictions. Offline data has a long history of success, dating back to the original US Census in 1790. Today, 230 years later, Census data sets the stage for public and social policy.

The question to ask yourself when curating third-party data for AI and ML is, “How was this data collected? What does this data tell me about my customer that I don’t already know? Is this data unique from the first-party data that I am already capturing? Is there sufficient coverage? Is it updated and timely?

To identify which third-party vendor will best be suited to inform your ML/AI analytics, ask these questions:

Is this data deterministic? When you make predictions, you want to have the most accurate information possible. Primary sources of reliable data include:

1. Voter Registration
2. Utility connects
3. Professional licenses
4. Vehicle Registrations
5. Real estate and deed information
6. Known purchase transactions and subscriptions
7. SEC Filings
8. New Business Filings
9. Bankruptcy
10. Domain registration
11. Change of Address

Does this data have sufficient coverage? Depth in your data leads to reliable attributes that perform consistently over the long haul and are reliable predictors. You might ask how many sources does your organization rely on to build their data? Is it linked to an identifiable individual? Leveraging multiple sources leads to comprehensive coverage and high accuracy. Date of Birth, for example, is a highly sought-after attribute because everyone has one! A reputable consumer data partner for age should match your first-party data asset at 70-90%.

A well-known adage in B2B lead generation is that the more information you ask, the lower your response rate. Just like in dating, it makes sense to get to know someone before you share all the details of your life. Most digital B2B relationships start with capturing an email. Appending firmographic data to a B2B email contact will speak volumes about an individual’s motivation and the resources their organization may have to support investment in your solution.

What is the recency?  How current is this information, and how often is it updated? Like milk on a shelf in your fridge, data begins to degrade as soon as it is acquired. Flawed data in your ML models will undermine the accuracy of predictions.