Introducing OneUnited Money Moves!

Introducing OneUnited Money Moves, a next generation “smarter person-to-person payment experience” putting the power of choice into your hands. Say farewell to the constraints of traditional payment…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Defining data quality with SLAs

What does data quality even mean anyway?

Almost anyone working in or around software engineering or infrastructure will have heard about SLAs (Service Level Agreement) at some point. But despite that, they’re still often not fully understood. In this post, we’ll break down the SLA concept and show how it can be applied to data quality.

At Bigeye, we believe SLAs can help answer a really big question for both data teams and the data consumers who depend on them: what does “data quality” mean exactly?

—your friendly neighborhood data engineer

One of the challenges to keeping data quality high is deceptively hard: agreeing on what high-quality data means in the first place.

Data consumers often have an intuitive sense of what “good” data is. But that intuition is rarely quantified and almost never documented. On top of that, the rate of change modern data platforms enable makes data quality a moving target.

Whatever “perfect” data quality might mean in theory isn’t achievable in practice.

But how clicky is too clicky? How clicky is too mushy? If only a couple keys out of 10,000 are off, is that okay? Or do we get a refund?

We don’t expect our supplier to magically know what perfect clickiness means to us, and we probably don’t expect an entire batch to be perfect down to the last switch in a batch of 10,000. But we can agree on what’s good enough to do business and what isn’t. These definitions and tolerances can be captured by an SLA.

Just like the keyboard-switch-makers and the fancy-keyboard-makers, data engineers and data consumers need to agree on the practical definition of data quality for a given use case, such as a dashboard, a scheduled export, or an ML model. That definition should crisply define how data quality will be measured, and what will happen if the standard isn’t met. But that’s easier said than done.

Without a crisp definition of quality, ambiguity can create tensions and disagreements between the data team and data consumers.

For example, an analyst might ask, “how fast can the transactions data be ready for querying?” But what they really meant was, “how soon will it be correct for this analysis I’m doing on recent order behavior?” This distinction matters when the datasets happens to have late arriving updates, like cancellations to orders that have already been placed. In that case, is the number for last month’s revenue usable before the 30-day window for merchandise returns has closed?

This ambiguity can lead to analysis that isn’t correct because the data consumer didn’t ask exactly the right question, and the data engineer didn’t know the context to understand what their real need was. If somebody points out some discrepancy after the analysis has been shared around, it could cause headaches for both the analyst and the data engineer.

This is where the SLA comes in.

Telecom operators started using SLAs in the 1980s, and now they can be found in lots of places, like Google’s Site Reliability Engineering function. They were designed to clarify and document expectations between a service provider and their users.

Similarly, the data team is effectively a service provider to both internal and external data consumers, and SLAs can bring the same level of accountability in that relationship. The only difference for internal SLAs is that there isn’t a fine or refund for missed targets since internal teams rarely pay one another directly.

The most commonly used SLAs I saw on Uber’s data platform were for freshness. The data in a given table might be guaranteed to be no more than 48hrs delayed, for example. These were often followed by a completeness metric (which is a longer topic to save for a dedicated future post). This gave consumers a clear expectation for how recent the data should be in their queries and dashboards, and if they needed it faster, it prompted a conversation with data engineering.

SLIs and SLOs and SLAs, oh my!

Service Level Indicators (SLIs) measure specific aspects of performance. In software engineering, SLIs might be “web page response time in milliseconds” or “gigs of storage capacity remaining.” For a data engineer, SLIs might be “hours since dataset refreshed” or “percentage of values that match a UUID regex.” When building an SLA for a specific use case, the SLIs should be chosen based on what data the use case relies on. If an ML model can tolerate some null IDs but not too many, the rate of null IDs is a great SLI to include in the SLA for that model.

Service Level Objectives (SLOs) give each SLI a target range. In software engineering, it could be “99% of pages in the last 7 days returned in under 90ms.” Going back to our data engineering SLIs, the relevant SLOs could be “less than 6 hours since the dataset refreshed” or “at least 99.9% of values match a UUID regex.” As long as an SLI is within the range set by its SLO, it’s considered acceptable for the target use case, and that aspect of its parent SLA is being met.

We see six major steps to putting SLAs to work on data quality:

Implementing SLAs can sometimes take cultural change. No software can magically create alignment between data engineers and users — but we’re trying to make it easier by adding support for creating SLAs right inside Bigeye. Removing the manual effort needed to create and track them in spreadsheets takes away one big barrier to adoption that could prevent a data team from getting started.

Add a comment

Related posts:

Plastics Testing

Plastic has many unique properties in terms of its manufacturability and production potential. These properties are increasingly used in the manufacturing of medical devices and medical packaging…

Putting control in the hands of investors.

Stropro provides access to curated global tactical opportunities that are conveniently assembled into investment products.