📈Working with Datasets and Indicators

A primer on datasets and indicators, and how to use them to measure data quality

Introduction

The central constructs in the Lariat framework for Data Quality are Datasets and Indicators.

A dataset is a unique combination of a Dataset Name, a Schema, and a Source ID.

An indicator is a calculation on that dataset, that is performed periodically.

Multiple Datasets may share the same Source, such as multiple tables with differing schemas in a production warehouse.

Similarly multiple Datasets may share the same schema, but different sources, such as staging and prod versions of a table.

Names and Schemas are generally inferred by Lariat integrations. For example, the Snowflake integration uses the information_schema to determine table names and schemas that are mirrored in Lariat.

The Source, however, is always provided by the user, most often as a source_id parameter when configuring an integration. The choice of a source_idis important since a given Indicator calculation may only operate on a single source_id. A source_id must be unique within an organization.

Although the above examples refer to data warehouses, the notion of a Dataset is decoupled from any assumption about where the data lives. The below are all valid Datasets in Lariat:

A table in a relational database
An S3 path containing date-partitioned Parquet files
An in-memory pandas DataFrame

Building Indicators

After a Dataset has been registered with Lariat, you may build Indicators on top of it.

Indicators are simply numeric values calculated from a dataset. They are designed with a few properties in mind that make them easy to layer on top of your data stack.

Periodic

Indicator Queries run:

On a schedule you define
Over a window of data you define
With customizable offsets that let you account for upstream lag

Read-only

An indicator execution fetches a result set, and computes a numeric value from that result set without modifying any data
This value may be tagged with any dimensions retrievable from the same result set.

Composable

If multiple Indicators are defined on top of overlapping result sets, Lariat will attempt to compose them into the lowest possible number of queries to execute against your data sets.
This lets you express Indicator queries at a level of simplicity that is meaningful to your team, rather than overloading them into complex SQL statements

Several Indicator can be activated by default using the Dataset Configuration page at /datasets/:id/configure

Working with Timestamps

In order to calculate Indicators, Lariat requires that registered Datasets possess at least one member of their schema that can be interpreted as a timestamp. The timestamp field ensures that Indicator calculations are performed over bounded windows of the data, and that these calculations advance forward in time as new data arrives.

A valid timestamp field is a numeric field in the schema representing a UNIX Epoch in seconds. However in the event that such a field doesn't exist users may instruct Lariat to derive it from existing fields on the Dataset Configuration page:

In the above example we are using an arithmetic operator to convert eventtimemilli a UNIX epoch in milliseconds, to a UNIX epoch in seconds. You may use any operator or function to aid this conversion, as long as it is supported by your underlying data source (for example, to_unixtime in Presto/Trino, or date_part in Snowflake, Postgres and others)

Indicator Requirements

An indicator calculates a scalar value from a dataset, with the option for grouping and filtering by any field that is part of the dataset's schema. As mentioned above, indicators require a timestamp field to be defined on the schema, so that calculations may be limited to an appropriate subset of data.

Basic Indicator Patterns

The form builder section in the Create Indicator page enables common indicator definitions to be quickly specified.

Dataset: The dataset which the indicator will be computed on
Column: The field in the dataset schema which the Indicator refers to
Calculation: One of sum , avg , min, max, count , or countDistinct
Group By: Fields in the schema that you wish this indicator calculation to be grouped on. Fields specified under Group By will be available as dimensions for graphing and alerting.
Filters: Boolean operations that limit the result set on which the Indicator will be calculated. These might commonly be used to discard null or test values, and prevent them from skewing indicator calculations.
Timestamp Field: The field in the schema to use as the timestamp for deciding the window of data under observation. See Lookback Period, Evaluation Interval, and Offset

Lookback Period, Evaluation Interval, and Offset

Every indicator definition requires a Lookback Period and an Evaluation Interval. Optionally, users may specify an Offset for the evaluation time as well.

The Lookback Period specifies the duration of the time window in the data over which the Indicator should be calculated. It is what you would use, for example, to distinguish a count over the last hour from a count over the last day . The Lookback Period refers specifically to the Timestamp field, as described under Working with Timestamps

The Evaluation Interval specifies how frequently the Indicator should be evaluated.

The Offset is an optional setting to delay each evaluation by a fixed amount of time. The beginning and end of the Lookback Period for each evaluation interval is determined prior to applying this offset, meaning that the Offset can be used to account for any systemic delays in the arrival of new data As a whole, you may think of these controls as appending an additional WHERE clause to your indicator query, specifying the start and end timestamps of the data under observation.

The Create Indicator screen shows you the Next 5 Evaluations that will be scheduled for this indicator.

By selecting an appropriate Lookback Period, and Evaluation Interval, you can specify if calculations should operate on overlapping, or non-overlapping windows of data.

The below specifies a calculation over a Lookback period spanning 1 day, as timestamped by the ddate field. Since the evaluation interval is every day at midnight , the Indicator produces a daily calculation over strictly non-overlapping windows of data

In the below example, since the evaluation interval is Every hour , the Indicator performs a rolling hourly calculation on overlapping windows of data, while always observing a day's worth of data.

To understand the utility of Offset , contrast this with the below definition

If we expect a maximum of a 10-minute delay in the arrival of data with new ddate values, Offset the evaluation time by: 10 Minutes can delay evaluation to the point when we are certain the Indicator will be calculated on a "whole" window of data.

Note that this is different from specifying the Evaluation Interval to be 10 minutes past the hour. Since Lookback Periods are calculated based on the requested evaluation times, the window of observed data would include timestamps from :00 UTC and :10 UTC , and this data may not have arrived in our dataset yet.

Using SQL

The SQL Editor in the Create Indicator page allows for free-form entry of a SQL query to evaluate against your dataset. This is helpful when your query cannot be sufficiently expressed using just the form builder, and is suitable for complex expressions such as CASE statements or custom calculations.

As a rule of thumb, if the query can be executed directly against your dataset using the native database driver, it can also be validated and run as an Indicator. This means that the SQL editor supports common ANSI SQL constructs such as WHEREand GROUP BY as well as the specific aggregations and functions that may be supported by your data source (such as KURTOSIS in Snowflake)

The SQL Query defined in the editor is subject to the same evaluation logic specified under Lookback Period, Evaluation Interval, and Offset. Hence users should avoid including timestamp ranges in the SQL query as these will be reliably inferred at indicator evaluation time.

The SQL Query defined in the editor is also subject to passing validation (via the Validate button) prior to indicator creation.

SQL Query Validation asserts a few things about your proposed query to provide a reasonable guarantee that it can be successfully executed by Lariat agents. The exact validation logic varies depending on the data source the query will operate on but generally speaking Lariat tests for:

Scalar Computation: Is the query extracting a singular numeric value from the dataset? (Or where a GROUP BY is specified, a singular numeric value per group)
Compatibility with the dataset Schema: Do the named fields exist in the schema, and do they have the appropriate types and casting for the requested operation? For example SUM on a string field, without a CAST would fail validation.
Is the SQL valid for the chosen data source dialect?

Should the above conditions not be satisfied, the SQL editor will display an appropriate error message.

Defining custom Datasets

Lariat allows you to define your own datasets derived from existing datasets captured by Lariat's integrations. This can be a useful shorthand for common columnar operations that need to be performed in several Indicators.

For example, if one wanted to use day_of_week as a grouping key for several indicators on a dataset, they may create a new dataset, registering day_of_week as a separate column.

Now day_of_week is a first-order part of the schema for the MyTableWithDayOfWeek dataset, and the new column is available for Indicator creation

At indicator evaluation time, custom datasets are expanded as "subqueries" (or their equivalent), in the wrapping indicator query. It is important to remember that custom datasets created in Lariat are not actually materialized in your data source. They are intended as a convenience method, to aid indicator creation, and support DRY. They should not be used in situations where a long-term materialized representation of new columns or aggregations is required. In the situation where you require materialization, it is preferrable to use your database-native method to do so, and then register the resulting materialized view / table in Lariat as another dataset.

PreviousInstallation & Configuration NextPlatform Architecture

Last updated 1 year ago