Working with Datasets and Indicators
A primer on datasets and indicators, and how to use them to measure data quality
Introduction
The central constructs in the Lariat framework for Data Quality are Datasets and Indicators.
A dataset is a unique combination of a Dataset Name, a Schema, and a Source ID.
An indicator is a calculation on that dataset, that is performed periodically.
Multiple Datasets may share the same Source, such as multiple tables with differing schemas in a production warehouse.
Similarly multiple Datasets may share the same schema, but different sources, such as staging and prod versions of a table.
Names and Schemas are generally inferred by Lariat integrations. For example, the Snowflake integration uses the information_schema
to determine table names and schemas that are mirrored in Lariat.
The Source, however, is always provided by the user, most often as a source_id
parameter when configuring an integration. The choice of a source_id
is important since a given Indicator calculation may only operate on a single source_id
. A source_id
must be unique within an organization.
Although the above examples refer to data warehouses, the notion of a Dataset is decoupled from any assumption about where the data lives. The below are all valid Datasets in Lariat:
A table in a relational database
An S3 path containing date-partitioned Parquet files
An in-memory pandas DataFrame
Building Indicators
After a Dataset has been registered with Lariat, you may build Indicators on top of it.
Indicators are simply numeric values calculated from a dataset. They are designed with a few properties in mind that make them easy to layer on top of your data stack.
Periodic
Indicator Queries run:
On a schedule you define
Over a window of data you define
With customizable offsets that let you account for upstream lag
Read-only
An indicator execution fetches a result set, and computes a numeric value from that result set without modifying any data
This value may be tagged with any dimensions retrievable from the same result set.
Composable
If multiple Indicators are defined on top of overlapping result sets, Lariat will attempt to compose them into the lowest possible number of queries to execute against your data sets.
This lets you express Indicator queries at a level of simplicity that is meaningful to your team, rather than overloading them into complex SQL statements
Several Indicator can be activated by default using the Dataset Configuration page at /datasets/:id/configure
Working with Timestamps
In order to calculate Indicators, Lariat requires that registered Datasets possess at least one member of their schema that can be interpreted as a timestamp. The timestamp field ensures that Indicator calculations are performed over bounded windows of the data, and that these calculations advance forward in time as new data arrives.
A valid timestamp field is a numeric field in the schema representing a UNIX Epoch in seconds. However in the event that such a field doesn't exist users may instruct Lariat to derive it from existing fields on the Dataset Configuration page:
In the above example we are using an arithmetic operator to convert eventtimemilli
a UNIX epoch in milliseconds, to a UNIX epoch in seconds. You may use any operator or function to aid this conversion, as long as it is supported by your underlying data source (for example, to_unixtime
in Presto/Trino, or date_part
in Snowflake, Postgres and others)
Indicator Requirements
An indicator calculates a scalar value from a dataset, with the option for grouping and filtering by any field that is part of the dataset's schema. As mentioned above, indicators require a timestamp field to be defined on the schema, so that calculations may be limited to an appropriate subset of data.
Basic Indicator Patterns
The form builder section in the Create Indicator page enables common indicator definitions to be quickly specified.
Dataset: The dataset which the indicator will be computed on
Column: The field in the dataset schema which the Indicator refers to
Calculation: One of
sum
,avg
,min
,max
,count
, orcountDistinct
Group By: Fields in the schema that you wish this indicator calculation to be grouped on. Fields specified under
Group By
will be available as dimensions for graphing and alerting.Filters: Boolean operations that limit the result set on which the Indicator will be calculated. These might commonly be used to discard
null
or test values, and prevent them from skewing indicator calculations.Timestamp Field: The field in the schema to use as the timestamp for deciding the window of data under observation. See Lookback Period, Evaluation Interval, and Offset
Lookback Period, Evaluation Interval, and Offset
Every indicator definition requires a Lookback Period and an Evaluation Interval. Optionally, users may specify an Offset for the evaluation time as well.
The Lookback Period specifies the duration of the time window in the data over which the Indicator should be calculated. It is what you would use, for example, to distinguish a count over the last hour
from a count over the last day
. The Lookback Period refers specifically to the Timestamp field, as described under Working with Timestamps
The Evaluation Interval specifies how frequently the Indicator should be evaluated.
The Offset is an optional setting to delay each evaluation by a fixed amount of time. The beginning and end of the Lookback Period for each evaluation interval is determined prior to applying this offset, meaning that the Offset can be used to account for any systemic delays in the arrival of new data
As a whole, you may think of these controls as appending an additional WHERE
clause to your indicator query, specifying the start and end timestamps of the data under observation.
The Create Indicator screen shows you the Next 5 Evaluations that will be scheduled for this indicator.
By selecting an appropriate Lookback Period, and Evaluation Interval, you can specify if calculations should operate on overlapping, or non-overlapping windows of data.
The below specifies a calculation over a Lookback period spanning 1 day, as timestamped by the ddate
field. Since the evaluation interval is every day at midnight
, the Indicator produces a daily calculation over strictly non-overlapping windows of data
In the below example, since the evaluation interval is Every hour
, the Indicator performs a rolling hourly calculation on overlapping windows of data, while always observing a day's worth of data.
To understand the utility of Offset
, contrast this with the below definition
If we expect a maximum of a 10-minute delay in the arrival of data with new ddate
values, Offset the evaluation time by: 10 Minutes
can delay evaluation to the point when we are certain the Indicator will be calculated on a "whole" window of data.
Note that this is different from specifying the Evaluation Interval
to be 10 minutes past the hour. Since Lookback Periods are calculated based on the requested evaluation times, the window of observed data would include timestamps from :00 UTC
and :10 UTC
, and this data may not have arrived in our dataset yet.
Using SQL
The SQL Editor in the Create Indicator page allows for free-form entry of a SQL query to evaluate against your dataset. This is helpful when your query cannot be sufficiently expressed using just the form builder, and is suitable for complex expressions such as CASE
statements or custom calculations.
As a rule of thumb, if the query can be executed directly against your dataset using the native database driver, it can also be validated and run as an Indicator. This means that the SQL editor supports common ANSI SQL constructs such as WHERE
and GROUP BY
as well as the specific aggregations and functions that may be supported by your data source (such as KURTOSIS
in Snowflake)
The SQL Query defined in the editor is subject to the same evaluation logic specified under Lookback Period, Evaluation Interval, and Offset. Hence users should avoid including timestamp ranges in the SQL query as these will be reliably inferred at indicator evaluation time.
The SQL Query defined in the editor is also subject to passing validation (via the Validate
button) prior to indicator creation.
SQL Query Validation asserts a few things about your proposed query to provide a reasonable guarantee that it can be successfully executed by Lariat agents. The exact validation logic varies depending on the data source the query will operate on but generally speaking Lariat tests for:
Scalar Computation: Is the query extracting a singular numeric value from the dataset? (Or where a
GROUP BY
is specified, a singular numeric value per group)Compatibility with the dataset Schema: Do the named fields exist in the schema, and do they have the appropriate types and casting for the requested operation? For example
SUM
on astring
field, without aCAST
would fail validation.Is the SQL valid for the chosen data source dialect?
Should the above conditions not be satisfied, the SQL editor will display an appropriate error message.
Defining custom Datasets
Lariat allows you to define your own datasets derived from existing datasets captured by Lariat's integrations. This can be a useful shorthand for common columnar operations that need to be performed in several Indicators.
For example, if one wanted to use day_of_week
as a grouping key for several indicators on a dataset, they may create a new dataset, registering day_of_week
as a separate column.
Now day_of_week
is a first-order part of the schema for the MyTableWithDayOfWeek
dataset, and the new column is available for Indicator creation
At indicator evaluation time, custom datasets are expanded as "subqueries" (or their equivalent), in the wrapping indicator query. It is important to remember that custom datasets created in Lariat are not actually materialized in your data source. They are intended as a convenience method, to aid indicator creation, and support DRY. They should not be used in situations where a long-term materialized representation of new columns or aggregations is required. In the situation where you require materialization, it is preferrable to use your database-native method to do so, and then register the resulting materialized view / table in Lariat as another dataset.
Last updated