# Working with Datasets and Indicators

### Introduction

The central constructs in the Lariat framework for Data Quality are Datasets and Indicators.

A dataset is a unique combination of a Dataset Name, a Schema, and a Source ID.&#x20;

An indicator is a calculation on that dataset, that is performed periodically.

Multiple Datasets may share the same Source, such as multiple tables with differing schemas in a production warehouse.&#x20;

Similarly multiple Datasets may share the same schema, but different sources, such as staging and prod versions of a table.

Names and Schemas are generally inferred by Lariat integrations. For example, the Snowflake integration uses the `information_schema` to determine table names and schemas that are mirrored in Lariat.

The Source, however, is always provided by the user, most often as a `source_id` parameter when configuring an integration. The choice of a `source_id`is important since a given Indicator calculation may only operate on a single `source_id`. A `source_id` must be unique within an organization.

Although the above examples refer to data warehouses, the notion of a Dataset is decoupled from any assumption about where the data lives. The below are all valid Datasets in Lariat:

* A table in a relational database
* An S3 path containing date-partitioned Parquet files
* An in-memory pandas DataFrame

### Building Indicators

After a Dataset has been registered with Lariat, you may build Indicators on top of it.&#x20;

Indicators are simply numeric values calculated from a dataset. They are designed with a few properties in mind that make them easy to layer on top of your data stack.

#### Periodic&#x20;

Indicator Queries run:

* On a schedule you define&#x20;
* Over a window of data you define&#x20;
* With customizable offsets that let you account for upstream lag

#### Read-only&#x20;

* An indicator execution fetches a result set, and computes a numeric value from that result set without modifying any data&#x20;
* This value may be tagged with any dimensions retrievable from the same result set.

#### Composable

* If multiple Indicators are defined on top of overlapping result sets, Lariat will attempt to compose them into the lowest possible number of queries to execute against your data sets.
* This lets you express Indicator queries at a level of simplicity that is meaningful to your team, rather than overloading them into complex SQL statements

&#x20;Several Indicator can be activated by default using the Dataset Configuration page at `/datasets/:id/configure`

#### Working with Timestamps

In order to calculate Indicators, Lariat requires that registered Datasets possess at least one member of their schema that can be interpreted as a timestamp. The timestamp field ensures that Indicator calculations are performed over bounded windows of the data, and that these calculations advance forward in time as new data arrives.

A valid timestamp field is a numeric field in the schema representing a UNIX Epoch in seconds. However in the event that such a field doesn't exist users may instruct Lariat to derive it from existing fields on the Dataset Configuration page:<br>

<figure><img src="/files/5N5FfCqHd8XTSdeoUQ3y" alt=""><figcaption></figcaption></figure>

<figure><img src="/files/V7mMLFPti6UpKaIiMZFp" alt=""><figcaption></figcaption></figure>

In the above example we are using an arithmetic operator to convert `eventtimemilli` a UNIX epoch in milliseconds, to a UNIX epoch in seconds. You may use any operator or function to aid this conversion, as long as it is supported by your underlying data source (for example, `to_unixtime` in Presto/Trino, or `date_part` in Snowflake, Postgres and others)

#### Indicator Requirements

An indicator calculates a scalar value from a dataset, with the option for grouping and filtering by any field that is part of the dataset's schema. As mentioned above, indicators require a timestamp field to be defined on the schema, so that calculations may be limited to an appropriate subset of data.

#### Basic Indicator Patterns

The form builder section in the Create Indicator page enables common indicator definitions to be quickly specified.

* Dataset: The dataset which the indicator will be computed on
* Column: The field in the dataset schema which the Indicator refers to
* Calculation: One of `sum` , `avg` , `min`, `max`, `count` , or `countDistinct`
* Group By: Fields in the schema that you wish this indicator calculation to be grouped on. Fields specified under `Group By` will be available as dimensions for graphing and alerting.
* Filters: Boolean operations that limit the result set on which the Indicator will be calculated. These might commonly be used to discard `null` or test values, and prevent them from skewing indicator calculations.
* Timestamp Field: The field in the schema to use as the timestamp for deciding the window of data under observation. See [#lookback-period-evaluation-interval-and-offset](#lookback-period-evaluation-interval-and-offset "mention")

#### Lookback Period, Evaluation Interval, and Offset

Every indicator definition requires a Lookback Period and an Evaluation Interval. Optionally, users may specify an Offset for the evaluation time as well.

The **Lookback Period** specifies the duration of the time window in the data over which the Indicator should be calculated. It is what you would use, for example, to distinguish a count `over the last hour` from a count `over the last day` . The Lookback Period refers specifically to the Timestamp field, as described under [#working-with-timestamps](#working-with-timestamps "mention")

The **Evaluation Interval** specifies how frequently the Indicator should be evaluated.

The **Offset** is an optional setting to delay each evaluation by a fixed amount of time. The beginning and end of the Lookback Period for each evaluation interval is determined **prior** to applying this offset, meaning that the Offset can be used to account for any systemic delays in the arrival of new data\
\
As a whole, you may think of these controls as appending an additional `WHERE` clause to your indicator query, specifying the start and end timestamps of the data under observation.&#x20;

The Create Indicator screen shows you the Next 5 Evaluations that will be scheduled for this indicator.

By selecting an appropriate Lookback Period, and Evaluation Interval, you can specify if calculations should operate on **overlapping**, or **non-overlapping** windows of data.

The below specifies a calculation over a Lookback period spanning 1 day, as timestamped by the `ddate` field. Since the evaluation interval  is `every day at midnight` , the Indicator produces a daily calculation over strictly non-overlapping windows of data

<figure><img src="/files/wRCuuPBmlafGjhzdBBAY" alt=""><figcaption></figcaption></figure>

In the below example, since the evaluation interval is `Every hour` , the Indicator performs a rolling hourly calculation on overlapping windows of data, while always observing a day's worth of data.

<figure><img src="/files/3UNgDu7MQoYFrUfy9TgA" alt=""><figcaption></figcaption></figure>

&#x20;To understand the utility of `Offset` , contrast this with the below definition

<figure><img src="/files/PVTAtqr5suYmWdl5wWcV" alt=""><figcaption></figcaption></figure>

If we expect a maximum of a 10-minute delay in the arrival of data with new `ddate` values, `Offset the evaluation time by: 10 Minutes` can delay evaluation to the point when we are certain the Indicator will be calculated on a "whole" window of data.

Note that this is different from specifying the `Evaluation Interval` to be 10 minutes past the hour. Since Lookback Periods are calculated based on the requested evaluation times, the window of observed data would include timestamps from `:00 UTC` and `:10 UTC` , and this data may not have arrived in our dataset yet.

#### Using SQL

The SQL Editor in the Create Indicator page allows for free-form entry of a SQL query to evaluate against your dataset. This is helpful when your query cannot be sufficiently expressed using just the form builder, and is suitable for complex expressions such as `CASE` statements or custom calculations.&#x20;

As a rule of thumb, if the query can be executed directly against your dataset using the native database driver, it can also be validated and run as an Indicator. This means that the SQL editor supports common ANSI SQL constructs such as `WHERE`and `GROUP BY` as well as the specific aggregations and functions that may be supported by your data source (such as `KURTOSIS` in Snowflake)

The SQL Query defined in the editor is subject to the same evaluation logic specified under [#lookback-period-evaluation-interval-and-offset](#lookback-period-evaluation-interval-and-offset "mention"). Hence users should avoid including timestamp ranges in the SQL query as these will be reliably inferred at indicator evaluation time.

The SQL Query defined in the editor is also subject to passing validation (via the `Validate` button) prior to indicator creation.

SQL Query Validation asserts a few things about your proposed query to provide a reasonable guarantee that it can be successfully executed by Lariat agents. The exact validation logic varies depending on the data source the query will operate on but generally speaking Lariat tests for:

* Scalar Computation: Is the query extracting a singular numeric value from the dataset? (Or where a `GROUP BY` is specified, a singular numeric value per group)
* Compatibility with the dataset Schema: Do the named fields exist in the schema, and do they have the appropriate types and casting for the requested operation? For example `SUM` on a `string` field, without a `CAST` would fail validation.
* Is the SQL valid for the chosen data source dialect?

Should the above conditions not be satisfied, the SQL editor will display an appropriate error message.&#x20;

### Defining custom Datasets

Lariat allows you to define your own datasets derived from existing datasets captured by Lariat's integrations. This can be a useful shorthand for common columnar operations that need to be performed in several Indicators.

For example, if one wanted to use `day_of_week` as a grouping key for several indicators on a dataset, they may create a new dataset, registering  `day_of_week` as a separate column.

<figure><img src="/files/EOhmEO4kjQzX2YmxjKlt" alt=""><figcaption></figcaption></figure>

Now  `day_of_week` is a first-order part of the schema for the `MyTableWithDayOfWeek` dataset, and the new column is available for Indicator creation

<figure><img src="/files/2xN6tHOVoHf9Mjot4pHd" alt=""><figcaption></figcaption></figure>

At indicator evaluation time, custom datasets are expanded as "subqueries" (or their equivalent), in the wrapping indicator query. It is important to remember that custom datasets created in Lariat are not actually materialized in your data source. They are intended as a convenience method, to aid indicator creation, and support DRY.  They should not be used in situations where a long-term materialized representation of new columns or aggregations is required. In the situation where you require materialization, it is preferrable to use your database-native method to do so, and then register the resulting materialized view / table in Lariat as another dataset.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.lariatdata.com/fundamentals/working-with-datasets-and-indicators.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
