🔍
Lariat Data
  • 👋Welcome to Lariat Data
  • Overview
    • 💡Video Overview
    • ✨Core Features
    • 🤓Glossary
  • Fundamentals
    • ⚙️Installation & Configuration
    • 📈Working with Datasets and Indicators
    • ☁️Platform Architecture
    • 🔓Your API & Application Keys
  • Integrations (Data Storage)
    • ⏏️S3 Object Storage
    • ⛄Iceberg
    • ⚛️AWS Athena
    • ❄️Snowflake
    • ⏏️GCS Object Storage
    • 🖥️AWS Redshift
    • 🖥️Google BigQuery
  • Integrations (Code)
    • 🐍Python
    • 💫Spark
    • ☕Java/JVM
Powered by GitBook
On this page
  1. Integrations (Data Storage)

S3 Object Storage

Instructions for installation and configuration of S3 Object Storage agent

PreviousYour API & Application KeysNextIceberg

Last updated 1 year ago

We recommend that you follow the Installation instructions from the UI. This can be found on app.lariatdata.com by clicking on the Integrations tab and going to "Add new integration" as outlined in the page.

To describe what the install entails, we outline how you could do this outside of the UI. If you are interested in the code that powers the install, please take a look here .

What do you need for the install?

  • docker

  • Your Cloud Account ID and region

  • Your own Cloud account keys

Note: If you are running the install outside of the UI, you will need your Lariat API Key and Lariat Application Key and Lariat generated Cloud access keys (e.g. AWS key and secret key)

Installation command

If using the UI, copy the installation command and fill in the unpopulated fields.

Here is what the command looks like:

docker run -it \
--mount type=bind,source=/local/path/to/config/s3_agent.yaml,target=/workspace/s3_agent.yaml,readonly \
-e AWS_REGION={YOUR_AWS_REGION} \
-e AWS_ACCOUNT_ID={YOUR_AWS_ACCOUNT_ID} \
-e AWS_ACCESS_KEY_ID=$(aws configure get aws_access_key_id) \
-e AWS_SECRET_ACCESS_KEY=$(aws configure get aws_secret_access_key) \ 
-e AWS_SESSION_TOKEN=$(aws configure get aws_session_token) \
-e LARIAT_TMP_AWS_ACCESS_KEY_ID={PREFILLED_BY_UI} \
-e LARIAT_TMP_AWS_SECRET_ACCESS_KEY={PREFILLED_BY_UI} \
-e LARIAT_API_KEY={PREFILLED_BY_UI} \
-e LARIAT_APPLICATION_KEY={PREFILLED_BY_UI} \
-e LARIAT_EVENT_NAME=sns_s3_trigger \
-e LARIAT_PAYLOAD_SOURCE=s3 \
lariatdata/install-aws-s3-agent:latest install

If you do not have access to the UI and need LARIAT_TMP_AWS_ACCESS_KEY_ID and LARIAT_TMP_AWS_SECRET_ACCESS_KEY you will have to reach out to support@lariatdata.com.

The config below matches an s3 path like so:

s3://s3-bucket-prefix/partition_seperated_val_1=1010/partition_seperated_val_2=some_source/fixed_value_1/2024-10/day=05

The config is stored on your object storage path that has the following format: lariat-s3-default-config followed by a timestamp prefix.

 your_bucket_name:
           - prefix: "s3-bucket-prefix"
             key_val_partition_separator: "="
             suffix_template: "{partition_seperated_var_1}/{partition_seperated_var_2}/fixed_value_1/<unpartitioned_var_1>/{partition_seperated_var_3}"
             file_type: parquet
             name: my_dataset_name 
             columns:
               string:
                 -  string_column_for_stats_1
                 -  string_column_for_stats_2
               number:
                - num_column_for_stats_1
                - num_column_for_stats_2
             dimensions:
              - dimension_field_1
              - dimension_field_2
             timestamp:
               timestamp_col_name_1:
                column: "{unpartitioned_var_1}-{partition_seperated_var_2}"
                format: "%Y-%m-%d"
                timezone: "UTC"
               timestamp_col_name_2:
                column: timestamp_column
                format: "unixtime"
                primary: True
source_id: unique-source-id-name

The above is the structure of the configuration YAML.

Below are descriptions for each field, and a demonstrative fully filled in YAML.

  • your_bucket_name: This is the bucket name being tracked by the monitoring agent

  • s3-bucket-prefix:Only monitor objects under this prefix (i.e. follows the pattern s3://your-bucket-name/s3-bucket-prefix

  • key_val_partition_separator: The seperator used in object keys to denote key-value pairs. Any character allowed here. (E.g. if you have any s3 path like s3://my-bucket/partner_id=1234, the partition_separator for {partner_id} would be "=") N.B: This isn't the "/" file directory partition separator.

  • suffix_template The variable patterns that you want to track objects for, including any partition variables that need to be captured.

    • {partition_seperated_var_1}: when wrapped in braces the pattern being matched is a key value partition pair that is separated by partition_separator. The variable will be named partition_seperated_var_1 and can be referred to as such when naming columns and defining timestamps

    • fixed_value_1: when there are no braces or angle-brackets, this part of the suffix just represents a static part of the object key

    • <unpartitioned_var_1>: when wrapped in angular brackets, this represents a variable that doesn't have a partition key but that we still want to assign a name and can be referred to as such when naming columns and defining timestamps

    • file_type we currently support the following file_types:

      • jsonl - Line seperated json (every line is a json record)

      • json - Json file

      • csv - CSV file

      • parquet - parquet file

    • name - Name to refer to this dataset by

    • columns The columns expected in the dataset that we want to track statistics

      • string - The listed columns here represent the columns of type string in the object

      • number - The columns under this list represent the columns of type number in the object

    • dimensions - This is the granularity at which we want the above statistics to be computed. They represent the columns we want to group & filter by.

    • timestamp - Time is a first class citizen in the Lariat platform. You can construct a time field so that you can see time series of statistics over time.

      • timestamp_col_name_1 Represents the name of the timestamp column to be defined. You can either construct by combining existing fields or by directly using a field.

        • columnDefinition of the column. There are two ways to define a column:

          • col_name - directly put in a column name. You will be able to specify a timezone and format

          • combine columns - combine columns like so "{partition_seperated_var_1}-{partition_seperated_var_2}" and specify a format. So if the values of partition_seperated_var_1 is 2023-02 and partition_seperated_var_2 is 10, you can specify a format of %Y-%m-%d to construct a time field.

        • primary - If specified and set to True, this will be the primary timestamp used for dashboarding and downstream analytics. If not set, the first defined timestamp is treated as primary.

    • source_id This is the unique source_id representing the agent. This is used to make sure we don't conflate data across regions or permission boundaries. N.B: Make sure no other s3 object storage configuration shares the same source_id

Once setup is complete, and the configuration has been defined, you can start to inspect events on the platform.

At app.lariatdata.com, select the "Object Events" option from the sidebar. The screenshot below shows what the sidebar looks like.

format - This is time format code that extends the 1989 C standard. Details can be found . We also support a special "unixtime" format that parses an int unixtime.

⏏️
Installation & Configuration
https://github.com/lariat-data/install-aws-s3-agent
here