Parseable

Data modelling


Datasets are the primary unit of storage in Parseable. Any MELT (Metrics, Events, Logs, Traces) data is ingested into one or other dataset.

Think of a dataset as a logical grouping of similar telemetry data. Proper datasets planning ensures fast query and optimal storage compression.

Every dataset is identified by a unique name. A dataset has a assigned schema (can be dynamic i.e. inferred from incoming data or static i.e. explicitly defined). Role based access control, alerts, retention and notifications are supported at the dataset level.

Mapping sources to datasets

As SREs and DevOps engineers, you often deal with multiple data sources generating telemetry data. These sources can include applications, infrastructure components, and third-party services. Each of these sources can produce different types of data with varying schemas.

When debugging or investigating issues, it's common to correlate data from multiple sources. For example, you might want to correlate application logs with infrastructure metrics to identify performance bottlenecks.

Hence it is crucial to thoughtfully map data sources to datasets in Parseable. This mapping ensures that related data is stored together, making it easier to query and analyze.

A data source is anything that generates data, i.e. agents like FluentBit, FluentD, Vector, LogStash, agents from the OTel ecosystem, or the application itself. While ingesting data, you'll need to specify the dataset name to which the data should be sent. This allows mapping sources to datasets. Technically it is possible to map any number of data sources any number of data sources. Parseable allows this to ensure flexibility for varying use cases.

But, it is important to critically think about mapping data sources to datasets. Too many unrelated columns in a dataset can lead to poor compression and slower query performance. On the other hand, too many datasets can lead to increased complexity in managing the data.

When deciding how to map sources to datasets, consider the following:

  • Schema similarity: If the sources have similar schema, it is better to map them to a single dataset. This allows for better compression and faster query performance. Similar here means fields are matching for 80 percent or more of the events. If the schema is too different, it is better to create separate datasets for each source.

  • Query patterns: If you frequently query across multiple sources, it is better to map them to a single dataset. This allows you to query the data easily without having to join multiple datasets.

  • Data retention: If the sources have different data retention requirements, it is better to create separate datasets for each source. This allows you to set different retention policies for each dataset.

  • Data ownership: If different teams own different sources, it is better to create separate datasets for each source. This allows you to set different access controls for each dataset and manage the data better.

Let's understand this with some examples:

  • Kubernetes infrastructure logs: Kubernetes infrastructure logs (e.g. kubelet, kube-proxy, etc.) can be mapped to a single dataset. This allows you to query the logs across all the Kubernetes components easily. Since these logs have similar schema, they fit well into a single dataset.

  • Application logs with similar schema: If you have logs from multiple applications that log in a common format for example go-log, you can create a single dataset for all of them. This allows you to query the logs across applications easily.

  • Application logs with different schemas: If you have logs from multiple applications that have completely different schemas, you can create a separate dataset for each application. This allows you to enforce a specific schema for each dataset and query them independently.

  • Aggregated data: If you have aggregated data (e.g. metrics, traces) that you want to store, you can create a separate dataset for that. This allows you to query the aggregated data separately from the raw logs.

Beyond 800 columns in a dataset, consider splitting the dataset into multiple datasets based on schema similarity or query patterns. Beyond 1000 columns, the server will reject the ingestion request with an error.

Adding data to Parseable

Identify the data source

Refer the ingestion guides for various data sources in the Ingestion section.

Create or use an existing dataset

You can create a dataset using the "Create Dataset" button on Datasets page. You'll be prompted to enter the dataset name, schema type, and partition column.

You can set the schema type to be static (schema has to be explicitly provided at the time of creation of dataset) or dynamic (let server infer the schema from the incoming data). Once set, the schema type cannot be changed. Read more about schema types in the Dataset Schema section.

Partition column is an optional field. If you want to partition the dataset based on a specific field, you can specify that field here. If you don't specify a partition field, Parseable will use the internal p_timestamp field as the partition field. Read more about partitioning in the Partitioning section.

Dataset vs index

Traditional indices in systems like Elasticsearch are build to ingest textual data, index each field and allow for fast search and retrieval. This works well for pure search use cases, where you want to search for specific keywords or phrases in the data.

But applying this concept to huge volumes of observability data (logs, metrics, traces) is not practical. Observability data is often structured, semi-structured or unstructured, and indexing every field can lead to excessive storage costs for little to no performance gain.

Parseable datasets are designed to handle large volumes of observability data efficiently. They focus on optimal storage compression and fast retrieval via query, rather than indexing every field. This allows Parseable to handle high cardinality data, such as logs with many unique fields, without the performance and storage overhead of traditional indices.

Dataset schema

Schema defines the fields in an event and their types. Parseable supports two types of schema - dynamic and static. You can choose the schema type while creating the dataset. Additionally, if you want to enforce a specific schema, you'll need to send that schema at the time of creating the dataset.

Dynamic

Datasets by default have dynamic schema. This means you don't need to define a schema for a dataset. The Parseable server detects the schema from first event. If there are subsequent events (with new schema), it updates internal schema accordingly.

Log data formats evolve over time, and users prefer a dynamic schema approach, where they don't have to worry about schema changes, and they are still able to ingest events to a given dataset.

For dynamic schema, Parseable doesn't allow changing the type of an existing column whose type is already set. For example, if a column is detected as string in the first event, it can't be changed to int or timestamp in a later event. If you'd like to enforce a specific schema, please use static schema.

Static

In some cases, you may want to enforce a specific schema for a dataset. You can do this by setting the static schema type while creating the dataset. This schema will be enforced for all the events ingested to the dataset. You'll need to provide the schema in the form of a JSON object with field names and their types, with the create dataset API call. The following types are supported in the schema: string, int, float, datetime,date, boolean.

FAQ

Some of common questions related to datasets are answered below. If you have any other questions, please reach out to us on Slack or GitHub Discussions.

Was this page helpful?

On this page