Catalog of Data Observability tools
Catalog of Data Observability tools

Catalog of Data Observability tools

By Louise de Leyritz from Castor (www.castordoc.com)

As data proliferates in modern organizations, technologies we use to move this data around have become more intricate. Data pipelines have become so complex that businesses have a hard time identifying the root cause of data issues, which leads to tremendous productivity losses.

This explains the recent explosion in the past two years of data quality and observability tools (internal, open-source, and SaaS). This new trend is not going to stop, and we'd rather bring visibility and structure soon.

At Castor, we believe the first step to structure Data Observability tools market, is more transparency. For that reason, we put up a list of all the Observability tools we heard of.

image

💡
This list is still exploratory, may contain errors, or lack information. Please reach out to us, if you notice anything wrong: louise@castordoc.com

Get started with Data Observability Tools

📢
In-depth analysis and evolution Read the full breakdown by generation and market analysis of data quality here.
image

Deeper dive into Data Observability

What does each column in the benchmark below mean?

Deployment support: Does the solution support a Saas development model, an open-source model? Both?

Monitoring framework: What approach does the solution use to investigate data? Do they use:

  • A pipeline testing framework allowing data engineers to test binary statements. For example, whether all the values in a column are unique, or whether the schema matches a certain expression. When tests fail, the data is labeled as "bad". The data engineering team can thus diagnose poor quality data and take the necessary steps to resolve issues.
  • An anomaly detection framework where the solution scans data assets collects statistics from this data and pays attention to the changes in the behavior of these statistics.

Threshold setting: Are the alert thresholds automatically set by the solution based on patterns observed in your data, can you manually set the thresholds for your metrics?

Interface type: Is there a no-code interface? Is there a command-line tool used to run tests on datasets to find invalid, missing, or unexpected data?

High cardinality support: Does the solution provide an easy way to add hundreds/ thousands of data streams to be monitored? If the solution doesn't support high cardinality, it means monitoring must be enabled stream by stream, which takes more time.

Monitoring frequency: How often does the observability tool monitor the data assets? Can it provide the relevant insights in real-time and identify data issues just as they are happening, with a possibility to stop the pipeline and prevent bad data from loading?

Automated features: Does the solution proposes any of the following automation features?

  • Automated threshold setting: The solution uses machine learning algorithms to detect patterns in the datasets. This way, thresholds are automatically defined, flagging unusual data points.
  • Automated thresholds updating: As data changes, the alert thresholds are automatically updated based on data quality metrics forecast.
  • Automated circuit breaker: Can the solution automatically stop the pipeline from running when it identifies bad data?
  • Automatic filtering-out: Can the solution automatically filter out and quarantine bad data that has been identified, ensuring bad data doesn't go in downstream applications?
  • Auto-resolution: When the solution detects bad data, can it automatically fix the issue?
  • Dynamic pipeline runs: Run different versions of pipelines based on changing data inputs, parameters, or model scores? The purpose here is to improve pipeline performance.

Data sources: Which data sources can the tool connect to?

Integrations: Does the solutions integrate with other tools, such as a data catalog?

Alert destinations: Which applications do the tools send alerts to when a threshold is reached? Slack, PagerDuty? Can it be customized to the notification system your team uses?

Security: Is your data in safe hands with the solution? Does the solution have access to your data or PII information? Is the solution SOC 2 compliant?

Metrics categories tracked: Which metrics are tracked by the solution? Data freshness, distribution, volume?

  • Freshness: Detect any glitch in the refreshment schedule of your data. Say, your data is meant to be refreshed every hour, and it hasn't been refreshed for 4 hours. You'll be made aware of that with a solution that monitors freshness.
  • Distribution: Metrics relating to distributions include the range of the data, the variance, the mean, the kurtosis, and many other. If any of these metrics reach a critical threshold, the observability solution sends an alert to the platform you use.
  • Volume: A volume issue occurs when your data is incomplete. For example, you're expecting to find one million rows in your table, but you only have 100 000. Monitoring data volume will alert you when such a thing happens.
  • Schema: Schema deals with changes that have been performed on the data. Has a table been added, removed, modified? That's all related to the schema.
  • Format: Get alerted when your data is not in the expected format.
  • Nulls and blanks: The solution alerts you when some cells are null when they shouldn't be.
  • Custom metrics: Define yourself the metrics you want to track, the solution makes it possible for you.
  • System: System metrics come from your pipeline executions, such as duration and resource utilization

Root cause analysis: Root cause analysis features allow your organization to identify the cause of a data issue. These include:

  • Lineage: Data lineage includes data origin, what happens to it, and where it moves over time. It is the process of understanding, recording, and visualizing data as it flows from data sources to consumption.
  • Correlation between metrics: Tracking correlation between metrics and events allows you to track better the cause of data issues.

Data Observability Benchmark and Key Features

NameLinkDeployment supportMonitoring frameworkThreshold settingInterface typeHigh cardinality supportreal-time data monitoringAutomated featuresData sourcesIntegrationsAlert destinationsSecurityMetrics categories trackedRoot cause analysisCommunityGithub
Bigeye
SaaSOn-premises
Anomaly detectionPipeline testing
AutomatedManual
No-code
Yes
Automated threshold settingAutomated threshold updatingAutomated circuit-breakerQuarantine bad data
Main data warehousesMain databases
GmailSlackPagerdutyAPIs
Certified SOC 2 compliantData stays in your environment
FreshnessOutliersFormatsDistributionVolumeCustom metricsNulls & blanks
Soda
Cloudopen-source
Anomaly detectionPipeline testing
AutomatedManual
No-codeCommand-line tool
Yes
Automated threshold settingAutomated circuit-breakerQuarantine bad dataAutomated threshold updating
Main data warehousesMain data lakes
CollibraTableauLookerAlation
SlackE-mailwebhooks for alerts & incidents
Data stays in your environment
FreshnessVolumeFormatsSchemaCustom metricsNulls & blanks
Automated failed row analysis
Databand
SaaSOpen source core
Pipeline testingAnomaly detection
ManualAutomated
Command-line toolNo-code
Yes
Automated threshold settingAutomated threshold updating
Main data warehousesMain databasesMain data lakes
SlackPagerdutyOpsGenieCustom
Data stays in your environment
SchemaFormatsDistributionCustom metricsOutliersSystemFreshnessData ingestion rate
Lineage
Monte Carlo
SaaSCloud
Anomaly detectionPipeline testing
AutomatedManual
No-codeCommand-line tool
Yes
Automated threshold settingAutomated threshold updatingAutomated circuit-breakerQuarantine bad data
Main data lakesMain data warehousesMain databases500+
LookerTableauPeriscopeChartioAlationAtlanAmundsenDbtDatadogModePowerBIPrefectDatahub
SlackPagerdutyWebhooksOpsgenieCustomTeamsMattermost
Certified SOC 2 compliantHIPAAGDPRPCICCPASOC 2 compliantData stays in your environment
FreshnessVolumeDistributionSchemaOutliersCustom metricsCorrelation across metricsNulls & blanksFormats
LineageCorrelation across metrics
Cito
SaaSCloudVPC
Anomaly detection
AutomatedManual
No-codeAPI
Yes
Automated threshold settingAutomated threshold updating
Main data warehouses
TableauDbtModePowerBILookerMetabase
SlackE-mail
Data stays in your environment
Nulls & blanksSchemaFreshnessOutliersDistributionCustom metricsFormatsVolume
LineageSQL Code AccessibilityColumn-level lineage
great expectations
Open-sourcecloud-product coming soon.
Pipeline testingAnomaly detection
AutomatedManual
Command-line toolPython notebooksPython library
Yes
Automated threshold settingAutomated threshold updatingAutomated circuit-breakerQuarantine bad dataAuto-resolution
Main data warehousesMain data lakesMain databases
AtlanDbtDagsterAstronomerPrefectPandasKedroFlyteDatahubMarquez
SlackPagerdutyOpsgenieE-mail
Data stays in your environment
Nulls & blanksCorrelation across metricsMultivariate feature checksOutliersFreshnessCustom metricsVolumeDistributionSchema
Sifflet
SaaSOn-premises
Anomaly detectionPipeline testing
AutomatedManual
No-codeAPI
Yes
Automated threshold settingAutomated threshold updating
Main data warehousesSQL server
TableauLookerDatadogDbt
SlackGmailAPIsPagerduty
Data stays in your environment
FreshnessVolumeOutliersFormatsDistributionSchemaLineageNulls & blanksCustom metrics
Lineage
Validio
SaaSDeployed in the customer cloud environment
Anomaly detectionPipeline testing
AutomatedManual
No-code
Yes
Automated threshold settingAutomated threshold updatingAutomated circuit-breakerAuto-resolution
Main data warehousesMain databases
SlackPagerdutyE-mail
Data stays in your environment
FreshnessDistributionOutliersVolumeData ingestion rateSchemaFormatsMultivariate feature checks
Lightup
SaaSManaged on-premFully on prem
TestingAnomaly detection
AutomatedManual
No-codeSQLAPI
Yes
Automated threshold settingAutomated threshold updating
Main data warehousesMain databasesMain data lakes
SlackTeamsPagerdutyE-mailAPIsMattermostWebhooksFlock
ISAEE 3000 compliantData stays in your environmentCertified SOC 2 compliant
VolumeFreshnessSchemaDistributionFormatsCorrelation across metricsCustom metrics
Correlation across metricsLineage
Lantern
SaaS
Anomaly detection
Automated
No-code
Yes
Automated threshold setting
Main data warehouses
SlackE-mail
DistributionVolume
Metaplane
SaaSVPC
Anomaly detection
AutomatedManual
No-code
Yes
Automated threshold settingAutomated threshold updating
Main data warehousesMain databases
DbtLookerTableauModePowerBI
SlackPagerdutyOpsgenieTeams
Certified SOC 2 compliantData stays in your environment
FreshnessOutliersDistributionVolumeSchemaCustom metricsNulls & blanksFormats
LineageCorrelation across metrics
Datafold
SaaS
Anomaly detection
Automated
No-code
Yes
Automated threshold setting
Main data warehouses
SlackPagerdutyE-mailWebhooks
FreshnessOutliersDistribution
Acceldata
Pipeline testing
Main data warehousesMain data lakes
DistributionSchema
Correlation across metrics
Anomalo
SaaSDeployed in the customer cloud environment
Anomaly detection
Automated
Automated threshold setting
Main data warehouses
Marquez
open-source
Testing
Manual
Command-line tool
Amundsen
Lineage

Additional benchmark resources