By Louise de Leyritz from Castor (www.castordoc.com)
As data proliferates in modern organizations, technologies we use to move this data around have become more intricate. Data pipelines have become so complex that businesses have a hard time identifying the root cause of data issues, which leads to tremendous productivity losses.
This explains the recent explosion in the past two years of data quality and observability tools (internal, open-source, and SaaS). This new trend is not going to stop, and we'd rather bring visibility and structure soon.
At CastorDoc, we believe the first step to structure Data Observability tools market, is more transparency. For that reason, we put up a list of all the Observability tools we heard of.
Get started with Data Observability Tools
Deeper dive into Data Observability
Deployment support: Does the solution support a Saas development model, an open-source model? Both?
Monitoring framework: What approach does the solution use to investigate data? Do they use:
- A pipeline testing framework allowing data engineers to test binary statements. For example, whether all the values in a column are unique, or whether the schema matches a certain expression. When tests fail, the data is labeled as "bad". The data engineering team can thus diagnose poor quality data and take the necessary steps to resolve issues.
- An anomaly detection framework where the solution scans data assets collects statistics from this data and pays attention to the changes in the behavior of these statistics.
Threshold setting: Are the alert thresholds automatically set by the solution based on patterns observed in your data, can you manually set the thresholds for your metrics?
Interface type: Is there a no-code interface? Is there a command-line tool used to run tests on datasets to find invalid, missing, or unexpected data?
High cardinality support: Does the solution provide an easy way to add hundreds/ thousands of data streams to be monitored? If the solution doesn't support high cardinality, it means monitoring must be enabled stream by stream, which takes more time.
Monitoring frequency: How often does the observability tool monitor the data assets? Can it provide the relevant insights in real-time and identify data issues just as they are happening, with a possibility to stop the pipeline and prevent bad data from loading?
Automated features: Does the solution proposes any of the following automation features?
- Automated threshold setting: The solution uses machine learning algorithms to detect patterns in the datasets. This way, thresholds are automatically defined, flagging unusual data points.
- Automated thresholds updating: As data changes, the alert thresholds are automatically updated based on data quality metrics forecast.
- Automated circuit breaker: Can the solution automatically stop the pipeline from running when it identifies bad data?
- Automatic filtering-out: Can the solution automatically filter out and quarantine bad data that has been identified, ensuring bad data doesn't go in downstream applications?
- Auto-resolution: When the solution detects bad data, can it automatically fix the issue?
- Dynamic pipeline runs: Run different versions of pipelines based on changing data inputs, parameters, or model scores? The purpose here is to improve pipeline performance.
Data sources: Which data sources can the tool connect to?
Integrations: Does the solutions integrate with other tools, such as a data catalog?
Alert destinations: Which applications do the tools send alerts to when a threshold is reached? Slack, PagerDuty? Can it be customized to the notification system your team uses?
Security: Is your data in safe hands with the solution? Does the solution have access to your data or PII information? Is the solution SOC 2 compliant?
Metrics categories tracked: Which metrics are tracked by the solution? Data freshness, distribution, volume?
- Freshness: Detect any glitch in the refreshment schedule of your data. Say, your data is meant to be refreshed every hour, and it hasn't been refreshed for 4 hours. You'll be made aware of that with a solution that monitors freshness.
- Distribution: Metrics relating to distributions include the range of the data, the variance, the mean, the kurtosis, and many other. If any of these metrics reach a critical threshold, the observability solution sends an alert to the platform you use.
- Volume: A volume issue occurs when your data is incomplete. For example, you're expecting to find one million rows in your table, but you only have 100 000. Monitoring data volume will alert you when such a thing happens.
- Schema: Schema deals with changes that have been performed on the data. Has a table been added, removed, modified? That's all related to the schema.
- Format: Get alerted when your data is not in the expected format.
- Nulls and blanks: The solution alerts you when some cells are null when they shouldn't be.
- Custom metrics: Define yourself the metrics you want to track, the solution makes it possible for you.
- System: System metrics come from your pipeline executions, such as duration and resource utilization
Root cause analysis: Root cause analysis features allow your organization to identify the cause of a data issue. These include:
- Lineage: Data lineage includes data origin, what happens to it, and where it moves over time. It is the process of understanding, recording, and visualizing data as it flows from data sources to consumption.
- Correlation between metrics: Tracking correlation between metrics and events allows you to track better the cause of data issues.
Name | Link | Deployment support | Monitoring framework | Threshold setting | Interface type | High cardinality support | real-time data monitoring | Automated features | Data sources | Integrations | Alert destinations | Security | Metrics categories tracked | Root cause analysis | Community | Github |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Bigeye | SaaSOn-premises | Anomaly detectionPipeline testing | AutomatedManual | No-code | Yes | Automated threshold settingAutomated threshold updatingAutomated circuit-breakerQuarantine bad data | Main data warehousesMain databases | GmailSlackPagerdutyAPIs | Certified SOC 2 compliantData stays in your environment | FreshnessOutliersFormatsDistributionVolumeCustom metricsNulls & blanks | ||||||
Soda | Cloudopen-source | Anomaly detectionPipeline testing | AutomatedManual | No-codeCommand-line tool | Yes | Automated threshold settingAutomated circuit-breakerQuarantine bad dataAutomated threshold updating | Main data warehousesMain data lakes | CollibraTableauLookerAlation | SlackE-mailwebhooks for alerts & incidents | Data stays in your environment | FreshnessVolumeFormatsSchemaCustom metricsNulls & blanks | Automated failed row analysis | ||||
Databand | SaaSOpen source core | Pipeline testingAnomaly detection | ManualAutomated | Command-line toolNo-code | Yes | Automated threshold settingAutomated threshold updating | Main data warehousesMain databasesMain data lakes | SlackPagerdutyOpsGenieCustom | Data stays in your environment | SchemaFormatsDistributionCustom metricsOutliersSystemFreshnessData ingestion rate | Lineage | |||||
Monte Carlo | SaaSCloud | Anomaly detectionPipeline testing | AutomatedManual | No-codeCommand-line tool | Yes | Automated threshold settingAutomated threshold updatingAutomated circuit-breakerQuarantine bad data | Main data lakesMain data warehousesMain databases500+ | LookerTableauPeriscopeChartioAlationAtlanAmundsenDbtDatadogModePowerBIPrefectDatahub | SlackPagerdutyWebhooksOpsgenieCustomTeamsMattermost | Certified SOC 2 compliantHIPAAGDPRPCICCPASOC 2 compliantData stays in your environment | FreshnessVolumeDistributionSchemaOutliersCustom metricsCorrelation across metricsNulls & blanksFormats | LineageCorrelation across metrics | ||||
Cito | SaaSCloudVPC | Anomaly detection | AutomatedManual | No-codeAPI | Yes | Automated threshold settingAutomated threshold updating | Main data warehouses | TableauDbtModePowerBILookerMetabase | SlackE-mail | Data stays in your environment | Nulls & blanksSchemaFreshnessOutliersDistributionCustom metricsFormatsVolume | LineageSQL Code AccessibilityColumn-level lineage | ||||
great expectations | Open-sourcecloud-product coming soon. | Pipeline testingAnomaly detection | AutomatedManual | Command-line toolPython notebooksPython library | Yes | Automated threshold settingAutomated threshold updatingAutomated circuit-breakerQuarantine bad dataAuto-resolution | Main data warehousesMain data lakesMain databases | AtlanDbtDagsterAstronomerPrefectPandasKedroFlyteDatahubMarquez | SlackPagerdutyOpsgenieE-mail | Data stays in your environment | Nulls & blanksCorrelation across metricsMultivariate feature checksOutliersFreshnessCustom metricsVolumeDistributionSchema | |||||
Sifflet | SaaSOn-premises | Anomaly detectionPipeline testing | AutomatedManual | No-codeAPI | Yes | Automated threshold settingAutomated threshold updating | Main data warehousesSQL server | TableauLookerDatadogDbt | SlackGmailAPIsPagerduty | Data stays in your environment | FreshnessVolumeOutliersFormatsDistributionSchemaLineageNulls & blanksCustom metrics | Lineage | ||||
Validio | SaaSDeployed in the customer cloud environment | Anomaly detectionPipeline testing | AutomatedManual | No-code | Yes | Automated threshold settingAutomated threshold updatingAutomated circuit-breakerAuto-resolution | Main data warehousesMain databases | SlackPagerdutyE-mail | Data stays in your environment | FreshnessDistributionOutliersVolumeData ingestion rateSchemaFormatsMultivariate feature checks | ||||||
Lightup | SaaSManaged on-premFully on prem | TestingAnomaly detection | AutomatedManual | No-codeSQLAPI | Yes | Automated threshold settingAutomated threshold updating | Main data warehousesMain databasesMain data lakes | SlackTeamsPagerdutyE-mailAPIsMattermostWebhooksFlock | ISAEE 3000 compliantData stays in your environmentCertified SOC 2 compliant | VolumeFreshnessSchemaDistributionFormatsCorrelation across metricsCustom metrics | Correlation across metricsLineage | |||||
Lantern | SaaS | Anomaly detection | Automated | No-code | Yes | Automated threshold setting | Main data warehouses | SlackE-mail | DistributionVolume | |||||||
Metaplane | SaaSVPC | Anomaly detection | AutomatedManual | No-code | Yes | Automated threshold settingAutomated threshold updating | Main data warehousesMain databases | DbtLookerTableauModePowerBI | SlackPagerdutyOpsgenieTeams | Certified SOC 2 compliantData stays in your environment | FreshnessOutliersDistributionVolumeSchemaCustom metricsNulls & blanksFormats | LineageCorrelation across metrics | ||||
Datafold | SaaS | Anomaly detection | Automated | No-code | Yes | Automated threshold setting | Main data warehouses | SlackPagerdutyE-mailWebhooks | FreshnessOutliersDistribution | |||||||
Acceldata | Pipeline testing | Main data warehousesMain data lakes | DistributionSchema | Correlation across metrics | ||||||||||||
Anomalo | SaaSDeployed in the customer cloud environment | Anomaly detection | Automated | Automated threshold setting | Main data warehouses | |||||||||||
Marquez | open-source | Testing | Manual | Command-line tool | Amundsen | Lineage |