By Louise de Leyritz from CastorDoc
More data, more tools, more people = more data catalogs
Companies are deploying their analytics to more people in the company. Now, regardless of data literacy, most departments of large companies are using data. For that reason, there's a need to improve trust and understanding in data resources and infrastructure.
This explains the recent explosion in the past five years of data catalogs (internal, open-source, and SaaS). This new trend is not going to stop, and we'd rather bring visibility and structure soon.
At CastorDoc, we believe the first step to structure the data catalog market, is more transparency. For that reason, we put up a list of all the catalog tools we heard of.
Feature definition
For this benchmark, we're assuming that core capabilities like search, extraction, documentation, tagging, and discovery are standard fare in data catalogs. Instead, we're focusing on the standout features that can truly distinguish one catalog from another. Let us know if you feel that we are missing a feature. You will find a definition of each feature below:
Role-Based Access Controls: A system that grants or restricts data access based on a user's role within an organization.
Metadata Analytics: The analysis of aggregated data about other data, to uncover patterns and insights. This can include reports about unused data assets, for example.
Metadata Bulk Edit: The capability to make changes to metadata attributes across multiple data assets simultaneously.
Automated PII Tagging: The process of automatically identifying and marking personally identifiable information within datasets.
Social Data Discovery: A feature that enables users to explore and interact with data assets that colleagues or various teams within the organization are actively using and endorsing.
Column Lineage: The tracking of data's origin and transformations at the column level within databases or data models.
Definition Propagation: The automatic updating and synchronization of data definitions across multiple data assets and systems.
Personalized Views: Customized displays of data or interfaces tailored to individual user preferences or roles. This allows different roles to only see the information that is relevant to them.
Chrome Extension: A small software program that can be installed in the Chrome browser to extend its functionality. This allows users to access the data catalog without having to switch tooling.
Two-way Sync: The continuous synchronization of data in two different locations, ensuring that each reflects the most recent version. This allows for every tool to be a source of truth for documentation. Whether stakeholders check the documentation in dbt, the data catalog, BI tools, etc, the definitions will always be in agreement. Instead of having one source of truth, ensure all your tools act a source of truth with a two-way sync back.
Slack Integration: The ability to connect and let users interact with the data catalog through Slack.
Teams Integration: The ability to connect and let users interact with the data catalog through Teams.
Natural Language Search: Search functionalities enhanced by artificial intelligence to provide more accurate and context-aware results.
AI Documentation: The automatic generation, enhancement, or maintenance of documentation using artificial intelligence.
AI for SQL: AI technologies applied to SQL for optimizing queries, generating code, or interpreting natural language requests into SQL commands.
AI Assistant: An AI-powered tool that provides users with assistance in various tasks through natural language interaction.
Business Glossary: A centralized repository of business terms and definitions, often linked to data assets for clarity and consistency.
Knowledge Map: A visual representation or framework that organizes and displays the relationships and flows between different metrics and KPIs. A sort of data lineage, but for metrics.
Advanced Tag Management: The ability to create, assign, manage, and search for tags within a data catalog, facilitating better organization and retrieval of data assets.
Advanced Search Filtering: Enhanced search capabilities that allow users to narrow down search results using multiple criteria and filters, improving the relevance of search outcomes.
Table Popularity & Frequent Users: Metrics that track and display the usage frequency of data tables and identify the most active users, providing insights into the most valuable and frequently accessed data assets.
Rich Text: The capability to format text within the data catalog's user interface, allowing for better presentation and readability of data documentation and metadata.
API-Based Ingestion: The process of importing metadata and other relevant data into the data catalog using application programming interfaces (APIs), enabling automation and integration with other systems.
On-Premise Metadata Extractor: A tool or service that extracts metadata from various data sources within an on-premises environment, as opposed to cloud-based sources.
SQL Editor: A built-in feature that allows users to write, edit, and execute SQL queries directly within the data catalog, facilitating data exploration and management.
Data Quality Integrations: Connections between the data catalog and data quality tools or services, enabling the assessment and monitoring of data quality within the catalog.
Policy and Workflow: Features that enable the creation, management, and enforcement of data governance policies and the automation of data-related workflows.
Multi-Tenant Infrastructure: An architecture that allows multiple customers or user groups (tenants) to use the same data catalog instance while keeping each tenant's data isolated and secure.
**This is an attempt at classifying the tools on the market. If anything seems wrong, the feature list seems off, or if you don't see your data catalog and want to have it placed, please reach out: louise@castordoc.com
Feature | Classification | Collibra | Alation | Atlan | CastorDoc | Informatica | Data World | Dataedo | OvalEdge | Purview | Octopai | Acryl | Secoda | Select Star | Metaphor |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Role-Based Access Controls | Data Governance | ||||||||||||||
Metadata Analytics | Data Governance | ||||||||||||||
Metadata Bulk Edit | Data Governance | ||||||||||||||
Automated PII tagging | Data Governance | ||||||||||||||
Advanced Tag Management | Data Governance | ||||||||||||||
Policy and Workflow | Data Governance | ||||||||||||||
Multi-Tenant Infrastructure | Data Governance | ||||||||||||||
Social Data Discovery | Data Discovery | ||||||||||||||
Advanced Search Filtering | Data Discovery | ||||||||||||||
Table Popularity & Frequent Users | Data Discovery | ||||||||||||||
Column Lineage | Data Lineage | ||||||||||||||
Cross Platform Lineage (ETL → Data Warehouse → BI tools) | Data Lineage | ||||||||||||||
Definition Propagation | Data Lineage | ||||||||||||||
Personalized Views | User Experience | ||||||||||||||
Chrome Extension | User Experience | ||||||||||||||
Rich Text | User Experience | ||||||||||||||
SQL Editor | User Experience | ||||||||||||||
Two-Way Sync | Integrations | ||||||||||||||
Slack Integration | Integrations | ||||||||||||||
API Based Ingestion | Integrations | ||||||||||||||
On Premise Metadata Extractor | Integrations | ||||||||||||||
Data Quality Integration | Integrations | ||||||||||||||
Natural Language Search | AI features | ||||||||||||||
AI Documentation | AI features | ||||||||||||||
AI for SQL | AI features | ||||||||||||||
AI Assistant | AI features | ||||||||||||||
Business Glossary | Knowledge Management | ||||||||||||||
Knowledge Map | Knowledge Management |
More Ressources
Data Catalog Pricing Guide:
Data Catalog Template:
Data Catalog RFI template:
Data Catalog ROI calculator:
F.A.Q
Do You Need a Data Catalog?
If you're having trouble finding the data; A data catalog is a tool that brings together information, from different data sources making it easier for users to search, discover and access the specific data they require. Without a catalog users may waste time navigating through databases and platforms in order to locate the datasets they need.
If you're unsure which datasets to utilize; A data catalog often provides features like data quality scores, user reviews, and additional annotations. These features assist users in identifying relevant datasets that align with their goals leading to improved decision making and analytical outcomes.
If you have too many data sources at your disposal; In organizations data is scattered across various locations such as on-premises databases, cloud storage systems or third-party platforms. A data catalog consolidates metadata from all these sources into a view making it easier for users to explore all data options and select the most suitable source based on their requirements.
If your data environment has never been properly documented it can lead to chaos and inefficiency. Having a data catalog is crucial as it not only helps organize your data but also ensures documentation. It stores information, about data lineage, owners, and definitions enabling everyone in the organization to have an understanding of the origin, purpose, and characteristics of each dataset.
In case you need to comply with data regulations such as GDPR, CCPA, or others it becomes essential to have an understanding of where personal data's stored how it's utilized and who has access, to it. A data catalog can track this metadata, making it easier for organizations to demonstrate compliance and ensure that sensitive data is handled appropriately.