Evaluating Modern Data Governance Catalogs— Part 1

Learn how to evaluate a Data Catalog solution for your organization

Shravan Deolalikar
Infostrux Engineering Blog

--

Photo by Jason Dent on Unsplash

Here at Infostrux, we have been keeping tabs on the growing field of modern data catalogs, focusing primarily on products that integrate with technology that reflects DataOps and engineering best practices. Specifically Snowflake Cloud Data Warehouse, DBT (Data Build Tool), 3rd party SaaS Integration tools such as Fivetran or Airbyte, and BI tools such as Tableau or Power BI. Before I get into the details of how we evaluate some of these products for our clients, I think some history will nicely frame the discussion.

I worked in an Enterprise Data Integration team in a US healthcare insurance company early in my career. At the time, I didn't fully understand the gravity of some of the problems my team was trying to address. In hindsight, I have a much greater appreciation.

Healthcare insurance providers in the late 90s and 2000s suffered from data silos as many companies had multiple departments and systems that were heterogenous and largely disconnected. This limited the ability of companies to have a comprehensive view of patient information and insurance claims, as well as an understanding of data quality. Insurance providers often had to deal with incomplete and inaccurate data due to inconsistent data integration and lack of standardization across departments and systems. Aging technical infrastructure and legacy systems further compounded these data integration issues.

The changing legal landscape in the late 1990s presented even more challenges for healthcare providers with the passage of HIPAA in 1996, the Health Insurance Portability and Accountability Act. The new privacy and security requirements needed to be addressed in the business and technical architecture. Personal health information (PHI) had to be safeguarded to protect PHI data confidentiality, integrity, and availability.

Healthcare Industry adopted file formats like HL7 and X12 because they set standards for exchanging, integrating, sharing, and retrieving electronic health information. I would be paired with Business Analysts on many projects who held very in-depth business contextual information about these files as I developed ETL processes. These complex data structures provided some efficiencies for exchanging data and metadata. However, formalized metadata collection processes and technology were unavoidable in the industry.

Understanding and discoverability

Given the siloed and fractured technical landscape of systems in healthcare insurance companies, metadata collection techniques help create a unified view of data across departments and systems. By gathering and organizing metadata in a data catalog, an organization can provide its users an easy way to search, discover and understand relevant data assets.

Standardization and Data Quality

Metadata collection techniques help enforce and maintain data standards across the organization. Data catalogs store information about data formats, such as HL7 and X12, and their corresponding data elements, ensuring consistency and adherence to industry standards. This, in turn, improves data quality and facilitates data integration efforts.

Compliance and Regulations

Metadata collection and data catalogs are vital in managing the complex requirements of regulations like HIPAA. By collecting and maintaining information about data lineage and data processing (such as source-to-target mapping), data catalogs help organizations demonstrate compliance with privacy and security requirements. Additionally, data catalogs track metadata related to personal health information (PHI), ensuring the appropriate safeguards are in place to protect the confidentiality, integrity, and availability of PHI data.

Data Integration

Metadata collection techniques streamline integrating data from different systems and formats, such as HL7 and X12, by providing essential information about the data’s structure, relationships, and semantics. This enables more efficient and accurate mapping of data elements across systems, reducing risk and errors and speeding up integration efforts.

Modernization Efforts

While not generally discussed as a benefit of metadata collection and data governance tooling. Metadata collection and data catalogs help enterprises with complex legacy systems, such as long-established health insurance providers, by providing insights into data dependencies, data usage patterns, and system architectures. These insights can then help plan for new technical solutions on modern infrastructure.

Evaluation Criteria

Whatever your strategic business drivers for implementing a data catalog, whether operating in the healthcare space or not, selecting the right one and crafting an implementation plan is a complex matter.

At Infostrux, we have helped clients with this process, and have explored the current landscape of offerings. The Forrester Wave article “Enterprise Data Catalogs for DataOps” is a good place to start understanding some of the strongest contenders in the space. While these are considered “Data Catalogs” many of these platforms address other aspects of data governance such as policy management, business glossary, data lineage, and in some cases, even data quality management.

To evaluate data catalogs, we generally focus on various categories and capabilities fundamental to resilient data architecture. The tool should enable your organization’s capabilities across these concerns.

Discoverability

Discoverability is a requirement that refers to the ease with which users can find, access, and understand relevant information or resources within a system. In data management, discoverability encompasses the ability to locate and comprehend datasets, data elements, and their associated metadata in an efficient and user-friendly manner.

User experience features are an area we focus on.

  • Term Referencing — Is there an easy way to link and navigate through terms in the tool? Can you traverse relationships between terms and entities with ease? Are synonyms or related terms presented conveniently to the user?
  • Personalized UX — Can landing pages for users be personalized? Does the tool support personas, or user-type customizations?
  • Search Functionality — Are search features intuitive and allow users to find data sets, data elements, and metadata based on keywords, tags or other attributes? Additionally, does the search functionality allow for the filtering of approved terms, or “vetted” data elements via a data governance workflow?
  • Data classification and organization — Does the tool support classification and organization of data assets through categorization, tagging, and creating hierarchical structures? Is it intuitive and easy for a user to present a specific view of a taxonomy, ontology, or logical domain?

Addressability

Addressability refers to the ability to uniquely identify and access specific components, resources, and elements within your data ecosystem. Data assets could be BI dashboards, tables in the data warehouse, metadata, and datasets. It is important that assets are referenced in a consistent and standardized manner.

  • Collaboration and communication — Can datasets be uniquely identified and referenced? Can these unique data sets be easily shared?
  • Integration with other systems and tools — Does the tool allow for addressability via integration with other data management and analytics tools, such as data integration tooling, data warehouses, or BI platforms?

Reliability

Reliability refers to the degree to which users can trust and rely on the information, resources, or components within a system. In the context of data catalogs or data management, reliability speaks to the capability to establish confidence in the accuracy, consistency, and quality of the data assets and metadata. Often this is referred to as “trustability”.

Data Governance workflow support is key in establishing trust in assets presented in the data catalog. Data Stewards will want to vet data assets for definitions and data quality before they are made available for consumption by knowledge workers.

  • Metadata Content Management— Is there support for a configurable data governance workflow? Does the product have the ability to define and implement workflows to manage content (approval/ratification) and enforce policies, business rules, and standards? Additionally, will other users in the ecosystem be notified of new or modified governed entities?
  • Metadata Life Cycle Management — Closely related to aspects of metadata content management, does the product provide lifecycle information in terms of evaluation for acceptance and general use? Is the retirement of a data asset clearly communicated by the platform?
  • Classification of Sensitive Data— Can datasets be identified via a data classification policy? Can the proper security controls be put on data assets in the platform, such as row-level or column-level obfuscation or tokenization?

On top of Data Governance Workflow support, features relating to data lineage are important in ensuring reliability in your data ecosystem.

  • Lineage and Logic Description — Is lineage visible for data assets? Do links between nodes have a logic description?
  • Lineage Enrichment — Can comments and or annotation be added to transformation logic on edges with descriptions?

Collaborative features are also important in establishing trust in data assets. The ability to see social activity around a data asset alerts users to the trustworthiness of the asset and the overall metadata life cycle management. These features help set some of the current offerings in the data catalog space apart from the rest.

  • Commenting and Discussion Forum— Can the platform support comment and discussion for agile data governance? Do users get alerted when comments are added?

Self-Describing

“Self-describing” in the context of data management and data catalogs refers to the ability of the tool to provide sufficient information about the data ecosystem structure, function, or usage without the need for external documentation or resources.

  • Automated Metadata Loading — Does the tool provide automated metadata loading from the data ecosystem, such as integration tools, transformation tools like DBT, and cloud data warehouses like Snowflake?
  • Manual Metadata Loading — Does the tool allow for manually uploading metadata?

Security

Security implies basic data governance controls essential to proper protection and discretionary access to metadata. Areas we focus on are Policy Management, User Management, and User Monitoring, while governance workflows also fall into this category.

  • Policy Management — Can assets be associated with a certain data policy? How are data policies presented to the user? Are policies intuitive to work with and provide necessary alerting to other users?
  • User Management — Does the tool use Role-Based Access control? Does the tool have Single-Sign-On and Multi-Factor Authentication support? Is user provisioning and de-provisioning straightforward and intuitive? Does the tool allow the association of permissions and assets to customizable roles? Does the tool have fine-grained access controls over data sets? Is user activity monitored and auditable?

Conclusion

The criteria I have shared with you are not an exhaustive list. Certainly, many other concerns fall under these categories during an evaluation. I have focused on evaluating the technology itself and the capabilities it will bring to your overall data architecture.

Evaluating your company’s culture, organizational structure, and data strategy is important in establishing requirements for a metadata management strategy and implementation. Taking a close look at the technology vendor itself and its commitment to successful customer implementation is another set of concerns we will explore in a follow-up.

At Infostrux, we have helped clients evaluate and implement data governance tooling and workflows. Get in touch if you need help evaluating and implementing a data catalog solution or help with other data management initiatives.

Try Infostrux Solutions open-source offerings in our GitHub. We are a Snowflake-focused consulting business changing the world of data.

.

--

--

Principal Data Architect, areas of focus in Data Management, Data Warehousing, Data Integration and Modern Cloud Solutions.