Anatomy of a Modern Data Solution on Snowflake — Data Modeling and Documentation

Learn about modeling approaches for Snowflake and what to consider when selecting one.

Milan Mosny
Infostrux Engineering Blog

--

Photo by Christina @ wocintechchat.com on Unsplash

This blog is the fifth in the “Anatomy of a Modern Data Solution on Snowflake” series. It talks about the fourth element of a solution — data modeling and documentation.

Why is data modeling important? A good model does at least two things:

  • It brings utmost clarity to the data that the solution is making available for analysis
  • It allows the reuse of business logic that turns raw data from the sources into data that is easy to slice and dice or otherwise analyze.

Clarity is important for the trustworthiness of the overall solution. It prevents questions like “Why is my dashboard showing 5 when yours is showing 7?” and its many varieties. Even seemingly simple concepts such as “sales” need to be well defined and often clarified with business to understand their purpose. Is it sales before taxes (for salespeople) or after taxes (for people dealing with cash flow)? Is it booked, shipped or delivered? Is it before discounts or after discounts? How are invoice-level discounts attributed to individual products? What do we do with credits for future purchases? How about returns and voids? Depending on the answers to these and many other questions, one can easily come up with 100-s of definitions for revenue, no two of which are the same.

Reusing business logic guarantees that the same well-defined and clarified metric (such as revenue) or dimension attribute is computed the same way and that the computation does not have to be coded twice. It allows us to standardize which source tables and columns are used to come up with the answer and standardize how the data from multiple systems are combined. This helps especially with older legacy systems where the business logic within the source system may have changed over time, and different tables and columns may provide different answers to the same question. Similarly, newer, more complex systems or services may have similar issues due to bugs or misunderstandings. Modeling a concept in one place and thus following the DRY principle, we can afford to do a bit more exploration and analysis to find the right answer. The added bonus is that there is a single place where our misunderstandings of the source systems or bugs can be fixed.

Business and Technical Drivers

When deciding on modeling approaches and technologies, consider the following:

  • Fit to use cases/usage patterns — for example, integrating many data sources could be a good fit for Data Vault, a need to compute many metrics based on simpler ones may lead to star schema combined with a layered approach, a need for type II dims for a vast majority of dimensions could indicate wide tables. Often, a combination of approaches is needed to yield the desired results.
  • Fit to the visualization tooling — some tools, for example, PowerBI work very well with star schemas, and some, for example, Sigma, are happier with wide tables.
  • Data team capabilities and preferences— your team could be deep into metadata-driven ETL, or they may prefer simple SQL transformation. They may prefer low-code or no-code solutions, or they may prefer (and be well versed) in everything-as-code. Depending on the solution's needs, one may choose certain paths that may include non-trivial training and up-skilling or re-skilling of your team.

Other considerations are common to many other elements of the solution and could be important when choosing the approach, tooling or technology:

  • Latency and frequency of updates
  • Security and governance
  • Multitenancy
  • Location requirements
  • SLAs and Reliability
  • Support and Maintenance
  • Disaster recovery
  • Development and deployment process
  • True cost-to-own (TCO)

They were described in more detail in previous blogs.

Activities and Deliverables

Choosing the right approach is usually the first step. Once the approach is clear, building the models follows. The model build does not necessarily have to be an explicit or manual activity. For example, we can choose a code-first approach, where the resulting logical or physical models can be derived from the transformation code through automation. Or, we can generate first-cut physical models (or even transformation implementation) from conceptual models and some additional metadata. As long as the models provide clarity and reuse, the job is done. Model building is often a part of an agile process and does not require to have complete models for the whole solution ready before an implementation starts. The modeling can be done one use case at a time.

The deliverables (whether created manually or via automation, whether defined explicitly and derived from other elements of the solution) often include:

  • Business Glossary
  • Conceptual Data Model
  • Logical Data Model
  • Physical Data Model

The implementation creates the transformations that (given our assumptions of Modern Data Stack) run almost exclusively in Snowflake.

Approaches, Best Practices and Technologies

Modeling

Snowflake itself does not constrain modeling approaches in a significant way. This differs from other cloud data warehousing platforms that pose performance penalties when working with, for example, joins. Any of the following seems to be working well:

  • Normalized schemas following typical entity/relationship models
  • Data Vault
  • Star Schemas
  • Wide Tables
  • Layered Architectures

Best Practices

Some of the best practices to keep in mind

  • Naming Conventions — defining a good set and enforcing them leads to data models that are easy to understand and use and are, therefore, faster to work with
  • Conformance to Conceptual Model — similarly, having a clear definition of the business domains helps with consistency and leads to fewer questions when working with or using the model and, consequently, higher speed
  • Clear documentation — to guarantee clarity, the documentation of the models is necessary.
  • Data Contracts — enforcing data contacts leads to fewer bugs and fewer breakages downstream.

Modeling Tools and Technologies

Simple docs and spreadsheets can go a long way as long as they are used consistently and, ideally, allow machine readability.

Dedicated tools such as ERWin can be used for larger and more complex models.

Data catalogs, such as data.world, Collibra, Allation, Atlan or open-source OpenMetadata allows for documenting and sharing of all levels of data models.

Some transformation tools, such as dbt, can produce data model documentation. The metadata produced by such tools can be further used by automation to verify best practices, such as naming conventions, conformance to conceptual models, presence of documentation or conformance to data contracts.

Transformations — Snowflake

Snowflake itself provides multiple ways to transform data:

  • Views — these are usual SQL views common to many database platforms.
  • Materialized Views — an optimization over the views (with some fairly strong restrictions).
  • Dynamic Tables — this is a newer addition that allows for declarative specification of the transformation in SQL (very much like views) but, under the hood, implements incremental (when possible) materialization of such transformation. This is an amazing feature that is making data pipelines in Snowflake an order of magnitude simpler.
  • Streams/Tasks — the approach can use built-in change tracking similar to CDC (streams) and scheduling of jobs (Tasks) to build complex and efficient data transformation pipelines.

Transformations — Third-party software and services

There are too many to list. There are literally dozens. This list is not meant to be exhaustive or ordered in any way. On the commercial side, the usual suspects are dbtCloud, Coalesce, Rivery, Matillion, Talend, ETLeap, and Ascend.io. On the open-source side, dbt core is our favorite.

There are newcomers fighting for the spotlight every year. If you’ve seen anything worthwhile, please leave a comment.

An interesting feature of some newcomers and incumbents (for example, Cube or dbtCloud) is the ability to support Semantic Layers that can simplify the definition and reuse of metrics.

Conclusion

The previous blogs showed us what it takes to set up a Snowflake account, load the data and set up visualization.

This blog talks about data modeling on Snowflake. We have listed the considerations that need to be taken into account before selecting the right approach and touched on the ways to get it done.

At this point of our journey (account, EL, modeling, documentation, and visualization), we do have a fairly decent data analytics solution. Of course, it can be improved further. Here's what it does well (bold) and not so well (italics):

  • Capabilities: Descriptive — Canned Reports and Ad-hoc, Diagnostic, Predictive, Prescriptive
  • Trustworthiness: Documentation, Availability, Data freshness, Low bug frequency, No surprises.
  • Data Quality: Completeness, Integrity, Validity, Uniqueness, Consistency, Accuracy, and Conformity to standards
  • Agility: Medium
  • Security and Governance: Basic

Modeling and related documentation helped with agility thanks to reuse, trustworthiness (documentation) and some data quality (consistency, conformity to standards).

A lack of automated tests and observability still hurts many aspects of trustworthiness, data quality and agility.

In the following blog posts, we explore how the additional elements (Orchestration, Automated Tests, DevEx, CI/CD, Data Governance, Metadata Management, Data Quality and Observability) improve our solution considerably.

Thank you for reading to the end. I hope you enjoyed the blog. Any feedback is welcome — please leave a comment!

I’m Milan Mosny, CTO at Infostrux Solutions. You can follow me here on Infostrux Medium Blog or LinkedIn. I write about Snowflake, data engineering and architecture and occasionally about other topics dear to my heart.

--

--

Co-founder and CTO of Infostrux. Comprehensive professional and managed services for all of your Snowflake needs. https://infostrux.com