Anatomy of a Modern Data Solution on Snowflake — Introduction

Learn how to build capable, trusted, and genuinely agile data solutions.

Milan Mosny
Infostrux Engineering Blog

--

Photo by Joyce Hankins on Unsplash

This is the first in a series of blog posts discussing modern data solutions on Snowflake. To keep the discussion focused, we’ll first make some assumptions about a typical current data solution. Then, we'll go over some criteria by which we can measure the “goodness” of a data solution. These are capabilities, trustworthiness, agility, security, and compliance. Then, we’ll cover the individual elements of a typical solution. Examples are Snowflake Account Setup, Extract and Load, Visualization, and Data Modelling, but there are many more.

Assumptions

For this blog, we'll assume we are building a greenfield solution. This is not always the case — many Snowflake projects start as migrations, which may pose their specific constraints and requirements. However, even migrations need a target architecture to be well-defined, and many points from this blog could be relevant. This is true, especially of migrations of the “redesign” or “rearchitect” variety, as opposed to “lift and shift".

The second assumption we are going to make is that of the Modern Data Stack approach :

  • A cloud data warehouse, in our case Snowflake
  • ELT (as opposed to ETL)
  • SaaS or otherwise ready-made tooling for extract and load, transformations, orchestration, and reverse ETL

The third assumption is that of scope. For now, we'll end our journey without venturing into more advanced data management topics such as Reference and Master Data Management or Data Ethics.

The fourth assumption is that of a reasonable simplicity and a single solution. We'll refrain from venturing into Data Mesh, Data Fabric, or other ways to organize data work among many teams. However, many points made here are applicable in those settings.

Criteria

What makes a data solution “good"? Many criteria can be considered, and their importance will vary from solution to solution. Here, we focus on these four that we are finding to be very common:

  • Capabilities
  • Trustworthiness
  • Agility
  • Security and Compliance

Capabilities

Capabilities are about what a solution can do. The typical capabilities of an analytics solution are often described as follows:

  • Descriptive — the solution describes what has happened in the past. The usual forms are Canned Reports or Dashboards that answer a particular business question but may still allow slicing, dicing or drill-down to obtain more details, and Ad-hoc Dashboarding or Reporting that enables data-savvy business users to get answers to most of their questions without involving the data team.
  • Diagnostic — the solution allows data scientists or analysts to gain deeper insights into the business, including explaining why something happened.
  • Predictive — the solution can make forecasts and predictions of the future based on historical data.
  • Prescriptive — the solution can suggest appropriate actions based on historical data, forecasts, and predictions.

Trustworthiness

One of the common reasons why data projects fail is data that cannot be trusted. Here is a couple of dimensions of trust worth keeping in mind:

  • Good documentation — accurate and accessible
  • Availability — up and running when it’s needed, recovers from disasters gracefully
  • Data freshness — reasonable frequency and latency of data refresh
  • Low bug frequency — bugs happen, but they should not happen often
  • No surprises — know about issues before our users do, and let them know

Perhaps one of the most important aspects of trustworthiness is data quality, to which we’ll dedicate a separate section.

Data Quality

Here, we follow a standard set of data quality dimensions:

  • Completeness — no missing data.
  • Integrity — primary keys and foreign keys work as expected, well-defined lineage.
  • Validity — availability for alignment, i.e., the same data, such as state/province or country, can be joined across data sets.
  • Uniqueness — a single record represents a single real-world entity or its state. This is a much stronger requirement than just guaranteeing no duplicate records.
  • Consistency — the information matches across the system; the same fields contain the same information.
  • Accuracy — real-world entity correspondence. This is a very strong requirement. Note that measuring accuracy against the source systems does not guarantee accuracy if the source systems contain inaccurate data.
  • Conformity to standards — internal or external, as required by the solution.

Agility

The agility of the solution indicates the amount of time and effort it takes to produce an answer to a business question. Modern data stack, if executed well, shines in this area. This is often an overlooked area that can make or break a solution. If the team needs weeks or even multiple days (as opposed to hours) to deliver the answers, chances are the answers are no longer relevant. A good data solution needs to match the speed of business.

Security and Governance

Security is a non-negotiable requirement, and the solution must provide proper authentication and authorization. Modern governance sets up policies and procedures that allow the permission to be as permissive as possible, giving all users the broadest possible (but still governed and secure) access, thus allowing the solution to bring the most significant value.

Some of the criteria mentioned above do overlap. For example, low bug frequency may overlap with availability, and availability may coincide with data freshness. However, they are not subsumed by each other, and all contribute to the “goodness” of the solution.

Elements

In this series, we’ll address the following elements:

  • Snowflake Account
  • Extract & Load
  • Visualization
  • Data Modeling and Documentation
  • Automated Tests
  • Orchestration
  • DevEx
  • CI/CD
  • Data Governance
  • Metadata Management
  • Data Quality and Observability

Not all elements need to be present to make up a solution. One can get away with just a Snowflake account, some extract & load and visualization. However, each element brings unique qualities that improve the solution in one or more criteria. If executed well, the result is a solution that covers the whole set.

For each element, we'll talk about:

  • Business Drivers — what should we consider when choosing how to approach this element
  • Activities and Deliverables — what does it take to take care of this particular solution element
  • Approaches and Best Practices — a selection of known methodologies and techniques for delivering the element
  • Technology — a piece of tech that can help with the implementation

What’s Next

In the next post, we'll talk about setting up a Snowflake account (or accounts). Stay tuned.

Thank you for reading to the end. I hope you enjoyed the start of the series. Any feedback is welcome — please leave a comment!

I’m Milan Mosny, CTO at Infostrux Solutions. You can follow me here on Infostrux Medium Blog or LinkedIn. I write about Snowflake, data engineering, and architecture and occasionally about other topics dear to my heart.

--

--

Co-founder and CTO of Infostrux. Comprehensive professional and managed services for all of your Snowflake needs. https://infostrux.com