Anatomy of a Modern Data Solution on Snowflake — Snowflake Account

Learn how to set up a Snowflake account for development and production.

Milan Mosny
Infostrux Engineering Blog

--

Photo by Alejandro Escamilla on Unsplash

This blog is the second in the “Anatomy of a Modern Data Solution on Snowflake” series. It talks about the first element of a solution — Snowflake account setup.

The primary purpose of a Snowflake account setup is to create the main structures (accounts, databases, schemas, warehouses) that will hold and process your data and to develop proper security mechanisms so that when the data starts appearing in Snowflake, it’s already well protected.

Business and Technical Drivers

When deciding on approaches to data architecture and the overall structure of Snowflake objects — account(s), databases, schemas, etc. — there are a couple of requirements worth considering. Perhaps not all of them will influence the final setup, but it could be worthwhile to understand them beforehand so that the data architecture and the Snowflake object structure can be mapped to the requirements.

  • Required capabilities use cases, and usage patterns — perhaps including the domains and data sources the solution will serve. For example, it may make sense to encapsulate a single simple tactical solution that solves a single business problem, such as a solution that will provide visibility to SaaS license usage, in a single database, but for larger, more generic answers, such as enterprise data warehouses or complex multitenant analytics, a more complex structures spanning multiple databases or even accounts could be advantageous.
  • Security and governance requirements — Snowflake makes applying RBAC to individual Snowflake objects very easy. Accounts, databases, and schemas represent a natural hierarchy that can simplify RBAC design and maintenance.
  • Teams' structure and product and project hierarchies—a good setup will make it easy to delineate lines of ownership
  • Development process — setup may need to support different environments (dev, QA, staging, pre-prod, prod).

Account(s) cloud and region could be driven by

  • Geographical distribution of users and dev team members
  • On-prem or cloud location of the upstream and downstream systems
  • Multitenancy requirements

The approaches for Snowflake account setup and maintenance (automated vs. manual) can be driven by teams' current and future capabilities.

Activities and Deliverables

Data Architecture

Data architecture defines the main data stores and data flows. A nice thing about Snowflake is that it is very permissive and does not dictate a particular architectural approach. It supports data lake architectures, data warehouse architectures, data science-driven architectures, and combinations of all. (It also allows for all usual data modeling approaches from normalized entities through traditional Kimball star schemas to wide tables).

Below is a simplified example of how a layered data architecture may look like:

Account/Database/Schema Design

Once the architecture is selected, the next step is to map data stores to accounts, databases, schemas, and tables/views and set up a naming convention for the objects. Here's a blog that talks about this in more detail. At this point, we do have to keep the use cases in mind, but don’t have to have full conceptual/logical/physical models. The purpose of this step is to come up with an overall structure that will hold the actual tables/views later.

An example below shows a single account with databases and schemas corresponding to the data architecture shown above instantiated for two environments — DEV and PROD. A simpler database architecture can yield a more straightforward solution with a single database and a couple of schemas. More complex approaches may be needed to enable multitenant solutions with replication to multiple regions, automatic failover, and other non-functional solid constraints.

PROD_RAW                           DEV_RAW
SALESFORCE SALESFORCE
QUICKBOOKS QUICKBOOKS
PROD_CLEAN DEV_CLEAN
SALESFORCE SALESFORCE
QUICKBOOKS QUICKBOOKS
PROD_NORMALIZE DEV_NORMALIZE
SALES SALES
ACCOUNTING ACCOUNTING
... ...

Security Design

The second design task that's necessary before the data hits Snowflake is security design. At this point, we should define at least a basic RBAC structure following Snowflake's best practices to secure access to the data and define main object access and functional roles. Below is an example of what that may look like for a data architecture and object design similar to the ones above. Again, this could be much simpler for setups with a few users or more involved for complex organizations.

Snowflake Account(s) Creation

Creating an account is, of course, the first implementation step. There are a couple of choices that need to be made based on the requirements above:

  • Snowflake edition — most clients opt for Enterprise, but there is also Standard, Business Critical, and Virtual Private Snowflake.
  • Cloud and region to host it — your choice of AWS, Azure, or GCP, with many areas to choose from.
  • Payment mechanism — some cloud providers, for example, AWS, let users create a Snowflake account paid through the cloud provider agreement, which is beneficial if you need to consume a previously signed contract. Another option is to sign a capacity deal directly with Snowflake, which usually comes with a discount over a pay-as-you-go setup.

Account Security and Resource Monitoring

At this point, we are ready to setup:

  • Network Access Policies — whitelisting IP addresses so that the Snowflake account is accessible only from your network
  • SSO authentication — this enables centralized user management
  • Resource Monitors — an account-wide setup that puts guardrails on the spending to prevent accidental over-usage.

Snowflake Objects Creation

With the account(s) in place, we can start creating the main objects, such as roles, databases, and schemas, and grant appropriate permissions based on our object and security designs.

Approaches and Best Practices

There are many standard approaches to data architecture. The latest version of DAMA DMBoK2 is an excellent place to start for an overview, but less formal approaches can be taken (and often are). Snowflake's flexibility gives us quite a bit of leeway when choosing the exemplary architecture and allows us to pick the most straightforward approach for the given use case or workload.

Snowflake's security best practices can be found in Snowflake's community pages or documentation.

There are two main approaches to Snowflake object creation and lifecycle:

  • Manual — using UI or SQL to create and maintain objects manually. This is acceptable only for very simple setups, as it tends to get out of hand quickly, leading to mistakes that are difficult to find and even more difficult to fix.
  • Automation — using code or dedicated software to maintain the object structures. Code is very powerful but requires a knowledgeable team or a team willing to learn. Combining code with proper version control and CI/CD enables good governance. Dedicated software often relies on coding skills to represent schema changes but provides a framework for their application, simplifying CI/CD and governance processes.

Technology

This is not a complete list, nor a full list of what Infostrux can work with. Instead, it is a selection of tools I am somewhat familiar with.

  • Diagraming tools such as Visio or Lucidchart and document editing such as MS Word or Google Docs for data architecture and object and security design are often sufficient.
  • Terraform and/or Ploomi for managing Snowflake objects is a good choice for the coding approach. The advantage is that these systems can take care not only of Snowflake but also of the rest of the solution in the cloud, including cloud storage, infrastructure, extractors, and their integration with Snowflake. Because they span both Cloud and Snowflake, they allow for entirely automatic creation and management of objects like Snowflake integrations and pipes.
  • SQL scripting — maintaining the state of the high-level objects such as databases, schemas, roles, and warehouses through simple SQL scripts is a possibility. However, proper versioning and application of the scripts is not easy. It’s a reasonable solution for simpler setups of high-level objects that do not undergo many changes. It is assumed that the individual tables and views are managed differently. (Dbt would be one example that takes care of the lower-level objects gracefully.)
  • Snowflake’s SchemaChange — this is an open-source tool written by Snowflake. It is Python-based and provides a mechanism to track which version of objects is already deployed in which environment. The user supplies scripts that move objects to the next version (e.g., adding/renaming/removing columns of tables) or rebuild them regardless of which version they are at (e.g., stored procs). It's rather simplistic — it cannot generate the scripts automatically, and it doesn't provide support to revert to the previous version. We had some trouble with reliability, but perhaps it has been fixed since then.
  • Flyway and Liquabase— these are traditional, mature, and stable tools for database state management. However, they can't take care of the cloud portion of the solution.
  • DevOps suites, such as dataops.live — a fully managed choice, often opinionated and somewhat less flexible. It could be expensive, but for larger teams building more complex solutions that fit their use case, this can be an excellent choice, especially from the productivity point of view.
  • Security management suites such as Immuta — are almost essential for complex security environments. They manage complex RBAC and security policies that could be somewhat complex to do in code.

We often suggest Terraform/GitHub/GitHub Actions combination as it provides flexibility, expressive power, versioning, auditability, and reasonable governance.

Conclusion

This blog shows what needs to happen before the data starts flowing to Snowflake. Ideally, we set the data stores (which are rather difficult to change once the downstream systems start using them) and basic security so that when the data hits the stores, we have a good handle on who can see or change what.

Thank you for reading to the end. I hope you enjoyed the second blog of the series. Any feedback is welcome — please leave a comment!

I’m Milan Mosny, CTO at Infostrux Solutions. You can follow me here on Infostrux Medium Blog or LinkedIn. I write about Snowflake, data engineering, and architecture and occasionally about other topics dear to my heart.

--

--

Co-founder and CTO of Infostrux. Comprehensive professional and managed services for all of your Snowflake needs. https://infostrux.com