Anatomy of a Modern Data Solution on Snowflake — Extract and Load

Learn about extract and load (EL) for Snowflake and what to consider when designing a solution.

Milan Mosny
Infostrux Engineering Blog

--

Photo by Mike van den Bos on Unsplash

This blog is the third in the "Anatomy of a Modern Data Solution on Snowflake" series. It talks about the second element of a solution — how to extract and load data into Snowflake in the context of a Modern Data Stack.

The purpose of this exercise is to extract data from data sources and load them without any, or only with truly minimal, transformations. We often call it ingest. The approach allows us to follow a couple of important principles (if we are so inclined):

  • Separation of concerns — this is a well-known software development (and systems design) principle that allows us to create simpler solutions, which means less development and maintenance effort, leading to a higher speed of delivery.
  • Functional data warehousing — the term is perhaps not that well known, but the approach is often followed. The idea is that all transformations downstream of the load can be recomputed anytime, leading to the substantially same result. This is possible if the load preserves historical changes at the data source and/or the history of loads themselves. A cloud warehouse technology that separates storage and compute (such as Snowflake) and can use as much compute as needed to run the recomputation helps. The advantage is that we don’t have to maintain the state of the warehouse and write complex updates and backfill procedures when we discover a bug or a design flaw or when we introduce changes expanding the coverage of downstream structures.

Business and Technical Drivers

When deciding on approaches and technologies, consider the following:

  • Data source type support — does the EL tech you are considering support the data sources you wish to work with now and in the future? The current technologies such as Fivetran, Airbyte or Rivery support literally 100s of different data sources, but they may still not support yours. It's worthwhile to look at the roadmap for these technologies or speak directly to the providers. The space is fast-moving, and new data sources are added weekly, if not daily. Can the EL tech support custom sources, and how much work is it to add one?
  • Effort to connect a data source — how much effort does it take to fully set up a connector, including security, logging, auditing, monitoring and other non-functional constraints?
  • CDC or history of changes — a technology that handles it all without us having to put in extra effort will impact how fast we can go and how easy it is to maintain the solution.
  • Latency and frequency of updates— latency measures the time from when a change occurs in the source system to the time that change is visible in the data solution. Many solutions do not require latency in minutes or seconds, but some do. What latency does the technology allow, and is it sufficient? Can it run frequently enough and at specified times?
  • Security and governance — different commercial and open-source providers have different standards for handling data in transit and at rest (as they often stage the data on their servers), how they connect to the source systems and how they handle connection secrets. Can the tech support our governance policies?
  • Multitenancy — if this is a requirement for our solution, can the tech handle this? Multitenancy often requires the ability to parameterize and automate.
  • The geographical location of the users, service and data — some privacy and compliance requirements may pose strict constraints on this.
  • SLAs and Reliability — what are the historical SLAs for the commercial providers, and/or how well is the open-source built? What happens if the data source schema changes? Does the tech support retry and/or self-healing?
  • Support/Maintenance — What happens if there is an issue that is not resolved automatically? Do the provider’s support policies fit our SLA requirements?
  • Disaster recovery — what happens if the tech goes down? What does it take to redo the setup? Does the solution support automation, even for setting up secrets, or does it support other means of recovery?
  • Development and deployment process — is there a need to support different environments that could be perhaps dedicated to separate branches or developers? Does the solution support the development processes (or vice versa)? Does the technology support parametrization of various aspects, including connections and secrets, automation via APIs or machine-readable formats
  • The true cost to own (TCO)— development, maintenance and runtime costs all combined. Each of these can be substantial.

Activities and Deliverables

Architecture and technology selection is often half of the battle. Architecture must consider all the aspects mentioned in the business and technical drivers above. It could be as simple as drawing a box with the Fiventran logo, arrows coming in from different data sources, and an arrow going to Snowflake. Or it can be a complex, terraform-driven, CI/CD managed, self-healing multitenant multi-region cloud solution with strict governance and security.

Implementation could be as simple as filling out a form on Fivetran or Airbyte or writing complex code and configuration scripts.

Once the implementation is in place, the data starts flowing and is available for analysis or further processing.

Approaches and Best Practices

Snowflake

Snowflake provides many mechanisms to load the data from the cloud storage:

  • COPY INTO — true batch processing approach
  • External Tables — perhaps a bit underused approach that, for certain formats and use cases, can be very efficient. External tables allow for partitioning and can be set up to honor the partitioning scheme when used with proper WHERE predicates. A big advantage of external tables is zero latency.
  • Iceberg Tables — these can use external or internal metadata catalogs and can provide significant performance improvement over external tables.
  • SnowPipe — automatic batch processing on file upload.
  • SnowPipe Streaming — an event-driven approach that allows for latency measured in seconds.

Of note is also Snowflake's ability to load data from other data sources than cloud storage:

  • External functions — allow the user to invoke lambdas or Azure functions as SQL functions. The implementation can reach out directly to the data source and extract the data. It is a rather complex approach but possible when support for highly custom data sources is required.
  • External network access — in the public preview from not that long ago, external network access allows Snowpark UDFs written in Python, Java or other languages to reach out to the external data sources and APIs to extract the data.
  • Snowpark containers — this allows for containerized workloads to run inside of Snowflake, performing almost any function, including extract and load. Snowpark containers were announced on Summit 2023 and, at the time of the writing, are in private preview only.

Third-party software and services

The usual suspects playing in the Modern Data Stack space are Fivetran, Airbyte (service), Matillion Data Loader / ETL, Boomi, Stitch, CDATA, and Rivery. Still, there are new kids on the block appearing all the time. There are dozens of providers.

Full-featured traditional or newer ETL suites can also help and often provide Snowflake-specific destinations.

On the open-source side, Airbyte, Meltano (and singer.io) are perhaps the most well-known, but other solutions exist.

Scripting / Custom Development

As a last resort, if there is nothing readily available that fits the requirements, it's always possible to do a bit of custom development and host (and orchestrate) the results using any of the myriad methods available. For example, on AWS, we can use EC2, ECS, EKS, Lambda and many other services.

Best Practices

Many of the best practices follow directly from the business and technology drivers and amount to simply ensuring the requirements are satisfied.

Separation of concerns and functional data warehousing were mentioned above.

Having said all of that, perhaps the most important best practice is not to lose data, especially historical data. Often, it's better to load as much data as possible than to try to guess what may or may not become important in the future. There are performance-driven exceptions, of course, but Snowflake is often pushing the point where the performance and cost benefits of selecting only a subset of data to load outweigh the flexibility, agility and development effort benefits of loading all far, far into the realm of 100-s of TBs or Petabytes. For many use cases, simply loading everything is the way to go.

Conclusion

The previous blog shows what it takes to set up a Snowflake account to prepare it for data loading.

This blog shows what needs to happen to get the data flowing to Snowflake. We have listed the considerations that need to be taken into account before selecting the right approach and touched on the actual ways to implement the task. Approaches range from simple filling of a form at Fivetran to complex builds. The best practices often follow the separation of concerns and functional data warehousing principles. Snowflake lets us load (and keep) as much (historical) data as possible.

Thank you for reading to the end. I hope you enjoyed the blog. Any feedback is welcome — please leave a comment!

I’m Milan Mosny, CTO at Infostrux Solutions. You can follow me here on Infostrux Medium Blog or LinkedIn. I write about Snowflake, data engineering and architecture and occasionally about other topics dear to my heart.

--

--

Co-founder and CTO of Infostrux. Comprehensive professional and managed services for all of your Snowflake needs. https://infostrux.com