A Machine Learning Feature Store implementation in Snowflake

A Comprehensive overview of Machine Learning Feature Stores and the key differences between Online and Offline Feature Stores

Published in

Infostrux Engineering Blog

8 min readApr 16, 2023

This post provides a comprehensive overview of Feature Stores, outlining the key differences between Online and Offline Feature Stores. Additionally, we’ll explore how Feature Stores can be integrated with Snowflake to enable robust machine-learning workflows. You’ll gain practical insights into various options, including AWS SageMaker Feature Store, FEAST, and Dataiku, to choose the best approach for your needs.

Please note that at Infostrux, we have implemented an in-house solution for Model Registry and Feature Store that is exclusively built on Snowflake and does not require any additional third-party environments. We will describe this solution in a future blog post.

What is Snowflake?

Snowflake is a cloud-based data warehouse that can store and analyze your data records in one place. It can automatically scale up/down its compute resources to load, integrate, and analyze data.

What are features used for?

Features are the attributes or properties models used during training and inference to make predictions. For example, to predict house prices, features might include the area of the house, location, number of rooms, type of house, and age of the house. The model uses these attributes to predict a house price. To train the model, you might take historical values (last few months/years) from your database or a public house sales database. For example, if the features of a 2-room, 716 sqft apartment in Vancouver downtown built in 2009 were used, the model might predict a $785,000 price.

Example of Vehicle properties that may be used for price estimations

What is feature engineering?

Feature engineering is a crucial step in preparing data for machine learning models. It involves investigating the various attributes of the data, selecting relevant features, transforming them, handling missing values, and creating new features by combining existing ones.

At Infostrux, we typically use Snowflake, DBT, and a Layered model to create features. However, in this blog post, I won’t be going into detail on our specific approach to feature engineering.

Why do I need to manage features?

Over time, you may enhance your model by splitting the number of rooms into the number of baths and bedrooms. Now, the predictions should be more accurate, but you must update the features and the dataset used for training and prediction. During the evolution of the model, you will have many experiments, some successful and some with regression in the results. In addition, you may have additional models, such as predicting house renting prices, which will have another set of features and historical values. To manage your models, model versions and models history, you would use a model registry. In the same way, you would like to manage your features to know which features relate to which model version and which features were used for a specific period. This is useful to avoid errors when running a model, sharing features between models, rolling back models or features, and debugging historical predictions. Also, for governance reasons, a customer may ask you to explain a historical prediction, and you should be able to know what model and features were used.

Offline Feature Store

A feature store is a centralized repository for storing and managing features in machine learning (ML) models. It can be considered a database designed to store and manage features, related metadata, and historical information. This can include raw features, derived features, engineered features, and information about how features were generated and what models they were used in.
Offline feature stores are commonly used for training machine learning models and are usually designed to handle large volumes of data. They may also provide features such as data versioning, data lineage, and data quality control to ensure that the data used in training is accurate, up-to-date, and reliable.
Snowflake is a perfect option for the Machine Learning Offline Feature Store.

Online Feature Store

For the Online Feature Store, we need a near-real-time one-row query, which an analytics data warehouse is not typically suitable for. We will show a few 3rd party Online Feature Store solutions later. Snowflake’s Unistore might be a good option (currently in private preview). Hybrid Tables, currently in private preview, are a new Snowflake table type to power Unistore with fast, single-row operations. That means teams can build transactional business apps directly on Snowflake.

Feature Store Products

There are a few feature store products in the market, and we will briefly show them. There is no right option. At Infostrux, we learn about the customer’s environment and requirements and find the best solutions for their use cases. We have also implemented a Feature Store solution based only on Snowflake without the need to have an AWS account or use a 3rd party tool. In a future blog post, I will give more details about our Snowflake solution.

FEAST for Snowflake’s Feature Store management

Feast (Feature Store) is an open-source tool for managing machine learning features. It utilizes your existing storage environments, such as Snowflake, GCP, and AWS, to create both offline and online feature stores. For instance, if you already use Snowflake and AWS, FEAST can leverage Snowflake for offline storage and AWS resources (DynamoDB+API Gateway) for online storage and serving.

If you are familiar with Python and Snowpark, installing and using the FEAST API should be easy and straightforward. While FEAST is a recommended solution for feature stores within Snowflake, one disadvantage is that it needs to be integrated into the more extensive machine-learning pipeline. It doesn’t have feature engineering, model registry, and model monitoring capabilities.

For more info about integrating FEAST with Snowflake, go to this link.

Dataiku as Snowflake’s Feature Store management

Dataiku is a platform for AI.
It includes Data preparation, Visualization, Machine learning, DataOps, MLOps, Analytics App, Collaboration, Governance, Explainability and Security.

Dataiku helps you to store, serve and catalogue your feature values.
Using Datiku UI and API you can create, view, monitor and use your ML Feature Groups.

If you have chosen Datiku as your AI platform, you can use Snowpark within your Datiku notebook to execute SQL queries over Snowflake.

More info: https://blog.dataiku.com/set-up-feature-store-with-dataiku
https://knowledge.dataiku.com/latest/mlops-o16n/feature-store/tutorial-building-feature-store.html

Amazon SageMaker Feature Store

Amazon SageMaker Feature Store is a repository designed for storing, updating, retrieving, and sharing machine learning (ML) features. Key features include the ability to ingest data from many sources, search and discover of features, ensure feature consistency for training and inference, standardize feature definitions, and integrate with Amazon SageMaker Pipelines.

The Feature Store supports both online and offline modes. In online mode, features are read with low latency and used for high throughput predictions. In offline mode, large data streams are fed to an offline store, which can be used for training and batch inference.

To use the Feature Store, you can ingest data from various sources, such as Amazon Kinesis Data Firehose or Amazon SageMaker Data Wrangler. The Feature Store tags and indexes feature for easy discovery and browsing, allowing teams to determine if a feature is helpful for a particular model.

To ensure feature consistency, the Feature Store makes the same features available for training and inference. During training, models use a complete data set, while inference needs to happen in milliseconds and usually requires a subset of the data. SageMaker Feature Store allows models to access the same features for both training runs (usually done offline and in batches) and for real-time inference.

The Feature Store also standardizes feature definitions, making it easier to reuse features for different applications. It integrates with Amazon SageMaker Pipelines to create, add feature search and discovery to, and reuse automated machine learning workflows.

The AWS Feature Store API supports both online and offline Feature Stores, ingestion (batch & streaming), Feature Group (add, describe, tag, share by setting permissions), searching by feature group name, description, record identifier name, creation date, and tags, and joins across different FeatureGroups for real-time inference.

How do we integrate Snowflake and AWS SageMaker Feature Store?
We use Snowflake to store our data. From there, we have two options for feature engineering:

We can use AWS Data Wrangler to perform feature engineering and create feature stores. This involves storing offline features on S3 and online features on low-latency storage, such as DynamoDB.
Alternatively, we can perform feature engineering on Snowflake with the help of DBT and use the final dataset as the source for the AWS Feature Store. We typically recommend this approach to customers who are already using AWS SageMaker.

Either way, the result is a set of carefully engineered features that can be used to train machine learning models and make accurate predictions.

The main advantage of using AWS SageMaker Feature Store is that it integrates with the entire AWS ML toolset and pipeline.

Datalake and Snowflake

Some of our customers have both a datalake and a data warehouse (Snowflake), with all data sources stored in the datalake (such as AWS S3), and only the data needed for analytics is pushed into Snowflake.
In such cases, some customers may prefer to run their ML models on top of the datalake, making a datalake Feature Store solution a good fit. One option is the AWS SageMaker Feature Store, while another example is Databricks.

Databricks supports both data storage and MLOps functionality, including the creation of feature tables for model training and inference. The Feature Store UI in Databricks provides discoverability and allows browsing and searching for existing features, while the lineage functionality saves the data sources used to create the feature table and provides access to models, notebooks, jobs, and endpoints that use the feature, ensuring consistency between inference and training. Databricks also integrates with model scoring and serving, making model deployment and updates easier. Finally, Feature Store supports point-in-time lookups for time series and event-based use cases that require point-in-time correctness.

Conclusion

Effective management of features is crucial for the success of the machine learning pipeline. Without proper automation, MLOps, and a clear understanding of the history of your models and features, errors will likely occur. When deciding how to manage features for your use cases, there are a few things to consider.

Firstly, determine whether online and offline feature stores are necessary for your needs. Secondly, consider the current tools and environments in which the development and analytics groups operate. Lastly, evaluate how the feature store will integrate with your existing MLOps and ML pipeline.

It is important to remember that the field of ML is advancing rapidly, and there may be a new service that could be a better fit for your use case. Therefore, staying up-to-date with the latest developments in the domain is crucial for effective feature management.

Follow our open-source efforts through our GitHub.