Snowflake LLM’s Use Case Based Learning — A Beginners Course— Part 1

Learn how to implement LLM projects with Snowflake

Mehdi Sidi Boumedine
Infostrux Engineering Blog

--

Introduction

Is it worth mentioning that Large Language Models is a term that has recently gained popularity in the tech field? LLM refers to a neural network that can learn from large amounts of text data and generate natural language outputs. LLMs have been used for text summarization, translation, generation, and conversational agents.

The recent breakthrough is undoubtedly due to the fantastic performance shown by Openai's Chat-GPT. Since then, the companies started believing in the incredible potential of AI-powered language processing and generation.

However, despite the hype around LLMs, most people are still confused about what needs to be done to implement simple use cases, such as sentiment analysis, text classification, or even the more attractive automatic text generation. Foremost, when it comes to doing that in a Snowflake ecosystem, it adds a bit of perplexity to the equation.

At Infostrux, we strive to build Machine Learning and MLOps Capabilities on top of Snowflake cloud database technology. If you want to know more about Infostrux's work and our services, please check out our website and look at our other blog posts here.

In this tutorial, we will quickly pass through some LLMs concepts; then we’ll go through the steps required to obtain Snowflake User-defined functions that we can use for LLM inference.

What will we do?

We will

  1. Import and serialize huggingface Machine Learning models
  2. Register them in a Snowflake account as functions used for inference
  3. Execute the inference on a sample of data

Before that, we will prepare our snowflake environment:

  1. Create a stage
  2. Populate tables with sample data

What will the result look like?

We will have something like

SELECT generate_text(field) 
from table

The result will be a list of records holding the output of the inference function (generate_text in this example).

What will we be using?

Hugging face and transformers library

To train a Large Language Model with the power of a Chat-GPT or BERT, one must collect, prepare, and process a massive amount of text data and make available the required processing resources and time. That means huge efforts and resources for a single pass of training and for the subsequent ones (and a carbon footprint, if you will).

Fortunately, one can re-use a pre-trained model. Huggingface.co 🤗 is a huge repository of pre-trained Machine Learning models and related services.

We’re interested here in using the Transformers library from huggingface. Transformers provide access to a wide range of Large Language Model functions.

By the way, I strongly recommend reading the huggingface explainer about Models, Checkpoints, and Architecture if you're unfamiliar with these concepts. In this tutorial, we'll use a single term, which is model, but to understand why we can use sentiment analysis with either distilbert-base-uncased-finetuned-sst-2-English or simply with BERT please check the following link https://huggingface.co/learn/nlp-course/chapter1/4#architecture-vs-checkpoints

How simple is it to use transformers?

Using transformers on your local machine can be as simple as the following code. It requires some additional steps to make it usable on Snowflake.

from transformers import pipeline
sentiment_analysis_model = pipeline("sentiment-analysis", model='distilbert-base-uncased-finetuned-sst-2-english')
results=sentiment_analysis_model("I love the way we implement ML models at Infostrux")
print(results)

Output:

[{'label': 'POSITIVE', 'score': 0.9992141723632812}]

Disclaimer:

As our purpose is simplicity, we will not fine-tune our models. Pre-trained models can still be helpful, but you can fine-tune them to make them even more adapted to your industry and use cases.

We use a simple username/password authentication method for the same reasons. We recommend using other methods when implementing in your environments.

Joblib python library

Once a hugging face model is created through the huggingface transformers pipeline function, we must serialize it back to the file system. That means we will store our model in a file to be loaded back when required. We will upload this file to a snowflake stage to use it in a function.

Let's visualize it:

Pre-trained model from Hugginface to Snowflake Stage

Snowpark

Snowflake's Snowpark library will help us with Snowflake interactions. Specifically, it will provide us a method to load our serialized model to a Snowflake Stage, and it will help us register a User Defined Function in the Snowflake database, and it’s that final function that we’ll use for inference.

Let's complete our diagram:

Create a Function leveraging the pre-trained model

We’ll orchestrate these components in a single Python script leading us from the Hugging Face repository to the ready-to-use UDF. Except for a few preparations like importing libraries, creating stages, etc.

Next, let’s take a glance at our use cases and describe the overall process before checking how to instantiate the process for each use case.

Our use cases:

In Part I, we will address the following use cases in a simple way: no training, no fine-tuning pull from Huggingface and push to Snowflake.

  • Sentiment analysis: we’ve already had a glimpse in the first example shown above
  • Tex Summarization: get the meaning of a text in much fewer words

The Implementation Process:

To create a snowflake function (UDF) for ML inference out of a huggingface pre-trained model, you need to proceed with the following steps. Don't worry about the Python snippets; we'll put all the pieces together in a single script so it makes sense for you.

  1. Create a stage in Snowflake. The stage will contain the serialized models (you can do it once in the context of this tutorial)
USE ROLE SYSADMIN ;
CREATE DATABASE ML_DEMO_DB;
CREATE SCHEMA LLM_SCHEMA;
CREATE WAREHOUSE ML_OPS_DEMO_WAREHOUSE;
CREATE STAGE my_pretrained_models_stage ;

2. Create the sample tables (we’ll see them in the examples)

3. Install the transformers library (you need it once per environment). Please note that I’ve used the 4.14.1 version, as using a version that is different from what is available on Snowflake would lead to unpredictable results.

pip3 install transformers==4.14.1

4. Write the Python code in your local/virtual machine (I've used a Debian-based AWS workspace; let me know in the comments if you use something else and you face challenges.)

a. Import the pipeline function

b. Using the pipeline function, create your model based on a pre-trained one. The model will be based on a default library if you don’t specify one, but you still have the option to choose the library you want to use (we’ll see that in one of the examples)
Below is an example of creating a model from the pre-trained sentiment-analysis one. Of course, this part of the code will be changing depending on our use case.

from transformers import pipeline
sentiment_analysis_model = pipeline("sentiment-analysis")

c. Using joblib library, serialize (convert and save in your local storage) your newly imported model.

import joblib
joblib.dump(sentiment_analysis_model, 'sentiment-analysis.jobless)

d. Now, our model is ready to be uploaded to a Snowflake stage. This step is necessary to permit our inference UDF to import and use the pre-trained model.

session = Session.builder.configs({
"account": "your_snowflake_account_locator",
"user": "your_username",
"password": "your_password", # you'll work on improving the security of this script
"role": "SYSADMIN", # you would apply the least privilege principle to chose the role
"warehouse": "SNOWPARK_DEMO",
"database": "ML_DEMO_DB",
"schema": "ML_DEMO_SCHEMA"
}
).create()
session.file.put(
'sentiment-analysis.joblib',
stage_location = f'@ML_DEMO_DB.ML_DEMO_SCHEMA.my_pretrained_models_stage',
overwrite=True,
auto_compress=False
)

e. Using the cachetool library, create a function that loads the serialized model from a file. Why? models can be huge and take a certain time to load. If we use a function that caches the results, the output of our load_model function will be stored in memory. So whenever we want to load the same model, the function will fetch the result from memory instead of actually loading the model from the storage.

@cachetools.cached(cache={})
def read_model():
import_dir = sys._xoptions.get("snowflake_import_directory")
if import_dir:
# Load the model
return joblib.load(f'{import_dir}/sentiment-analysis.joblib')

f. Register the function that does the inference in Snowflake using the Snowpark library. Registration is explained in the paragraph entitled “Snowpark” above

We’d ensure we import the library our function uses, and our serialized model from the stage.

@pandas_udf(  
name='ML_DEMO_DB.ML_DEMO_SCHEMA.get_sentiment',
session=session,
is_permanent=True,
replace=True,
imports=[
f'@ML_DEMO_DB.ML_DEMO_SCHEMA.my_pretrained_models_stage/sentiment-analysis.joblib'
],
input_types=[PandasSeriesType(StringType())],
return_type=PandasSeriesType(StringType()),
stage_location='@ML_DEMO_DB.ML_DEMO_SCHEMA.my_pretrained_models_stage',
packages=['cachetools==4.2.2', 'transformers==4.14.1']
)

def get_setiment(sentences):
# Load the sentiment analysis model from stage
# using the caching mechanism
sentiment_analysis_model = read_model()
# Apply the model
predictions = []
for sentence in sentences:
result = sentiment_analysis_model(sentence)
predictions.append(result)
return pd.Series(predictions)

5. Run the Python script

6. Open a snowflake worksheet and run your function about the data we pre-loaded. Voilà!

create table hotel_reviews (review varchar) ;
insert into hotel_reviews values ('I would not recommend this hotel to anyone. The room phone is broken so if you need anything you hv to go to reception. Lots of mosquitoes and they dun provide any repellent or spray. Bad service during check in and overall very dissapointed with the hotel'),
('I booked for a weekend.really i was impressed by the cleanliness and the organisation .... The services were very good.the food was delicious and varied presented in an Amazing and fabulous way....even the hôtel staff were cheerful and kind.. That is why i think to Book another Time with my friends'),
('The location Large swimming pool Main dishes are good for dinner
Old facilities Lack of maintenance (elevator, bathrooms, balconies sliding doors.) Lack of choice for breakfast Low desserts quality'),
('beach, pool and the games (pool, airhockey etc) were nice to have available. the room was spacious and nice
gosh, I did not expect to find an "all inclusive hotel"... with noisy music and al that "fun" when booking into a Golden Tulip (but apparently that was my misperception..)')
;

select review as PARAGRAPH, GET_SENTIMENT(review) as RESULT from hotel_reviews;

As promised, here’s a link to the full Python script.

So… that's fantastic; I love to see I can call an inference function from within Snowflake on Snowflake data. This opens many applications for historical data, like translation, comparing sentences to whether they have the same meaning, text summarization, and text classification.

Let's see one more of them.

Text Summarization Use Case:

Suppose you have thousands of text pages stored in your database, and you want to build a summary automatically for SEO or indexing purposes.

Tip: to avoid struggling when looking up the best model for your use case, visit The Huggingface models page and click on the tag that matches your use case.

Pick a popular model in the new page filtered on the “summarization" tag and check its description. Some pages also give you the opportunity to try the model inference online:

Even though this is not enough to prove the model's reliability, here, for example, the facebook/bart-large-cnn model was the most popular and has a good working example, but it didn't perform well with my own texts.

So I've found this model philschmid/bart-large-cnn-samsum, which looked particularly promising. Let's start with the steps to get our inference working in a Snowflake account.

First, if you haven't run the previous example, or if you wish to go through the end-to-end process again, repeat steps 1 to 3 from the “The implementation process” section above.

Then, continue with Step 4, instantiated with our new use case values.

from transformers import pipeline
summarize = pipeline('summarization', model='philschmid/bart-large-cnn-samsum')

Now, let’s dump our model to a local folder, then upload it to a Snowflake stage.

# save the model to a local folder
joblib.dump(summarize, 'summarization.joblib')

# upload our model to a snwoflake stage
session = Session.builder.configs({
"account": "your_snowflake_account_locator",
"user": "your_username",
"password": "your_password", # you'll work on improving the security of this script
"role": "SYSADMIN", # you would apply the least privilege principle to chose the role
"warehouse": "SNOWPARK_DEMO",
"database": "ML_DEMO_DB",
"schema": "ML_DEMO_SCHEMA"
}
).create()
session.file.put(
'summarization.joblib',
stage_location = f'@ML_DEMO_DB.ML_DEMO_SCHEMA.my_pretrained_models_stage',
overwrite=True,
auto_compress=False
)

The next step is to register a UDF (a Snowflake User Defined Function), by using the pandas_udf decorator and passing the model we would have uploaded to the stage.

Remember that we also use the cache tools library to make loading the model by the function faster and easier.

@cachetools.cached(cache={})
def read_model():
import_dir = sys._xoptions.get("snowflake_import_directory")
if import_dir:
# Load the model
return joblib.load(f'{import_dir}/summarization.joblib')

@pandas_udf(
name='ML_DEMO_DB.ML_DEMO_SCHEMA.get_text_summary',
session=session,
is_permanent=True,
replace=True,
imports=[
f'@ML_DEMO_DB.ML_DEMO_SCHEMA.my_pretrained_models_stage/summarization.joblib'
],
input_types=[PandasSeriesType(StringType())],
return_type=PandasSeriesType(StringType()),
stage_location='@ML_DEMO_DB.ML_DEMO_SCHEMA.my_pretrained_models_stage',
packages=['cachetools==4.2.2', 'transformers==4.14.1']
)

Our Python script is all set; let’s give it a go.

[our script is running; it may take a couple of minutes]

Now, we should have our get_summary function ready to use. Before we test it, let's put some data in a table.

We just need some data to play with. In this example, we'll have only one row of data because of the length of the texts, but multiple rows also work.

The full script for registering a text summarization function can be found here.

CREATE TABLE TEXT_TO_SUMMARIZE (long_text varchar) ;
INSERT INTO TEXT_TO_SUMMARIZE VALUES ('The history of artificial intelligence (AI) began in antiquity, with myths, stories and rumors of artificial beings endowed with intelligence or consciousness by master craftsmen. The seeds of modern AI were planted by philosophers who attempted to describe the process of human thinking as the mechanical manipulation of symbols. This work culminated in the invention of the programmable digital computer in the 1940s, a machine based on the abstract essence of mathematical reasoning. This device and the ideas behind it inspired a handful of scientists to begin seriously discussing the possibility of building an electronic brain.
The field of AI research was founded at a workshop held on the campus of Dartmouth College, USA during the summer of 1956.[1] Those who attended would become the leaders of AI research for decades. Many of them predicted that a machine as intelligent as a human being would exist in no more than a generation, and they were given millions of dollars to make this vision come true.[2]
Eventually, it became obvious that commercial developers and researchers had grossly underestimated the difficulty of the project.[3] In 1974, in response to the criticism from James Lighthill and ongoing pressure from congress, the U.S. and British Governments stopped funding undirected research into artificial intelligence, and the difficult years that followed would later be known as an "AI winter". Seven years later, a visionary initiative by the Japanese Government inspired governments and industry to provide AI with billions of dollars, but by the late 1980s the investors became disillusioned and withdrew funding again.
Investment and interest in AI boomed in the first decades of the 21st century when machine learning was successfully applied to many problems in academia and industry due to new methods, the application of powerful computer hardware, and the collection of immense data sets.');

Time to run our inference function.

And here is the whole summary if you want to see how good it is.

[{'summary_text': 'The history of artificial intelligence (AI) began in antiquity with myths, stories and rumors of artificial beings endowed with intelligence or consciousness by master craftsmen. The field of AI research was founded at a workshop held on the campus of Dartmouth College, USA during the summer of 1956. In 1974, in response to the criticism from James Lighthill, the U.S. and British governments stopped funding undirected research into artificial intelligence. Seven years later, a visionary initiative by the Japanese Government inspired governments and industry to provide AI with billions of dollars, but by the late 1980s the investors became disillusioned and withdrew funding.'}]

I’m always amazed to see that. It takes 45 seconds for the function to return something, so work still needs to be done to enhance our inference function's performance, but we can already leverage on pre-trained models, ready to transfer to Snowflake.

In part two, we'll see a couple of other use cases, but this time we’ll do a bit of tuning to turn ineffective pre-trained models into useful ones.

Thank you for reading; I hope you’ve learned something today with me. I’m Mehdi Sidi Boumedine, Senior Data Engineer at Infostrux. Feel free to connect on LinkedIn. Also, find more interesting articles by subscribing to Infostrux Engineering Blog. Stay tuned for part II.

--

--