LLM 101 — Fine-tuning And Evaluating Large Language Models — Part 2

Learn to Fine-tune and Evaluate large language models

Published in

Infostrux Engineering Blog

8 min readOct 1, 2023

This post explores how performance can be enhanced by fine-tuning large pre-trained models for specific tasks or domains.

In the previous post, we found that providing prompt hints improves responses. However, adding more examples to the prompt takes up context space and loses effectiveness after a few instances. A superior alternative is Instructions fine-tuning.

Computer — Photo by Glenn Carstens-Peters on Unsplash

Instructions Fine-Tuning

Instruction fine-tuning typically refers to a machine learning process, specifically in training large language models like GPT-3.5 or similar architectures. Fine-tuning is a phase that comes after the initial pre-training of a model.

Here’s an overview of the process:

Pre-training: In this phase, a language model is trained on a large corpus of text data. During pre-training, the model learns to predict the next word in a sentence, which helps it capture grammar, context, and various linguistic patterns.
Fine-tuning: After pre-training, the model is further trained or “fine-tuned” on a narrower and more specific dataset. The developers or users often create this dataset to tailor the model’s behavior to a particular task, domain, or application.

For example, if you’re building a chatbot for customer service, you might fine-tune the language model on a dataset of customer support conversations. This helps the model learn how to respond appropriately to customer queries and understand the specific language used in that context.

Fine-tuning involves using the pre-trained model as a starting point and then continuing the training process on the specific dataset. This process allows the model to adapt its knowledge and behavior.

For the Chatbot sample, we will prepare a set of “prompt instruction templates” from our recorded chats database. So the inquiry and the response are real chats between your customer and the customer service team.

[Main Instruction]: Provide appropriate responses for a customer service chatbot handling inquiries related to [specific products or services]. Ensure the responses are accurate, helpful, and empathetic.

[Context]: You are the voice of a knowledgeable and friendly customer service representative. Your goal is to address customer queries, offer solutions, and provide a positive experience. Responses should reflect a professional tone and demonstrate empathy towards customers.

[Example Prompt]:
"My television won't turn on."

[Response Expectation]:
Certainly! I'm here to help you get your television up and running.
Ensure that your television is properly plugged into a working power outlet.
Verify that the batteries in your remote control are functioning and properly inserted.
Unplug your TV from the power source, wait for about 60 seconds, and then plug it back in. Press the power button on the TV or the remote control to see if it powers on.
Happy to hear that this worked out!
Is there anything else I can assist you with?

Once you prepared a file with many examples, run the prompts with your model, then compare the model response to the recorded answer. Feed this information back to the model and run an additional batch of samples over and over until the model is fined tune. For this kind of task few hundred examples should be enough.

Catastrophic Forgetting

The challenge with instruction fine-tuning is that while it can enhance the model’s performance for a particular case, it might hinder its ability to excel in other tasks. For instance, if you fine-tune the model for customer service, its performance might decline when tasked with summarizing technical issues. Therefore, it’s crucial to consider the diverse tasks your model will tackle and carefully weigh the fine-tuning use cases to ensure balanced and effective performance across various scenarios.

For multiple use cases, training the model with prompt examples for each case ensures versatile performance across different tasks, yielding effective results.

FLAN

Fine-tuning can be a demanding task. Google has introduced FLAN (Fine-tuned LAnguage Net), a refined model version with improved results. To fine-tune the model, substantial datasets were employed alongside prompt templates. Yelp and IMDb reviews were utilized for sentiment training as an illustration.

A diagram showing that FLAT version is giving a better results comparing to zero-shot or few-shot

Image and more reading can be found in Google FLAN introduction documentation:

https://ai.googleblog.com/2021/10/introducing-flan-more-generalizable.html

The FLAN version can be further fine-tuned using more pertinent data or your business examples.

How to Evaluate your Model?

Accuracy in traditional machine learning refers to the ratio of correctly predicted instances to the total number of instances in a dataset. It is a standard evaluation metric used to measure the performance of a classification model. Accuracy is calculated using the formula:

Accuracy = Number of Correct Predictions x 100 / Total Number of Predictions

In this formula, the “Number of Correct Predictions” represents the instances that the model correctly classified, and the “Total Number of Predictions” is the entire dataset being evaluated.

This technique might be problematic where it’s difficult to know what is a correct prediction, for example, for translating text.

Large Language Models (LLMs) like GPT-3 are often evaluated using metrics such as METEOR, ROUGE, WER, PER, etc. These metrics are commonly used in natural language processing (NLP) to assess the text's quality, such as machine-generated summaries or translations. However, they were initially developed for evaluating machine translation. While they can provide insights into the quality of LLM-generated text, they might only capture some aspects of language understanding and generation.

Unigrams, bigrams, and n-grams are terms used in natural language processing to refer to sequences of words in a text. They are crucial in various language analysis tasks, including machine translation, text generation, and information retrieval.

A unigram is the simplest form of an n-gram and represents a single word in a text. It is a basic unit of analysis in language processing. For example, in the sentence “The cat is on the mat,” the unigrams are: “The,” “cat,” “is,” “on,” “the,” and “mat.”

A bigram consists of two consecutive words occurring together in a text. Bigrams capture some level of word order information. In the same sentence, the bigrams are: “The cat,” “cat is,” “is on,” “on the,” and “the mat.”

An n-gram is a sequence of n words occurring together in a text. It could be a unigram (n=1), a bigram (n=2), a trigram (n=3), or any larger sequence. N-grams help in understanding the context and relationships between words in a text. For example, in the sentence “The cat is on the mat,” the trigrams are: “The cat is,” “cat is on,” and “is on the.”

ROUGE (Recall-Oriented Understudy for Gisting Evaluation):
ROUGE measures the overlap between the generated text and reference (ground truth) text by computing various types of n-gram matches (unigrams, bigrams, trigrams, etc.). It focuses on recall, or the ability of the generated text to include essential words or phrases from the reference text.

Example of ROUGH 1 (Unigrams) calculation

Generated Sentence: “The cat is on the mat.”
Reference Sentence: “A cat sits on the mat.”

1. First, let’s tokenize both sentences into unigrams:
— Generated Unigrams: [“The”, “cat”, “is”, “on”, “the”, “mat”]
— Reference Unigrams: [“A”, “cat”, “sits”, “on”, “the”, “mat”]

2. Next, count the number of overlapping unigrams between the generated and reference sentences:
— Overlapping Unigrams: [“cat”, “on”, “the”, “mat”] (4 unigrams)

3. Calculate the recall, which is the ratio of overlapping unigrams to the total number of reference unigrams:
— Recall = Number of Overlapping Unigrams / Total Reference Unigrams
— Recall = 4 / 6 ≈ 0.67

4. The ROUGE-1 score is often expressed as a percentage by multiplying the recall by 100:
— ROUGE-1 Score ≈ 0.67 * 100 ≈ 67

For ROUGE 2 (bigrams)
The score would be: 60

For the case of:
Generated Sentence: “The cat is not on the mat.”
Reference Sentence: “A cat sits on the mat.”

ROUGE 2 scores would be 50, while ROUGE 1 score would be 57.

Example of METOR:

Calculating METEOR (Metric for Evaluation of Translation with Explicit ORdering) involves multiple steps and considerations, including unigram matching, synonymy, stemming, and word order. It is a complex metric that requires specialized software or libraries to compute accurately.
Here is a simplified overview of how METEOR is calculated:

Generated Sentence: “The cat is not on the mat.”
Reference Sentence: “A cat sits on the mat.”

1. Preprocessing:
— Tokenize the generated and reference sentences into words.
— Apply stemming and other linguistic preprocessing if required.

2. Unigram Matching:
— Count the number of matching unigrams between the generated and reference sentences.
— Matching unigrams: [“cat”, “on”, “the”, “mat”] (4 unigrams)

3. Synonymy and Stemming:
— Identify synonyms and apply stemming to consider variations of words.

4. Word Order and Phrase Matching:
— Consider the longest common subsequences and phrases between the generated and reference sentences.

5. Normalization and Penalty:
— Compute a penalty term based on differences in unigram order and length between the generated and reference sentences.

6. Score Calculation:
— METEOR score is a combination of precision and recall, adjusted by a penalty term.
— The exact formula involves various weighting factors and calculations.

Since we have the word NOT, the METEOR score will be low in this specific case.

Benchmarking Your LLM

Benchmarking a Large Language Model (LLM) involves assessing its performance on a variety of tasks and datasets to understand its strengths, weaknesses, and overall capabilities.

GLUE (General Language Understanding Evaluation) is a benchmark and evaluation framework designed to assess the performance of various natural language understanding (NLU) tasks using machine learning models. It was developed to provide a standardized way of evaluating the capabilities of different language models and NLU systems across a wide range of tasks.

GLUE consists of a diverse set of NLU tasks, each focusing on a specific aspect of language understanding. These tasks cover a variety of linguistic phenomena, including sentence similarity, sentiment analysis, textual entailment, and more. GLUE aims to encourage the development of models that can perform well across multiple NLU tasks, thereby demonstrating a more comprehensive understanding of language.

The GLUE benchmark comprises several individual tasks, including:

1. CoLA (Corpus of Linguistic Acceptability): Sentence grammaticality judgment task.
2. SST-2 (Stanford Sentiment Treebank): Sentiment analysis task.
3. MRPC (Microsoft Research Paraphrase Corpus): Paraphrase identification task.
4. STS-B (Semantic Textual Similarity Benchmark): Sentence pair similarity task.
5. QQP (Quora Question Pairs): Duplicate question identification task.
6. MNLI (MultiNLI): Natural language inference task.
7. QNLI (Question Natural Language Inference): Question entailment task.
8. RTE (Recognizing Textual Entailment): Textual entailment task.
9. WNLI (Winograd Schema Challenge): Coreference resolution task.
10. AX (Diagnostic Datasets): Diagnostic sentence completion tasks.

GLUE has been followed by SuperGLUE, an extended benchmark that includes more challenging NLU tasks, pushing models to demonstrate further their understanding and reasoning abilities, and the leaderboard can be found in the link below.

SuperGLUE Benchmark

SuperGLUE is a new benchmark styled after original GLUE benchmark with a set of more difficult language understanding…

super.gluebenchmark.com

Conclusion

In this post, we saw how we could enhance the results of the pre-trained large model by instruction fine-tuning and learning how to evaluate and benchmark the model.

I’m Eylon Steiner, Engineering Manager for Infostrux Solutions. You can follow me on LinkedIn.

Subscribe to Infostrux Medium Blog at https://blog.infostrux.com for the most interesting Data Engineering and Snowflake news. Follow Infostrux’s open-source efforts through GitHub.