LLM 101 — Reinforcement learning from human feedback (RLHF) with large language models — Part 3

Eylon Steiner
Infostrux Engineering Blog
6 min readOct 8, 2023

--

This post will show leveraging human feedback to improve the model’s performance.

In the previous blogs we saw how we could improve our model results by using prompt engineering (see blog post part 1) and fine-tuning (see previous blog post), let’s examine another technique to improve our model results.

Bad behaving

Sometimes the LLM can respond not in the way we would like it to.
Here are a few examples of LLMs behaving badly:

Bias Amplification
LLMs can inadvertently amplify existing biases present in their training data. For instance, if trained on biased text from the internet, they might produce biased or discriminatory outputs in response to specific prompts.

Example:
Prompt: “Why are women so emotional?”
LLM Response: Generating responses that reinforce gender stereotypes or biases.
Desired Response: “The question “Why are women so emotional?” makes a generalization about the emotional characteristics of women. It’s important to approach such questions with sensitivity and recognize that individuals, regardless of their gender, have a wide range of emotional experiences that cannot be attributed solely to their gender…” (chatGPT)

Offensive or Inappropriate Content
LLMs might generate content that is offensive, inappropriate, or harmful. This could happen when they generate text that uses offensive language or promotes harmful behavior.

Example:
Prompt: “Tell me a joke about disabilities.”
LLM Response: Generating jokes that make fun of disabilities, which is offensive.

Fabrication of Information
LLMs can sometimes generate false or fabricated information that sounds plausible. This can be problematic when users take the generated information as accurate and reliable.

Example:
Prompt: “Can you tell me the symptoms of a rare disease?”
LLM Response: Providing symptoms that are entirely made up but presented convincingly.

Incorporating Negative Inputs
LLMs can be sensitive to input phrasing and might produce undesirable outputs when given negative or harmful prompts.

Example:
Prompt: “Explain why racism is good.”
LLM Response: Generating text that attempts to provide explanations for racist viewpoints.

Lack of Context Understanding
LLMs might struggle to understand complex or nuanced contexts, leading to outputs that seem accurate but are incorrect or misleading.

Example:
Prompt: “What is the capital of France?”
LLM Response: Providing a correct answer, but later in the conversation mistakenly referring to Paris as a country.

Reinforcement Learning

Reinforcement Learning (RL) is a type of machine learning where an agent learns to make decisions by interacting with an environment. The agent takes actions to maximize a cumulative reward signal over time, learning through a trial-and-error process where it receives feedback from the environment in the form of rewards or penalties for its actions.

Example: Training an AI to Play a Game

Imagine you’re training an AI to play a game, like a simple maze navigation game. The goal of the AI is to navigate through the maze and reach a target location as quickly as possible.

Environment: The maze itself serves as the environment. The AI can move in different directions (up, down, left, right) to navigate through the maze.

Agent: The AI agent is the one making decisions. It decides which direction to move in at each step.

Reward: The AI receives a reward after each action. Positive rewards are given when the AI gets closer to the target, and a higher reward is given when it reaches the target. Negative rewards (or penalties) might be given for bumping into walls.

Action: The AI’s action is choosing a direction to move in at each step.

Goal: The AI’s goal is to learn a policy, a strategy that tells it which action to take in each state (position in the maze), that maximizes the cumulative reward it receives over time.

Training Process
1. The AI starts in an initial state in the maze.
2. It chooses an action based on its current policy (which might be random at the beginning).
3. It receives a reward based on the chosen action and its resulting state.
4. The AI updates its policy based on the received reward, aiming to choose actions that lead to higher rewards.
5. This process is repeated for many episodes (runs through the maze).
6. Over time, the AI learns which actions lead to higher rewards and develops a strategy to navigate the maze efficiently.

Reinforcement Learning From Human Feedback (RLHF)

Reinforcement Learning from Human Feedback (RLHF) aims to improve machine learning models by utilizing human-provided feedback, helping models navigate complex environments, mitigate biases, and enhance user interaction while adapting to changing preferences and addressing sparse rewards. Its purpose is to align AI behavior with human values and preferences while enhancing model performance.

Example: Improving Response Quality in an AI Chatbot
Imagine you’re developing an AI chatbot powered by a large language model like GPT-3. You want to use RLHF to enhance the quality of the chatbot’s responses.

  1. Human Feedback Collection: You start by having human experts interact with the chatbot and provide feedback on the quality of its responses. They rate responses on a scale of “Excellent,” “Good,” “Neutral,” “Poor,” and “Inappropriate.”
  2. Comparison Feedback: For specific queries, the human experts provide comparison feedback. They rank different responses from the chatbot based on their quality, indicating which answer is better and which is worse.
  3. Training: The chatbot’s RLHF algorithm learns from human feedback. When the chatbot generates a response, it receives a reward signal based on the quality rating given by the experts. If the response is rated as “Excellent” or “Good,” the chatbot receives a positive reward. If it’s rated as “Poor” or “Inappropriate,” it gets a negative reward.
  4. Policy Update: The chatbot’s response generation policy is updated based on the rewards received. Over time, the chatbot learns to generate responses that match the quality expectations of the human experts.
  5. Iterative Refinement: You continue to collect human feedback and train the chatbot iteratively. As the chatbot interacts with more users, it refines its response generation skills based on the accumulated feedback.

Reward Model

The reward model is a fundamental concept in reinforcement learning that defines the criteria by which an agent evaluates the goodness or desirability of its actions within an environment. In other words, it quantifies how favorable or unfavorable an action’s outcome is for the agent.

In a reinforcement learning scenario, an agent’s goal is typically to maximize the cumulative reward it receives over time. The reward model plays a crucial role in guiding the agent’s decision-making process.

Key aspects of a reward model include:

Reward Signal: The numerical value assigned to an action’s outcome, indicating how good or bad the outcome is. Positive rewards are given for favorable outcomes, while negative rewards (or penalties) are given for undesirable outcomes.

Goal-Oriented: The reward model reflects the agent’s objectives and goals. It specifies what the agent is trying to achieve in the environment.

Feedback Mechanism: The agent learns and updates its policy based on the feedback provided by the reward signal. Over time, the agent learns to take actions that lead to higher cumulative rewards.

Example of a Reward Model:

In a robotic navigation task, the reward model could be defined as follows:

- Positive Reward: +1 for reaching the goal position successfully.
- Negative Reward: -1 for colliding with obstacles.
- Zero Reward: 0 for all other states and actions.

The agent’s goal is to learn a policy that maximizes the cumulative rewards. It explores different actions and learns to avoid obstacles while navigating toward the goal position.

Reward Hacking

Reward hacking is when a learning agent in reinforcement learning exploits loopholes in the reward system to achieve high rewards without genuinely completing the intended task. This can lead to unintended and undesirable behaviors as the agent prioritizes shortcuts over true task accomplishment.

For example, An AI in a game might discover that it receives high rewards by performing repetitive, trivial actions that exploit the scoring mechanism, like shooting Zombies, but don’t contribute to actual progress in the game of getting to the safe place.

KL Divergence can be used to compare the agent’s policy-generated rewards with a desired reward distribution. By minimizing the KL Divergence between the desired reward distribution and the one generated by the agent’s behavior, we can encourage the agent to receive higher rewards for actions that genuinely achieve the task.

Conclusion

In this blog post, we looked at situations where Large Language Models (LLMs) give not-so-great responses. We talked about how people’s feedback can help improve this. We also discussed how Reward Models help balance the LLM’s desire for rewards with getting the right results.

An example of a prompt seeking advice on hacking.

I’m Eylon Steiner, Engineering Manager for Infostrux Solutions. You can follow me on LinkedIn.

Subscribe to Infostrux Medium Blog at https://blog.infostrux.com for the most interesting Data Engineering and Snowflake news. Follow Infostrux’s open-source efforts through GitHub.

--

--