Introduction

Imagine you're new to a city and trying to find the best restaurant for dinner. Each time you eat out, you rate your experience. Sometimes, you return to restaurants you enjoy (exploitation), while other times, you try new places (exploration). Over time, you learn which restaurants consistently provide the best meals, helping you make better dining choices.

This everyday scenario mirrors the core principles of Reinforcement Learning (RL). In this blog post, we'll explore three fundamental concepts in RL: the mechanics of rewards, the mathematics of discounting, and the strategic balance between exploration and exploitation.

Let's delve into these concepts, examining their technical foundations and practical implications using our restaurant-finding scenario as a running example.

group of people inside the restaurant — Photo by kayleigh harrington on Unsplash

The Mechanics of Rewards in Reinforcement Learning

Defining Rewards

Rewards in Reinforcement Learning (RL) are crucial because they are the only feedback an agent receives to determine whether an action is good or bad. Rewards are numerical feedback signals that the environment provides to the agent after each action. In our restaurant scenario, the reward could be your satisfaction rating after each meal, perhaps on a scale from 1 to 10.

The cumulative reward at each time step t represents the total rewards an agent has accumulated up to that point. In our example, this would be the sum of all your restaurant ratings over time.

Formally, the cumulative reward R(τ) at time step t can be expressed as:

Where R(τ) is the cumulative reward.

However, this approach assumes that all rewards, regardless of when they are received, are equally valuable, which isn't typically the case in real-world scenarios. In reality, rewards that occur sooner are often more valuable and predictable than those received in the distant future. This is where reward discounting comes into play, allowing the agent to weigh short-term and long-term rewards differently.

Reward Discounting

To account for the varying importance of rewards over time, we introduce a discount factor γ, where γ is a value between 0 and 1 (usually between 0.95 and 0.99). The discount factor helps prioritize short-term versus long-term rewards:

A larger γ (closer to 1) implies a smaller discount, meaning the agent values future rewards more and is more focused on long-term success.
A smaller γ (closer to 0) implies a larger discount, meaning the agent values immediate rewards more and focuses on short-term gains.

The formula for the discounted cumulative reward at time step t is given by:

In our restaurant example, a high γ might represent your willingness to try a new restaurant with great reviews, even if it's farther away or more expensive. A low γ would mean you're more likely to stick with nearby, familiar options.

Exploration and Exploitation Tradeoff

In RL, agents face the challenge of balancing exploration and exploitation:

Exploration involves trying out new actions to discover more information about the environment. This helps the agent learn about potentially better actions that it hasn't encountered before.
Exploitation involves using the knowledge the agent has already acquired to maximize its reward by choosing the best-known actions.

Given that RL operates under the reward hypothesis—the idea that the goal is to maximize cumulative reward—agents might be tempted to exploit known actions repeatedly. However, this can trap the agent in suboptimal behaviour, preventing it from exploring potentially better actions.

To avoid this, a balance between exploration and exploitation is necessary. The agent must explore enough to discover better strategies while exploiting its current knowledge to achieve high rewards. Striking this balance is key to successful learning in RL.

In our scenario:

Exploration is trying new restaurants you've never visited before.
Exploitation is returning to restaurants you know you like.

If you only ate at your favourite restaurant (pure exploitation), you might miss out on discovering even better places. Conversely, you might have too many mediocre meals if you always try new restaurants (pure exploration).

Conclusion

In conclusion, rewards and discounting play a crucial role in helping reinforcement learning agents evaluate actions over time. However, the agent also relies on a policy, which serves as its decision-making framework. The policy dictates how the agent chooses actions based on the current state, ultimately guiding it toward maximizing long-term rewards. Next, we’ll explore how policies are formed and optimized in RL.

If you enjoyed this blog, please click the ❤️ button, share it with your peers, and subscribe for more content. Your support helps spread the knowledge and grow our community.

Subscribe now