Dlow shuffle part 1 and 2 together

DLOW SHUFFLE PART 1 AND 2 TOGETHER UPDATE

The Blackjack game we set up in Part 1 does not accurately model the Reinforcement Learning cycle. Why Re-Build Our Blackjack Environment Using OpenAI Gym? There are no starting and stopping points in the stock market, and you will have to get creative when defining episodes!įor these reasons, predicting the stock market using Reinforcement Learning would be considered a continuous task, and cracking Blackjack would be considered an episodic task. However, other contexts, such as applying Reinforcement Learning to predict the stock market, do not have “rounds” to help define episodes. Luckily, we had the notion of a “round” in Blackjack to help define an episode. This means that there will usually be 1–3 State/Action/Reward tuples per episode, because the agent will likely only make 1–3 stand/hit decisions per round of Blackjack (or more on rare occasions). How many loops around the cycle should comprise an episode? In my Blackjack environment, I considered one round of Blackjack to be one episode. In the next article, we will dive into exactly how our Reinforcement Learning algorithm will direct our agent in using these State/Action/Reward tuples to optimize its policy.

DLOW SHUFFLE PART 1 AND 2 TOGETHER UPDATE

Going through some “n” number of loops around the cycle and recording State/Action/Reward tuples as we go along is called an episode.Īfter we do our desired number of loops (let’s say 50), our agent will go through the State/Action/Reward tuples and update its policy accordingly. The cycle above implies that this loop will go on indefinitely, so where/when does the actual learning happen?Ī single cycle can be represented as a sequence:Īs we do loops around this cycle, we can record these State/Action/Reward tuples.

The environment models this by sending the agent an initial state (player hand value + dealer up-card value).

A round of Blackjack begins: 2 cards are dealt to the player and dealer, and the agent only sees its cards and one of the dealer’s cards.

How These Components Work Together In Blackjack The logic for determining a win/loss/tie is in the environment the rewards are the number value assigned to each of those results. In our version of Blackjack, win = +$100, lose = -$100, and tie = +$0. This is defined by the programmer to help show the agent which outcomes are preferable, and the agent can start to adjust its actions accordingly.

Reward: The feedback that the agent receives for its action in some given state.

In our version of Blackjack, the available actions are hit and stand.

Actions: The options available to the agent in interacting with some given state.

The player/agent can only see these two things when making a decision.

In our version of Blackjack, a state will consist of the player’s hand value and the dealer’s up-card value.

State: The “situations” that help make up the environment.

In Blackjack, this is the set of all possible player hands, dealer up-cards, player actions (hit or stand), and results (win/lose/tie).

Environment: The set of all possible situations the agent can interact with, the actions available in each situation, and the outcomes (rewards + punishments) associated with each of these situations.

Agent: The AI abstraction that is carrying out the learning process.

Below, I have broken down the image into its components, and their equivalents in Blackjack: To move forward with our mission to “crack” Blackjack using Reinforcement Learning, we must understand the image above.