Course 10: Stochastic dynamic programming

Author: OptimizationCity Group

# Introduction

The only difference between stochastic dynamic programming and deterministic dynamic programming is that in stochastic dynamic programming, with the determination of the state and decision policy of each step, the state of the next step is not determined, but only its distribution function can be determined. In fact, what is known here is the distribution function of the states of the next stage. The general structure of probabilistic dynamic programming is shown in the figure below.

In order to learn how to use these formulas in stochastic dynamic programming, we will solve the examples.

# Examples

Example:

A statistician claims to have found a way to win a series of races. His friends do not believe this claim and made a big bet with him that he cannot start the race with three coins and end up with five coins. In each game round, the participant can participate with any number of coins. If he wins, he wins as many and if he loses, he loses as many. The probability of this expert winning is estimated to be 2/3 per round. Assuming that such an estimate is correct, how many coins would this statistician have to participate in each round of a three-round game?

Solution:

The number of stages (periods) is equal to 3.

Decision variable: xn is the number of coins which he participates in each period of the game.

State: The number of coins this player has at each stage.

The objective function of the problem is to maximize the probability of winning the player, which is as follows:

According to the above, the results of the calculations are as follows:

In the above table, the calculation results for n=3 are given. If s = 0, it means that the player has no coins to play at this stage and therefore the player is a loser. The same conclusion is true for s equal to 1 and 2. If s=3, that means the player has three coins to play. If he plays with two or more coins, he wins with probability 0.66, and since he has more than five coins, he wins the bet. If s=4, the player only needs to play with more than one coin, and in case of winning (with probability 0.66), he will have at least five coins. If he has five or more coins, he doesn’t need to play at this stage and definitely wins the game (with probability 1).

if x2=0 and s=0, it means that there is no coin left for the player to play, so the probability of him winning is zero. If s=1 and x2=0, the player does not play in stage 2 and enters stage 3 with one coin, and the probability of him winning with five coins is zero. The same discussion holds for s=2. If s=3 and x2=0, by not playing in stage 2, the player can win by playing more than two coins with probability 0.66.

If x2=1 and s=0, it means that the player does not have any coin to play, so he cannot play at all, and therefore has been used for this case. If s=1 and x2=1, the player has one coin and bets with one coin at this stage. In this case, he loses with a probability of 0.33 and has no coins in stage 3, and wins with a probability of 0.66 and has two coins, which in any case has no chance of winning the bet. The mathematical expression of this case is as follows:

If s=2 and x2=1, the probability of winning the game is calculated as follows:

If s=3 and x2=1, the probability of winning the game is calculated as follows:

If s=4 and x2=1, the probability of winning the game is calculated as follows:

Other values can be calculated in a similar way.

The details of the calculations in the above table are given below.

Suppose s=3 and x1=0, the probability of winning the game is calculated as follows:

Suppose s=3 and x1=1, the probability of winning the game is calculated as follows:

Other values are calculated in similar ways.

The results of the last three tables can be summarized as follows:

According to the above policy, this player wins the bet with a probability of 20/27.

Example:

A player is going to play three times, each time he can win any amount between zero and his balance. In the same way, in case of loss, he loses the money he bet. At the beginning of the game, he has 75 dollars and his goal is to have 100 dollars at the end. What should be the optimal policy to maximize the probability of having exactly \$100 at the end of the game?

Solution:

sn is the balance of the player in the nth step.

If s3<50, even if he wins the bet, his balance does not reach 100 dollars, so he is a loser and his probability of winning is zero.

If the amount of the bet is not such that he has 100 dollars in case of winning, the probability of him winning is zero. But if he bets the remaining amount of \$100 (i.e. s3-100), he will win the game with a probability of 0.5.

If his balance is equal to 100, if he bets, he is a loser in any case because if he bets x3*>0, his balance will no longer be \$100 (100-x3* or 100+x3*).

If the player bets s3-100, he will lose with probability 0.5, and in this case he has exactly \$100. The summary of the results for n=3 is given in the following table:

For n=2, the calculations are as follows:

If he bets the entire balance in the third stage, he will have less than 50 dollars, which is a loser (0≤s2<25).

The calculations of stage 2 are as follows:

The calculations of stage 1 are as follows:

Based on the above calculations, the optimal solution is as follows:

Example:

A factory is trying to prepare its production plan for the next two months. The sales demand function is as shown in the table below.

The maximum production capacity is 4 units per month. The production cost of each product is 20 \$ and its selling price is 36 \$. The cost of maintaining each product unit is 2 \$ per month. If a product is still in stock at the end of the second month, it will be considered unusable and be sold at half price. If the inventory at the beginning of the first period is zero, propose the production schedule to maximize profit using dynamic programming.

Solution:

Stage (n): We have stages as many months as planning (n=1,2).

State variable (Sn): The number of goods remaining from the previous month at the beginning of each month.

Decision variable (Xn): number of goods to be produced in month n.

Stage 2:

Considering that the maximum production capacity per month is 4 and on the other hand, there is a possibility that none of the goods will be sold, therefore, the maximum value of s2 will be 4, i.e. s2=0,1,2,3,4. The maximum production in the second month can be 4, i.e. x2=0,1,2,3,4.

According to the available information, the amount of profit should be determined as follows:

In the table of the second stage, as an example, suppose we want to calculate f2(0,1).

Consider a case where there are no items left in the warehouse from the previous stage. In this period, we plan to produce one. According to the sales demand table, the probability of selling one product is equal to 1-0.9=0.1, and therefore the possible sales value will be equal to 0.9*36\$=32.4\$. Also, considering that if this one goods produced in this stage is not sold (the probability of which is 0.1), then it will be sold at the price of 18\$=36\$/2. Therefore, the sales of unused product is equal to 0.1*18\$=1.8\$. On the other hand, if this product is not sold, it must be stored. The probability of storing or the probability of not selling the product is equal to 0.1. The cost of storing per each product is also equal to 2\$. As a result, the cost of storage will be 0.2\$=0.1*2\$ for this one product. According to the formula provided to calculate the net profit, we have (x2=1,s=0):

Now consider a case where there is no product left from the previous stage (s2=0) and we intend to produce two units in this stage (x2=2). According to the table of the monthly sales demand, the probability that both of these products will be sold is equal to 0.4 + 0.1 + 0.1 = 0.6, and the probability that only one unit will be sold is equal to 0.3. Note that the probability of both products being sold, that is, that the amount of demand is at least 2 units (2 or 3 or 4 units) is equal to 0.6. Therefore, the revenue from the sale of one unit is 0.3*1*36\$=10.8\$ and revenue from the sale of two units is 0.6*2*36\$=43.2\$. Then the revenue will be equal to:

The value of unsold product is also obtained according to the probability of not selling one or two units as follows:

The probability that no product is sold or that exactly one product remains in stock is 0.1, and the probability that exactly one product is sold is 0.3. According to the mentioned possibilities, the cost of storage is obtained as follows:

As a result, the total net profit is(x2=2,s=0):

In the following, the calculations of the net profit in each of the states in stage 2 are presented in the table:

Regarding the calculation of f2(1,3), considering that three units are produced, but since one unit is in the warehouse from the previous period, therefore, it should be possible to sell or not sell 3+1=4 units..

Stage 1:

In this stage, there is no product in the warehouse and s1=0. But considering that the maximum production capacity is four units and the probability of sale demand is up to four units, thus the decision variable can be set from zero to four, that is, x1=0,1,2,3,4. At this stage, the sale of unsold product is not calculated. This is because if the any product remains in the inventory, it is transferred to the next period, and in practice, no remaining products are unsold in this stage. According to the sold or nonsold of the products in this stage, the level of inventory in the stage 2 is calculated.

The details of the calculations in the table of the first stage are as follows:

At the beginning of this period, there is no inventory and we do not intend to produce in this stage. We go to the second stage with empty inventory. According to the table of the second stage, the value of the optimal policy in the state s2=0 is equal to 22\$, which is The value of the first stage, i.e. zero, is added.

In the calculation of f1(0,1), we will enter the second stage with a different inventory level, depending on whether one unit may or may not be sold. If one unit produced in the first stage is sold (with a probability of 1-0.1=0.9), we enter the second stage with empty inventory, and the optimal policy value in s2=0 is equal to 22\$, and it will produce net profit equivalent to 22\$*0.9=19.8\$. if one unit is not sold in the first stage (with a probability of 0.1), the level of inventory is one unit and the value of the optimal policy in s2=1 is equal to 42\$ and finally the profit is equal to 0.1*42\$=4.2\$.

Here we calculate f1(0,2). If two units produced in the first stage are sold (with a probability of 1-0.1-0.3=0.6), we enter the second stage with zero level of inventory. It has an optimal value of 22\$ and the profit is equal to 22\$*0.6=13.2\$. Now, if only one unit is sold in the first stage (with a probability of 0.3), then we go to the second stage with one unit in the inventory. The optimal value is equal to 42\$ and the net profit is equal to 0.3*42\$=12.6\$. if none of the two units produced in the first stage are sold (with a probability of 0.1), in which case we go to the second stage with the same two units. it has the optimal value of 62\$ and the net profit is equal to 0.1*62\$=6.2\$.

The calculation for f1(0,3) and f1(0,4) is done in a similar way.

Now, to determine the optimal solution, considering that x1*=3. With possible demand; 0, 1, 2 or all 3 units of the product in the first stage may not be sold, so we start the second stage according to the table of the first stage with different case of sale. The following table  shows the details of the optimal solution.