OverflowError as I try to use the value-iteration algorithm with mdptoolbox












1















I set up a simple MDP for a board that has 4 possible states and 4 possible actions. The board and reward setup looks as follows:



enter image description here



Here S4 is the goal state and S2 is the absorbing state. I have defined the transition probability matrices and reward matrice in the code that I wrote to get the optimal value function for this MDP. But as I run the code, I get an error that says: OverflowError: cannot convert float infinity to integer. I could not understand the reason for this.



import mdptoolbox
import numpy as np

transitions = np.array([
# action 1 (Right)
[
[0.1, 0.7, 0.1, 0.1],
[0.3, 0.3, 0.3, 0.1],
[0.1, 0.2, 0.2, 0.5],
[0.1, 0.1, 0.1, 0.7]
],
# action 2 (Down)
[
[0.1, 0.4, 0.4, 0.1],
[0.3, 0.3, 0.3, 0.1],
[0.4, 0.1, 0.4, 0.1],
[0.1, 0.1, 0.1, 0.7]
],
# action 3 (Left)
[
[0.4, 0.3, 0.2, 0.1],
[0.2, 0.2, 0.4, 0.2],
[0.5, 0.1, 0.3, 0.1],
[0.1, 0.1, 0.1, 0.7]
],
# action 4 (Top)
[
[0.1, 0.4, 0.4, 0.1],
[0.3, 0.3, 0.3, 0.1],
[0.4, 0.1, 0.4, 0.1],
[0.1, 0.1, 0.1, 0.7]
]
])

rewards = np.array([
[-1, -100, -1, 1],
[-1, -100, -1, 1],
[-1, -100, -1, 1],
[1, 1, 1, 1]
])


vi = mdptoolbox.mdp.ValueIteration(transitions, rewards, discount=0.5)
vi.setVerbose()
vi.run()

print("Value function:")
print(vi.V)


print("Policy function")
print(vi.policy)


If I change the value of discount to 1 from 0.5, it works fine. What could be the reason for the value iteration not working with discount value 0.5 or any other decimal values?



Update: It looks like there is some issue with my reward matrix. I have not able to write it as I intended it to be. Because if I change some values in the reward matrix, the error disappears.










share|improve this question





























    1















    I set up a simple MDP for a board that has 4 possible states and 4 possible actions. The board and reward setup looks as follows:



    enter image description here



    Here S4 is the goal state and S2 is the absorbing state. I have defined the transition probability matrices and reward matrice in the code that I wrote to get the optimal value function for this MDP. But as I run the code, I get an error that says: OverflowError: cannot convert float infinity to integer. I could not understand the reason for this.



    import mdptoolbox
    import numpy as np

    transitions = np.array([
    # action 1 (Right)
    [
    [0.1, 0.7, 0.1, 0.1],
    [0.3, 0.3, 0.3, 0.1],
    [0.1, 0.2, 0.2, 0.5],
    [0.1, 0.1, 0.1, 0.7]
    ],
    # action 2 (Down)
    [
    [0.1, 0.4, 0.4, 0.1],
    [0.3, 0.3, 0.3, 0.1],
    [0.4, 0.1, 0.4, 0.1],
    [0.1, 0.1, 0.1, 0.7]
    ],
    # action 3 (Left)
    [
    [0.4, 0.3, 0.2, 0.1],
    [0.2, 0.2, 0.4, 0.2],
    [0.5, 0.1, 0.3, 0.1],
    [0.1, 0.1, 0.1, 0.7]
    ],
    # action 4 (Top)
    [
    [0.1, 0.4, 0.4, 0.1],
    [0.3, 0.3, 0.3, 0.1],
    [0.4, 0.1, 0.4, 0.1],
    [0.1, 0.1, 0.1, 0.7]
    ]
    ])

    rewards = np.array([
    [-1, -100, -1, 1],
    [-1, -100, -1, 1],
    [-1, -100, -1, 1],
    [1, 1, 1, 1]
    ])


    vi = mdptoolbox.mdp.ValueIteration(transitions, rewards, discount=0.5)
    vi.setVerbose()
    vi.run()

    print("Value function:")
    print(vi.V)


    print("Policy function")
    print(vi.policy)


    If I change the value of discount to 1 from 0.5, it works fine. What could be the reason for the value iteration not working with discount value 0.5 or any other decimal values?



    Update: It looks like there is some issue with my reward matrix. I have not able to write it as I intended it to be. Because if I change some values in the reward matrix, the error disappears.










    share|improve this question



























      1












      1








      1








      I set up a simple MDP for a board that has 4 possible states and 4 possible actions. The board and reward setup looks as follows:



      enter image description here



      Here S4 is the goal state and S2 is the absorbing state. I have defined the transition probability matrices and reward matrice in the code that I wrote to get the optimal value function for this MDP. But as I run the code, I get an error that says: OverflowError: cannot convert float infinity to integer. I could not understand the reason for this.



      import mdptoolbox
      import numpy as np

      transitions = np.array([
      # action 1 (Right)
      [
      [0.1, 0.7, 0.1, 0.1],
      [0.3, 0.3, 0.3, 0.1],
      [0.1, 0.2, 0.2, 0.5],
      [0.1, 0.1, 0.1, 0.7]
      ],
      # action 2 (Down)
      [
      [0.1, 0.4, 0.4, 0.1],
      [0.3, 0.3, 0.3, 0.1],
      [0.4, 0.1, 0.4, 0.1],
      [0.1, 0.1, 0.1, 0.7]
      ],
      # action 3 (Left)
      [
      [0.4, 0.3, 0.2, 0.1],
      [0.2, 0.2, 0.4, 0.2],
      [0.5, 0.1, 0.3, 0.1],
      [0.1, 0.1, 0.1, 0.7]
      ],
      # action 4 (Top)
      [
      [0.1, 0.4, 0.4, 0.1],
      [0.3, 0.3, 0.3, 0.1],
      [0.4, 0.1, 0.4, 0.1],
      [0.1, 0.1, 0.1, 0.7]
      ]
      ])

      rewards = np.array([
      [-1, -100, -1, 1],
      [-1, -100, -1, 1],
      [-1, -100, -1, 1],
      [1, 1, 1, 1]
      ])


      vi = mdptoolbox.mdp.ValueIteration(transitions, rewards, discount=0.5)
      vi.setVerbose()
      vi.run()

      print("Value function:")
      print(vi.V)


      print("Policy function")
      print(vi.policy)


      If I change the value of discount to 1 from 0.5, it works fine. What could be the reason for the value iteration not working with discount value 0.5 or any other decimal values?



      Update: It looks like there is some issue with my reward matrix. I have not able to write it as I intended it to be. Because if I change some values in the reward matrix, the error disappears.










      share|improve this question
















      I set up a simple MDP for a board that has 4 possible states and 4 possible actions. The board and reward setup looks as follows:



      enter image description here



      Here S4 is the goal state and S2 is the absorbing state. I have defined the transition probability matrices and reward matrice in the code that I wrote to get the optimal value function for this MDP. But as I run the code, I get an error that says: OverflowError: cannot convert float infinity to integer. I could not understand the reason for this.



      import mdptoolbox
      import numpy as np

      transitions = np.array([
      # action 1 (Right)
      [
      [0.1, 0.7, 0.1, 0.1],
      [0.3, 0.3, 0.3, 0.1],
      [0.1, 0.2, 0.2, 0.5],
      [0.1, 0.1, 0.1, 0.7]
      ],
      # action 2 (Down)
      [
      [0.1, 0.4, 0.4, 0.1],
      [0.3, 0.3, 0.3, 0.1],
      [0.4, 0.1, 0.4, 0.1],
      [0.1, 0.1, 0.1, 0.7]
      ],
      # action 3 (Left)
      [
      [0.4, 0.3, 0.2, 0.1],
      [0.2, 0.2, 0.4, 0.2],
      [0.5, 0.1, 0.3, 0.1],
      [0.1, 0.1, 0.1, 0.7]
      ],
      # action 4 (Top)
      [
      [0.1, 0.4, 0.4, 0.1],
      [0.3, 0.3, 0.3, 0.1],
      [0.4, 0.1, 0.4, 0.1],
      [0.1, 0.1, 0.1, 0.7]
      ]
      ])

      rewards = np.array([
      [-1, -100, -1, 1],
      [-1, -100, -1, 1],
      [-1, -100, -1, 1],
      [1, 1, 1, 1]
      ])


      vi = mdptoolbox.mdp.ValueIteration(transitions, rewards, discount=0.5)
      vi.setVerbose()
      vi.run()

      print("Value function:")
      print(vi.V)


      print("Policy function")
      print(vi.policy)


      If I change the value of discount to 1 from 0.5, it works fine. What could be the reason for the value iteration not working with discount value 0.5 or any other decimal values?



      Update: It looks like there is some issue with my reward matrix. I have not able to write it as I intended it to be. Because if I change some values in the reward matrix, the error disappears.







      python dynamic-programming markov-chains stochastic mdptoolbox






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 22 '18 at 6:37







      Suhail Gupta

















      asked Nov 21 '18 at 11:56









      Suhail GuptaSuhail Gupta

      9,60850141247




      9,60850141247
























          1 Answer
          1






          active

          oldest

          votes


















          0














          So it came out that the reward matrix I had defined was incorrect. According to the reward matrix as defined in the picture above, it should be of type (S,A) as given in the documentation, where each row corresponds to a state starting from S1 until S4 and each column corresponds to action starting from A1 until A4. The new reward matrice looks as follows:



          #(S,A)
          rewards = np.array([
          [-1, -1, -1, -1],
          [-100, -100, -100, -100],
          [-1, -1, -1, -1],
          [1, 1, 1, 1]
          ])


          It works fine with this. But I am still not sure, what was happening inside that led to the overflow error.






          share|improve this answer























            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53411530%2foverflowerror-as-i-try-to-use-the-value-iteration-algorithm-with-mdptoolbox%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            0














            So it came out that the reward matrix I had defined was incorrect. According to the reward matrix as defined in the picture above, it should be of type (S,A) as given in the documentation, where each row corresponds to a state starting from S1 until S4 and each column corresponds to action starting from A1 until A4. The new reward matrice looks as follows:



            #(S,A)
            rewards = np.array([
            [-1, -1, -1, -1],
            [-100, -100, -100, -100],
            [-1, -1, -1, -1],
            [1, 1, 1, 1]
            ])


            It works fine with this. But I am still not sure, what was happening inside that led to the overflow error.






            share|improve this answer




























              0














              So it came out that the reward matrix I had defined was incorrect. According to the reward matrix as defined in the picture above, it should be of type (S,A) as given in the documentation, where each row corresponds to a state starting from S1 until S4 and each column corresponds to action starting from A1 until A4. The new reward matrice looks as follows:



              #(S,A)
              rewards = np.array([
              [-1, -1, -1, -1],
              [-100, -100, -100, -100],
              [-1, -1, -1, -1],
              [1, 1, 1, 1]
              ])


              It works fine with this. But I am still not sure, what was happening inside that led to the overflow error.






              share|improve this answer


























                0












                0








                0







                So it came out that the reward matrix I had defined was incorrect. According to the reward matrix as defined in the picture above, it should be of type (S,A) as given in the documentation, where each row corresponds to a state starting from S1 until S4 and each column corresponds to action starting from A1 until A4. The new reward matrice looks as follows:



                #(S,A)
                rewards = np.array([
                [-1, -1, -1, -1],
                [-100, -100, -100, -100],
                [-1, -1, -1, -1],
                [1, 1, 1, 1]
                ])


                It works fine with this. But I am still not sure, what was happening inside that led to the overflow error.






                share|improve this answer













                So it came out that the reward matrix I had defined was incorrect. According to the reward matrix as defined in the picture above, it should be of type (S,A) as given in the documentation, where each row corresponds to a state starting from S1 until S4 and each column corresponds to action starting from A1 until A4. The new reward matrice looks as follows:



                #(S,A)
                rewards = np.array([
                [-1, -1, -1, -1],
                [-100, -100, -100, -100],
                [-1, -1, -1, -1],
                [1, 1, 1, 1]
                ])


                It works fine with this. But I am still not sure, what was happening inside that led to the overflow error.







                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Nov 22 '18 at 9:57









                Suhail GuptaSuhail Gupta

                9,60850141247




                9,60850141247
































                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53411530%2foverflowerror-as-i-try-to-use-the-value-iteration-algorithm-with-mdptoolbox%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Biblatex bibliography style without URLs when DOI exists (in Overleaf with Zotero bibliography)

                    ComboBox Display Member on multiple fields

                    Is it possible to collect Nectar points via Trainline?