Fast subsetting in Pandas in Python











up vote
1
down vote

favorite












I am running a loop a few million times, and I need to subset a different amount of data in each loop. I have a dataframe that has two columns, time (which is a time series) and electrode, which signifies a number between 1-64 for any electrode fired at that time.



time    electrode
0 1
1 43
2 45
3 12
4 7


In each loop I need to subset the data, as such:



num_electrodes = 
window_size = 5
index = 0
while index < len(data['time']) - interval_size:
start = data['time'][index]
end = data['time'][index+window_size]
window_data = data[(data['time'] >= start) & (data['time'] < end)]
num_electrodes.append(len(window_data['electrode'].unique()))


The really slow part of the code here is subsetting the dataframe and making a new dataframe, in the following code.



window_data = data[(data['time'] >= start) & (data['time'] < end)]


Is there any good alternative to this?










share|improve this question
























  • Is your time series string / datetime / timedelta / something else?
    – jpp
    Nov 12 at 18:28












  • When do you augment index and from how many?
    – B. M.
    Nov 12 at 19:50












  • @jpp the data is normalized to start at timepoint '0' and then continues to timepoint ~3600 minutes, with increments of ~0.001 minutes.
    – Rotavator
    Nov 12 at 21:01










  • @B.M. i actually put a simplified example here, i just want to know how to faster subset my data
    – Rotavator
    Nov 12 at 21:01















up vote
1
down vote

favorite












I am running a loop a few million times, and I need to subset a different amount of data in each loop. I have a dataframe that has two columns, time (which is a time series) and electrode, which signifies a number between 1-64 for any electrode fired at that time.



time    electrode
0 1
1 43
2 45
3 12
4 7


In each loop I need to subset the data, as such:



num_electrodes = 
window_size = 5
index = 0
while index < len(data['time']) - interval_size:
start = data['time'][index]
end = data['time'][index+window_size]
window_data = data[(data['time'] >= start) & (data['time'] < end)]
num_electrodes.append(len(window_data['electrode'].unique()))


The really slow part of the code here is subsetting the dataframe and making a new dataframe, in the following code.



window_data = data[(data['time'] >= start) & (data['time'] < end)]


Is there any good alternative to this?










share|improve this question
























  • Is your time series string / datetime / timedelta / something else?
    – jpp
    Nov 12 at 18:28












  • When do you augment index and from how many?
    – B. M.
    Nov 12 at 19:50












  • @jpp the data is normalized to start at timepoint '0' and then continues to timepoint ~3600 minutes, with increments of ~0.001 minutes.
    – Rotavator
    Nov 12 at 21:01










  • @B.M. i actually put a simplified example here, i just want to know how to faster subset my data
    – Rotavator
    Nov 12 at 21:01













up vote
1
down vote

favorite









up vote
1
down vote

favorite











I am running a loop a few million times, and I need to subset a different amount of data in each loop. I have a dataframe that has two columns, time (which is a time series) and electrode, which signifies a number between 1-64 for any electrode fired at that time.



time    electrode
0 1
1 43
2 45
3 12
4 7


In each loop I need to subset the data, as such:



num_electrodes = 
window_size = 5
index = 0
while index < len(data['time']) - interval_size:
start = data['time'][index]
end = data['time'][index+window_size]
window_data = data[(data['time'] >= start) & (data['time'] < end)]
num_electrodes.append(len(window_data['electrode'].unique()))


The really slow part of the code here is subsetting the dataframe and making a new dataframe, in the following code.



window_data = data[(data['time'] >= start) & (data['time'] < end)]


Is there any good alternative to this?










share|improve this question















I am running a loop a few million times, and I need to subset a different amount of data in each loop. I have a dataframe that has two columns, time (which is a time series) and electrode, which signifies a number between 1-64 for any electrode fired at that time.



time    electrode
0 1
1 43
2 45
3 12
4 7


In each loop I need to subset the data, as such:



num_electrodes = 
window_size = 5
index = 0
while index < len(data['time']) - interval_size:
start = data['time'][index]
end = data['time'][index+window_size]
window_data = data[(data['time'] >= start) & (data['time'] < end)]
num_electrodes.append(len(window_data['electrode'].unique()))


The really slow part of the code here is subsetting the dataframe and making a new dataframe, in the following code.



window_data = data[(data['time'] >= start) & (data['time'] < end)]


Is there any good alternative to this?







python pandas performance indexing subset






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 12 at 18:35









jpp

82.4k194796




82.4k194796










asked Nov 12 at 18:25









Rotavator

567




567












  • Is your time series string / datetime / timedelta / something else?
    – jpp
    Nov 12 at 18:28












  • When do you augment index and from how many?
    – B. M.
    Nov 12 at 19:50












  • @jpp the data is normalized to start at timepoint '0' and then continues to timepoint ~3600 minutes, with increments of ~0.001 minutes.
    – Rotavator
    Nov 12 at 21:01










  • @B.M. i actually put a simplified example here, i just want to know how to faster subset my data
    – Rotavator
    Nov 12 at 21:01


















  • Is your time series string / datetime / timedelta / something else?
    – jpp
    Nov 12 at 18:28












  • When do you augment index and from how many?
    – B. M.
    Nov 12 at 19:50












  • @jpp the data is normalized to start at timepoint '0' and then continues to timepoint ~3600 minutes, with increments of ~0.001 minutes.
    – Rotavator
    Nov 12 at 21:01










  • @B.M. i actually put a simplified example here, i just want to know how to faster subset my data
    – Rotavator
    Nov 12 at 21:01
















Is your time series string / datetime / timedelta / something else?
– jpp
Nov 12 at 18:28






Is your time series string / datetime / timedelta / something else?
– jpp
Nov 12 at 18:28














When do you augment index and from how many?
– B. M.
Nov 12 at 19:50






When do you augment index and from how many?
– B. M.
Nov 12 at 19:50














@jpp the data is normalized to start at timepoint '0' and then continues to timepoint ~3600 minutes, with increments of ~0.001 minutes.
– Rotavator
Nov 12 at 21:01




@jpp the data is normalized to start at timepoint '0' and then continues to timepoint ~3600 minutes, with increments of ~0.001 minutes.
– Rotavator
Nov 12 at 21:01












@B.M. i actually put a simplified example here, i just want to know how to faster subset my data
– Rotavator
Nov 12 at 21:01




@B.M. i actually put a simplified example here, i just want to know how to faster subset my data
– Rotavator
Nov 12 at 21:01












3 Answers
3






active

oldest

votes

















up vote
1
down vote













Sort by your time, then you can use .loc to access the indices at the beginning and end of your window, and then select a range of indices as your subset.



Set your df's index to the time series, then use df.index.get_loc(beginning_window) and min(df.index.get_loc(beginning_window+window+1)) -1 to get your index range.



The min accounts for non-unique indices.



Then use .iloc to select that range.



That should speed it up by quite a bit.






share|improve this answer























  • i will try this... the problem is i am not only choosing index + 5 as i gave in this example, but need to find the end of the window based on the actual time value, so this involves finding the index of this time point
    – Rotavator
    Nov 12 at 18:45






  • 1




    You can find the index of that time, once you set your index to the time column by using df.index.get_loc(time).
    – John H
    Nov 12 at 18:57










  • i don't know the specific time though, i need to find the last time within the time window, then find the index. and to make things more complicated, more than one electrode can fire at the same time, so the times are not unique values
    – Rotavator
    Nov 12 at 21:02










  • if you don't know where you want to start your window, no one else can help you determine that. choose randomly or go through every possibility. updated the answer to accommodate non-unique indices
    – John H
    Nov 12 at 21:06




















up vote
0
down vote













Assuming your data is sorted by time, you just have to group the electrodes by 5. Then set can be faster than np.unique :



size=10**6
window_size=5
electrodes = np.random.randint(0,64,size)
electrodes_by_5 = electrodes.reshape(-1,window_size)

nb_electrodes=np.apply_along_axis(lambda arr:len(set(arr)),1,electrodes_by_5)


Output :



In [463]: electrodes[:10]
Out[463]: array([13, 13, 23, 20, 5, 30, 9, 6, 28, 11])

In [464]: electrodes_by_5[:2]
Out[464]:
array([[13, 13, 23, 20, 5],
[30, 9, 6, 28, 11]])

In [465]: nb_electrodes[:2]
Out[465]: array([4, 5])





share|improve this answer






























    up vote
    0
    down vote













    So I solved this by switching to numpy.ndarray which just went infinitely faster than indexing with iloc.






    share|improve this answer





















      Your Answer






      StackExchange.ifUsing("editor", function () {
      StackExchange.using("externalEditor", function () {
      StackExchange.using("snippets", function () {
      StackExchange.snippets.init();
      });
      });
      }, "code-snippets");

      StackExchange.ready(function() {
      var channelOptions = {
      tags: "".split(" "),
      id: "1"
      };
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function() {
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled) {
      StackExchange.using("snippets", function() {
      createEditor();
      });
      }
      else {
      createEditor();
      }
      });

      function createEditor() {
      StackExchange.prepareEditor({
      heartbeatType: 'answer',
      convertImagesToLinks: true,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: 10,
      bindNavPrevention: true,
      postfix: "",
      imageUploader: {
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      },
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      });


      }
      });














       

      draft saved


      draft discarded


















      StackExchange.ready(
      function () {
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53268002%2ffast-subsetting-in-pandas-in-python%23new-answer', 'question_page');
      }
      );

      Post as a guest















      Required, but never shown

























      3 Answers
      3






      active

      oldest

      votes








      3 Answers
      3






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes








      up vote
      1
      down vote













      Sort by your time, then you can use .loc to access the indices at the beginning and end of your window, and then select a range of indices as your subset.



      Set your df's index to the time series, then use df.index.get_loc(beginning_window) and min(df.index.get_loc(beginning_window+window+1)) -1 to get your index range.



      The min accounts for non-unique indices.



      Then use .iloc to select that range.



      That should speed it up by quite a bit.






      share|improve this answer























      • i will try this... the problem is i am not only choosing index + 5 as i gave in this example, but need to find the end of the window based on the actual time value, so this involves finding the index of this time point
        – Rotavator
        Nov 12 at 18:45






      • 1




        You can find the index of that time, once you set your index to the time column by using df.index.get_loc(time).
        – John H
        Nov 12 at 18:57










      • i don't know the specific time though, i need to find the last time within the time window, then find the index. and to make things more complicated, more than one electrode can fire at the same time, so the times are not unique values
        – Rotavator
        Nov 12 at 21:02










      • if you don't know where you want to start your window, no one else can help you determine that. choose randomly or go through every possibility. updated the answer to accommodate non-unique indices
        – John H
        Nov 12 at 21:06

















      up vote
      1
      down vote













      Sort by your time, then you can use .loc to access the indices at the beginning and end of your window, and then select a range of indices as your subset.



      Set your df's index to the time series, then use df.index.get_loc(beginning_window) and min(df.index.get_loc(beginning_window+window+1)) -1 to get your index range.



      The min accounts for non-unique indices.



      Then use .iloc to select that range.



      That should speed it up by quite a bit.






      share|improve this answer























      • i will try this... the problem is i am not only choosing index + 5 as i gave in this example, but need to find the end of the window based on the actual time value, so this involves finding the index of this time point
        – Rotavator
        Nov 12 at 18:45






      • 1




        You can find the index of that time, once you set your index to the time column by using df.index.get_loc(time).
        – John H
        Nov 12 at 18:57










      • i don't know the specific time though, i need to find the last time within the time window, then find the index. and to make things more complicated, more than one electrode can fire at the same time, so the times are not unique values
        – Rotavator
        Nov 12 at 21:02










      • if you don't know where you want to start your window, no one else can help you determine that. choose randomly or go through every possibility. updated the answer to accommodate non-unique indices
        – John H
        Nov 12 at 21:06















      up vote
      1
      down vote










      up vote
      1
      down vote









      Sort by your time, then you can use .loc to access the indices at the beginning and end of your window, and then select a range of indices as your subset.



      Set your df's index to the time series, then use df.index.get_loc(beginning_window) and min(df.index.get_loc(beginning_window+window+1)) -1 to get your index range.



      The min accounts for non-unique indices.



      Then use .iloc to select that range.



      That should speed it up by quite a bit.






      share|improve this answer














      Sort by your time, then you can use .loc to access the indices at the beginning and end of your window, and then select a range of indices as your subset.



      Set your df's index to the time series, then use df.index.get_loc(beginning_window) and min(df.index.get_loc(beginning_window+window+1)) -1 to get your index range.



      The min accounts for non-unique indices.



      Then use .iloc to select that range.



      That should speed it up by quite a bit.







      share|improve this answer














      share|improve this answer



      share|improve this answer








      edited Nov 12 at 21:04

























      answered Nov 12 at 18:32









      John H

      1,174315




      1,174315












      • i will try this... the problem is i am not only choosing index + 5 as i gave in this example, but need to find the end of the window based on the actual time value, so this involves finding the index of this time point
        – Rotavator
        Nov 12 at 18:45






      • 1




        You can find the index of that time, once you set your index to the time column by using df.index.get_loc(time).
        – John H
        Nov 12 at 18:57










      • i don't know the specific time though, i need to find the last time within the time window, then find the index. and to make things more complicated, more than one electrode can fire at the same time, so the times are not unique values
        – Rotavator
        Nov 12 at 21:02










      • if you don't know where you want to start your window, no one else can help you determine that. choose randomly or go through every possibility. updated the answer to accommodate non-unique indices
        – John H
        Nov 12 at 21:06




















      • i will try this... the problem is i am not only choosing index + 5 as i gave in this example, but need to find the end of the window based on the actual time value, so this involves finding the index of this time point
        – Rotavator
        Nov 12 at 18:45






      • 1




        You can find the index of that time, once you set your index to the time column by using df.index.get_loc(time).
        – John H
        Nov 12 at 18:57










      • i don't know the specific time though, i need to find the last time within the time window, then find the index. and to make things more complicated, more than one electrode can fire at the same time, so the times are not unique values
        – Rotavator
        Nov 12 at 21:02










      • if you don't know where you want to start your window, no one else can help you determine that. choose randomly or go through every possibility. updated the answer to accommodate non-unique indices
        – John H
        Nov 12 at 21:06


















      i will try this... the problem is i am not only choosing index + 5 as i gave in this example, but need to find the end of the window based on the actual time value, so this involves finding the index of this time point
      – Rotavator
      Nov 12 at 18:45




      i will try this... the problem is i am not only choosing index + 5 as i gave in this example, but need to find the end of the window based on the actual time value, so this involves finding the index of this time point
      – Rotavator
      Nov 12 at 18:45




      1




      1




      You can find the index of that time, once you set your index to the time column by using df.index.get_loc(time).
      – John H
      Nov 12 at 18:57




      You can find the index of that time, once you set your index to the time column by using df.index.get_loc(time).
      – John H
      Nov 12 at 18:57












      i don't know the specific time though, i need to find the last time within the time window, then find the index. and to make things more complicated, more than one electrode can fire at the same time, so the times are not unique values
      – Rotavator
      Nov 12 at 21:02




      i don't know the specific time though, i need to find the last time within the time window, then find the index. and to make things more complicated, more than one electrode can fire at the same time, so the times are not unique values
      – Rotavator
      Nov 12 at 21:02












      if you don't know where you want to start your window, no one else can help you determine that. choose randomly or go through every possibility. updated the answer to accommodate non-unique indices
      – John H
      Nov 12 at 21:06






      if you don't know where you want to start your window, no one else can help you determine that. choose randomly or go through every possibility. updated the answer to accommodate non-unique indices
      – John H
      Nov 12 at 21:06














      up vote
      0
      down vote













      Assuming your data is sorted by time, you just have to group the electrodes by 5. Then set can be faster than np.unique :



      size=10**6
      window_size=5
      electrodes = np.random.randint(0,64,size)
      electrodes_by_5 = electrodes.reshape(-1,window_size)

      nb_electrodes=np.apply_along_axis(lambda arr:len(set(arr)),1,electrodes_by_5)


      Output :



      In [463]: electrodes[:10]
      Out[463]: array([13, 13, 23, 20, 5, 30, 9, 6, 28, 11])

      In [464]: electrodes_by_5[:2]
      Out[464]:
      array([[13, 13, 23, 20, 5],
      [30, 9, 6, 28, 11]])

      In [465]: nb_electrodes[:2]
      Out[465]: array([4, 5])





      share|improve this answer



























        up vote
        0
        down vote













        Assuming your data is sorted by time, you just have to group the electrodes by 5. Then set can be faster than np.unique :



        size=10**6
        window_size=5
        electrodes = np.random.randint(0,64,size)
        electrodes_by_5 = electrodes.reshape(-1,window_size)

        nb_electrodes=np.apply_along_axis(lambda arr:len(set(arr)),1,electrodes_by_5)


        Output :



        In [463]: electrodes[:10]
        Out[463]: array([13, 13, 23, 20, 5, 30, 9, 6, 28, 11])

        In [464]: electrodes_by_5[:2]
        Out[464]:
        array([[13, 13, 23, 20, 5],
        [30, 9, 6, 28, 11]])

        In [465]: nb_electrodes[:2]
        Out[465]: array([4, 5])





        share|improve this answer

























          up vote
          0
          down vote










          up vote
          0
          down vote









          Assuming your data is sorted by time, you just have to group the electrodes by 5. Then set can be faster than np.unique :



          size=10**6
          window_size=5
          electrodes = np.random.randint(0,64,size)
          electrodes_by_5 = electrodes.reshape(-1,window_size)

          nb_electrodes=np.apply_along_axis(lambda arr:len(set(arr)),1,electrodes_by_5)


          Output :



          In [463]: electrodes[:10]
          Out[463]: array([13, 13, 23, 20, 5, 30, 9, 6, 28, 11])

          In [464]: electrodes_by_5[:2]
          Out[464]:
          array([[13, 13, 23, 20, 5],
          [30, 9, 6, 28, 11]])

          In [465]: nb_electrodes[:2]
          Out[465]: array([4, 5])





          share|improve this answer














          Assuming your data is sorted by time, you just have to group the electrodes by 5. Then set can be faster than np.unique :



          size=10**6
          window_size=5
          electrodes = np.random.randint(0,64,size)
          electrodes_by_5 = electrodes.reshape(-1,window_size)

          nb_electrodes=np.apply_along_axis(lambda arr:len(set(arr)),1,electrodes_by_5)


          Output :



          In [463]: electrodes[:10]
          Out[463]: array([13, 13, 23, 20, 5, 30, 9, 6, 28, 11])

          In [464]: electrodes_by_5[:2]
          Out[464]:
          array([[13, 13, 23, 20, 5],
          [30, 9, 6, 28, 11]])

          In [465]: nb_electrodes[:2]
          Out[465]: array([4, 5])






          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Nov 12 at 20:02

























          answered Nov 12 at 19:45









          B. M.

          11.8k11934




          11.8k11934






















              up vote
              0
              down vote













              So I solved this by switching to numpy.ndarray which just went infinitely faster than indexing with iloc.






              share|improve this answer

























                up vote
                0
                down vote













                So I solved this by switching to numpy.ndarray which just went infinitely faster than indexing with iloc.






                share|improve this answer























                  up vote
                  0
                  down vote










                  up vote
                  0
                  down vote









                  So I solved this by switching to numpy.ndarray which just went infinitely faster than indexing with iloc.






                  share|improve this answer












                  So I solved this by switching to numpy.ndarray which just went infinitely faster than indexing with iloc.







                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Nov 12 at 22:54









                  Rotavator

                  567




                  567






























                       

                      draft saved


                      draft discarded



















































                       


                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function () {
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53268002%2ffast-subsetting-in-pandas-in-python%23new-answer', 'question_page');
                      }
                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      How to change which sound is reproduced for terminal bell?

                      Can I use Tabulator js library in my java Spring + Thymeleaf project?

                      Title Spacing in Bjornstrup Chapter, Removing Chapter Number From Contents