Fast subsetting in Pandas in Python

up vote
1
down vote

favorite

I am running a loop a few million times, and I need to subset a different amount of data in each loop. I have a dataframe that has two columns, time (which is a time series) and electrode, which signifies a number between 1-64 for any electrode fired at that time.

time    electrode

 0          1

 1          43

 2          45

 3          12

 4          7

In each loop I need to subset the data, as such:

num_electrodes = 

window_size = 5

index = 0

while index < len(data['time']) - interval_size:

    start = data['time'][index]

    end = data['time'][index+window_size]

    window_data = data[(data['time'] >= start) & (data['time'] < end)]

    num_electrodes.append(len(window_data['electrode'].unique()))

The really slow part of the code here is subsetting the dataframe and making a new dataframe, in the following code.

window_data = data[(data['time'] >= start) & (data['time'] < end)]

Is there any good alternative to this?

edited Nov 12 at 18:35

jpp

82.4k194796

asked Nov 12 at 18:25

Rotavator

567

Is your time series string / datetime / timedelta / something else?
– jpp
Nov 12 at 18:28

When do you augment index and from how many?
– B. M.
Nov 12 at 19:50

@jpp the data is normalized to start at timepoint '0' and then continues to timepoint ~3600 minutes, with increments of ~0.001 minutes.
– Rotavator
Nov 12 at 21:01

@B.M. i actually put a simplified example here, i just want to know how to faster subset my data
– Rotavator
Nov 12 at 21:01

add a comment |

up vote
1
down vote

favorite

time    electrode

 0          1

 1          43

 2          45

 3          12

 4          7

In each loop I need to subset the data, as such:

num_electrodes = 

window_size = 5

index = 0

while index < len(data['time']) - interval_size:

    start = data['time'][index]

    end = data['time'][index+window_size]

    window_data = data[(data['time'] >= start) & (data['time'] < end)]

    num_electrodes.append(len(window_data['electrode'].unique()))

The really slow part of the code here is subsetting the dataframe and making a new dataframe, in the following code.

window_data = data[(data['time'] >= start) & (data['time'] < end)]

Is there any good alternative to this?

edited Nov 12 at 18:35

jpp

82.4k194796

asked Nov 12 at 18:25

Rotavator

567

Is your time series string / datetime / timedelta / something else?
– jpp
Nov 12 at 18:28

When do you augment index and from how many?
– B. M.
Nov 12 at 19:50

@jpp the data is normalized to start at timepoint '0' and then continues to timepoint ~3600 minutes, with increments of ~0.001 minutes.
– Rotavator
Nov 12 at 21:01

@B.M. i actually put a simplified example here, i just want to know how to faster subset my data
– Rotavator
Nov 12 at 21:01

add a comment |

up vote
1
down vote

favorite

time    electrode

 0          1

 1          43

 2          45

 3          12

 4          7

In each loop I need to subset the data, as such:

num_electrodes = 

window_size = 5

index = 0

while index < len(data['time']) - interval_size:

    start = data['time'][index]

    end = data['time'][index+window_size]

    window_data = data[(data['time'] >= start) & (data['time'] < end)]

    num_electrodes.append(len(window_data['electrode'].unique()))

The really slow part of the code here is subsetting the dataframe and making a new dataframe, in the following code.

window_data = data[(data['time'] >= start) & (data['time'] < end)]

Is there any good alternative to this?

edited Nov 12 at 18:35

jpp

82.4k194796

asked Nov 12 at 18:25

Rotavator

567

time    electrode

 0          1

 1          43

 2          45

 3          12

 4          7

In each loop I need to subset the data, as such:

num_electrodes = 

window_size = 5

index = 0

while index < len(data['time']) - interval_size:

    start = data['time'][index]

    end = data['time'][index+window_size]

    window_data = data[(data['time'] >= start) & (data['time'] < end)]

    num_electrodes.append(len(window_data['electrode'].unique()))

The really slow part of the code here is subsetting the dataframe and making a new dataframe, in the following code.

window_data = data[(data['time'] >= start) & (data['time'] < end)]

Is there any good alternative to this?

python pandas performance indexing subset

edited Nov 12 at 18:35

jpp

82.4k194796

asked Nov 12 at 18:25

Rotavator

567

edited Nov 12 at 18:35

jpp

82.4k194796

asked Nov 12 at 18:25

Rotavator

567

edited Nov 12 at 18:35

jpp

82.4k194796

edited Nov 12 at 18:35

jpp

82.4k194796

edited Nov 12 at 18:35

jpp

82.4k194796

asked Nov 12 at 18:25

Rotavator

567

asked Nov 12 at 18:25

Rotavator

567

asked Nov 12 at 18:25

Rotavator

567

Is your time series string / datetime / timedelta / something else?
– jpp
Nov 12 at 18:28

When do you augment index and from how many?
– B. M.
Nov 12 at 19:50

@jpp the data is normalized to start at timepoint '0' and then continues to timepoint ~3600 minutes, with increments of ~0.001 minutes.
– Rotavator
Nov 12 at 21:01

@B.M. i actually put a simplified example here, i just want to know how to faster subset my data
– Rotavator
Nov 12 at 21:01

add a comment |

Is your time series string / datetime / timedelta / something else?
– jpp
Nov 12 at 18:28

When do you augment index and from how many?
– B. M.
Nov 12 at 19:50

@jpp the data is normalized to start at timepoint '0' and then continues to timepoint ~3600 minutes, with increments of ~0.001 minutes.
– Rotavator
Nov 12 at 21:01

@B.M. i actually put a simplified example here, i just want to know how to faster subset my data
– Rotavator
Nov 12 at 21:01

Is your time series string / datetime / timedelta / something else?
– jpp
Nov 12 at 18:28

When do you augment index and from how many?
– B. M.
Nov 12 at 19:50

@jpp the data is normalized to start at timepoint '0' and then continues to timepoint ~3600 minutes, with increments of ~0.001 minutes.
– Rotavator
Nov 12 at 21:01

@B.M. i actually put a simplified example here, i just want to know how to faster subset my data
– Rotavator
Nov 12 at 21:01

add a comment |

3 Answers
3

active

oldest

votes

up vote
1
down vote

Sort by your time, then you can use .loc to access the indices at the beginning and end of your window, and then select a range of indices as your subset.

Set your df's index to the time series, then use df.index.get_loc(beginning_window) and min(df.index.get_loc(beginning_window+window+1)) -1 to get your index range.

The min accounts for non-unique indices.

Then use .iloc to select that range.

That should speed it up by quite a bit.

edited Nov 12 at 21:04

answered Nov 12 at 18:32

John H

1,174315

i will try this... the problem is i am not only choosing index + 5 as i gave in this example, but need to find the end of the window based on the actual time value, so this involves finding the index of this time point
– Rotavator
Nov 12 at 18:45

1

You can find the index of that time, once you set your index to the time column by using df.index.get_loc(time).
– John H
Nov 12 at 18:57

i don't know the specific time though, i need to find the last time within the time window, then find the index. and to make things more complicated, more than one electrode can fire at the same time, so the times are not unique values
– Rotavator
Nov 12 at 21:02

if you don't know where you want to start your window, no one else can help you determine that. choose randomly or go through every possibility. updated the answer to accommodate non-unique indices
– John H
Nov 12 at 21:06

add a comment |

up vote
0
down vote

Assuming your data is sorted by time, you just have to group the electrodes by 5. Then set can be faster than np.unique :

size=10**6

window_size=5

electrodes = np.random.randint(0,64,size)

electrodes_by_5 = electrodes.reshape(-1,window_size)



nb_electrodes=np.apply_along_axis(lambda arr:len(set(arr)),1,electrodes_by_5)

Output :

In [463]: electrodes[:10]

Out[463]: array([13, 13, 23, 20,  5, 30,  9,  6, 28, 11])



In [464]: electrodes_by_5[:2]

Out[464]: 

array([[13, 13, 23, 20,  5],

       [30,  9,  6, 28, 11]])



In [465]: nb_electrodes[:2]

Out[465]: array([4, 5])

edited Nov 12 at 20:02

answered Nov 12 at 19:45

B. M.

11.8k11934

add a comment |

up vote
0
down vote

So I solved this by switching to numpy.ndarray which just went infinitely faster than indexing with iloc.

answered Nov 12 at 22:54

Rotavator

567

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53268002%2ffast-subsetting-in-pandas-in-python%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

up vote
1
down vote

Sort by your time, then you can use .loc to access the indices at the beginning and end of your window, and then select a range of indices as your subset.

Set your df's index to the time series, then use df.index.get_loc(beginning_window) and min(df.index.get_loc(beginning_window+window+1)) -1 to get your index range.

The min accounts for non-unique indices.

Then use .iloc to select that range.

That should speed it up by quite a bit.

edited Nov 12 at 21:04

answered Nov 12 at 18:32

John H

1,174315

i will try this... the problem is i am not only choosing index + 5 as i gave in this example, but need to find the end of the window based on the actual time value, so this involves finding the index of this time point
– Rotavator
Nov 12 at 18:45

1

You can find the index of that time, once you set your index to the time column by using df.index.get_loc(time).
– John H
Nov 12 at 18:57

i don't know the specific time though, i need to find the last time within the time window, then find the index. and to make things more complicated, more than one electrode can fire at the same time, so the times are not unique values
– Rotavator
Nov 12 at 21:02

if you don't know where you want to start your window, no one else can help you determine that. choose randomly or go through every possibility. updated the answer to accommodate non-unique indices
– John H
Nov 12 at 21:06

add a comment |

up vote
1
down vote

Sort by your time, then you can use .loc to access the indices at the beginning and end of your window, and then select a range of indices as your subset.

Set your df's index to the time series, then use df.index.get_loc(beginning_window) and min(df.index.get_loc(beginning_window+window+1)) -1 to get your index range.

The min accounts for non-unique indices.

Then use .iloc to select that range.

That should speed it up by quite a bit.

edited Nov 12 at 21:04

answered Nov 12 at 18:32

John H

1,174315

i will try this... the problem is i am not only choosing index + 5 as i gave in this example, but need to find the end of the window based on the actual time value, so this involves finding the index of this time point
– Rotavator
Nov 12 at 18:45

1

You can find the index of that time, once you set your index to the time column by using df.index.get_loc(time).
– John H
Nov 12 at 18:57

i don't know the specific time though, i need to find the last time within the time window, then find the index. and to make things more complicated, more than one electrode can fire at the same time, so the times are not unique values
– Rotavator
Nov 12 at 21:02

if you don't know where you want to start your window, no one else can help you determine that. choose randomly or go through every possibility. updated the answer to accommodate non-unique indices
– John H
Nov 12 at 21:06

add a comment |

up vote
1
down vote

Sort by your time, then you can use .loc to access the indices at the beginning and end of your window, and then select a range of indices as your subset.

Set your df's index to the time series, then use df.index.get_loc(beginning_window) and min(df.index.get_loc(beginning_window+window+1)) -1 to get your index range.

The min accounts for non-unique indices.

Then use .iloc to select that range.

That should speed it up by quite a bit.

edited Nov 12 at 21:04

answered Nov 12 at 18:32

John H

1,174315

Sort by your time, then you can use .loc to access the indices at the beginning and end of your window, and then select a range of indices as your subset.

Set your df's index to the time series, then use df.index.get_loc(beginning_window) and min(df.index.get_loc(beginning_window+window+1)) -1 to get your index range.

The min accounts for non-unique indices.

Then use .iloc to select that range.

That should speed it up by quite a bit.

edited Nov 12 at 21:04

answered Nov 12 at 18:32

John H

1,174315

edited Nov 12 at 21:04

answered Nov 12 at 18:32

John H

1,174315

answered Nov 12 at 18:32

John H

1,174315

answered Nov 12 at 18:32

John H

1,174315

i will try this... the problem is i am not only choosing index + 5 as i gave in this example, but need to find the end of the window based on the actual time value, so this involves finding the index of this time point
– Rotavator
Nov 12 at 18:45

1

You can find the index of that time, once you set your index to the time column by using df.index.get_loc(time).
– John H
Nov 12 at 18:57

i don't know the specific time though, i need to find the last time within the time window, then find the index. and to make things more complicated, more than one electrode can fire at the same time, so the times are not unique values
– Rotavator
Nov 12 at 21:02

if you don't know where you want to start your window, no one else can help you determine that. choose randomly or go through every possibility. updated the answer to accommodate non-unique indices
– John H
Nov 12 at 21:06

add a comment |

i will try this... the problem is i am not only choosing index + 5 as i gave in this example, but need to find the end of the window based on the actual time value, so this involves finding the index of this time point
– Rotavator
Nov 12 at 18:45

1

You can find the index of that time, once you set your index to the time column by using df.index.get_loc(time).
– John H
Nov 12 at 18:57

i don't know the specific time though, i need to find the last time within the time window, then find the index. and to make things more complicated, more than one electrode can fire at the same time, so the times are not unique values
– Rotavator
Nov 12 at 21:02

if you don't know where you want to start your window, no one else can help you determine that. choose randomly or go through every possibility. updated the answer to accommodate non-unique indices
– John H
Nov 12 at 21:06

i will try this... the problem is i am not only choosing index + 5 as i gave in this example, but need to find the end of the window based on the actual time value, so this involves finding the index of this time point
– Rotavator
Nov 12 at 18:45

You can find the index of that time, once you set your index to the time column by using df.index.get_loc(time).
– John H
Nov 12 at 18:57

i don't know the specific time though, i need to find the last time within the time window, then find the index. and to make things more complicated, more than one electrode can fire at the same time, so the times are not unique values
– Rotavator
Nov 12 at 21:02

if you don't know where you want to start your window, no one else can help you determine that. choose randomly or go through every possibility. updated the answer to accommodate non-unique indices
– John H
Nov 12 at 21:06

add a comment |

up vote
0
down vote

Assuming your data is sorted by time, you just have to group the electrodes by 5. Then set can be faster than np.unique :

size=10**6

window_size=5

electrodes = np.random.randint(0,64,size)

electrodes_by_5 = electrodes.reshape(-1,window_size)



nb_electrodes=np.apply_along_axis(lambda arr:len(set(arr)),1,electrodes_by_5)

Output :

In [463]: electrodes[:10]

Out[463]: array([13, 13, 23, 20,  5, 30,  9,  6, 28, 11])



In [464]: electrodes_by_5[:2]

Out[464]: 

array([[13, 13, 23, 20,  5],

       [30,  9,  6, 28, 11]])



In [465]: nb_electrodes[:2]

Out[465]: array([4, 5])

edited Nov 12 at 20:02

answered Nov 12 at 19:45

B. M.

11.8k11934

add a comment |

up vote
0
down vote

Assuming your data is sorted by time, you just have to group the electrodes by 5. Then set can be faster than np.unique :

size=10**6

window_size=5

electrodes = np.random.randint(0,64,size)

electrodes_by_5 = electrodes.reshape(-1,window_size)



nb_electrodes=np.apply_along_axis(lambda arr:len(set(arr)),1,electrodes_by_5)

Output :

In [463]: electrodes[:10]

Out[463]: array([13, 13, 23, 20,  5, 30,  9,  6, 28, 11])



In [464]: electrodes_by_5[:2]

Out[464]: 

array([[13, 13, 23, 20,  5],

       [30,  9,  6, 28, 11]])



In [465]: nb_electrodes[:2]

Out[465]: array([4, 5])

edited Nov 12 at 20:02

answered Nov 12 at 19:45

B. M.

11.8k11934

add a comment |

up vote
0
down vote

Assuming your data is sorted by time, you just have to group the electrodes by 5. Then set can be faster than np.unique :

size=10**6

window_size=5

electrodes = np.random.randint(0,64,size)

electrodes_by_5 = electrodes.reshape(-1,window_size)



nb_electrodes=np.apply_along_axis(lambda arr:len(set(arr)),1,electrodes_by_5)

Output :

In [463]: electrodes[:10]

Out[463]: array([13, 13, 23, 20,  5, 30,  9,  6, 28, 11])



In [464]: electrodes_by_5[:2]

Out[464]: 

array([[13, 13, 23, 20,  5],

       [30,  9,  6, 28, 11]])



In [465]: nb_electrodes[:2]

Out[465]: array([4, 5])

edited Nov 12 at 20:02

answered Nov 12 at 19:45

B. M.

11.8k11934

Assuming your data is sorted by time, you just have to group the electrodes by 5. Then set can be faster than np.unique :

size=10**6

window_size=5

electrodes = np.random.randint(0,64,size)

electrodes_by_5 = electrodes.reshape(-1,window_size)



nb_electrodes=np.apply_along_axis(lambda arr:len(set(arr)),1,electrodes_by_5)

Output :

In [463]: electrodes[:10]

Out[463]: array([13, 13, 23, 20,  5, 30,  9,  6, 28, 11])



In [464]: electrodes_by_5[:2]

Out[464]: 

array([[13, 13, 23, 20,  5],

       [30,  9,  6, 28, 11]])



In [465]: nb_electrodes[:2]

Out[465]: array([4, 5])

edited Nov 12 at 20:02

answered Nov 12 at 19:45

B. M.

11.8k11934

edited Nov 12 at 20:02

answered Nov 12 at 19:45

B. M.

11.8k11934

answered Nov 12 at 19:45

B. M.

11.8k11934

answered Nov 12 at 19:45

B. M.

11.8k11934

add a comment |

up vote
0
down vote

So I solved this by switching to numpy.ndarray which just went infinitely faster than indexing with iloc.

answered Nov 12 at 22:54

Rotavator

567

add a comment |

up vote
0
down vote

So I solved this by switching to numpy.ndarray which just went infinitely faster than indexing with iloc.

answered Nov 12 at 22:54

Rotavator

567

add a comment |

up vote
0
down vote

So I solved this by switching to numpy.ndarray which just went infinitely faster than indexing with iloc.

answered Nov 12 at 22:54

Rotavator

567

So I solved this by switching to numpy.ndarray which just went infinitely faster than indexing with iloc.

answered Nov 12 at 22:54

Rotavator

567

answered Nov 12 at 22:54

Rotavator

567

answered Nov 12 at 22:54

Rotavator

567

answered Nov 12 at 22:54

Rotavator

567

add a comment |

draft saved

draft discarded

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Cfrgtkky