Fast subsetting in Pandas in Python
up vote
1
down vote
favorite
I am running a loop a few million times, and I need to subset a different amount of data in each loop. I have a dataframe that has two columns, time (which is a time series) and electrode, which signifies a number between 1-64 for any electrode fired at that time.
time electrode
0 1
1 43
2 45
3 12
4 7
In each loop I need to subset the data, as such:
num_electrodes =
window_size = 5
index = 0
while index < len(data['time']) - interval_size:
start = data['time'][index]
end = data['time'][index+window_size]
window_data = data[(data['time'] >= start) & (data['time'] < end)]
num_electrodes.append(len(window_data['electrode'].unique()))
The really slow part of the code here is subsetting the dataframe and making a new dataframe, in the following code.
window_data = data[(data['time'] >= start) & (data['time'] < end)]
Is there any good alternative to this?
python pandas performance indexing subset
add a comment |
up vote
1
down vote
favorite
I am running a loop a few million times, and I need to subset a different amount of data in each loop. I have a dataframe that has two columns, time (which is a time series) and electrode, which signifies a number between 1-64 for any electrode fired at that time.
time electrode
0 1
1 43
2 45
3 12
4 7
In each loop I need to subset the data, as such:
num_electrodes =
window_size = 5
index = 0
while index < len(data['time']) - interval_size:
start = data['time'][index]
end = data['time'][index+window_size]
window_data = data[(data['time'] >= start) & (data['time'] < end)]
num_electrodes.append(len(window_data['electrode'].unique()))
The really slow part of the code here is subsetting the dataframe and making a new dataframe, in the following code.
window_data = data[(data['time'] >= start) & (data['time'] < end)]
Is there any good alternative to this?
python pandas performance indexing subset
Is yourtime
series string /datetime
/timedelta
/ something else?
– jpp
Nov 12 at 18:28
When do you augment index and from how many?
– B. M.
Nov 12 at 19:50
@jpp the data is normalized to start at timepoint '0' and then continues to timepoint ~3600 minutes, with increments of ~0.001 minutes.
– Rotavator
Nov 12 at 21:01
@B.M. i actually put a simplified example here, i just want to know how to faster subset my data
– Rotavator
Nov 12 at 21:01
add a comment |
up vote
1
down vote
favorite
up vote
1
down vote
favorite
I am running a loop a few million times, and I need to subset a different amount of data in each loop. I have a dataframe that has two columns, time (which is a time series) and electrode, which signifies a number between 1-64 for any electrode fired at that time.
time electrode
0 1
1 43
2 45
3 12
4 7
In each loop I need to subset the data, as such:
num_electrodes =
window_size = 5
index = 0
while index < len(data['time']) - interval_size:
start = data['time'][index]
end = data['time'][index+window_size]
window_data = data[(data['time'] >= start) & (data['time'] < end)]
num_electrodes.append(len(window_data['electrode'].unique()))
The really slow part of the code here is subsetting the dataframe and making a new dataframe, in the following code.
window_data = data[(data['time'] >= start) & (data['time'] < end)]
Is there any good alternative to this?
python pandas performance indexing subset
I am running a loop a few million times, and I need to subset a different amount of data in each loop. I have a dataframe that has two columns, time (which is a time series) and electrode, which signifies a number between 1-64 for any electrode fired at that time.
time electrode
0 1
1 43
2 45
3 12
4 7
In each loop I need to subset the data, as such:
num_electrodes =
window_size = 5
index = 0
while index < len(data['time']) - interval_size:
start = data['time'][index]
end = data['time'][index+window_size]
window_data = data[(data['time'] >= start) & (data['time'] < end)]
num_electrodes.append(len(window_data['electrode'].unique()))
The really slow part of the code here is subsetting the dataframe and making a new dataframe, in the following code.
window_data = data[(data['time'] >= start) & (data['time'] < end)]
Is there any good alternative to this?
python pandas performance indexing subset
python pandas performance indexing subset
edited Nov 12 at 18:35
jpp
82.4k194796
82.4k194796
asked Nov 12 at 18:25
Rotavator
567
567
Is yourtime
series string /datetime
/timedelta
/ something else?
– jpp
Nov 12 at 18:28
When do you augment index and from how many?
– B. M.
Nov 12 at 19:50
@jpp the data is normalized to start at timepoint '0' and then continues to timepoint ~3600 minutes, with increments of ~0.001 minutes.
– Rotavator
Nov 12 at 21:01
@B.M. i actually put a simplified example here, i just want to know how to faster subset my data
– Rotavator
Nov 12 at 21:01
add a comment |
Is yourtime
series string /datetime
/timedelta
/ something else?
– jpp
Nov 12 at 18:28
When do you augment index and from how many?
– B. M.
Nov 12 at 19:50
@jpp the data is normalized to start at timepoint '0' and then continues to timepoint ~3600 minutes, with increments of ~0.001 minutes.
– Rotavator
Nov 12 at 21:01
@B.M. i actually put a simplified example here, i just want to know how to faster subset my data
– Rotavator
Nov 12 at 21:01
Is your
time
series string / datetime
/ timedelta
/ something else?– jpp
Nov 12 at 18:28
Is your
time
series string / datetime
/ timedelta
/ something else?– jpp
Nov 12 at 18:28
When do you augment index and from how many?
– B. M.
Nov 12 at 19:50
When do you augment index and from how many?
– B. M.
Nov 12 at 19:50
@jpp the data is normalized to start at timepoint '0' and then continues to timepoint ~3600 minutes, with increments of ~0.001 minutes.
– Rotavator
Nov 12 at 21:01
@jpp the data is normalized to start at timepoint '0' and then continues to timepoint ~3600 minutes, with increments of ~0.001 minutes.
– Rotavator
Nov 12 at 21:01
@B.M. i actually put a simplified example here, i just want to know how to faster subset my data
– Rotavator
Nov 12 at 21:01
@B.M. i actually put a simplified example here, i just want to know how to faster subset my data
– Rotavator
Nov 12 at 21:01
add a comment |
3 Answers
3
active
oldest
votes
up vote
1
down vote
Sort by your time, then you can use .loc to access the indices at the beginning and end of your window, and then select a range of indices as your subset.
Set your df's index to the time series, then use df.index.get_loc(beginning_window)
and min(df.index.get_loc(beginning_window+window+1)) -1
to get your index range.
The min accounts for non-unique indices.
Then use .iloc to select that range.
That should speed it up by quite a bit.
i will try this... the problem is i am not only choosing index + 5 as i gave in this example, but need to find the end of the window based on the actual time value, so this involves finding the index of this time point
– Rotavator
Nov 12 at 18:45
1
You can find the index of that time, once you set your index to the time column by using df.index.get_loc(time).
– John H
Nov 12 at 18:57
i don't know the specific time though, i need to find the last time within the time window, then find the index. and to make things more complicated, more than one electrode can fire at the same time, so the times are not unique values
– Rotavator
Nov 12 at 21:02
if you don't know where you want to start your window, no one else can help you determine that. choose randomly or go through every possibility. updated the answer to accommodate non-unique indices
– John H
Nov 12 at 21:06
add a comment |
up vote
0
down vote
Assuming your data is sorted by time, you just have to group the electrodes by 5. Then set
can be faster than np.unique
:
size=10**6
window_size=5
electrodes = np.random.randint(0,64,size)
electrodes_by_5 = electrodes.reshape(-1,window_size)
nb_electrodes=np.apply_along_axis(lambda arr:len(set(arr)),1,electrodes_by_5)
Output :
In [463]: electrodes[:10]
Out[463]: array([13, 13, 23, 20, 5, 30, 9, 6, 28, 11])
In [464]: electrodes_by_5[:2]
Out[464]:
array([[13, 13, 23, 20, 5],
[30, 9, 6, 28, 11]])
In [465]: nb_electrodes[:2]
Out[465]: array([4, 5])
add a comment |
up vote
0
down vote
So I solved this by switching to numpy.ndarray which just went infinitely faster than indexing with iloc.
add a comment |
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
Sort by your time, then you can use .loc to access the indices at the beginning and end of your window, and then select a range of indices as your subset.
Set your df's index to the time series, then use df.index.get_loc(beginning_window)
and min(df.index.get_loc(beginning_window+window+1)) -1
to get your index range.
The min accounts for non-unique indices.
Then use .iloc to select that range.
That should speed it up by quite a bit.
i will try this... the problem is i am not only choosing index + 5 as i gave in this example, but need to find the end of the window based on the actual time value, so this involves finding the index of this time point
– Rotavator
Nov 12 at 18:45
1
You can find the index of that time, once you set your index to the time column by using df.index.get_loc(time).
– John H
Nov 12 at 18:57
i don't know the specific time though, i need to find the last time within the time window, then find the index. and to make things more complicated, more than one electrode can fire at the same time, so the times are not unique values
– Rotavator
Nov 12 at 21:02
if you don't know where you want to start your window, no one else can help you determine that. choose randomly or go through every possibility. updated the answer to accommodate non-unique indices
– John H
Nov 12 at 21:06
add a comment |
up vote
1
down vote
Sort by your time, then you can use .loc to access the indices at the beginning and end of your window, and then select a range of indices as your subset.
Set your df's index to the time series, then use df.index.get_loc(beginning_window)
and min(df.index.get_loc(beginning_window+window+1)) -1
to get your index range.
The min accounts for non-unique indices.
Then use .iloc to select that range.
That should speed it up by quite a bit.
i will try this... the problem is i am not only choosing index + 5 as i gave in this example, but need to find the end of the window based on the actual time value, so this involves finding the index of this time point
– Rotavator
Nov 12 at 18:45
1
You can find the index of that time, once you set your index to the time column by using df.index.get_loc(time).
– John H
Nov 12 at 18:57
i don't know the specific time though, i need to find the last time within the time window, then find the index. and to make things more complicated, more than one electrode can fire at the same time, so the times are not unique values
– Rotavator
Nov 12 at 21:02
if you don't know where you want to start your window, no one else can help you determine that. choose randomly or go through every possibility. updated the answer to accommodate non-unique indices
– John H
Nov 12 at 21:06
add a comment |
up vote
1
down vote
up vote
1
down vote
Sort by your time, then you can use .loc to access the indices at the beginning and end of your window, and then select a range of indices as your subset.
Set your df's index to the time series, then use df.index.get_loc(beginning_window)
and min(df.index.get_loc(beginning_window+window+1)) -1
to get your index range.
The min accounts for non-unique indices.
Then use .iloc to select that range.
That should speed it up by quite a bit.
Sort by your time, then you can use .loc to access the indices at the beginning and end of your window, and then select a range of indices as your subset.
Set your df's index to the time series, then use df.index.get_loc(beginning_window)
and min(df.index.get_loc(beginning_window+window+1)) -1
to get your index range.
The min accounts for non-unique indices.
Then use .iloc to select that range.
That should speed it up by quite a bit.
edited Nov 12 at 21:04
answered Nov 12 at 18:32
John H
1,174315
1,174315
i will try this... the problem is i am not only choosing index + 5 as i gave in this example, but need to find the end of the window based on the actual time value, so this involves finding the index of this time point
– Rotavator
Nov 12 at 18:45
1
You can find the index of that time, once you set your index to the time column by using df.index.get_loc(time).
– John H
Nov 12 at 18:57
i don't know the specific time though, i need to find the last time within the time window, then find the index. and to make things more complicated, more than one electrode can fire at the same time, so the times are not unique values
– Rotavator
Nov 12 at 21:02
if you don't know where you want to start your window, no one else can help you determine that. choose randomly or go through every possibility. updated the answer to accommodate non-unique indices
– John H
Nov 12 at 21:06
add a comment |
i will try this... the problem is i am not only choosing index + 5 as i gave in this example, but need to find the end of the window based on the actual time value, so this involves finding the index of this time point
– Rotavator
Nov 12 at 18:45
1
You can find the index of that time, once you set your index to the time column by using df.index.get_loc(time).
– John H
Nov 12 at 18:57
i don't know the specific time though, i need to find the last time within the time window, then find the index. and to make things more complicated, more than one electrode can fire at the same time, so the times are not unique values
– Rotavator
Nov 12 at 21:02
if you don't know where you want to start your window, no one else can help you determine that. choose randomly or go through every possibility. updated the answer to accommodate non-unique indices
– John H
Nov 12 at 21:06
i will try this... the problem is i am not only choosing index + 5 as i gave in this example, but need to find the end of the window based on the actual time value, so this involves finding the index of this time point
– Rotavator
Nov 12 at 18:45
i will try this... the problem is i am not only choosing index + 5 as i gave in this example, but need to find the end of the window based on the actual time value, so this involves finding the index of this time point
– Rotavator
Nov 12 at 18:45
1
1
You can find the index of that time, once you set your index to the time column by using df.index.get_loc(time).
– John H
Nov 12 at 18:57
You can find the index of that time, once you set your index to the time column by using df.index.get_loc(time).
– John H
Nov 12 at 18:57
i don't know the specific time though, i need to find the last time within the time window, then find the index. and to make things more complicated, more than one electrode can fire at the same time, so the times are not unique values
– Rotavator
Nov 12 at 21:02
i don't know the specific time though, i need to find the last time within the time window, then find the index. and to make things more complicated, more than one electrode can fire at the same time, so the times are not unique values
– Rotavator
Nov 12 at 21:02
if you don't know where you want to start your window, no one else can help you determine that. choose randomly or go through every possibility. updated the answer to accommodate non-unique indices
– John H
Nov 12 at 21:06
if you don't know where you want to start your window, no one else can help you determine that. choose randomly or go through every possibility. updated the answer to accommodate non-unique indices
– John H
Nov 12 at 21:06
add a comment |
up vote
0
down vote
Assuming your data is sorted by time, you just have to group the electrodes by 5. Then set
can be faster than np.unique
:
size=10**6
window_size=5
electrodes = np.random.randint(0,64,size)
electrodes_by_5 = electrodes.reshape(-1,window_size)
nb_electrodes=np.apply_along_axis(lambda arr:len(set(arr)),1,electrodes_by_5)
Output :
In [463]: electrodes[:10]
Out[463]: array([13, 13, 23, 20, 5, 30, 9, 6, 28, 11])
In [464]: electrodes_by_5[:2]
Out[464]:
array([[13, 13, 23, 20, 5],
[30, 9, 6, 28, 11]])
In [465]: nb_electrodes[:2]
Out[465]: array([4, 5])
add a comment |
up vote
0
down vote
Assuming your data is sorted by time, you just have to group the electrodes by 5. Then set
can be faster than np.unique
:
size=10**6
window_size=5
electrodes = np.random.randint(0,64,size)
electrodes_by_5 = electrodes.reshape(-1,window_size)
nb_electrodes=np.apply_along_axis(lambda arr:len(set(arr)),1,electrodes_by_5)
Output :
In [463]: electrodes[:10]
Out[463]: array([13, 13, 23, 20, 5, 30, 9, 6, 28, 11])
In [464]: electrodes_by_5[:2]
Out[464]:
array([[13, 13, 23, 20, 5],
[30, 9, 6, 28, 11]])
In [465]: nb_electrodes[:2]
Out[465]: array([4, 5])
add a comment |
up vote
0
down vote
up vote
0
down vote
Assuming your data is sorted by time, you just have to group the electrodes by 5. Then set
can be faster than np.unique
:
size=10**6
window_size=5
electrodes = np.random.randint(0,64,size)
electrodes_by_5 = electrodes.reshape(-1,window_size)
nb_electrodes=np.apply_along_axis(lambda arr:len(set(arr)),1,electrodes_by_5)
Output :
In [463]: electrodes[:10]
Out[463]: array([13, 13, 23, 20, 5, 30, 9, 6, 28, 11])
In [464]: electrodes_by_5[:2]
Out[464]:
array([[13, 13, 23, 20, 5],
[30, 9, 6, 28, 11]])
In [465]: nb_electrodes[:2]
Out[465]: array([4, 5])
Assuming your data is sorted by time, you just have to group the electrodes by 5. Then set
can be faster than np.unique
:
size=10**6
window_size=5
electrodes = np.random.randint(0,64,size)
electrodes_by_5 = electrodes.reshape(-1,window_size)
nb_electrodes=np.apply_along_axis(lambda arr:len(set(arr)),1,electrodes_by_5)
Output :
In [463]: electrodes[:10]
Out[463]: array([13, 13, 23, 20, 5, 30, 9, 6, 28, 11])
In [464]: electrodes_by_5[:2]
Out[464]:
array([[13, 13, 23, 20, 5],
[30, 9, 6, 28, 11]])
In [465]: nb_electrodes[:2]
Out[465]: array([4, 5])
edited Nov 12 at 20:02
answered Nov 12 at 19:45
B. M.
11.8k11934
11.8k11934
add a comment |
add a comment |
up vote
0
down vote
So I solved this by switching to numpy.ndarray which just went infinitely faster than indexing with iloc.
add a comment |
up vote
0
down vote
So I solved this by switching to numpy.ndarray which just went infinitely faster than indexing with iloc.
add a comment |
up vote
0
down vote
up vote
0
down vote
So I solved this by switching to numpy.ndarray which just went infinitely faster than indexing with iloc.
So I solved this by switching to numpy.ndarray which just went infinitely faster than indexing with iloc.
answered Nov 12 at 22:54
Rotavator
567
567
add a comment |
add a comment |
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53268002%2ffast-subsetting-in-pandas-in-python%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Is your
time
series string /datetime
/timedelta
/ something else?– jpp
Nov 12 at 18:28
When do you augment index and from how many?
– B. M.
Nov 12 at 19:50
@jpp the data is normalized to start at timepoint '0' and then continues to timepoint ~3600 minutes, with increments of ~0.001 minutes.
– Rotavator
Nov 12 at 21:01
@B.M. i actually put a simplified example here, i just want to know how to faster subset my data
– Rotavator
Nov 12 at 21:01