Cumulative apply within window defined by other columns
up vote
9
down vote
favorite
I am trying to apply a function, cumulatively, to values that lie within a window defined by 'start' and 'finish' columns. So, 'start' and 'finish' define the intervals where the value is 'active'; for each row, I want to get a sum of all 'active' values at the time.
Here is a 'bruteforce' example that does what I am after - is there a more elegant, faster or more memory efficient way of doing this?
df = pd.DataFrame(data=[[1,3,100], [2,4,200], [3,6,300], [4,6,400], [5,6,500]],
columns=['start', 'finish', 'val'])
df['dummy'] = 1
df = df.merge(df, on=['dummy'], how='left')
df = df[(df['start_y'] <= df['start_x']) & (df['finish_y'] > df['start_x'])]
val = df.groupby('start_x')['val_y'].sum()
Originally, df is:
start finish val
0 1 3 100
1 2 4 200
2 3 6 300
3 4 6 400
4 5 6 500
The result I am after is:
1 100
2 300
3 500
4 700
5 1200
python pandas
add a comment |
up vote
9
down vote
favorite
I am trying to apply a function, cumulatively, to values that lie within a window defined by 'start' and 'finish' columns. So, 'start' and 'finish' define the intervals where the value is 'active'; for each row, I want to get a sum of all 'active' values at the time.
Here is a 'bruteforce' example that does what I am after - is there a more elegant, faster or more memory efficient way of doing this?
df = pd.DataFrame(data=[[1,3,100], [2,4,200], [3,6,300], [4,6,400], [5,6,500]],
columns=['start', 'finish', 'val'])
df['dummy'] = 1
df = df.merge(df, on=['dummy'], how='left')
df = df[(df['start_y'] <= df['start_x']) & (df['finish_y'] > df['start_x'])]
val = df.groupby('start_x')['val_y'].sum()
Originally, df is:
start finish val
0 1 3 100
1 2 4 200
2 3 6 300
3 4 6 400
4 5 6 500
The result I am after is:
1 100
2 300
3 500
4 700
5 1200
python pandas
add a comment |
up vote
9
down vote
favorite
up vote
9
down vote
favorite
I am trying to apply a function, cumulatively, to values that lie within a window defined by 'start' and 'finish' columns. So, 'start' and 'finish' define the intervals where the value is 'active'; for each row, I want to get a sum of all 'active' values at the time.
Here is a 'bruteforce' example that does what I am after - is there a more elegant, faster or more memory efficient way of doing this?
df = pd.DataFrame(data=[[1,3,100], [2,4,200], [3,6,300], [4,6,400], [5,6,500]],
columns=['start', 'finish', 'val'])
df['dummy'] = 1
df = df.merge(df, on=['dummy'], how='left')
df = df[(df['start_y'] <= df['start_x']) & (df['finish_y'] > df['start_x'])]
val = df.groupby('start_x')['val_y'].sum()
Originally, df is:
start finish val
0 1 3 100
1 2 4 200
2 3 6 300
3 4 6 400
4 5 6 500
The result I am after is:
1 100
2 300
3 500
4 700
5 1200
python pandas
I am trying to apply a function, cumulatively, to values that lie within a window defined by 'start' and 'finish' columns. So, 'start' and 'finish' define the intervals where the value is 'active'; for each row, I want to get a sum of all 'active' values at the time.
Here is a 'bruteforce' example that does what I am after - is there a more elegant, faster or more memory efficient way of doing this?
df = pd.DataFrame(data=[[1,3,100], [2,4,200], [3,6,300], [4,6,400], [5,6,500]],
columns=['start', 'finish', 'val'])
df['dummy'] = 1
df = df.merge(df, on=['dummy'], how='left')
df = df[(df['start_y'] <= df['start_x']) & (df['finish_y'] > df['start_x'])]
val = df.groupby('start_x')['val_y'].sum()
Originally, df is:
start finish val
0 1 3 100
1 2 4 200
2 3 6 300
3 4 6 400
4 5 6 500
The result I am after is:
1 100
2 300
3 500
4 700
5 1200
python pandas
python pandas
asked Nov 12 at 15:25
rinspy
1627
1627
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
up vote
5
down vote
accepted
Using numpy
boardcast , unfortunately it is still O(n*m) solution , but should be faster than the groupby
. So far base on my test Pir 's solution performance is the best
s1=df['start'].values
s2=df['finish'].values
np.sum(((s1<=s1[:,None])&(s2>=s2[:,None]))*df.val.values,1)
Out[44]: array([ 100, 200, 300, 700, 1200], dtype=int64)
Some timing
#df=pd.concat([df]*1000)
%timeit merged(df)
1 loop, best of 3: 5.02 s per loop
%timeit npb(df)
1 loop, best of 3: 283 ms per loop
% timeit PIR(df)
100 loops, best of 3: 9.8 ms per loop
def merged(df):
df['dummy'] = 1
df = df.merge(df, on=['dummy'], how='left')
df = df[(df['start_y'] <= df['start_x']) & (df['finish_y'] > df['start_x'])]
val = df.groupby('start_x')['val_y'].sum()
return val
def npb(df):
s1 = df['start'].values
s2 = df['finish'].values
return np.sum(((s1 <= s1[:, None]) & (s2 >= s2[:, None])) * df.val.values, 1)
1
I dont think the dataframe used for the timing would give proper output for Pir's method.
– Dark
Nov 12 at 16:17
@Dark - exactly, problem is only 6 groups so.sum(level)
what is hiddengroupby(level=0).sum()
is really fast.
– jezrael
Nov 12 at 16:19
1
I think best is create new more general sample data and then testing... But difference is huge, so possiblecomprehension
should be faster, but not so radically
– jezrael
Nov 12 at 16:21
add a comment |
up vote
6
down vote
numba
from numba import njit
@njit
def pir_numba(S, F, V):
mn = S.min()
mx = F.max()
out = np.zeros(mx)
for s, f, v in zip(S, F, V):
out[s:f] += v
return out[mn:]
pir_numba(*[df[c].values for c in ['start', 'finish', 'val']])
np.bincount
s, f, v = [df[col].values for col in ['start', 'finish', 'val']]
np.bincount([i - 1 for r in map(range, s, f) for i in r], v.repeat(f - s))
array([ 100., 300., 500., 700., 1200.])
Comprehension
This depends on the index
being unique
pd.Series({
(k, i): v
for i, s, f, v in df.itertuples()
for k in range(s, f)
}).sum(level=0)
1 100
2 300
3 500
4 700
5 1200
dtype: int64
With no dependence on index
pd.Series({
(k, i): v
for i, (s, f, v) in enumerate(zip(*map(df.get, ['start', 'finish', 'val'])))
for k in range(s, f)
}).sum(level=0)
3
Add the timing and getting shocked :-)
– W-B
Nov 12 at 15:49
wau, it is really interesting...
– jezrael
Nov 12 at 15:52
1
Can you add more general timings, maybe with graph?
– jezrael
Nov 12 at 16:23
1
Oh ohk, @rinspy please add the output for the case where there is a repeation of the interval. I dont think the intervals are all unique.
– Dark
Nov 12 at 16:26
1
@piRSquared - the comprehension solution assumes that the start/finish intervals are integers and are all close together. If the intervals are sparse it gets very inefficient.
– rinspy
Nov 12 at 16:29
|
show 5 more comments
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
5
down vote
accepted
Using numpy
boardcast , unfortunately it is still O(n*m) solution , but should be faster than the groupby
. So far base on my test Pir 's solution performance is the best
s1=df['start'].values
s2=df['finish'].values
np.sum(((s1<=s1[:,None])&(s2>=s2[:,None]))*df.val.values,1)
Out[44]: array([ 100, 200, 300, 700, 1200], dtype=int64)
Some timing
#df=pd.concat([df]*1000)
%timeit merged(df)
1 loop, best of 3: 5.02 s per loop
%timeit npb(df)
1 loop, best of 3: 283 ms per loop
% timeit PIR(df)
100 loops, best of 3: 9.8 ms per loop
def merged(df):
df['dummy'] = 1
df = df.merge(df, on=['dummy'], how='left')
df = df[(df['start_y'] <= df['start_x']) & (df['finish_y'] > df['start_x'])]
val = df.groupby('start_x')['val_y'].sum()
return val
def npb(df):
s1 = df['start'].values
s2 = df['finish'].values
return np.sum(((s1 <= s1[:, None]) & (s2 >= s2[:, None])) * df.val.values, 1)
1
I dont think the dataframe used for the timing would give proper output for Pir's method.
– Dark
Nov 12 at 16:17
@Dark - exactly, problem is only 6 groups so.sum(level)
what is hiddengroupby(level=0).sum()
is really fast.
– jezrael
Nov 12 at 16:19
1
I think best is create new more general sample data and then testing... But difference is huge, so possiblecomprehension
should be faster, but not so radically
– jezrael
Nov 12 at 16:21
add a comment |
up vote
5
down vote
accepted
Using numpy
boardcast , unfortunately it is still O(n*m) solution , but should be faster than the groupby
. So far base on my test Pir 's solution performance is the best
s1=df['start'].values
s2=df['finish'].values
np.sum(((s1<=s1[:,None])&(s2>=s2[:,None]))*df.val.values,1)
Out[44]: array([ 100, 200, 300, 700, 1200], dtype=int64)
Some timing
#df=pd.concat([df]*1000)
%timeit merged(df)
1 loop, best of 3: 5.02 s per loop
%timeit npb(df)
1 loop, best of 3: 283 ms per loop
% timeit PIR(df)
100 loops, best of 3: 9.8 ms per loop
def merged(df):
df['dummy'] = 1
df = df.merge(df, on=['dummy'], how='left')
df = df[(df['start_y'] <= df['start_x']) & (df['finish_y'] > df['start_x'])]
val = df.groupby('start_x')['val_y'].sum()
return val
def npb(df):
s1 = df['start'].values
s2 = df['finish'].values
return np.sum(((s1 <= s1[:, None]) & (s2 >= s2[:, None])) * df.val.values, 1)
1
I dont think the dataframe used for the timing would give proper output for Pir's method.
– Dark
Nov 12 at 16:17
@Dark - exactly, problem is only 6 groups so.sum(level)
what is hiddengroupby(level=0).sum()
is really fast.
– jezrael
Nov 12 at 16:19
1
I think best is create new more general sample data and then testing... But difference is huge, so possiblecomprehension
should be faster, but not so radically
– jezrael
Nov 12 at 16:21
add a comment |
up vote
5
down vote
accepted
up vote
5
down vote
accepted
Using numpy
boardcast , unfortunately it is still O(n*m) solution , but should be faster than the groupby
. So far base on my test Pir 's solution performance is the best
s1=df['start'].values
s2=df['finish'].values
np.sum(((s1<=s1[:,None])&(s2>=s2[:,None]))*df.val.values,1)
Out[44]: array([ 100, 200, 300, 700, 1200], dtype=int64)
Some timing
#df=pd.concat([df]*1000)
%timeit merged(df)
1 loop, best of 3: 5.02 s per loop
%timeit npb(df)
1 loop, best of 3: 283 ms per loop
% timeit PIR(df)
100 loops, best of 3: 9.8 ms per loop
def merged(df):
df['dummy'] = 1
df = df.merge(df, on=['dummy'], how='left')
df = df[(df['start_y'] <= df['start_x']) & (df['finish_y'] > df['start_x'])]
val = df.groupby('start_x')['val_y'].sum()
return val
def npb(df):
s1 = df['start'].values
s2 = df['finish'].values
return np.sum(((s1 <= s1[:, None]) & (s2 >= s2[:, None])) * df.val.values, 1)
Using numpy
boardcast , unfortunately it is still O(n*m) solution , but should be faster than the groupby
. So far base on my test Pir 's solution performance is the best
s1=df['start'].values
s2=df['finish'].values
np.sum(((s1<=s1[:,None])&(s2>=s2[:,None]))*df.val.values,1)
Out[44]: array([ 100, 200, 300, 700, 1200], dtype=int64)
Some timing
#df=pd.concat([df]*1000)
%timeit merged(df)
1 loop, best of 3: 5.02 s per loop
%timeit npb(df)
1 loop, best of 3: 283 ms per loop
% timeit PIR(df)
100 loops, best of 3: 9.8 ms per loop
def merged(df):
df['dummy'] = 1
df = df.merge(df, on=['dummy'], how='left')
df = df[(df['start_y'] <= df['start_x']) & (df['finish_y'] > df['start_x'])]
val = df.groupby('start_x')['val_y'].sum()
return val
def npb(df):
s1 = df['start'].values
s2 = df['finish'].values
return np.sum(((s1 <= s1[:, None]) & (s2 >= s2[:, None])) * df.val.values, 1)
edited Nov 12 at 15:49
answered Nov 12 at 15:37
W-B
92.4k72755
92.4k72755
1
I dont think the dataframe used for the timing would give proper output for Pir's method.
– Dark
Nov 12 at 16:17
@Dark - exactly, problem is only 6 groups so.sum(level)
what is hiddengroupby(level=0).sum()
is really fast.
– jezrael
Nov 12 at 16:19
1
I think best is create new more general sample data and then testing... But difference is huge, so possiblecomprehension
should be faster, but not so radically
– jezrael
Nov 12 at 16:21
add a comment |
1
I dont think the dataframe used for the timing would give proper output for Pir's method.
– Dark
Nov 12 at 16:17
@Dark - exactly, problem is only 6 groups so.sum(level)
what is hiddengroupby(level=0).sum()
is really fast.
– jezrael
Nov 12 at 16:19
1
I think best is create new more general sample data and then testing... But difference is huge, so possiblecomprehension
should be faster, but not so radically
– jezrael
Nov 12 at 16:21
1
1
I dont think the dataframe used for the timing would give proper output for Pir's method.
– Dark
Nov 12 at 16:17
I dont think the dataframe used for the timing would give proper output for Pir's method.
– Dark
Nov 12 at 16:17
@Dark - exactly, problem is only 6 groups so
.sum(level)
what is hidden groupby(level=0).sum()
is really fast.– jezrael
Nov 12 at 16:19
@Dark - exactly, problem is only 6 groups so
.sum(level)
what is hidden groupby(level=0).sum()
is really fast.– jezrael
Nov 12 at 16:19
1
1
I think best is create new more general sample data and then testing... But difference is huge, so possible
comprehension
should be faster, but not so radically– jezrael
Nov 12 at 16:21
I think best is create new more general sample data and then testing... But difference is huge, so possible
comprehension
should be faster, but not so radically– jezrael
Nov 12 at 16:21
add a comment |
up vote
6
down vote
numba
from numba import njit
@njit
def pir_numba(S, F, V):
mn = S.min()
mx = F.max()
out = np.zeros(mx)
for s, f, v in zip(S, F, V):
out[s:f] += v
return out[mn:]
pir_numba(*[df[c].values for c in ['start', 'finish', 'val']])
np.bincount
s, f, v = [df[col].values for col in ['start', 'finish', 'val']]
np.bincount([i - 1 for r in map(range, s, f) for i in r], v.repeat(f - s))
array([ 100., 300., 500., 700., 1200.])
Comprehension
This depends on the index
being unique
pd.Series({
(k, i): v
for i, s, f, v in df.itertuples()
for k in range(s, f)
}).sum(level=0)
1 100
2 300
3 500
4 700
5 1200
dtype: int64
With no dependence on index
pd.Series({
(k, i): v
for i, (s, f, v) in enumerate(zip(*map(df.get, ['start', 'finish', 'val'])))
for k in range(s, f)
}).sum(level=0)
3
Add the timing and getting shocked :-)
– W-B
Nov 12 at 15:49
wau, it is really interesting...
– jezrael
Nov 12 at 15:52
1
Can you add more general timings, maybe with graph?
– jezrael
Nov 12 at 16:23
1
Oh ohk, @rinspy please add the output for the case where there is a repeation of the interval. I dont think the intervals are all unique.
– Dark
Nov 12 at 16:26
1
@piRSquared - the comprehension solution assumes that the start/finish intervals are integers and are all close together. If the intervals are sparse it gets very inefficient.
– rinspy
Nov 12 at 16:29
|
show 5 more comments
up vote
6
down vote
numba
from numba import njit
@njit
def pir_numba(S, F, V):
mn = S.min()
mx = F.max()
out = np.zeros(mx)
for s, f, v in zip(S, F, V):
out[s:f] += v
return out[mn:]
pir_numba(*[df[c].values for c in ['start', 'finish', 'val']])
np.bincount
s, f, v = [df[col].values for col in ['start', 'finish', 'val']]
np.bincount([i - 1 for r in map(range, s, f) for i in r], v.repeat(f - s))
array([ 100., 300., 500., 700., 1200.])
Comprehension
This depends on the index
being unique
pd.Series({
(k, i): v
for i, s, f, v in df.itertuples()
for k in range(s, f)
}).sum(level=0)
1 100
2 300
3 500
4 700
5 1200
dtype: int64
With no dependence on index
pd.Series({
(k, i): v
for i, (s, f, v) in enumerate(zip(*map(df.get, ['start', 'finish', 'val'])))
for k in range(s, f)
}).sum(level=0)
3
Add the timing and getting shocked :-)
– W-B
Nov 12 at 15:49
wau, it is really interesting...
– jezrael
Nov 12 at 15:52
1
Can you add more general timings, maybe with graph?
– jezrael
Nov 12 at 16:23
1
Oh ohk, @rinspy please add the output for the case where there is a repeation of the interval. I dont think the intervals are all unique.
– Dark
Nov 12 at 16:26
1
@piRSquared - the comprehension solution assumes that the start/finish intervals are integers and are all close together. If the intervals are sparse it gets very inefficient.
– rinspy
Nov 12 at 16:29
|
show 5 more comments
up vote
6
down vote
up vote
6
down vote
numba
from numba import njit
@njit
def pir_numba(S, F, V):
mn = S.min()
mx = F.max()
out = np.zeros(mx)
for s, f, v in zip(S, F, V):
out[s:f] += v
return out[mn:]
pir_numba(*[df[c].values for c in ['start', 'finish', 'val']])
np.bincount
s, f, v = [df[col].values for col in ['start', 'finish', 'val']]
np.bincount([i - 1 for r in map(range, s, f) for i in r], v.repeat(f - s))
array([ 100., 300., 500., 700., 1200.])
Comprehension
This depends on the index
being unique
pd.Series({
(k, i): v
for i, s, f, v in df.itertuples()
for k in range(s, f)
}).sum(level=0)
1 100
2 300
3 500
4 700
5 1200
dtype: int64
With no dependence on index
pd.Series({
(k, i): v
for i, (s, f, v) in enumerate(zip(*map(df.get, ['start', 'finish', 'val'])))
for k in range(s, f)
}).sum(level=0)
numba
from numba import njit
@njit
def pir_numba(S, F, V):
mn = S.min()
mx = F.max()
out = np.zeros(mx)
for s, f, v in zip(S, F, V):
out[s:f] += v
return out[mn:]
pir_numba(*[df[c].values for c in ['start', 'finish', 'val']])
np.bincount
s, f, v = [df[col].values for col in ['start', 'finish', 'val']]
np.bincount([i - 1 for r in map(range, s, f) for i in r], v.repeat(f - s))
array([ 100., 300., 500., 700., 1200.])
Comprehension
This depends on the index
being unique
pd.Series({
(k, i): v
for i, s, f, v in df.itertuples()
for k in range(s, f)
}).sum(level=0)
1 100
2 300
3 500
4 700
5 1200
dtype: int64
With no dependence on index
pd.Series({
(k, i): v
for i, (s, f, v) in enumerate(zip(*map(df.get, ['start', 'finish', 'val'])))
for k in range(s, f)
}).sum(level=0)
edited Nov 12 at 16:50
answered Nov 12 at 15:42
piRSquared
149k21133273
149k21133273
3
Add the timing and getting shocked :-)
– W-B
Nov 12 at 15:49
wau, it is really interesting...
– jezrael
Nov 12 at 15:52
1
Can you add more general timings, maybe with graph?
– jezrael
Nov 12 at 16:23
1
Oh ohk, @rinspy please add the output for the case where there is a repeation of the interval. I dont think the intervals are all unique.
– Dark
Nov 12 at 16:26
1
@piRSquared - the comprehension solution assumes that the start/finish intervals are integers and are all close together. If the intervals are sparse it gets very inefficient.
– rinspy
Nov 12 at 16:29
|
show 5 more comments
3
Add the timing and getting shocked :-)
– W-B
Nov 12 at 15:49
wau, it is really interesting...
– jezrael
Nov 12 at 15:52
1
Can you add more general timings, maybe with graph?
– jezrael
Nov 12 at 16:23
1
Oh ohk, @rinspy please add the output for the case where there is a repeation of the interval. I dont think the intervals are all unique.
– Dark
Nov 12 at 16:26
1
@piRSquared - the comprehension solution assumes that the start/finish intervals are integers and are all close together. If the intervals are sparse it gets very inefficient.
– rinspy
Nov 12 at 16:29
3
3
Add the timing and getting shocked :-)
– W-B
Nov 12 at 15:49
Add the timing and getting shocked :-)
– W-B
Nov 12 at 15:49
wau, it is really interesting...
– jezrael
Nov 12 at 15:52
wau, it is really interesting...
– jezrael
Nov 12 at 15:52
1
1
Can you add more general timings, maybe with graph?
– jezrael
Nov 12 at 16:23
Can you add more general timings, maybe with graph?
– jezrael
Nov 12 at 16:23
1
1
Oh ohk, @rinspy please add the output for the case where there is a repeation of the interval. I dont think the intervals are all unique.
– Dark
Nov 12 at 16:26
Oh ohk, @rinspy please add the output for the case where there is a repeation of the interval. I dont think the intervals are all unique.
– Dark
Nov 12 at 16:26
1
1
@piRSquared - the comprehension solution assumes that the start/finish intervals are integers and are all close together. If the intervals are sparse it gets very inefficient.
– rinspy
Nov 12 at 16:29
@piRSquared - the comprehension solution assumes that the start/finish intervals are integers and are all close together. If the intervals are sparse it gets very inefficient.
– rinspy
Nov 12 at 16:29
|
show 5 more comments
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53265238%2fcumulative-apply-within-window-defined-by-other-columns%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown