Cumulative apply within window defined by other columns
up vote
8
down vote
favorite
I am trying to apply a function, cumulatively, to values that lie within a window defined by 'start' and 'finish' columns. So, 'start' and 'finish' define the intervals where the value is 'active'; for each row, I want to get a sum of all 'active' values at the time.
Here is a 'bruteforce' example that does what I am after - is there a more elegant, faster or more memory efficient way of doing this?
df = pd.DataFrame(data=[[1,3,100], [2,4,200], [3,6,300], [4,6,400], [5,6,500]],
columns=['start', 'finish', 'val'])
df['dummy'] = 1
df = df.merge(df, on=['dummy'], how='left')
df = df[(df['start_y'] <= df['start_x']) & (df['finish_y'] > df['start_x'])]
val = df.groupby('start_x')['val_y'].sum()
Originally, df is:
start finish val
0 1 3 100
1 2 4 200
2 3 6 300
3 4 6 400
4 5 6 500
The result I am after is:
1 100
2 300
3 500
4 700
5 1200
python pandas
add a comment |
up vote
8
down vote
favorite
I am trying to apply a function, cumulatively, to values that lie within a window defined by 'start' and 'finish' columns. So, 'start' and 'finish' define the intervals where the value is 'active'; for each row, I want to get a sum of all 'active' values at the time.
Here is a 'bruteforce' example that does what I am after - is there a more elegant, faster or more memory efficient way of doing this?
df = pd.DataFrame(data=[[1,3,100], [2,4,200], [3,6,300], [4,6,400], [5,6,500]],
columns=['start', 'finish', 'val'])
df['dummy'] = 1
df = df.merge(df, on=['dummy'], how='left')
df = df[(df['start_y'] <= df['start_x']) & (df['finish_y'] > df['start_x'])]
val = df.groupby('start_x')['val_y'].sum()
Originally, df is:
start finish val
0 1 3 100
1 2 4 200
2 3 6 300
3 4 6 400
4 5 6 500
The result I am after is:
1 100
2 300
3 500
4 700
5 1200
python pandas
add a comment |
up vote
8
down vote
favorite
up vote
8
down vote
favorite
I am trying to apply a function, cumulatively, to values that lie within a window defined by 'start' and 'finish' columns. So, 'start' and 'finish' define the intervals where the value is 'active'; for each row, I want to get a sum of all 'active' values at the time.
Here is a 'bruteforce' example that does what I am after - is there a more elegant, faster or more memory efficient way of doing this?
df = pd.DataFrame(data=[[1,3,100], [2,4,200], [3,6,300], [4,6,400], [5,6,500]],
columns=['start', 'finish', 'val'])
df['dummy'] = 1
df = df.merge(df, on=['dummy'], how='left')
df = df[(df['start_y'] <= df['start_x']) & (df['finish_y'] > df['start_x'])]
val = df.groupby('start_x')['val_y'].sum()
Originally, df is:
start finish val
0 1 3 100
1 2 4 200
2 3 6 300
3 4 6 400
4 5 6 500
The result I am after is:
1 100
2 300
3 500
4 700
5 1200
python pandas
I am trying to apply a function, cumulatively, to values that lie within a window defined by 'start' and 'finish' columns. So, 'start' and 'finish' define the intervals where the value is 'active'; for each row, I want to get a sum of all 'active' values at the time.
Here is a 'bruteforce' example that does what I am after - is there a more elegant, faster or more memory efficient way of doing this?
df = pd.DataFrame(data=[[1,3,100], [2,4,200], [3,6,300], [4,6,400], [5,6,500]],
columns=['start', 'finish', 'val'])
df['dummy'] = 1
df = df.merge(df, on=['dummy'], how='left')
df = df[(df['start_y'] <= df['start_x']) & (df['finish_y'] > df['start_x'])]
val = df.groupby('start_x')['val_y'].sum()
Originally, df is:
start finish val
0 1 3 100
1 2 4 200
2 3 6 300
3 4 6 400
4 5 6 500
The result I am after is:
1 100
2 300
3 500
4 700
5 1200
python pandas
python pandas
asked 2 days ago
rinspy
1577
1577
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
up vote
5
down vote
accepted
Using numpy
boardcast , unfortunately it is still O(n*m) solution , but should be faster than the groupby
. So far base on my test Pir 's solution performance is the best
s1=df['start'].values
s2=df['finish'].values
np.sum(((s1<=s1[:,None])&(s2>=s2[:,None]))*df.val.values,1)
Out[44]: array([ 100, 200, 300, 700, 1200], dtype=int64)
Some timing
#df=pd.concat([df]*1000)
%timeit merged(df)
1 loop, best of 3: 5.02 s per loop
%timeit npb(df)
1 loop, best of 3: 283 ms per loop
% timeit PIR(df)
100 loops, best of 3: 9.8 ms per loop
def merged(df):
df['dummy'] = 1
df = df.merge(df, on=['dummy'], how='left')
df = df[(df['start_y'] <= df['start_x']) & (df['finish_y'] > df['start_x'])]
val = df.groupby('start_x')['val_y'].sum()
return val
def npb(df):
s1 = df['start'].values
s2 = df['finish'].values
return np.sum(((s1 <= s1[:, None]) & (s2 >= s2[:, None])) * df.val.values, 1)
1
I dont think the dataframe used for the timing would give proper output for Pir's method.
– Dark
2 days ago
@Dark - exactly, problem is only 6 groups so.sum(level)
what is hiddengroupby(level=0).sum()
is really fast.
– jezrael
2 days ago
1
I think best is create new more general sample data and then testing... But difference is huge, so possiblecomprehension
should be faster, but not so radically
– jezrael
2 days ago
add a comment |
up vote
6
down vote
numba
from numba import njit
@njit
def pir_numba(S, F, V):
mn = S.min()
mx = F.max()
out = np.zeros(mx)
for s, f, v in zip(S, F, V):
out[s:f] += v
return out[mn:]
pir_numba(*[df[c].values for c in ['start', 'finish', 'val']])
np.bincount
s, f, v = [df[col].values for col in ['start', 'finish', 'val']]
np.bincount([i - 1 for r in map(range, s, f) for i in r], v.repeat(f - s))
array([ 100., 300., 500., 700., 1200.])
Comprehension
This depends on the index
being unique
pd.Series({
(k, i): v
for i, s, f, v in df.itertuples()
for k in range(s, f)
}).sum(level=0)
1 100
2 300
3 500
4 700
5 1200
dtype: int64
With no dependence on index
pd.Series({
(k, i): v
for i, (s, f, v) in enumerate(zip(*map(df.get, ['start', 'finish', 'val'])))
for k in range(s, f)
}).sum(level=0)
3
Add the timing and getting shocked :-)
– W-B
2 days ago
wau, it is really interesting...
– jezrael
2 days ago
1
Can you add more general timings, maybe with graph?
– jezrael
2 days ago
1
Oh ohk, @rinspy please add the output for the case where there is a repeation of the interval. I dont think the intervals are all unique.
– Dark
2 days ago
1
@piRSquared - the comprehension solution assumes that the start/finish intervals are integers and are all close together. If the intervals are sparse it gets very inefficient.
– rinspy
2 days ago
|
show 5 more comments
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
5
down vote
accepted
Using numpy
boardcast , unfortunately it is still O(n*m) solution , but should be faster than the groupby
. So far base on my test Pir 's solution performance is the best
s1=df['start'].values
s2=df['finish'].values
np.sum(((s1<=s1[:,None])&(s2>=s2[:,None]))*df.val.values,1)
Out[44]: array([ 100, 200, 300, 700, 1200], dtype=int64)
Some timing
#df=pd.concat([df]*1000)
%timeit merged(df)
1 loop, best of 3: 5.02 s per loop
%timeit npb(df)
1 loop, best of 3: 283 ms per loop
% timeit PIR(df)
100 loops, best of 3: 9.8 ms per loop
def merged(df):
df['dummy'] = 1
df = df.merge(df, on=['dummy'], how='left')
df = df[(df['start_y'] <= df['start_x']) & (df['finish_y'] > df['start_x'])]
val = df.groupby('start_x')['val_y'].sum()
return val
def npb(df):
s1 = df['start'].values
s2 = df['finish'].values
return np.sum(((s1 <= s1[:, None]) & (s2 >= s2[:, None])) * df.val.values, 1)
1
I dont think the dataframe used for the timing would give proper output for Pir's method.
– Dark
2 days ago
@Dark - exactly, problem is only 6 groups so.sum(level)
what is hiddengroupby(level=0).sum()
is really fast.
– jezrael
2 days ago
1
I think best is create new more general sample data and then testing... But difference is huge, so possiblecomprehension
should be faster, but not so radically
– jezrael
2 days ago
add a comment |
up vote
5
down vote
accepted
Using numpy
boardcast , unfortunately it is still O(n*m) solution , but should be faster than the groupby
. So far base on my test Pir 's solution performance is the best
s1=df['start'].values
s2=df['finish'].values
np.sum(((s1<=s1[:,None])&(s2>=s2[:,None]))*df.val.values,1)
Out[44]: array([ 100, 200, 300, 700, 1200], dtype=int64)
Some timing
#df=pd.concat([df]*1000)
%timeit merged(df)
1 loop, best of 3: 5.02 s per loop
%timeit npb(df)
1 loop, best of 3: 283 ms per loop
% timeit PIR(df)
100 loops, best of 3: 9.8 ms per loop
def merged(df):
df['dummy'] = 1
df = df.merge(df, on=['dummy'], how='left')
df = df[(df['start_y'] <= df['start_x']) & (df['finish_y'] > df['start_x'])]
val = df.groupby('start_x')['val_y'].sum()
return val
def npb(df):
s1 = df['start'].values
s2 = df['finish'].values
return np.sum(((s1 <= s1[:, None]) & (s2 >= s2[:, None])) * df.val.values, 1)
1
I dont think the dataframe used for the timing would give proper output for Pir's method.
– Dark
2 days ago
@Dark - exactly, problem is only 6 groups so.sum(level)
what is hiddengroupby(level=0).sum()
is really fast.
– jezrael
2 days ago
1
I think best is create new more general sample data and then testing... But difference is huge, so possiblecomprehension
should be faster, but not so radically
– jezrael
2 days ago
add a comment |
up vote
5
down vote
accepted
up vote
5
down vote
accepted
Using numpy
boardcast , unfortunately it is still O(n*m) solution , but should be faster than the groupby
. So far base on my test Pir 's solution performance is the best
s1=df['start'].values
s2=df['finish'].values
np.sum(((s1<=s1[:,None])&(s2>=s2[:,None]))*df.val.values,1)
Out[44]: array([ 100, 200, 300, 700, 1200], dtype=int64)
Some timing
#df=pd.concat([df]*1000)
%timeit merged(df)
1 loop, best of 3: 5.02 s per loop
%timeit npb(df)
1 loop, best of 3: 283 ms per loop
% timeit PIR(df)
100 loops, best of 3: 9.8 ms per loop
def merged(df):
df['dummy'] = 1
df = df.merge(df, on=['dummy'], how='left')
df = df[(df['start_y'] <= df['start_x']) & (df['finish_y'] > df['start_x'])]
val = df.groupby('start_x')['val_y'].sum()
return val
def npb(df):
s1 = df['start'].values
s2 = df['finish'].values
return np.sum(((s1 <= s1[:, None]) & (s2 >= s2[:, None])) * df.val.values, 1)
Using numpy
boardcast , unfortunately it is still O(n*m) solution , but should be faster than the groupby
. So far base on my test Pir 's solution performance is the best
s1=df['start'].values
s2=df['finish'].values
np.sum(((s1<=s1[:,None])&(s2>=s2[:,None]))*df.val.values,1)
Out[44]: array([ 100, 200, 300, 700, 1200], dtype=int64)
Some timing
#df=pd.concat([df]*1000)
%timeit merged(df)
1 loop, best of 3: 5.02 s per loop
%timeit npb(df)
1 loop, best of 3: 283 ms per loop
% timeit PIR(df)
100 loops, best of 3: 9.8 ms per loop
def merged(df):
df['dummy'] = 1
df = df.merge(df, on=['dummy'], how='left')
df = df[(df['start_y'] <= df['start_x']) & (df['finish_y'] > df['start_x'])]
val = df.groupby('start_x')['val_y'].sum()
return val
def npb(df):
s1 = df['start'].values
s2 = df['finish'].values
return np.sum(((s1 <= s1[:, None]) & (s2 >= s2[:, None])) * df.val.values, 1)
edited 2 days ago
answered 2 days ago
W-B
91.7k72755
91.7k72755
1
I dont think the dataframe used for the timing would give proper output for Pir's method.
– Dark
2 days ago
@Dark - exactly, problem is only 6 groups so.sum(level)
what is hiddengroupby(level=0).sum()
is really fast.
– jezrael
2 days ago
1
I think best is create new more general sample data and then testing... But difference is huge, so possiblecomprehension
should be faster, but not so radically
– jezrael
2 days ago
add a comment |
1
I dont think the dataframe used for the timing would give proper output for Pir's method.
– Dark
2 days ago
@Dark - exactly, problem is only 6 groups so.sum(level)
what is hiddengroupby(level=0).sum()
is really fast.
– jezrael
2 days ago
1
I think best is create new more general sample data and then testing... But difference is huge, so possiblecomprehension
should be faster, but not so radically
– jezrael
2 days ago
1
1
I dont think the dataframe used for the timing would give proper output for Pir's method.
– Dark
2 days ago
I dont think the dataframe used for the timing would give proper output for Pir's method.
– Dark
2 days ago
@Dark - exactly, problem is only 6 groups so
.sum(level)
what is hidden groupby(level=0).sum()
is really fast.– jezrael
2 days ago
@Dark - exactly, problem is only 6 groups so
.sum(level)
what is hidden groupby(level=0).sum()
is really fast.– jezrael
2 days ago
1
1
I think best is create new more general sample data and then testing... But difference is huge, so possible
comprehension
should be faster, but not so radically– jezrael
2 days ago
I think best is create new more general sample data and then testing... But difference is huge, so possible
comprehension
should be faster, but not so radically– jezrael
2 days ago
add a comment |
up vote
6
down vote
numba
from numba import njit
@njit
def pir_numba(S, F, V):
mn = S.min()
mx = F.max()
out = np.zeros(mx)
for s, f, v in zip(S, F, V):
out[s:f] += v
return out[mn:]
pir_numba(*[df[c].values for c in ['start', 'finish', 'val']])
np.bincount
s, f, v = [df[col].values for col in ['start', 'finish', 'val']]
np.bincount([i - 1 for r in map(range, s, f) for i in r], v.repeat(f - s))
array([ 100., 300., 500., 700., 1200.])
Comprehension
This depends on the index
being unique
pd.Series({
(k, i): v
for i, s, f, v in df.itertuples()
for k in range(s, f)
}).sum(level=0)
1 100
2 300
3 500
4 700
5 1200
dtype: int64
With no dependence on index
pd.Series({
(k, i): v
for i, (s, f, v) in enumerate(zip(*map(df.get, ['start', 'finish', 'val'])))
for k in range(s, f)
}).sum(level=0)
3
Add the timing and getting shocked :-)
– W-B
2 days ago
wau, it is really interesting...
– jezrael
2 days ago
1
Can you add more general timings, maybe with graph?
– jezrael
2 days ago
1
Oh ohk, @rinspy please add the output for the case where there is a repeation of the interval. I dont think the intervals are all unique.
– Dark
2 days ago
1
@piRSquared - the comprehension solution assumes that the start/finish intervals are integers and are all close together. If the intervals are sparse it gets very inefficient.
– rinspy
2 days ago
|
show 5 more comments
up vote
6
down vote
numba
from numba import njit
@njit
def pir_numba(S, F, V):
mn = S.min()
mx = F.max()
out = np.zeros(mx)
for s, f, v in zip(S, F, V):
out[s:f] += v
return out[mn:]
pir_numba(*[df[c].values for c in ['start', 'finish', 'val']])
np.bincount
s, f, v = [df[col].values for col in ['start', 'finish', 'val']]
np.bincount([i - 1 for r in map(range, s, f) for i in r], v.repeat(f - s))
array([ 100., 300., 500., 700., 1200.])
Comprehension
This depends on the index
being unique
pd.Series({
(k, i): v
for i, s, f, v in df.itertuples()
for k in range(s, f)
}).sum(level=0)
1 100
2 300
3 500
4 700
5 1200
dtype: int64
With no dependence on index
pd.Series({
(k, i): v
for i, (s, f, v) in enumerate(zip(*map(df.get, ['start', 'finish', 'val'])))
for k in range(s, f)
}).sum(level=0)
3
Add the timing and getting shocked :-)
– W-B
2 days ago
wau, it is really interesting...
– jezrael
2 days ago
1
Can you add more general timings, maybe with graph?
– jezrael
2 days ago
1
Oh ohk, @rinspy please add the output for the case where there is a repeation of the interval. I dont think the intervals are all unique.
– Dark
2 days ago
1
@piRSquared - the comprehension solution assumes that the start/finish intervals are integers and are all close together. If the intervals are sparse it gets very inefficient.
– rinspy
2 days ago
|
show 5 more comments
up vote
6
down vote
up vote
6
down vote
numba
from numba import njit
@njit
def pir_numba(S, F, V):
mn = S.min()
mx = F.max()
out = np.zeros(mx)
for s, f, v in zip(S, F, V):
out[s:f] += v
return out[mn:]
pir_numba(*[df[c].values for c in ['start', 'finish', 'val']])
np.bincount
s, f, v = [df[col].values for col in ['start', 'finish', 'val']]
np.bincount([i - 1 for r in map(range, s, f) for i in r], v.repeat(f - s))
array([ 100., 300., 500., 700., 1200.])
Comprehension
This depends on the index
being unique
pd.Series({
(k, i): v
for i, s, f, v in df.itertuples()
for k in range(s, f)
}).sum(level=0)
1 100
2 300
3 500
4 700
5 1200
dtype: int64
With no dependence on index
pd.Series({
(k, i): v
for i, (s, f, v) in enumerate(zip(*map(df.get, ['start', 'finish', 'val'])))
for k in range(s, f)
}).sum(level=0)
numba
from numba import njit
@njit
def pir_numba(S, F, V):
mn = S.min()
mx = F.max()
out = np.zeros(mx)
for s, f, v in zip(S, F, V):
out[s:f] += v
return out[mn:]
pir_numba(*[df[c].values for c in ['start', 'finish', 'val']])
np.bincount
s, f, v = [df[col].values for col in ['start', 'finish', 'val']]
np.bincount([i - 1 for r in map(range, s, f) for i in r], v.repeat(f - s))
array([ 100., 300., 500., 700., 1200.])
Comprehension
This depends on the index
being unique
pd.Series({
(k, i): v
for i, s, f, v in df.itertuples()
for k in range(s, f)
}).sum(level=0)
1 100
2 300
3 500
4 700
5 1200
dtype: int64
With no dependence on index
pd.Series({
(k, i): v
for i, (s, f, v) in enumerate(zip(*map(df.get, ['start', 'finish', 'val'])))
for k in range(s, f)
}).sum(level=0)
edited 2 days ago
answered 2 days ago
piRSquared
148k21132268
148k21132268
3
Add the timing and getting shocked :-)
– W-B
2 days ago
wau, it is really interesting...
– jezrael
2 days ago
1
Can you add more general timings, maybe with graph?
– jezrael
2 days ago
1
Oh ohk, @rinspy please add the output for the case where there is a repeation of the interval. I dont think the intervals are all unique.
– Dark
2 days ago
1
@piRSquared - the comprehension solution assumes that the start/finish intervals are integers and are all close together. If the intervals are sparse it gets very inefficient.
– rinspy
2 days ago
|
show 5 more comments
3
Add the timing and getting shocked :-)
– W-B
2 days ago
wau, it is really interesting...
– jezrael
2 days ago
1
Can you add more general timings, maybe with graph?
– jezrael
2 days ago
1
Oh ohk, @rinspy please add the output for the case where there is a repeation of the interval. I dont think the intervals are all unique.
– Dark
2 days ago
1
@piRSquared - the comprehension solution assumes that the start/finish intervals are integers and are all close together. If the intervals are sparse it gets very inefficient.
– rinspy
2 days ago
3
3
Add the timing and getting shocked :-)
– W-B
2 days ago
Add the timing and getting shocked :-)
– W-B
2 days ago
wau, it is really interesting...
– jezrael
2 days ago
wau, it is really interesting...
– jezrael
2 days ago
1
1
Can you add more general timings, maybe with graph?
– jezrael
2 days ago
Can you add more general timings, maybe with graph?
– jezrael
2 days ago
1
1
Oh ohk, @rinspy please add the output for the case where there is a repeation of the interval. I dont think the intervals are all unique.
– Dark
2 days ago
Oh ohk, @rinspy please add the output for the case where there is a repeation of the interval. I dont think the intervals are all unique.
– Dark
2 days ago
1
1
@piRSquared - the comprehension solution assumes that the start/finish intervals are integers and are all close together. If the intervals are sparse it gets very inefficient.
– rinspy
2 days ago
@piRSquared - the comprehension solution assumes that the start/finish intervals are integers and are all close together. If the intervals are sparse it gets very inefficient.
– rinspy
2 days ago
|
show 5 more comments
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53265238%2fcumulative-apply-within-window-defined-by-other-columns%23new-answer', 'question_page');
}
);
Post as a guest
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password