Cumulative apply within window defined by other columns

up vote
9
down vote

favorite

I am trying to apply a function, cumulatively, to values that lie within a window defined by 'start' and 'finish' columns. So, 'start' and 'finish' define the intervals where the value is 'active'; for each row, I want to get a sum of all 'active' values at the time.

Here is a 'bruteforce' example that does what I am after - is there a more elegant, faster or more memory efficient way of doing this?

df = pd.DataFrame(data=[[1,3,100], [2,4,200], [3,6,300], [4,6,400], [5,6,500]],

    columns=['start', 'finish', 'val'])

df['dummy'] = 1

df = df.merge(df, on=['dummy'], how='left')

df = df[(df['start_y'] <= df['start_x']) & (df['finish_y'] > df['start_x'])]

val = df.groupby('start_x')['val_y'].sum()

Originally, df is:

  start  finish  val

0   1      3     100

1   2      4     200

2   3      6     300

3   4      6     400

4   5      6     500

The result I am after is:

asked Nov 12 at 15:25

rinspy

1627

add a comment |

up vote
9
down vote

favorite

Here is a 'bruteforce' example that does what I am after - is there a more elegant, faster or more memory efficient way of doing this?

df = pd.DataFrame(data=[[1,3,100], [2,4,200], [3,6,300], [4,6,400], [5,6,500]],

    columns=['start', 'finish', 'val'])

df['dummy'] = 1

df = df.merge(df, on=['dummy'], how='left')

df = df[(df['start_y'] <= df['start_x']) & (df['finish_y'] > df['start_x'])]

val = df.groupby('start_x')['val_y'].sum()

Originally, df is:

  start  finish  val

0   1      3     100

1   2      4     200

2   3      6     300

3   4      6     400

4   5      6     500

The result I am after is:

asked Nov 12 at 15:25

rinspy

1627

add a comment |

up vote
9
down vote

favorite

Here is a 'bruteforce' example that does what I am after - is there a more elegant, faster or more memory efficient way of doing this?

df = pd.DataFrame(data=[[1,3,100], [2,4,200], [3,6,300], [4,6,400], [5,6,500]],

    columns=['start', 'finish', 'val'])

df['dummy'] = 1

df = df.merge(df, on=['dummy'], how='left')

df = df[(df['start_y'] <= df['start_x']) & (df['finish_y'] > df['start_x'])]

val = df.groupby('start_x')['val_y'].sum()

Originally, df is:

  start  finish  val

0   1      3     100

1   2      4     200

2   3      6     300

3   4      6     400

4   5      6     500

The result I am after is:

asked Nov 12 at 15:25

rinspy

1627

Here is a 'bruteforce' example that does what I am after - is there a more elegant, faster or more memory efficient way of doing this?

df = pd.DataFrame(data=[[1,3,100], [2,4,200], [3,6,300], [4,6,400], [5,6,500]],

    columns=['start', 'finish', 'val'])

df['dummy'] = 1

df = df.merge(df, on=['dummy'], how='left')

df = df[(df['start_y'] <= df['start_x']) & (df['finish_y'] > df['start_x'])]

val = df.groupby('start_x')['val_y'].sum()

Originally, df is:

  start  finish  val

0   1      3     100

1   2      4     200

2   3      6     300

3   4      6     400

4   5      6     500

The result I am after is:

python pandas

asked Nov 12 at 15:25

rinspy

1627

asked Nov 12 at 15:25

rinspy

1627

asked Nov 12 at 15:25

rinspy

1627

asked Nov 12 at 15:25

rinspy

1627

asked Nov 12 at 15:25

rinspy

1627

add a comment |

2 Answers
2

active

oldest

votes

up vote
5
down vote

accepted

Using numpy boardcast , unfortunately it is still O(n*m) solution , but should be faster than the groupby. So far base on my test Pir 's solution performance is the best

s1=df['start'].values

s2=df['finish'].values

np.sum(((s1<=s1[:,None])&(s2>=s2[:,None]))*df.val.values,1)

Out[44]: array([ 100,  200,  300,  700, 1200], dtype=int64)

Some timing

#df=pd.concat([df]*1000)

%timeit merged(df)

1 loop, best of 3: 5.02 s per loop

%timeit npb(df)

1 loop, best of 3: 283 ms per loop

% timeit PIR(df)

100 loops, best of 3: 9.8 ms per loop

def merged(df):

    df['dummy'] = 1

    df = df.merge(df, on=['dummy'], how='left')

    df = df[(df['start_y'] <= df['start_x']) & (df['finish_y'] > df['start_x'])]

    val = df.groupby('start_x')['val_y'].sum()

    return val



def npb(df):

    s1 = df['start'].values

    s2 = df['finish'].values

    return np.sum(((s1 <= s1[:, None]) & (s2 >= s2[:, None])) * df.val.values, 1)

edited Nov 12 at 15:49

answered Nov 12 at 15:37

W-B

92.4k72755

1

I dont think the dataframe used for the timing would give proper output for Pir's method.
– Dark
Nov 12 at 16:17

@Dark - exactly, problem is only 6 groups so .sum(level) what is hidden groupby(level=0).sum() is really fast.
– jezrael
Nov 12 at 16:19

1

I think best is create new more general sample data and then testing... But difference is huge, so possible comprehension should be faster, but not so radically
– jezrael
Nov 12 at 16:21

add a comment |

up vote
6
down vote

`numba`

from numba import njit



@njit

def pir_numba(S, F, V):

  mn = S.min()

  mx = F.max()

  out = np.zeros(mx)

  for s, f, v in zip(S, F, V):

    out[s:f] += v

  return out[mn:]



pir_numba(*[df[c].values for c in ['start', 'finish', 'val']])

`np.bincount`

s, f, v = [df[col].values for col in ['start', 'finish', 'val']]

np.bincount([i - 1 for r in map(range, s, f) for i in r], v.repeat(f - s))



array([ 100.,  300.,  500.,  700., 1200.])

Comprehension

This depends on the index being unique

pd.Series({

    (k, i): v

    for i, s, f, v in df.itertuples()

    for k in range(s, f)

}).sum(level=0)



1     100

2     300

3     500

4     700

5    1200

dtype: int64

With no dependence on index

pd.Series({

    (k, i): v

    for i, (s, f, v) in enumerate(zip(*map(df.get, ['start', 'finish', 'val'])))

    for k in range(s, f)

}).sum(level=0)

edited Nov 12 at 16:50

answered Nov 12 at 15:42

piRSquared

149k21133273

3

Add the timing and getting shocked :-)
– W-B
Nov 12 at 15:49

wau, it is really interesting...
– jezrael
Nov 12 at 15:52

1

Can you add more general timings, maybe with graph?
– jezrael
Nov 12 at 16:23

1

Oh ohk, @rinspy please add the output for the case where there is a repeation of the interval. I dont think the intervals are all unique.
– Dark
Nov 12 at 16:26

1

@piRSquared - the comprehension solution assumes that the start/finish intervals are integers and are all close together. If the intervals are sparse it gets very inefficient.
– rinspy
Nov 12 at 16:29

|
show 5 more comments

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53265238%2fcumulative-apply-within-window-defined-by-other-columns%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

up vote
5
down vote

accepted

Using numpy boardcast , unfortunately it is still O(n*m) solution , but should be faster than the groupby. So far base on my test Pir 's solution performance is the best

s1=df['start'].values

s2=df['finish'].values

np.sum(((s1<=s1[:,None])&(s2>=s2[:,None]))*df.val.values,1)

Out[44]: array([ 100,  200,  300,  700, 1200], dtype=int64)

Some timing

#df=pd.concat([df]*1000)

%timeit merged(df)

1 loop, best of 3: 5.02 s per loop

%timeit npb(df)

1 loop, best of 3: 283 ms per loop

% timeit PIR(df)

100 loops, best of 3: 9.8 ms per loop

def merged(df):

    df['dummy'] = 1

    df = df.merge(df, on=['dummy'], how='left')

    df = df[(df['start_y'] <= df['start_x']) & (df['finish_y'] > df['start_x'])]

    val = df.groupby('start_x')['val_y'].sum()

    return val



def npb(df):

    s1 = df['start'].values

    s2 = df['finish'].values

    return np.sum(((s1 <= s1[:, None]) & (s2 >= s2[:, None])) * df.val.values, 1)

edited Nov 12 at 15:49

answered Nov 12 at 15:37

W-B

92.4k72755

1

I dont think the dataframe used for the timing would give proper output for Pir's method.
– Dark
Nov 12 at 16:17

@Dark - exactly, problem is only 6 groups so .sum(level) what is hidden groupby(level=0).sum() is really fast.
– jezrael
Nov 12 at 16:19

1

I think best is create new more general sample data and then testing... But difference is huge, so possible comprehension should be faster, but not so radically
– jezrael
Nov 12 at 16:21

add a comment |

up vote
5
down vote

accepted

Using numpy boardcast , unfortunately it is still O(n*m) solution , but should be faster than the groupby. So far base on my test Pir 's solution performance is the best

s1=df['start'].values

s2=df['finish'].values

np.sum(((s1<=s1[:,None])&(s2>=s2[:,None]))*df.val.values,1)

Out[44]: array([ 100,  200,  300,  700, 1200], dtype=int64)

Some timing

#df=pd.concat([df]*1000)

%timeit merged(df)

1 loop, best of 3: 5.02 s per loop

%timeit npb(df)

1 loop, best of 3: 283 ms per loop

% timeit PIR(df)

100 loops, best of 3: 9.8 ms per loop

def merged(df):

    df['dummy'] = 1

    df = df.merge(df, on=['dummy'], how='left')

    df = df[(df['start_y'] <= df['start_x']) & (df['finish_y'] > df['start_x'])]

    val = df.groupby('start_x')['val_y'].sum()

    return val



def npb(df):

    s1 = df['start'].values

    s2 = df['finish'].values

    return np.sum(((s1 <= s1[:, None]) & (s2 >= s2[:, None])) * df.val.values, 1)

edited Nov 12 at 15:49

answered Nov 12 at 15:37

W-B

92.4k72755

1

I dont think the dataframe used for the timing would give proper output for Pir's method.
– Dark
Nov 12 at 16:17

@Dark - exactly, problem is only 6 groups so .sum(level) what is hidden groupby(level=0).sum() is really fast.
– jezrael
Nov 12 at 16:19

1

I think best is create new more general sample data and then testing... But difference is huge, so possible comprehension should be faster, but not so radically
– jezrael
Nov 12 at 16:21

add a comment |

up vote
5
down vote

accepted

Using numpy boardcast , unfortunately it is still O(n*m) solution , but should be faster than the groupby. So far base on my test Pir 's solution performance is the best

s1=df['start'].values

s2=df['finish'].values

np.sum(((s1<=s1[:,None])&(s2>=s2[:,None]))*df.val.values,1)

Out[44]: array([ 100,  200,  300,  700, 1200], dtype=int64)

Some timing

#df=pd.concat([df]*1000)

%timeit merged(df)

1 loop, best of 3: 5.02 s per loop

%timeit npb(df)

1 loop, best of 3: 283 ms per loop

% timeit PIR(df)

100 loops, best of 3: 9.8 ms per loop

def merged(df):

    df['dummy'] = 1

    df = df.merge(df, on=['dummy'], how='left')

    df = df[(df['start_y'] <= df['start_x']) & (df['finish_y'] > df['start_x'])]

    val = df.groupby('start_x')['val_y'].sum()

    return val



def npb(df):

    s1 = df['start'].values

    s2 = df['finish'].values

    return np.sum(((s1 <= s1[:, None]) & (s2 >= s2[:, None])) * df.val.values, 1)

edited Nov 12 at 15:49

answered Nov 12 at 15:37

W-B

92.4k72755

Using numpy boardcast , unfortunately it is still O(n*m) solution , but should be faster than the groupby. So far base on my test Pir 's solution performance is the best

s1=df['start'].values

s2=df['finish'].values

np.sum(((s1<=s1[:,None])&(s2>=s2[:,None]))*df.val.values,1)

Out[44]: array([ 100,  200,  300,  700, 1200], dtype=int64)

Some timing

#df=pd.concat([df]*1000)

%timeit merged(df)

1 loop, best of 3: 5.02 s per loop

%timeit npb(df)

1 loop, best of 3: 283 ms per loop

% timeit PIR(df)

100 loops, best of 3: 9.8 ms per loop

def merged(df):

    df['dummy'] = 1

    df = df.merge(df, on=['dummy'], how='left')

    df = df[(df['start_y'] <= df['start_x']) & (df['finish_y'] > df['start_x'])]

    val = df.groupby('start_x')['val_y'].sum()

    return val



def npb(df):

    s1 = df['start'].values

    s2 = df['finish'].values

    return np.sum(((s1 <= s1[:, None]) & (s2 >= s2[:, None])) * df.val.values, 1)

edited Nov 12 at 15:49

answered Nov 12 at 15:37

W-B

92.4k72755

edited Nov 12 at 15:49

answered Nov 12 at 15:37

W-B

92.4k72755

answered Nov 12 at 15:37

W-B

92.4k72755

answered Nov 12 at 15:37

W-B

92.4k72755

1

I dont think the dataframe used for the timing would give proper output for Pir's method.
– Dark
Nov 12 at 16:17

@Dark - exactly, problem is only 6 groups so .sum(level) what is hidden groupby(level=0).sum() is really fast.
– jezrael
Nov 12 at 16:19

1

I think best is create new more general sample data and then testing... But difference is huge, so possible comprehension should be faster, but not so radically
– jezrael
Nov 12 at 16:21

add a comment |

1

I dont think the dataframe used for the timing would give proper output for Pir's method.
– Dark
Nov 12 at 16:17

@Dark - exactly, problem is only 6 groups so .sum(level) what is hidden groupby(level=0).sum() is really fast.
– jezrael
Nov 12 at 16:19

1

I think best is create new more general sample data and then testing... But difference is huge, so possible comprehension should be faster, but not so radically
– jezrael
Nov 12 at 16:21

I dont think the dataframe used for the timing would give proper output for Pir's method.
– Dark
Nov 12 at 16:17

@Dark - exactly, problem is only 6 groups so .sum(level) what is hidden groupby(level=0).sum() is really fast.
– jezrael
Nov 12 at 16:19

I think best is create new more general sample data and then testing... But difference is huge, so possible comprehension should be faster, but not so radically
– jezrael
Nov 12 at 16:21

add a comment |

up vote
6
down vote

`numba`

from numba import njit



@njit

def pir_numba(S, F, V):

  mn = S.min()

  mx = F.max()

  out = np.zeros(mx)

  for s, f, v in zip(S, F, V):

    out[s:f] += v

  return out[mn:]



pir_numba(*[df[c].values for c in ['start', 'finish', 'val']])

`np.bincount`

s, f, v = [df[col].values for col in ['start', 'finish', 'val']]

np.bincount([i - 1 for r in map(range, s, f) for i in r], v.repeat(f - s))



array([ 100.,  300.,  500.,  700., 1200.])

Comprehension

This depends on the index being unique

pd.Series({

    (k, i): v

    for i, s, f, v in df.itertuples()

    for k in range(s, f)

}).sum(level=0)



1     100

2     300

3     500

4     700

5    1200

dtype: int64

With no dependence on index

pd.Series({

    (k, i): v

    for i, (s, f, v) in enumerate(zip(*map(df.get, ['start', 'finish', 'val'])))

    for k in range(s, f)

}).sum(level=0)

edited Nov 12 at 16:50

answered Nov 12 at 15:42

piRSquared

149k21133273

3

Add the timing and getting shocked :-)
– W-B
Nov 12 at 15:49

wau, it is really interesting...
– jezrael
Nov 12 at 15:52

1

Can you add more general timings, maybe with graph?
– jezrael
Nov 12 at 16:23

1

Oh ohk, @rinspy please add the output for the case where there is a repeation of the interval. I dont think the intervals are all unique.
– Dark
Nov 12 at 16:26

1

@piRSquared - the comprehension solution assumes that the start/finish intervals are integers and are all close together. If the intervals are sparse it gets very inefficient.
– rinspy
Nov 12 at 16:29

|
show 5 more comments

up vote
6
down vote

`numba`

from numba import njit



@njit

def pir_numba(S, F, V):

  mn = S.min()

  mx = F.max()

  out = np.zeros(mx)

  for s, f, v in zip(S, F, V):

    out[s:f] += v

  return out[mn:]



pir_numba(*[df[c].values for c in ['start', 'finish', 'val']])

`np.bincount`

s, f, v = [df[col].values for col in ['start', 'finish', 'val']]

np.bincount([i - 1 for r in map(range, s, f) for i in r], v.repeat(f - s))



array([ 100.,  300.,  500.,  700., 1200.])

Comprehension

This depends on the index being unique

pd.Series({

    (k, i): v

    for i, s, f, v in df.itertuples()

    for k in range(s, f)

}).sum(level=0)



1     100

2     300

3     500

4     700

5    1200

dtype: int64

With no dependence on index

pd.Series({

    (k, i): v

    for i, (s, f, v) in enumerate(zip(*map(df.get, ['start', 'finish', 'val'])))

    for k in range(s, f)

}).sum(level=0)

edited Nov 12 at 16:50

answered Nov 12 at 15:42

piRSquared

149k21133273

3

Add the timing and getting shocked :-)
– W-B
Nov 12 at 15:49

wau, it is really interesting...
– jezrael
Nov 12 at 15:52

1

Can you add more general timings, maybe with graph?
– jezrael
Nov 12 at 16:23

1

Oh ohk, @rinspy please add the output for the case where there is a repeation of the interval. I dont think the intervals are all unique.
– Dark
Nov 12 at 16:26

1

@piRSquared - the comprehension solution assumes that the start/finish intervals are integers and are all close together. If the intervals are sparse it gets very inefficient.
– rinspy
Nov 12 at 16:29

|
show 5 more comments

up vote
6
down vote

`numba`

from numba import njit



@njit

def pir_numba(S, F, V):

  mn = S.min()

  mx = F.max()

  out = np.zeros(mx)

  for s, f, v in zip(S, F, V):

    out[s:f] += v

  return out[mn:]



pir_numba(*[df[c].values for c in ['start', 'finish', 'val']])

`np.bincount`

s, f, v = [df[col].values for col in ['start', 'finish', 'val']]

np.bincount([i - 1 for r in map(range, s, f) for i in r], v.repeat(f - s))



array([ 100.,  300.,  500.,  700., 1200.])

Comprehension

This depends on the index being unique

pd.Series({

    (k, i): v

    for i, s, f, v in df.itertuples()

    for k in range(s, f)

}).sum(level=0)



1     100

2     300

3     500

4     700

5    1200

dtype: int64

With no dependence on index

pd.Series({

    (k, i): v

    for i, (s, f, v) in enumerate(zip(*map(df.get, ['start', 'finish', 'val'])))

    for k in range(s, f)

}).sum(level=0)

edited Nov 12 at 16:50

answered Nov 12 at 15:42

piRSquared

149k21133273

`numba`

from numba import njit



@njit

def pir_numba(S, F, V):

  mn = S.min()

  mx = F.max()

  out = np.zeros(mx)

  for s, f, v in zip(S, F, V):

    out[s:f] += v

  return out[mn:]



pir_numba(*[df[c].values for c in ['start', 'finish', 'val']])

`np.bincount`

s, f, v = [df[col].values for col in ['start', 'finish', 'val']]

np.bincount([i - 1 for r in map(range, s, f) for i in r], v.repeat(f - s))



array([ 100.,  300.,  500.,  700., 1200.])

Comprehension

This depends on the index being unique

pd.Series({

    (k, i): v

    for i, s, f, v in df.itertuples()

    for k in range(s, f)

}).sum(level=0)



1     100

2     300

3     500

4     700

5    1200

dtype: int64

With no dependence on index

pd.Series({

    (k, i): v

    for i, (s, f, v) in enumerate(zip(*map(df.get, ['start', 'finish', 'val'])))

    for k in range(s, f)

}).sum(level=0)

edited Nov 12 at 16:50

answered Nov 12 at 15:42

piRSquared

149k21133273

edited Nov 12 at 16:50

answered Nov 12 at 15:42

piRSquared

149k21133273

answered Nov 12 at 15:42

piRSquared

149k21133273

answered Nov 12 at 15:42

piRSquared

149k21133273

3

Add the timing and getting shocked :-)
– W-B
Nov 12 at 15:49

wau, it is really interesting...
– jezrael
Nov 12 at 15:52

1

Can you add more general timings, maybe with graph?
– jezrael
Nov 12 at 16:23

1

Oh ohk, @rinspy please add the output for the case where there is a repeation of the interval. I dont think the intervals are all unique.
– Dark
Nov 12 at 16:26

1

@piRSquared - the comprehension solution assumes that the start/finish intervals are integers and are all close together. If the intervals are sparse it gets very inefficient.
– rinspy
Nov 12 at 16:29

|
show 5 more comments

3

Add the timing and getting shocked :-)
– W-B
Nov 12 at 15:49

wau, it is really interesting...
– jezrael
Nov 12 at 15:52

1

Can you add more general timings, maybe with graph?
– jezrael
Nov 12 at 16:23

1

Oh ohk, @rinspy please add the output for the case where there is a repeation of the interval. I dont think the intervals are all unique.
– Dark
Nov 12 at 16:26

1

@piRSquared - the comprehension solution assumes that the start/finish intervals are integers and are all close together. If the intervals are sparse it gets very inefficient.
– rinspy
Nov 12 at 16:29

Add the timing and getting shocked :-)
– W-B
Nov 12 at 15:49

wau, it is really interesting...
– jezrael
Nov 12 at 15:52

Can you add more general timings, maybe with graph?
– jezrael
Nov 12 at 16:23

Oh ohk, @rinspy please add the output for the case where there is a repeation of the interval. I dont think the intervals are all unique.
– Dark
Nov 12 at 16:26

@piRSquared - the comprehension solution assumes that the start/finish intervals are integers and are all close together. If the intervals are sparse it gets very inefficient.
– rinspy
Nov 12 at 16:29

|
show 5 more comments

draft saved

draft discarded

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Cfrgtkky