Cumulative apply within window defined by other columns

up vote
8
down vote

favorite

I am trying to apply a function, cumulatively, to values that lie within a window defined by 'start' and 'finish' columns. So, 'start' and 'finish' define the intervals where the value is 'active'; for each row, I want to get a sum of all 'active' values at the time.

Here is a 'bruteforce' example that does what I am after - is there a more elegant, faster or more memory efficient way of doing this?

df = pd.DataFrame(data=[[1,3,100], [2,4,200], [3,6,300], [4,6,400], [5,6,500]],

    columns=['start', 'finish', 'val'])

df['dummy'] = 1

df = df.merge(df, on=['dummy'], how='left')

df = df[(df['start_y'] <= df['start_x']) & (df['finish_y'] > df['start_x'])]

val = df.groupby('start_x')['val_y'].sum()

Originally, df is:

  start  finish  val

0   1      3     100

1   2      4     200

2   3      6     300

3   4      6     400

4   5      6     500

The result I am after is:

asked 2 days ago

rinspy

1577

add a comment |

up vote
8
down vote

favorite

Here is a 'bruteforce' example that does what I am after - is there a more elegant, faster or more memory efficient way of doing this?

df = pd.DataFrame(data=[[1,3,100], [2,4,200], [3,6,300], [4,6,400], [5,6,500]],

    columns=['start', 'finish', 'val'])

df['dummy'] = 1

df = df.merge(df, on=['dummy'], how='left')

df = df[(df['start_y'] <= df['start_x']) & (df['finish_y'] > df['start_x'])]

val = df.groupby('start_x')['val_y'].sum()

Originally, df is:

  start  finish  val

0   1      3     100

1   2      4     200

2   3      6     300

3   4      6     400

4   5      6     500

The result I am after is:

asked 2 days ago

rinspy

1577

add a comment |

up vote
8
down vote

favorite

Here is a 'bruteforce' example that does what I am after - is there a more elegant, faster or more memory efficient way of doing this?

df = pd.DataFrame(data=[[1,3,100], [2,4,200], [3,6,300], [4,6,400], [5,6,500]],

    columns=['start', 'finish', 'val'])

df['dummy'] = 1

df = df.merge(df, on=['dummy'], how='left')

df = df[(df['start_y'] <= df['start_x']) & (df['finish_y'] > df['start_x'])]

val = df.groupby('start_x')['val_y'].sum()

Originally, df is:

  start  finish  val

0   1      3     100

1   2      4     200

2   3      6     300

3   4      6     400

4   5      6     500

The result I am after is:

asked 2 days ago

rinspy

1577

Here is a 'bruteforce' example that does what I am after - is there a more elegant, faster or more memory efficient way of doing this?

df = pd.DataFrame(data=[[1,3,100], [2,4,200], [3,6,300], [4,6,400], [5,6,500]],

    columns=['start', 'finish', 'val'])

df['dummy'] = 1

df = df.merge(df, on=['dummy'], how='left')

df = df[(df['start_y'] <= df['start_x']) & (df['finish_y'] > df['start_x'])]

val = df.groupby('start_x')['val_y'].sum()

Originally, df is:

  start  finish  val

0   1      3     100

1   2      4     200

2   3      6     300

3   4      6     400

4   5      6     500

The result I am after is:

python pandas

asked 2 days ago

rinspy

1577

asked 2 days ago

rinspy

1577

asked 2 days ago

rinspy

1577

asked 2 days ago

rinspy

1577

asked 2 days ago

rinspy

1577

add a comment |

2 Answers
2

active

oldest

votes

up vote
5
down vote

accepted

Using numpy boardcast , unfortunately it is still O(n*m) solution , but should be faster than the groupby. So far base on my test Pir 's solution performance is the best

s1=df['start'].values

s2=df['finish'].values

np.sum(((s1<=s1[:,None])&(s2>=s2[:,None]))*df.val.values,1)

Out[44]: array([ 100,  200,  300,  700, 1200], dtype=int64)

Some timing

#df=pd.concat([df]*1000)

%timeit merged(df)

1 loop, best of 3: 5.02 s per loop

%timeit npb(df)

1 loop, best of 3: 283 ms per loop

% timeit PIR(df)

100 loops, best of 3: 9.8 ms per loop

def merged(df):

    df['dummy'] = 1

    df = df.merge(df, on=['dummy'], how='left')

    df = df[(df['start_y'] <= df['start_x']) & (df['finish_y'] > df['start_x'])]

    val = df.groupby('start_x')['val_y'].sum()

    return val



def npb(df):

    s1 = df['start'].values

    s2 = df['finish'].values

    return np.sum(((s1 <= s1[:, None]) & (s2 >= s2[:, None])) * df.val.values, 1)

edited 2 days ago

answered 2 days ago

W-B

91.7k72755

1

I dont think the dataframe used for the timing would give proper output for Pir's method.
– Dark
2 days ago

@Dark - exactly, problem is only 6 groups so .sum(level) what is hidden groupby(level=0).sum() is really fast.
– jezrael
2 days ago

1

I think best is create new more general sample data and then testing... But difference is huge, so possible comprehension should be faster, but not so radically
– jezrael
2 days ago

add a comment |

up vote
6
down vote

`numba`

from numba import njit



@njit

def pir_numba(S, F, V):

  mn = S.min()

  mx = F.max()

  out = np.zeros(mx)

  for s, f, v in zip(S, F, V):

    out[s:f] += v

  return out[mn:]



pir_numba(*[df[c].values for c in ['start', 'finish', 'val']])

`np.bincount`

s, f, v = [df[col].values for col in ['start', 'finish', 'val']]

np.bincount([i - 1 for r in map(range, s, f) for i in r], v.repeat(f - s))



array([ 100.,  300.,  500.,  700., 1200.])

Comprehension

This depends on the index being unique

pd.Series({

    (k, i): v

    for i, s, f, v in df.itertuples()

    for k in range(s, f)

}).sum(level=0)



1     100

2     300

3     500

4     700

5    1200

dtype: int64

With no dependence on index

pd.Series({

    (k, i): v

    for i, (s, f, v) in enumerate(zip(*map(df.get, ['start', 'finish', 'val'])))

    for k in range(s, f)

}).sum(level=0)

edited 2 days ago

answered 2 days ago

piRSquared

148k21132268

3

Add the timing and getting shocked :-)
– W-B
2 days ago

wau, it is really interesting...
– jezrael
2 days ago

1

Can you add more general timings, maybe with graph?
– jezrael
2 days ago

1

Oh ohk, @rinspy please add the output for the case where there is a repeation of the interval. I dont think the intervals are all unique.
– Dark
2 days ago

1

@piRSquared - the comprehension solution assumes that the start/finish intervals are integers and are all close together. If the intervals are sparse it gets very inefficient.
– rinspy
2 days ago

|
show 5 more comments

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53265238%2fcumulative-apply-within-window-defined-by-other-columns%23new-answer', 'question_page');
}
);

Post as a guest

Name

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

up vote
5
down vote

accepted

Using numpy boardcast , unfortunately it is still O(n*m) solution , but should be faster than the groupby. So far base on my test Pir 's solution performance is the best

s1=df['start'].values

s2=df['finish'].values

np.sum(((s1<=s1[:,None])&(s2>=s2[:,None]))*df.val.values,1)

Out[44]: array([ 100,  200,  300,  700, 1200], dtype=int64)

Some timing

#df=pd.concat([df]*1000)

%timeit merged(df)

1 loop, best of 3: 5.02 s per loop

%timeit npb(df)

1 loop, best of 3: 283 ms per loop

% timeit PIR(df)

100 loops, best of 3: 9.8 ms per loop

def merged(df):

    df['dummy'] = 1

    df = df.merge(df, on=['dummy'], how='left')

    df = df[(df['start_y'] <= df['start_x']) & (df['finish_y'] > df['start_x'])]

    val = df.groupby('start_x')['val_y'].sum()

    return val



def npb(df):

    s1 = df['start'].values

    s2 = df['finish'].values

    return np.sum(((s1 <= s1[:, None]) & (s2 >= s2[:, None])) * df.val.values, 1)

edited 2 days ago

answered 2 days ago

W-B

91.7k72755

1

I dont think the dataframe used for the timing would give proper output for Pir's method.
– Dark
2 days ago

@Dark - exactly, problem is only 6 groups so .sum(level) what is hidden groupby(level=0).sum() is really fast.
– jezrael
2 days ago

1

I think best is create new more general sample data and then testing... But difference is huge, so possible comprehension should be faster, but not so radically
– jezrael
2 days ago

add a comment |

up vote
5
down vote

accepted

Using numpy boardcast , unfortunately it is still O(n*m) solution , but should be faster than the groupby. So far base on my test Pir 's solution performance is the best

s1=df['start'].values

s2=df['finish'].values

np.sum(((s1<=s1[:,None])&(s2>=s2[:,None]))*df.val.values,1)

Out[44]: array([ 100,  200,  300,  700, 1200], dtype=int64)

Some timing

#df=pd.concat([df]*1000)

%timeit merged(df)

1 loop, best of 3: 5.02 s per loop

%timeit npb(df)

1 loop, best of 3: 283 ms per loop

% timeit PIR(df)

100 loops, best of 3: 9.8 ms per loop

def merged(df):

    df['dummy'] = 1

    df = df.merge(df, on=['dummy'], how='left')

    df = df[(df['start_y'] <= df['start_x']) & (df['finish_y'] > df['start_x'])]

    val = df.groupby('start_x')['val_y'].sum()

    return val



def npb(df):

    s1 = df['start'].values

    s2 = df['finish'].values

    return np.sum(((s1 <= s1[:, None]) & (s2 >= s2[:, None])) * df.val.values, 1)

edited 2 days ago

answered 2 days ago

W-B

91.7k72755

1

I dont think the dataframe used for the timing would give proper output for Pir's method.
– Dark
2 days ago

@Dark - exactly, problem is only 6 groups so .sum(level) what is hidden groupby(level=0).sum() is really fast.
– jezrael
2 days ago

1

I think best is create new more general sample data and then testing... But difference is huge, so possible comprehension should be faster, but not so radically
– jezrael
2 days ago

add a comment |

up vote
5
down vote

accepted

Using numpy boardcast , unfortunately it is still O(n*m) solution , but should be faster than the groupby. So far base on my test Pir 's solution performance is the best

s1=df['start'].values

s2=df['finish'].values

np.sum(((s1<=s1[:,None])&(s2>=s2[:,None]))*df.val.values,1)

Out[44]: array([ 100,  200,  300,  700, 1200], dtype=int64)

Some timing

#df=pd.concat([df]*1000)

%timeit merged(df)

1 loop, best of 3: 5.02 s per loop

%timeit npb(df)

1 loop, best of 3: 283 ms per loop

% timeit PIR(df)

100 loops, best of 3: 9.8 ms per loop

def merged(df):

    df['dummy'] = 1

    df = df.merge(df, on=['dummy'], how='left')

    df = df[(df['start_y'] <= df['start_x']) & (df['finish_y'] > df['start_x'])]

    val = df.groupby('start_x')['val_y'].sum()

    return val



def npb(df):

    s1 = df['start'].values

    s2 = df['finish'].values

    return np.sum(((s1 <= s1[:, None]) & (s2 >= s2[:, None])) * df.val.values, 1)

edited 2 days ago

answered 2 days ago

W-B

91.7k72755

Using numpy boardcast , unfortunately it is still O(n*m) solution , but should be faster than the groupby. So far base on my test Pir 's solution performance is the best

s1=df['start'].values

s2=df['finish'].values

np.sum(((s1<=s1[:,None])&(s2>=s2[:,None]))*df.val.values,1)

Out[44]: array([ 100,  200,  300,  700, 1200], dtype=int64)

Some timing

#df=pd.concat([df]*1000)

%timeit merged(df)

1 loop, best of 3: 5.02 s per loop

%timeit npb(df)

1 loop, best of 3: 283 ms per loop

% timeit PIR(df)

100 loops, best of 3: 9.8 ms per loop

def merged(df):

    df['dummy'] = 1

    df = df.merge(df, on=['dummy'], how='left')

    df = df[(df['start_y'] <= df['start_x']) & (df['finish_y'] > df['start_x'])]

    val = df.groupby('start_x')['val_y'].sum()

    return val



def npb(df):

    s1 = df['start'].values

    s2 = df['finish'].values

    return np.sum(((s1 <= s1[:, None]) & (s2 >= s2[:, None])) * df.val.values, 1)

edited 2 days ago

answered 2 days ago

W-B

91.7k72755

edited 2 days ago

answered 2 days ago

W-B

91.7k72755

answered 2 days ago

W-B

91.7k72755

answered 2 days ago

W-B

91.7k72755

1

I dont think the dataframe used for the timing would give proper output for Pir's method.
– Dark
2 days ago

@Dark - exactly, problem is only 6 groups so .sum(level) what is hidden groupby(level=0).sum() is really fast.
– jezrael
2 days ago

1

I think best is create new more general sample data and then testing... But difference is huge, so possible comprehension should be faster, but not so radically
– jezrael
2 days ago

add a comment |

1

I dont think the dataframe used for the timing would give proper output for Pir's method.
– Dark
2 days ago

@Dark - exactly, problem is only 6 groups so .sum(level) what is hidden groupby(level=0).sum() is really fast.
– jezrael
2 days ago

1

I think best is create new more general sample data and then testing... But difference is huge, so possible comprehension should be faster, but not so radically
– jezrael
2 days ago

I dont think the dataframe used for the timing would give proper output for Pir's method.
– Dark
2 days ago

@Dark - exactly, problem is only 6 groups so .sum(level) what is hidden groupby(level=0).sum() is really fast.
– jezrael
2 days ago

I think best is create new more general sample data and then testing... But difference is huge, so possible comprehension should be faster, but not so radically
– jezrael
2 days ago

add a comment |

up vote
6
down vote

`numba`

from numba import njit



@njit

def pir_numba(S, F, V):

  mn = S.min()

  mx = F.max()

  out = np.zeros(mx)

  for s, f, v in zip(S, F, V):

    out[s:f] += v

  return out[mn:]



pir_numba(*[df[c].values for c in ['start', 'finish', 'val']])

`np.bincount`

s, f, v = [df[col].values for col in ['start', 'finish', 'val']]

np.bincount([i - 1 for r in map(range, s, f) for i in r], v.repeat(f - s))



array([ 100.,  300.,  500.,  700., 1200.])

Comprehension

This depends on the index being unique

pd.Series({

    (k, i): v

    for i, s, f, v in df.itertuples()

    for k in range(s, f)

}).sum(level=0)



1     100

2     300

3     500

4     700

5    1200

dtype: int64

With no dependence on index

pd.Series({

    (k, i): v

    for i, (s, f, v) in enumerate(zip(*map(df.get, ['start', 'finish', 'val'])))

    for k in range(s, f)

}).sum(level=0)

edited 2 days ago

answered 2 days ago

piRSquared

148k21132268

3

Add the timing and getting shocked :-)
– W-B
2 days ago

wau, it is really interesting...
– jezrael
2 days ago

1

Can you add more general timings, maybe with graph?
– jezrael
2 days ago

1

Oh ohk, @rinspy please add the output for the case where there is a repeation of the interval. I dont think the intervals are all unique.
– Dark
2 days ago

1

@piRSquared - the comprehension solution assumes that the start/finish intervals are integers and are all close together. If the intervals are sparse it gets very inefficient.
– rinspy
2 days ago

|
show 5 more comments

up vote
6
down vote

`numba`

from numba import njit



@njit

def pir_numba(S, F, V):

  mn = S.min()

  mx = F.max()

  out = np.zeros(mx)

  for s, f, v in zip(S, F, V):

    out[s:f] += v

  return out[mn:]



pir_numba(*[df[c].values for c in ['start', 'finish', 'val']])

`np.bincount`

s, f, v = [df[col].values for col in ['start', 'finish', 'val']]

np.bincount([i - 1 for r in map(range, s, f) for i in r], v.repeat(f - s))



array([ 100.,  300.,  500.,  700., 1200.])

Comprehension

This depends on the index being unique

pd.Series({

    (k, i): v

    for i, s, f, v in df.itertuples()

    for k in range(s, f)

}).sum(level=0)



1     100

2     300

3     500

4     700

5    1200

dtype: int64

With no dependence on index

pd.Series({

    (k, i): v

    for i, (s, f, v) in enumerate(zip(*map(df.get, ['start', 'finish', 'val'])))

    for k in range(s, f)

}).sum(level=0)

edited 2 days ago

answered 2 days ago

piRSquared

148k21132268

3

Add the timing and getting shocked :-)
– W-B
2 days ago

wau, it is really interesting...
– jezrael
2 days ago

1

Can you add more general timings, maybe with graph?
– jezrael
2 days ago

1

Oh ohk, @rinspy please add the output for the case where there is a repeation of the interval. I dont think the intervals are all unique.
– Dark
2 days ago

1

@piRSquared - the comprehension solution assumes that the start/finish intervals are integers and are all close together. If the intervals are sparse it gets very inefficient.
– rinspy
2 days ago

|
show 5 more comments

up vote
6
down vote

`numba`

from numba import njit



@njit

def pir_numba(S, F, V):

  mn = S.min()

  mx = F.max()

  out = np.zeros(mx)

  for s, f, v in zip(S, F, V):

    out[s:f] += v

  return out[mn:]



pir_numba(*[df[c].values for c in ['start', 'finish', 'val']])

`np.bincount`

s, f, v = [df[col].values for col in ['start', 'finish', 'val']]

np.bincount([i - 1 for r in map(range, s, f) for i in r], v.repeat(f - s))



array([ 100.,  300.,  500.,  700., 1200.])

Comprehension

This depends on the index being unique

pd.Series({

    (k, i): v

    for i, s, f, v in df.itertuples()

    for k in range(s, f)

}).sum(level=0)



1     100

2     300

3     500

4     700

5    1200

dtype: int64

With no dependence on index

pd.Series({

    (k, i): v

    for i, (s, f, v) in enumerate(zip(*map(df.get, ['start', 'finish', 'val'])))

    for k in range(s, f)

}).sum(level=0)

edited 2 days ago

answered 2 days ago

piRSquared

148k21132268

`numba`

from numba import njit



@njit

def pir_numba(S, F, V):

  mn = S.min()

  mx = F.max()

  out = np.zeros(mx)

  for s, f, v in zip(S, F, V):

    out[s:f] += v

  return out[mn:]



pir_numba(*[df[c].values for c in ['start', 'finish', 'val']])

`np.bincount`

s, f, v = [df[col].values for col in ['start', 'finish', 'val']]

np.bincount([i - 1 for r in map(range, s, f) for i in r], v.repeat(f - s))



array([ 100.,  300.,  500.,  700., 1200.])

Comprehension

This depends on the index being unique

pd.Series({

    (k, i): v

    for i, s, f, v in df.itertuples()

    for k in range(s, f)

}).sum(level=0)



1     100

2     300

3     500

4     700

5    1200

dtype: int64

With no dependence on index

pd.Series({

    (k, i): v

    for i, (s, f, v) in enumerate(zip(*map(df.get, ['start', 'finish', 'val'])))

    for k in range(s, f)

}).sum(level=0)

edited 2 days ago

answered 2 days ago

piRSquared

148k21132268

edited 2 days ago

answered 2 days ago

piRSquared

148k21132268

answered 2 days ago

piRSquared

148k21132268

answered 2 days ago

piRSquared

148k21132268

3

Add the timing and getting shocked :-)
– W-B
2 days ago

wau, it is really interesting...
– jezrael
2 days ago

1

Can you add more general timings, maybe with graph?
– jezrael
2 days ago

1

Oh ohk, @rinspy please add the output for the case where there is a repeation of the interval. I dont think the intervals are all unique.
– Dark
2 days ago

1

@piRSquared - the comprehension solution assumes that the start/finish intervals are integers and are all close together. If the intervals are sparse it gets very inefficient.
– rinspy
2 days ago

|
show 5 more comments

3

Add the timing and getting shocked :-)
– W-B
2 days ago

wau, it is really interesting...
– jezrael
2 days ago

1

Can you add more general timings, maybe with graph?
– jezrael
2 days ago

1

Oh ohk, @rinspy please add the output for the case where there is a repeation of the interval. I dont think the intervals are all unique.
– Dark
2 days ago

1

@piRSquared - the comprehension solution assumes that the start/finish intervals are integers and are all close together. If the intervals are sparse it gets very inefficient.
– rinspy
2 days ago

Add the timing and getting shocked :-)
– W-B
2 days ago

wau, it is really interesting...
– jezrael
2 days ago

Can you add more general timings, maybe with graph?
– jezrael
2 days ago

Oh ohk, @rinspy please add the output for the case where there is a repeation of the interval. I dont think the intervals are all unique.
– Dark
2 days ago

@piRSquared - the comprehension solution assumes that the start/finish intervals are integers and are all close together. If the intervals are sparse it gets very inefficient.
– rinspy
2 days ago

|
show 5 more comments

draft saved

draft discarded

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Name

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Cfrgtkky