Dataframe iterate rows eliminate when condition met

-1

I have a big Dataframe, here is the sample data:

df['length']

353.216  

353.514  

273.559  

274.199  

353.813  

354.116

I want to iterate over the rows and compare the i+1 with i row (and if the difference is less 2, then the value should stay, otherwise the whole row should be filtered out), I tried with Boolean indexing: diff = abs(df['length']).diff() < 2 and then df_clean = df[diff]

I want to get rid off all 'abnormal' rows. I know that every i+1 row should be in +- 2 range.
The problem is that there can be more than one row. I want to get rid of 273.559 and 274.199 (in this case), as the difference between them is less 2 I would need to iterate all the rows two times. Including a for loop to iterate over and over again doesn't seem the best approach to me, any good solutions?

Edit: My Output should look as follows:

df_clean_data ['length']

353.216  

353.514  

353.813  

354.116

Thank you in advance
Ziga

edited Nov 21 '18 at 13:45

asked Nov 21 '18 at 13:18

Ziga

175

1

Can you explain better exactly what you want as output?

– Matina G
Nov 21 '18 at 13:21

1

Why only 273.559 and 274.199? There are more contiguous elements with a smaller than 2 distance to surrounding ones, like 353.216 and 353.514

– yatu
Nov 21 '18 at 13:27

Only 273.559 (diff = 273.559 - 353.514 = -79.955) and 274.199 should also be eliminated as it exceeds 2 to other 'normal' values (diff = 274.199 - 353.514 = -79.315)

– Ziga
Nov 21 '18 at 13:31

Please reformulate your question if you want any help, what you are trying to do seems quite unclear

– yatu
Nov 21 '18 at 13:36

add a comment |

-1

I have a big Dataframe, here is the sample data:

df['length']

353.216  

353.514  

273.559  

274.199  

353.813  

354.116

Edit: My Output should look as follows:

df_clean_data ['length']

353.216  

353.514  

353.813  

354.116

Thank you in advance
Ziga

edited Nov 21 '18 at 13:45

asked Nov 21 '18 at 13:18

Ziga

175

1

Can you explain better exactly what you want as output?

– Matina G
Nov 21 '18 at 13:21

1

Why only 273.559 and 274.199? There are more contiguous elements with a smaller than 2 distance to surrounding ones, like 353.216 and 353.514

– yatu
Nov 21 '18 at 13:27

Only 273.559 (diff = 273.559 - 353.514 = -79.955) and 274.199 should also be eliminated as it exceeds 2 to other 'normal' values (diff = 274.199 - 353.514 = -79.315)

– Ziga
Nov 21 '18 at 13:31

Please reformulate your question if you want any help, what you are trying to do seems quite unclear

– yatu
Nov 21 '18 at 13:36

add a comment |

-1

I have a big Dataframe, here is the sample data:

df['length']

353.216  

353.514  

273.559  

274.199  

353.813  

354.116

Edit: My Output should look as follows:

df_clean_data ['length']

353.216  

353.514  

353.813  

354.116

Thank you in advance
Ziga

edited Nov 21 '18 at 13:45

asked Nov 21 '18 at 13:18

Ziga

175

I have a big Dataframe, here is the sample data:

df['length']

353.216  

353.514  

273.559  

274.199  

353.813  

354.116

Edit: My Output should look as follows:

df_clean_data ['length']

353.216  

353.514  

353.813  

354.116

Thank you in advance
Ziga

python pandas

edited Nov 21 '18 at 13:45

asked Nov 21 '18 at 13:18

Ziga

175

edited Nov 21 '18 at 13:45

asked Nov 21 '18 at 13:18

Ziga

175

edited Nov 21 '18 at 13:45

asked Nov 21 '18 at 13:18

Ziga

175

asked Nov 21 '18 at 13:18

Ziga

175

asked Nov 21 '18 at 13:18

Ziga

175

1

Can you explain better exactly what you want as output?

– Matina G
Nov 21 '18 at 13:21

1

Why only 273.559 and 274.199? There are more contiguous elements with a smaller than 2 distance to surrounding ones, like 353.216 and 353.514

– yatu
Nov 21 '18 at 13:27

Only 273.559 (diff = 273.559 - 353.514 = -79.955) and 274.199 should also be eliminated as it exceeds 2 to other 'normal' values (diff = 274.199 - 353.514 = -79.315)

– Ziga
Nov 21 '18 at 13:31

Please reformulate your question if you want any help, what you are trying to do seems quite unclear

– yatu
Nov 21 '18 at 13:36

add a comment |

1

Can you explain better exactly what you want as output?

– Matina G
Nov 21 '18 at 13:21

1

Why only 273.559 and 274.199? There are more contiguous elements with a smaller than 2 distance to surrounding ones, like 353.216 and 353.514

– yatu
Nov 21 '18 at 13:27

Only 273.559 (diff = 273.559 - 353.514 = -79.955) and 274.199 should also be eliminated as it exceeds 2 to other 'normal' values (diff = 274.199 - 353.514 = -79.315)

– Ziga
Nov 21 '18 at 13:31

Please reformulate your question if you want any help, what you are trying to do seems quite unclear

– yatu
Nov 21 '18 at 13:36

Can you explain better exactly what you want as output?

– Matina G
Nov 21 '18 at 13:21

Why only 273.559 and 274.199? There are more contiguous elements with a smaller than 2 distance to surrounding ones, like 353.216 and 353.514

– yatu
Nov 21 '18 at 13:27

Only 273.559 (diff = 273.559 - 353.514 = -79.955) and 274.199 should also be eliminated as it exceeds 2 to other 'normal' values (diff = 274.199 - 353.514 = -79.315)

– Ziga
Nov 21 '18 at 13:31

Please reformulate your question if you want any help, what you are trying to do seems quite unclear

– yatu
Nov 21 '18 at 13:36

add a comment |

3 Answers
3

active

oldest

votes

The key to success is a function working almost like diff():

def mark(x):

    global prevX

    difr = abs(x - prevX)

    result = difr >= 2

    if not result:

        prevX = x

    return result

But the differences are that:

This function uses a global variable "previous x" (prevX),
containing initially the first length (the program has to
set it).

Substitution of the current x under prevX occurs only
if the difference is less than 2. So, in this respect,
we "skip" rows to be deleted.

The initial step is to set prevX to the 1st length:

prevX = df.loc[0, 'length']

And the actual processing is performed with a single instruction:

df.drop(df[df['length'].apply(mark)].index, inplace=True)

A bit of explanation:

df['length'].apply(mark) generates boolean array. True means "this row
is to be deleted". For instruction purpose execute this command alone
(before dropping).

df[...].index generates list of index values for these rows.

df.drop deletes rows with the given indices (in place).

So the whole script is like below:

import pandas as pd



def mark(x):

    global prevX

    difr = abs(x - prevX)

    result = difr > 2

    if not result:

        prevX = x

    return result



data={ 'length': [ 353.216, 353.514, 273.559, 274.199, 353.813, 354.116 ] }

df = pd.DataFrame(data)

prevX = df.loc[0, 'length']

df.drop(df[df['length'].apply(mark)].index, inplace=True)

The result is:

Alternative: If you want the result in another Dataframe, delete
inplace=True and substitute the result under the target variable.

edited Nov 21 '18 at 18:20

answered Nov 21 '18 at 18:15

Valdi_Bo

5,2252916

That's an amazing solution, thank you. One more Question: what exactly does 'inplace = True'?

– Ziga
Nov 22 '18 at 6:52

1

Without inplace the DataFrame with dropped rows is only the result of the function and the df involved is no changed. But when you use inplace=True, the result is saved in this df.

– Valdi_Bo
Nov 22 '18 at 8:29

add a comment |

You have to iterate over your dataframe's rows like this as you can have multiple lines to filter between 2 values :

ref_row=df.iloc[0] # First line or first value you want to set as reference

valid_rows_indexes =  # Store valid lines indexes

for index, row in df.iterrows(): # Iterate over rows

    if abs(ref_row['length'] - row['length'])<2:

        valid_rows_indexes.append(index) # Append valid line index

        ref_row=row # Set this row as new reference value

df_clean_data = df.loc[valid_rows_indexes] # Filter dataframe

Hope this is helpfull.

answered Nov 21 '18 at 13:56

Clem G.

12916

add a comment |

your question is not crystal clear, but still whatever I understood I am trying to suggest some way.

sort the DataFrame on that column(length)

using for loop check for your difference

if you want that record add it in the new DataFrame

use new DataFrame

other way Because you have Big DataFrame

sort the DataFrame on that column(length)

create new column

using for loop check for your difference

if you don't want that record write np.nanin the new column

remove all the record which contain np.nan in new column

answered Nov 21 '18 at 13:59

Anuprita

285

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53412947%2fdataframe-iterate-rows-eliminate-when-condition-met%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

The key to success is a function working almost like diff():

def mark(x):

    global prevX

    difr = abs(x - prevX)

    result = difr >= 2

    if not result:

        prevX = x

    return result

But the differences are that:

This function uses a global variable "previous x" (prevX),
containing initially the first length (the program has to
set it).

Substitution of the current x under prevX occurs only
if the difference is less than 2. So, in this respect,
we "skip" rows to be deleted.

The initial step is to set prevX to the 1st length:

prevX = df.loc[0, 'length']

And the actual processing is performed with a single instruction:

df.drop(df[df['length'].apply(mark)].index, inplace=True)

A bit of explanation:

df['length'].apply(mark) generates boolean array. True means "this row
is to be deleted". For instruction purpose execute this command alone
(before dropping).

df[...].index generates list of index values for these rows.

df.drop deletes rows with the given indices (in place).

So the whole script is like below:

import pandas as pd



def mark(x):

    global prevX

    difr = abs(x - prevX)

    result = difr > 2

    if not result:

        prevX = x

    return result



data={ 'length': [ 353.216, 353.514, 273.559, 274.199, 353.813, 354.116 ] }

df = pd.DataFrame(data)

prevX = df.loc[0, 'length']

df.drop(df[df['length'].apply(mark)].index, inplace=True)

The result is:

Alternative: If you want the result in another Dataframe, delete
inplace=True and substitute the result under the target variable.

edited Nov 21 '18 at 18:20

answered Nov 21 '18 at 18:15

Valdi_Bo

5,2252916

That's an amazing solution, thank you. One more Question: what exactly does 'inplace = True'?

– Ziga
Nov 22 '18 at 6:52

1

Without inplace the DataFrame with dropped rows is only the result of the function and the df involved is no changed. But when you use inplace=True, the result is saved in this df.

– Valdi_Bo
Nov 22 '18 at 8:29

add a comment |

The key to success is a function working almost like diff():

def mark(x):

    global prevX

    difr = abs(x - prevX)

    result = difr >= 2

    if not result:

        prevX = x

    return result

But the differences are that:

This function uses a global variable "previous x" (prevX),
containing initially the first length (the program has to
set it).

Substitution of the current x under prevX occurs only
if the difference is less than 2. So, in this respect,
we "skip" rows to be deleted.

The initial step is to set prevX to the 1st length:

prevX = df.loc[0, 'length']

And the actual processing is performed with a single instruction:

df.drop(df[df['length'].apply(mark)].index, inplace=True)

A bit of explanation:

df['length'].apply(mark) generates boolean array. True means "this row
is to be deleted". For instruction purpose execute this command alone
(before dropping).

df[...].index generates list of index values for these rows.

df.drop deletes rows with the given indices (in place).

So the whole script is like below:

import pandas as pd



def mark(x):

    global prevX

    difr = abs(x - prevX)

    result = difr > 2

    if not result:

        prevX = x

    return result



data={ 'length': [ 353.216, 353.514, 273.559, 274.199, 353.813, 354.116 ] }

df = pd.DataFrame(data)

prevX = df.loc[0, 'length']

df.drop(df[df['length'].apply(mark)].index, inplace=True)

The result is:

Alternative: If you want the result in another Dataframe, delete
inplace=True and substitute the result under the target variable.

edited Nov 21 '18 at 18:20

answered Nov 21 '18 at 18:15

Valdi_Bo

5,2252916

That's an amazing solution, thank you. One more Question: what exactly does 'inplace = True'?

– Ziga
Nov 22 '18 at 6:52

1

Without inplace the DataFrame with dropped rows is only the result of the function and the df involved is no changed. But when you use inplace=True, the result is saved in this df.

– Valdi_Bo
Nov 22 '18 at 8:29

add a comment |

The key to success is a function working almost like diff():

def mark(x):

    global prevX

    difr = abs(x - prevX)

    result = difr >= 2

    if not result:

        prevX = x

    return result

But the differences are that:

This function uses a global variable "previous x" (prevX),
containing initially the first length (the program has to
set it).

Substitution of the current x under prevX occurs only
if the difference is less than 2. So, in this respect,
we "skip" rows to be deleted.

The initial step is to set prevX to the 1st length:

prevX = df.loc[0, 'length']

And the actual processing is performed with a single instruction:

df.drop(df[df['length'].apply(mark)].index, inplace=True)

A bit of explanation:

df['length'].apply(mark) generates boolean array. True means "this row
is to be deleted". For instruction purpose execute this command alone
(before dropping).

df[...].index generates list of index values for these rows.

df.drop deletes rows with the given indices (in place).

So the whole script is like below:

import pandas as pd



def mark(x):

    global prevX

    difr = abs(x - prevX)

    result = difr > 2

    if not result:

        prevX = x

    return result



data={ 'length': [ 353.216, 353.514, 273.559, 274.199, 353.813, 354.116 ] }

df = pd.DataFrame(data)

prevX = df.loc[0, 'length']

df.drop(df[df['length'].apply(mark)].index, inplace=True)

The result is:

Alternative: If you want the result in another Dataframe, delete
inplace=True and substitute the result under the target variable.

edited Nov 21 '18 at 18:20

answered Nov 21 '18 at 18:15

Valdi_Bo

5,2252916

The key to success is a function working almost like diff():

def mark(x):

    global prevX

    difr = abs(x - prevX)

    result = difr >= 2

    if not result:

        prevX = x

    return result

But the differences are that:

This function uses a global variable "previous x" (prevX),
containing initially the first length (the program has to
set it).

Substitution of the current x under prevX occurs only
if the difference is less than 2. So, in this respect,
we "skip" rows to be deleted.

The initial step is to set prevX to the 1st length:

prevX = df.loc[0, 'length']

And the actual processing is performed with a single instruction:

df.drop(df[df['length'].apply(mark)].index, inplace=True)

A bit of explanation:

df['length'].apply(mark) generates boolean array. True means "this row
is to be deleted". For instruction purpose execute this command alone
(before dropping).

df[...].index generates list of index values for these rows.

df.drop deletes rows with the given indices (in place).

So the whole script is like below:

import pandas as pd



def mark(x):

    global prevX

    difr = abs(x - prevX)

    result = difr > 2

    if not result:

        prevX = x

    return result



data={ 'length': [ 353.216, 353.514, 273.559, 274.199, 353.813, 354.116 ] }

df = pd.DataFrame(data)

prevX = df.loc[0, 'length']

df.drop(df[df['length'].apply(mark)].index, inplace=True)

The result is:

Alternative: If you want the result in another Dataframe, delete
inplace=True and substitute the result under the target variable.

edited Nov 21 '18 at 18:20

answered Nov 21 '18 at 18:15

Valdi_Bo

5,2252916

edited Nov 21 '18 at 18:20

answered Nov 21 '18 at 18:15

Valdi_Bo

5,2252916

answered Nov 21 '18 at 18:15

Valdi_Bo

5,2252916

answered Nov 21 '18 at 18:15

Valdi_Bo

5,2252916

That's an amazing solution, thank you. One more Question: what exactly does 'inplace = True'?

– Ziga
Nov 22 '18 at 6:52

1

Without inplace the DataFrame with dropped rows is only the result of the function and the df involved is no changed. But when you use inplace=True, the result is saved in this df.

– Valdi_Bo
Nov 22 '18 at 8:29

add a comment |

That's an amazing solution, thank you. One more Question: what exactly does 'inplace = True'?

– Ziga
Nov 22 '18 at 6:52

1

Without inplace the DataFrame with dropped rows is only the result of the function and the df involved is no changed. But when you use inplace=True, the result is saved in this df.

– Valdi_Bo
Nov 22 '18 at 8:29

That's an amazing solution, thank you. One more Question: what exactly does 'inplace = True'?

– Ziga
Nov 22 '18 at 6:52

Without inplace the DataFrame with dropped rows is only the result of the function and the df involved is no changed. But when you use inplace=True, the result is saved in this df.

– Valdi_Bo
Nov 22 '18 at 8:29

add a comment |

You have to iterate over your dataframe's rows like this as you can have multiple lines to filter between 2 values :

ref_row=df.iloc[0] # First line or first value you want to set as reference

valid_rows_indexes =  # Store valid lines indexes

for index, row in df.iterrows(): # Iterate over rows

    if abs(ref_row['length'] - row['length'])<2:

        valid_rows_indexes.append(index) # Append valid line index

        ref_row=row # Set this row as new reference value

df_clean_data = df.loc[valid_rows_indexes] # Filter dataframe

Hope this is helpfull.

answered Nov 21 '18 at 13:56

Clem G.

12916

add a comment |

You have to iterate over your dataframe's rows like this as you can have multiple lines to filter between 2 values :

ref_row=df.iloc[0] # First line or first value you want to set as reference

valid_rows_indexes =  # Store valid lines indexes

for index, row in df.iterrows(): # Iterate over rows

    if abs(ref_row['length'] - row['length'])<2:

        valid_rows_indexes.append(index) # Append valid line index

        ref_row=row # Set this row as new reference value

df_clean_data = df.loc[valid_rows_indexes] # Filter dataframe

Hope this is helpfull.

answered Nov 21 '18 at 13:56

Clem G.

12916

add a comment |

You have to iterate over your dataframe's rows like this as you can have multiple lines to filter between 2 values :

ref_row=df.iloc[0] # First line or first value you want to set as reference

valid_rows_indexes =  # Store valid lines indexes

for index, row in df.iterrows(): # Iterate over rows

    if abs(ref_row['length'] - row['length'])<2:

        valid_rows_indexes.append(index) # Append valid line index

        ref_row=row # Set this row as new reference value

df_clean_data = df.loc[valid_rows_indexes] # Filter dataframe

Hope this is helpfull.

answered Nov 21 '18 at 13:56

Clem G.

12916

You have to iterate over your dataframe's rows like this as you can have multiple lines to filter between 2 values :

ref_row=df.iloc[0] # First line or first value you want to set as reference

valid_rows_indexes =  # Store valid lines indexes

for index, row in df.iterrows(): # Iterate over rows

    if abs(ref_row['length'] - row['length'])<2:

        valid_rows_indexes.append(index) # Append valid line index

        ref_row=row # Set this row as new reference value

df_clean_data = df.loc[valid_rows_indexes] # Filter dataframe

Hope this is helpfull.

answered Nov 21 '18 at 13:56

Clem G.

12916

answered Nov 21 '18 at 13:56

Clem G.

12916

answered Nov 21 '18 at 13:56

Clem G.

12916

answered Nov 21 '18 at 13:56

Clem G.

12916

add a comment |

your question is not crystal clear, but still whatever I understood I am trying to suggest some way.

sort the DataFrame on that column(length)

using for loop check for your difference

if you want that record add it in the new DataFrame

use new DataFrame

other way Because you have Big DataFrame

sort the DataFrame on that column(length)

create new column

using for loop check for your difference

if you don't want that record write np.nanin the new column

remove all the record which contain np.nan in new column

answered Nov 21 '18 at 13:59

Anuprita

285

add a comment |

your question is not crystal clear, but still whatever I understood I am trying to suggest some way.

sort the DataFrame on that column(length)

using for loop check for your difference

if you want that record add it in the new DataFrame

use new DataFrame

other way Because you have Big DataFrame

sort the DataFrame on that column(length)

create new column

using for loop check for your difference

if you don't want that record write np.nanin the new column

remove all the record which contain np.nan in new column

answered Nov 21 '18 at 13:59

Anuprita

285

add a comment |

your question is not crystal clear, but still whatever I understood I am trying to suggest some way.

sort the DataFrame on that column(length)

using for loop check for your difference

if you want that record add it in the new DataFrame

use new DataFrame

other way Because you have Big DataFrame

sort the DataFrame on that column(length)

create new column

using for loop check for your difference

if you don't want that record write np.nanin the new column

remove all the record which contain np.nan in new column

answered Nov 21 '18 at 13:59

Anuprita

285

your question is not crystal clear, but still whatever I understood I am trying to suggest some way.

sort the DataFrame on that column(length)

using for loop check for your difference

if you want that record add it in the new DataFrame

use new DataFrame

other way Because you have Big DataFrame

sort the DataFrame on that column(length)

create new column

using for loop check for your difference

if you don't want that record write np.nanin the new column

remove all the record which contain np.nan in new column

answered Nov 21 '18 at 13:59

Anuprita

285

answered Nov 21 '18 at 13:59

Anuprita

285

answered Nov 21 '18 at 13:59

Anuprita

285

answered Nov 21 '18 at 13:59

Anuprita

285

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Cfrgtkky