Cfrgtkky

Question

Just curious on the behavior of 'where' and why you would use it over 'loc'.

If I create a dataframe:

df = pd.DataFrame({'ID':[1,2,3,4,5,6,7,8,9,10], 

                   'Run Distance':[234,35,77,787,243,5435,775,123,355,123],

                   'Goals':[12,23,56,7,8,0,4,2,1,34],

                   'Gender':['m','m','m','f','f','m','f','m','f','m']})

And then apply the 'where' function:

df2 = df.where(df['Goals']>10)

I get the following which filters out the results where Goals > 10, but leaves everything else as NaN:

  Gender  Goals    ID  Run Distance                                                                                                                                                  

0      m   12.0   1.0         234.0                                                                                                                                                  

1      m   23.0   2.0          35.0                                                                                                                                                  

2      m   56.0   3.0          77.0                                                                                                                                                  

3    NaN    NaN   NaN           NaN                                                                                                                                                  

4    NaN    NaN   NaN           NaN                                                                                                                                                  

5    NaN    NaN   NaN           NaN                                                                                                                                                  

6    NaN    NaN   NaN           NaN                                                                                                                                                  

7    NaN    NaN   NaN           NaN                                                                                                                                                  

8    NaN    NaN   NaN           NaN                                                                                                                                                  

9      m   34.0  10.0         123.0

If however I use the 'loc' function:

df2 = df.loc[df['Goals']>10]

It returns the dataframe subsetted without the NaN values:

  Gender  Goals  ID  Run Distance                                                                                                                                                    

0      m     12   1           234                                                                                                                                                    

1      m     23   2            35                                                                                                                                                    

2      m     56   3            77                                                                                                                                                    

9      m     34  10           123

So essentially I am curious why you would use 'where' over 'loc/iloc' and why it returns NaN values?

Related: Pandas mask / where methods versus NumPy np.where. Summary: Pandas where rarely outperforms (or is more readable versus) the more popular NumPy np.where, so the former is often irrelevant. — Feb 27 at 15:10
Thank you jpp. Interesting question by you and response by 'ead'. I will look at numpy for using 'where'. — Feb 27 at 15:28

score 7 · Accepted Answer · 2019-02-27 08:28:29Z

Think of loc as a filter - give me only the parts of the df that conform to a condition.

where originally comes from numpy. It runs over an array and checks if each element fits a condition. So it gives you back the entire array, with a result or NaN. A nice feature of where is that you can also get back something different, e.g. df2 = df.where(df['Goals']>10, other='0'), to replace values that don't meet the condition with 0.

ID  Run Distance Goals Gender

0   1   234      12     m

1   2   35       23     m

2   3   77       56     m

3   0   0        0      0

4   0   0        0      0

5   0   0        0      0

6   0   0        0      0

7   0   0        0      0

8   0   0        0      0

9   10  123      34     m

Also, while where is only for conditional filtering, loc is the standard way of selecting in Pandas, along with iloc. loc uses row and column names, while iloc uses their index number. So with loc you could choose to return, say, df.loc[0:1, ['Gender', 'Goals']]:

    Gender  Goals

0   m   12

1   m   23

That is super helpful, thank you. So 'loc' filters, and 'where' is more for where you want to change values that do not fit the condition to something else. Perfect, thank you! — Feb 27 at 8:15

score 6 · Accepted Answer · 2019-02-27 08:18:10Z

If check docs DataFrame.where it replace rows by condition - default by NAN, but is possible specify value:

df2 = df.where(df['Goals']>10)

print (df2)

     ID  Run Distance  Goals Gender

0   1.0         234.0   12.0      m

1   2.0          35.0   23.0      m

2   3.0          77.0   56.0      m

3   NaN           NaN    NaN    NaN

4   NaN           NaN    NaN    NaN

5   NaN           NaN    NaN    NaN

6   NaN           NaN    NaN    NaN

7   NaN           NaN    NaN    NaN

8   NaN           NaN    NaN    NaN

9  10.0         123.0   34.0      m



df2 = df.where(df['Goals']>10, 100)

print (df2)

    ID  Run Distance  Goals Gender

0    1           234     12      m

1    2            35     23      m

2    3            77     56      m

3  100           100    100    100

4  100           100    100    100

5  100           100    100    100

6  100           100    100    100

7  100           100    100    100

8  100           100    100    100

9   10           123     34      m

Another syntax is called boolean indexing and is for filter rows - remove rows matched condition.

df2 = df.loc[df['Goals']>10]

#alternative

df2 = df[df['Goals']>10]



print (df2)

   ID  Run Distance  Goals Gender

0   1           234     12      m

1   2            35     23      m

2   3            77     56      m

9  10           123     34      m

If use loc is possible also filter by rows by condition and columns by name(s):

s = df.loc[df['Goals']>10, 'ID']

print (s)

0     1

1     2

2     3

9    10

Name: ID, dtype: int64



df2 = df.loc[df['Goals']>10, ['ID','Gender']]

print (df2)

   ID Gender

0   1      m

1   2      m

2   3      m

9  10      m

That makes a lot of sense, thank you very much. Also thanks for the tip on the alternative! — Feb 27 at 8:12

CastiCasti 8318 · Accepted Answer · 2019-02-27 08:11:17Z

5

loc retrieves only the rows that matches the condition.

where returns the whole dataframe, replacing the rows that don't match the condition (NaN by default).

answered Feb 27 at 8:11

Casti

8318

1

Great, thank you. 'Where' is a lot more useful than originally thought!

– ScoutEU
Feb 27 at 8:12

add a comment |

score 7 · Accepted Answer · 2019-02-27 08:28:29Z

Think of loc as a filter - give me only the parts of the df that conform to a condition.

where originally comes from numpy. It runs over an array and checks if each element fits a condition. So it gives you back the entire array, with a result or NaN. A nice feature of where is that you can also get back something different, e.g. df2 = df.where(df['Goals']>10, other='0'), to replace values that don't meet the condition with 0.

ID  Run Distance Goals Gender

0   1   234      12     m

1   2   35       23     m

2   3   77       56     m

3   0   0        0      0

4   0   0        0      0

5   0   0        0      0

6   0   0        0      0

7   0   0        0      0

8   0   0        0      0

9   10  123      34     m

Also, while where is only for conditional filtering, loc is the standard way of selecting in Pandas, along with iloc. loc uses row and column names, while iloc uses their index number. So with loc you could choose to return, say, df.loc[0:1, ['Gender', 'Goals']]:

    Gender  Goals

0   m   12

1   m   23

That is super helpful, thank you. So 'loc' filters, and 'where' is more for where you want to change values that do not fit the condition to something else. Perfect, thank you! — Feb 27 at 8:15

score 6 · Accepted Answer · 2019-02-27 08:18:10Z

If check docs DataFrame.where it replace rows by condition - default by NAN, but is possible specify value:

df2 = df.where(df['Goals']>10)

print (df2)

     ID  Run Distance  Goals Gender

0   1.0         234.0   12.0      m

1   2.0          35.0   23.0      m

2   3.0          77.0   56.0      m

3   NaN           NaN    NaN    NaN

4   NaN           NaN    NaN    NaN

5   NaN           NaN    NaN    NaN

6   NaN           NaN    NaN    NaN

7   NaN           NaN    NaN    NaN

8   NaN           NaN    NaN    NaN

9  10.0         123.0   34.0      m



df2 = df.where(df['Goals']>10, 100)

print (df2)

    ID  Run Distance  Goals Gender

0    1           234     12      m

1    2            35     23      m

2    3            77     56      m

3  100           100    100    100

4  100           100    100    100

5  100           100    100    100

6  100           100    100    100

7  100           100    100    100

8  100           100    100    100

9   10           123     34      m

Another syntax is called boolean indexing and is for filter rows - remove rows matched condition.

df2 = df.loc[df['Goals']>10]

#alternative

df2 = df[df['Goals']>10]



print (df2)

   ID  Run Distance  Goals Gender

0   1           234     12      m

1   2            35     23      m

2   3            77     56      m

9  10           123     34      m

If use loc is possible also filter by rows by condition and columns by name(s):

s = df.loc[df['Goals']>10, 'ID']

print (s)

0     1

1     2

2     3

9    10

Name: ID, dtype: int64



df2 = df.loc[df['Goals']>10, ['ID','Gender']]

print (df2)

   ID Gender

0   1      m

1   2      m

2   3      m

9  10      m

That makes a lot of sense, thank you very much. Also thanks for the tip on the alternative! — Feb 27 at 8:12

CastiCasti 8318 · Accepted Answer · 2019-02-27 08:11:17Z

5

loc retrieves only the rows that matches the condition.

where returns the whole dataframe, replacing the rows that don't match the condition (NaN by default).

answered Feb 27 at 8:11

Casti

8318

1

Great, thank you. 'Where' is a lot more useful than originally thought!

– ScoutEU
Feb 27 at 8:12

add a comment |

搜尋此網誌

Cfrgtkky

Python Pandas - difference between 'loc' and 'where'?

3 Answers
3

Your Answer

Post as a guest

3 Answers
3

3 Answers
3

Post as a guest

Popular posts from this blog

Biblatex bibliography style without URLs when DOI exists (in Overleaf with Zotero bibliography)

ComboBox Display Member on multiple fields

Is it possible to collect Nectar points via Trainline?

Python Pandas - difference between 'loc' and 'where'?

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

3 Answers 3

3 Answers 3

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Biblatex bibliography style without URLs when DOI exists (in Overleaf with Zotero bibliography)

ComboBox Display Member on multiple fields

Is it possible to collect Nectar points via Trainline?

3 Answers
3

3 Answers
3

3 Answers
3