Merge two dataframes on string column - compound string column

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}

I am trying to merge two huge dataframes (4+ millions each) that have the following structure:

Dataframe A:

     date    Fruit        a    b    c    d

     01      "apple"      0    3    5    1

     03      "apple"      8    2    7    2

     02      "banana"     1    4    3    5

     04      "banana"     3    5    2    6

     03      "pineapple"  2    6    4    6

     05      "pineapple"  3    5    7    9

Dataframe B:

     date   Fruits                         x    y    z 

     01     "apple, pear, strawberry"      a    n    q 

     02     "banana, apple, coconut"       b    m    p 

     03     "pineapple, pear, banana"      c    s    o

     04     "banana, apple, coconut"       d    f    v 

     05     "pineapple, pear, banana"      r    ñ    t

What I am trying to achieve is a third dataframe with the following structure:

Dataframe C:

     date   Fruit        a    b    c    d    x    y    z

     01     "apple"      0    3    5    1    a    n    q

     03     "apple"      0    3    5    1    0    0    0

     02     "banana"     1    4    3    5    b    m    p

     04     "banana"     1    4    3    5    d    f    v

     03     "pineapple"  2    6    4    6    c    s    o

     05     "pineapple"  2    6    4    6    r    ñ    t

      ...

I had already tried something like:

test = market_test.assetCode.apply(lambda x : news_test.assetCodes.str.find(x)>=0)

But my kernel breaks, I also had tried using a for cycle to expand the fruit column of B dataframe into a 'fruit-b' column, keeping the data from the other B columns and then merging between the date column and the 'fruit-B' columns, but the time of execution is too high.

Is there a way of obtaining dataframe C using dataframe A and B that does not consume a lot of time and memory?

Fruit and Fruits columns type is string.

edited Nov 23 '18 at 14:11

asked Nov 22 '18 at 23:13

Santiago Hernàndez

185

What is the total number of unique fruits occurring across df_A and df_B? You could convert them to one-hot or Categorical, instead of storing as string.

– smci
Nov 23 '18 at 1:03

1

df_B.Fruits is a compound column. I would retitle this "Merge two dataframes on string column/ compound string column"

– smci
Nov 23 '18 at 1:13

@smci, thanks for your response, the unique fruits number should be around 5000, i'll rename the question as you suggested too.

– Santiago Hernàndez
Nov 23 '18 at 14:11

add a comment |

I am trying to merge two huge dataframes (4+ millions each) that have the following structure:

Dataframe A:

     date    Fruit        a    b    c    d

     01      "apple"      0    3    5    1

     03      "apple"      8    2    7    2

     02      "banana"     1    4    3    5

     04      "banana"     3    5    2    6

     03      "pineapple"  2    6    4    6

     05      "pineapple"  3    5    7    9

Dataframe B:

     date   Fruits                         x    y    z 

     01     "apple, pear, strawberry"      a    n    q 

     02     "banana, apple, coconut"       b    m    p 

     03     "pineapple, pear, banana"      c    s    o

     04     "banana, apple, coconut"       d    f    v 

     05     "pineapple, pear, banana"      r    ñ    t

What I am trying to achieve is a third dataframe with the following structure:

Dataframe C:

     date   Fruit        a    b    c    d    x    y    z

     01     "apple"      0    3    5    1    a    n    q

     03     "apple"      0    3    5    1    0    0    0

     02     "banana"     1    4    3    5    b    m    p

     04     "banana"     1    4    3    5    d    f    v

     03     "pineapple"  2    6    4    6    c    s    o

     05     "pineapple"  2    6    4    6    r    ñ    t

      ...

I had already tried something like:

test = market_test.assetCode.apply(lambda x : news_test.assetCodes.str.find(x)>=0)

Is there a way of obtaining dataframe C using dataframe A and B that does not consume a lot of time and memory?

Fruit and Fruits columns type is string.

edited Nov 23 '18 at 14:11

asked Nov 22 '18 at 23:13

Santiago Hernàndez

185

What is the total number of unique fruits occurring across df_A and df_B? You could convert them to one-hot or Categorical, instead of storing as string.

– smci
Nov 23 '18 at 1:03

1

df_B.Fruits is a compound column. I would retitle this "Merge two dataframes on string column/ compound string column"

– smci
Nov 23 '18 at 1:13

@smci, thanks for your response, the unique fruits number should be around 5000, i'll rename the question as you suggested too.

– Santiago Hernàndez
Nov 23 '18 at 14:11

add a comment |

I am trying to merge two huge dataframes (4+ millions each) that have the following structure:

Dataframe A:

     date    Fruit        a    b    c    d

     01      "apple"      0    3    5    1

     03      "apple"      8    2    7    2

     02      "banana"     1    4    3    5

     04      "banana"     3    5    2    6

     03      "pineapple"  2    6    4    6

     05      "pineapple"  3    5    7    9

Dataframe B:

     date   Fruits                         x    y    z 

     01     "apple, pear, strawberry"      a    n    q 

     02     "banana, apple, coconut"       b    m    p 

     03     "pineapple, pear, banana"      c    s    o

     04     "banana, apple, coconut"       d    f    v 

     05     "pineapple, pear, banana"      r    ñ    t

What I am trying to achieve is a third dataframe with the following structure:

Dataframe C:

     date   Fruit        a    b    c    d    x    y    z

     01     "apple"      0    3    5    1    a    n    q

     03     "apple"      0    3    5    1    0    0    0

     02     "banana"     1    4    3    5    b    m    p

     04     "banana"     1    4    3    5    d    f    v

     03     "pineapple"  2    6    4    6    c    s    o

     05     "pineapple"  2    6    4    6    r    ñ    t

      ...

I had already tried something like:

test = market_test.assetCode.apply(lambda x : news_test.assetCodes.str.find(x)>=0)

Is there a way of obtaining dataframe C using dataframe A and B that does not consume a lot of time and memory?

Fruit and Fruits columns type is string.

edited Nov 23 '18 at 14:11

asked Nov 22 '18 at 23:13

Santiago Hernàndez

185

I am trying to merge two huge dataframes (4+ millions each) that have the following structure:

Dataframe A:

     date    Fruit        a    b    c    d

     01      "apple"      0    3    5    1

     03      "apple"      8    2    7    2

     02      "banana"     1    4    3    5

     04      "banana"     3    5    2    6

     03      "pineapple"  2    6    4    6

     05      "pineapple"  3    5    7    9

Dataframe B:

     date   Fruits                         x    y    z 

     01     "apple, pear, strawberry"      a    n    q 

     02     "banana, apple, coconut"       b    m    p 

     03     "pineapple, pear, banana"      c    s    o

     04     "banana, apple, coconut"       d    f    v 

     05     "pineapple, pear, banana"      r    ñ    t

What I am trying to achieve is a third dataframe with the following structure:

Dataframe C:

     date   Fruit        a    b    c    d    x    y    z

     01     "apple"      0    3    5    1    a    n    q

     03     "apple"      0    3    5    1    0    0    0

     02     "banana"     1    4    3    5    b    m    p

     04     "banana"     1    4    3    5    d    f    v

     03     "pineapple"  2    6    4    6    c    s    o

     05     "pineapple"  2    6    4    6    r    ñ    t

      ...

I had already tried something like:

test = market_test.assetCode.apply(lambda x : news_test.assetCodes.str.find(x)>=0)

Is there a way of obtaining dataframe C using dataframe A and B that does not consume a lot of time and memory?

Fruit and Fruits columns type is string.

python pandas dataframe merge

edited Nov 23 '18 at 14:11

asked Nov 22 '18 at 23:13

Santiago Hernàndez

185

edited Nov 23 '18 at 14:11

asked Nov 22 '18 at 23:13

Santiago Hernàndez

185

edited Nov 23 '18 at 14:11

asked Nov 22 '18 at 23:13

Santiago Hernàndez

185

asked Nov 22 '18 at 23:13

Santiago Hernàndez

185

asked Nov 22 '18 at 23:13

Santiago Hernàndez

185

What is the total number of unique fruits occurring across df_A and df_B? You could convert them to one-hot or Categorical, instead of storing as string.

– smci
Nov 23 '18 at 1:03

1

df_B.Fruits is a compound column. I would retitle this "Merge two dataframes on string column/ compound string column"

– smci
Nov 23 '18 at 1:13

@smci, thanks for your response, the unique fruits number should be around 5000, i'll rename the question as you suggested too.

– Santiago Hernàndez
Nov 23 '18 at 14:11

add a comment |

What is the total number of unique fruits occurring across df_A and df_B? You could convert them to one-hot or Categorical, instead of storing as string.

– smci
Nov 23 '18 at 1:03

1

df_B.Fruits is a compound column. I would retitle this "Merge two dataframes on string column/ compound string column"

– smci
Nov 23 '18 at 1:13

@smci, thanks for your response, the unique fruits number should be around 5000, i'll rename the question as you suggested too.

– Santiago Hernàndez
Nov 23 '18 at 14:11

What is the total number of unique fruits occurring across df_A and df_B? You could convert them to one-hot or Categorical, instead of storing as string.

– smci
Nov 23 '18 at 1:03

df_B.Fruits is a compound column. I would retitle this "Merge two dataframes on string column/ compound string column"

– smci
Nov 23 '18 at 1:13

@smci, thanks for your response, the unique fruits number should be around 5000, i'll rename the question as you suggested too.

– Santiago Hernàndez
Nov 23 '18 at 14:11

add a comment |

1 Answer
1

active

oldest

votes

Use:

print (df_A)



   date      Fruit  a  b  c  d

0     1      apple  0  3  5  1

1     3      apple  8  2  7  2

2     2     banana  1  4  3  5

3     4     banana  3  5  2  6

4     3  pineapple  2  6  4  6

5     5  pineapple  3  5  7  9



print (df_B)



   date                   Fruits  x  y  z

0     1  apple, pear, strawberry  a  n  q

1     2   banana, apple, coconut  b  m  p

2     3  pineapple, pear, banana  c  s  o

3     4   banana, apple, coconut  d  f  v

4     5  pineapple, pear, banana  r  ñ  t

import pandas as pd

import numpy as np



# Split the strings into list.

df_B.Fruits = df_B.Fruits.str.split(', ')



# reindex and repeat on length of list

temp = df_B.reindex(df_B.index.repeat(df_B.Fruits.str.len())).drop('Fruits',1)



temp['Fruit'] = np.concatenate(df_B.Fruits.values)



df_C = df_A.merge(temp, on=['date','Fruit'], how='left').fillna(0)



print (df_C)



   date      Fruit  a  b  c  d  x  y  z

0     1      apple  0  3  5  1  a  n  q

1     3      apple  8  2  7  2  0  0  0

2     2     banana  1  4  3  5  b  m  p

3     4     banana  3  5  2  6  d  f  v

4     3  pineapple  2  6  4  6  c  s  o

5     5  pineapple  3  5  7  9  r  ñ  t

edited Nov 23 '18 at 0:55

answered Nov 23 '18 at 0:33

Abhi

2,539422

1

Thanks it's working!

– Santiago Hernàndez
Nov 23 '18 at 20:15

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53438964%2fmerge-two-dataframes-on-string-column-compound-string-column%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Use:

print (df_A)



   date      Fruit  a  b  c  d

0     1      apple  0  3  5  1

1     3      apple  8  2  7  2

2     2     banana  1  4  3  5

3     4     banana  3  5  2  6

4     3  pineapple  2  6  4  6

5     5  pineapple  3  5  7  9



print (df_B)



   date                   Fruits  x  y  z

0     1  apple, pear, strawberry  a  n  q

1     2   banana, apple, coconut  b  m  p

2     3  pineapple, pear, banana  c  s  o

3     4   banana, apple, coconut  d  f  v

4     5  pineapple, pear, banana  r  ñ  t

import pandas as pd

import numpy as np



# Split the strings into list.

df_B.Fruits = df_B.Fruits.str.split(', ')



# reindex and repeat on length of list

temp = df_B.reindex(df_B.index.repeat(df_B.Fruits.str.len())).drop('Fruits',1)



temp['Fruit'] = np.concatenate(df_B.Fruits.values)



df_C = df_A.merge(temp, on=['date','Fruit'], how='left').fillna(0)



print (df_C)



   date      Fruit  a  b  c  d  x  y  z

0     1      apple  0  3  5  1  a  n  q

1     3      apple  8  2  7  2  0  0  0

2     2     banana  1  4  3  5  b  m  p

3     4     banana  3  5  2  6  d  f  v

4     3  pineapple  2  6  4  6  c  s  o

5     5  pineapple  3  5  7  9  r  ñ  t

edited Nov 23 '18 at 0:55

answered Nov 23 '18 at 0:33

Abhi

2,539422

1

Thanks it's working!

– Santiago Hernàndez
Nov 23 '18 at 20:15

add a comment |

Use:

print (df_A)



   date      Fruit  a  b  c  d

0     1      apple  0  3  5  1

1     3      apple  8  2  7  2

2     2     banana  1  4  3  5

3     4     banana  3  5  2  6

4     3  pineapple  2  6  4  6

5     5  pineapple  3  5  7  9



print (df_B)



   date                   Fruits  x  y  z

0     1  apple, pear, strawberry  a  n  q

1     2   banana, apple, coconut  b  m  p

2     3  pineapple, pear, banana  c  s  o

3     4   banana, apple, coconut  d  f  v

4     5  pineapple, pear, banana  r  ñ  t

import pandas as pd

import numpy as np



# Split the strings into list.

df_B.Fruits = df_B.Fruits.str.split(', ')



# reindex and repeat on length of list

temp = df_B.reindex(df_B.index.repeat(df_B.Fruits.str.len())).drop('Fruits',1)



temp['Fruit'] = np.concatenate(df_B.Fruits.values)



df_C = df_A.merge(temp, on=['date','Fruit'], how='left').fillna(0)



print (df_C)



   date      Fruit  a  b  c  d  x  y  z

0     1      apple  0  3  5  1  a  n  q

1     3      apple  8  2  7  2  0  0  0

2     2     banana  1  4  3  5  b  m  p

3     4     banana  3  5  2  6  d  f  v

4     3  pineapple  2  6  4  6  c  s  o

5     5  pineapple  3  5  7  9  r  ñ  t

edited Nov 23 '18 at 0:55

answered Nov 23 '18 at 0:33

Abhi

2,539422

1

Thanks it's working!

– Santiago Hernàndez
Nov 23 '18 at 20:15

add a comment |

Use:

print (df_A)



   date      Fruit  a  b  c  d

0     1      apple  0  3  5  1

1     3      apple  8  2  7  2

2     2     banana  1  4  3  5

3     4     banana  3  5  2  6

4     3  pineapple  2  6  4  6

5     5  pineapple  3  5  7  9



print (df_B)



   date                   Fruits  x  y  z

0     1  apple, pear, strawberry  a  n  q

1     2   banana, apple, coconut  b  m  p

2     3  pineapple, pear, banana  c  s  o

3     4   banana, apple, coconut  d  f  v

4     5  pineapple, pear, banana  r  ñ  t

import pandas as pd

import numpy as np



# Split the strings into list.

df_B.Fruits = df_B.Fruits.str.split(', ')



# reindex and repeat on length of list

temp = df_B.reindex(df_B.index.repeat(df_B.Fruits.str.len())).drop('Fruits',1)



temp['Fruit'] = np.concatenate(df_B.Fruits.values)



df_C = df_A.merge(temp, on=['date','Fruit'], how='left').fillna(0)



print (df_C)



   date      Fruit  a  b  c  d  x  y  z

0     1      apple  0  3  5  1  a  n  q

1     3      apple  8  2  7  2  0  0  0

2     2     banana  1  4  3  5  b  m  p

3     4     banana  3  5  2  6  d  f  v

4     3  pineapple  2  6  4  6  c  s  o

5     5  pineapple  3  5  7  9  r  ñ  t

edited Nov 23 '18 at 0:55

answered Nov 23 '18 at 0:33

Abhi

2,539422

Use:

print (df_A)



   date      Fruit  a  b  c  d

0     1      apple  0  3  5  1

1     3      apple  8  2  7  2

2     2     banana  1  4  3  5

3     4     banana  3  5  2  6

4     3  pineapple  2  6  4  6

5     5  pineapple  3  5  7  9



print (df_B)



   date                   Fruits  x  y  z

0     1  apple, pear, strawberry  a  n  q

1     2   banana, apple, coconut  b  m  p

2     3  pineapple, pear, banana  c  s  o

3     4   banana, apple, coconut  d  f  v

4     5  pineapple, pear, banana  r  ñ  t

import pandas as pd

import numpy as np



# Split the strings into list.

df_B.Fruits = df_B.Fruits.str.split(', ')



# reindex and repeat on length of list

temp = df_B.reindex(df_B.index.repeat(df_B.Fruits.str.len())).drop('Fruits',1)



temp['Fruit'] = np.concatenate(df_B.Fruits.values)



df_C = df_A.merge(temp, on=['date','Fruit'], how='left').fillna(0)



print (df_C)



   date      Fruit  a  b  c  d  x  y  z

0     1      apple  0  3  5  1  a  n  q

1     3      apple  8  2  7  2  0  0  0

2     2     banana  1  4  3  5  b  m  p

3     4     banana  3  5  2  6  d  f  v

4     3  pineapple  2  6  4  6  c  s  o

5     5  pineapple  3  5  7  9  r  ñ  t

edited Nov 23 '18 at 0:55

answered Nov 23 '18 at 0:33

Abhi

2,539422

edited Nov 23 '18 at 0:55

answered Nov 23 '18 at 0:33

Abhi

2,539422

answered Nov 23 '18 at 0:33

Abhi

2,539422

answered Nov 23 '18 at 0:33

Abhi

2,539422

1

Thanks it's working!

– Santiago Hernàndez
Nov 23 '18 at 20:15

add a comment |

1

Thanks it's working!

– Santiago Hernàndez
Nov 23 '18 at 20:15

Thanks it's working!

– Santiago Hernàndez
Nov 23 '18 at 20:15

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Cfrgtkky