Merge two dataframes on string column - compound string column





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}







3















I am trying to merge two huge dataframes (4+ millions each) that have the following structure:



Dataframe A:



     date    Fruit        a    b    c    d
01 "apple" 0 3 5 1
03 "apple" 8 2 7 2
02 "banana" 1 4 3 5
04 "banana" 3 5 2 6
03 "pineapple" 2 6 4 6
05 "pineapple" 3 5 7 9


Dataframe B:



     date   Fruits                         x    y    z 
01 "apple, pear, strawberry" a n q
02 "banana, apple, coconut" b m p
03 "pineapple, pear, banana" c s o
04 "banana, apple, coconut" d f v
05 "pineapple, pear, banana" r ñ t


What I am trying to achieve is a third dataframe with the following structure:



Dataframe C:



     date   Fruit        a    b    c    d    x    y    z
01 "apple" 0 3 5 1 a n q
03 "apple" 0 3 5 1 0 0 0
02 "banana" 1 4 3 5 b m p
04 "banana" 1 4 3 5 d f v
03 "pineapple" 2 6 4 6 c s o
05 "pineapple" 2 6 4 6 r ñ t
...


I had already tried something like:



test = market_test.assetCode.apply(lambda x : news_test.assetCodes.str.find(x)>=0)


But my kernel breaks, I also had tried using a for cycle to expand the fruit column of B dataframe into a 'fruit-b' column, keeping the data from the other B columns and then merging between the date column and the 'fruit-B' columns, but the time of execution is too high.



Is there a way of obtaining dataframe C using dataframe A and B that does not consume a lot of time and memory?



Fruit and Fruits columns type is string.










share|improve this question

























  • What is the total number of unique fruits occurring across df_A and df_B? You could convert them to one-hot or Categorical, instead of storing as string.

    – smci
    Nov 23 '18 at 1:03






  • 1





    df_B.Fruits is a compound column. I would retitle this "Merge two dataframes on string column/ compound string column"

    – smci
    Nov 23 '18 at 1:13













  • @smci, thanks for your response, the unique fruits number should be around 5000, i'll rename the question as you suggested too.

    – Santiago Hernàndez
    Nov 23 '18 at 14:11


















3















I am trying to merge two huge dataframes (4+ millions each) that have the following structure:



Dataframe A:



     date    Fruit        a    b    c    d
01 "apple" 0 3 5 1
03 "apple" 8 2 7 2
02 "banana" 1 4 3 5
04 "banana" 3 5 2 6
03 "pineapple" 2 6 4 6
05 "pineapple" 3 5 7 9


Dataframe B:



     date   Fruits                         x    y    z 
01 "apple, pear, strawberry" a n q
02 "banana, apple, coconut" b m p
03 "pineapple, pear, banana" c s o
04 "banana, apple, coconut" d f v
05 "pineapple, pear, banana" r ñ t


What I am trying to achieve is a third dataframe with the following structure:



Dataframe C:



     date   Fruit        a    b    c    d    x    y    z
01 "apple" 0 3 5 1 a n q
03 "apple" 0 3 5 1 0 0 0
02 "banana" 1 4 3 5 b m p
04 "banana" 1 4 3 5 d f v
03 "pineapple" 2 6 4 6 c s o
05 "pineapple" 2 6 4 6 r ñ t
...


I had already tried something like:



test = market_test.assetCode.apply(lambda x : news_test.assetCodes.str.find(x)>=0)


But my kernel breaks, I also had tried using a for cycle to expand the fruit column of B dataframe into a 'fruit-b' column, keeping the data from the other B columns and then merging between the date column and the 'fruit-B' columns, but the time of execution is too high.



Is there a way of obtaining dataframe C using dataframe A and B that does not consume a lot of time and memory?



Fruit and Fruits columns type is string.










share|improve this question

























  • What is the total number of unique fruits occurring across df_A and df_B? You could convert them to one-hot or Categorical, instead of storing as string.

    – smci
    Nov 23 '18 at 1:03






  • 1





    df_B.Fruits is a compound column. I would retitle this "Merge two dataframes on string column/ compound string column"

    – smci
    Nov 23 '18 at 1:13













  • @smci, thanks for your response, the unique fruits number should be around 5000, i'll rename the question as you suggested too.

    – Santiago Hernàndez
    Nov 23 '18 at 14:11














3












3








3


0






I am trying to merge two huge dataframes (4+ millions each) that have the following structure:



Dataframe A:



     date    Fruit        a    b    c    d
01 "apple" 0 3 5 1
03 "apple" 8 2 7 2
02 "banana" 1 4 3 5
04 "banana" 3 5 2 6
03 "pineapple" 2 6 4 6
05 "pineapple" 3 5 7 9


Dataframe B:



     date   Fruits                         x    y    z 
01 "apple, pear, strawberry" a n q
02 "banana, apple, coconut" b m p
03 "pineapple, pear, banana" c s o
04 "banana, apple, coconut" d f v
05 "pineapple, pear, banana" r ñ t


What I am trying to achieve is a third dataframe with the following structure:



Dataframe C:



     date   Fruit        a    b    c    d    x    y    z
01 "apple" 0 3 5 1 a n q
03 "apple" 0 3 5 1 0 0 0
02 "banana" 1 4 3 5 b m p
04 "banana" 1 4 3 5 d f v
03 "pineapple" 2 6 4 6 c s o
05 "pineapple" 2 6 4 6 r ñ t
...


I had already tried something like:



test = market_test.assetCode.apply(lambda x : news_test.assetCodes.str.find(x)>=0)


But my kernel breaks, I also had tried using a for cycle to expand the fruit column of B dataframe into a 'fruit-b' column, keeping the data from the other B columns and then merging between the date column and the 'fruit-B' columns, but the time of execution is too high.



Is there a way of obtaining dataframe C using dataframe A and B that does not consume a lot of time and memory?



Fruit and Fruits columns type is string.










share|improve this question
















I am trying to merge two huge dataframes (4+ millions each) that have the following structure:



Dataframe A:



     date    Fruit        a    b    c    d
01 "apple" 0 3 5 1
03 "apple" 8 2 7 2
02 "banana" 1 4 3 5
04 "banana" 3 5 2 6
03 "pineapple" 2 6 4 6
05 "pineapple" 3 5 7 9


Dataframe B:



     date   Fruits                         x    y    z 
01 "apple, pear, strawberry" a n q
02 "banana, apple, coconut" b m p
03 "pineapple, pear, banana" c s o
04 "banana, apple, coconut" d f v
05 "pineapple, pear, banana" r ñ t


What I am trying to achieve is a third dataframe with the following structure:



Dataframe C:



     date   Fruit        a    b    c    d    x    y    z
01 "apple" 0 3 5 1 a n q
03 "apple" 0 3 5 1 0 0 0
02 "banana" 1 4 3 5 b m p
04 "banana" 1 4 3 5 d f v
03 "pineapple" 2 6 4 6 c s o
05 "pineapple" 2 6 4 6 r ñ t
...


I had already tried something like:



test = market_test.assetCode.apply(lambda x : news_test.assetCodes.str.find(x)>=0)


But my kernel breaks, I also had tried using a for cycle to expand the fruit column of B dataframe into a 'fruit-b' column, keeping the data from the other B columns and then merging between the date column and the 'fruit-B' columns, but the time of execution is too high.



Is there a way of obtaining dataframe C using dataframe A and B that does not consume a lot of time and memory?



Fruit and Fruits columns type is string.







python pandas dataframe merge






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 23 '18 at 14:11







Santiago Hernàndez

















asked Nov 22 '18 at 23:13









Santiago HernàndezSantiago Hernàndez

185




185













  • What is the total number of unique fruits occurring across df_A and df_B? You could convert them to one-hot or Categorical, instead of storing as string.

    – smci
    Nov 23 '18 at 1:03






  • 1





    df_B.Fruits is a compound column. I would retitle this "Merge two dataframes on string column/ compound string column"

    – smci
    Nov 23 '18 at 1:13













  • @smci, thanks for your response, the unique fruits number should be around 5000, i'll rename the question as you suggested too.

    – Santiago Hernàndez
    Nov 23 '18 at 14:11



















  • What is the total number of unique fruits occurring across df_A and df_B? You could convert them to one-hot or Categorical, instead of storing as string.

    – smci
    Nov 23 '18 at 1:03






  • 1





    df_B.Fruits is a compound column. I would retitle this "Merge two dataframes on string column/ compound string column"

    – smci
    Nov 23 '18 at 1:13













  • @smci, thanks for your response, the unique fruits number should be around 5000, i'll rename the question as you suggested too.

    – Santiago Hernàndez
    Nov 23 '18 at 14:11

















What is the total number of unique fruits occurring across df_A and df_B? You could convert them to one-hot or Categorical, instead of storing as string.

– smci
Nov 23 '18 at 1:03





What is the total number of unique fruits occurring across df_A and df_B? You could convert them to one-hot or Categorical, instead of storing as string.

– smci
Nov 23 '18 at 1:03




1




1





df_B.Fruits is a compound column. I would retitle this "Merge two dataframes on string column/ compound string column"

– smci
Nov 23 '18 at 1:13







df_B.Fruits is a compound column. I would retitle this "Merge two dataframes on string column/ compound string column"

– smci
Nov 23 '18 at 1:13















@smci, thanks for your response, the unique fruits number should be around 5000, i'll rename the question as you suggested too.

– Santiago Hernàndez
Nov 23 '18 at 14:11





@smci, thanks for your response, the unique fruits number should be around 5000, i'll rename the question as you suggested too.

– Santiago Hernàndez
Nov 23 '18 at 14:11












1 Answer
1






active

oldest

votes


















0














Use:



print (df_A)

date Fruit a b c d
0 1 apple 0 3 5 1
1 3 apple 8 2 7 2
2 2 banana 1 4 3 5
3 4 banana 3 5 2 6
4 3 pineapple 2 6 4 6
5 5 pineapple 3 5 7 9

print (df_B)

date Fruits x y z
0 1 apple, pear, strawberry a n q
1 2 banana, apple, coconut b m p
2 3 pineapple, pear, banana c s o
3 4 banana, apple, coconut d f v
4 5 pineapple, pear, banana r ñ t




import pandas as pd
import numpy as np

# Split the strings into list.
df_B.Fruits = df_B.Fruits.str.split(', ')

# reindex and repeat on length of list
temp = df_B.reindex(df_B.index.repeat(df_B.Fruits.str.len())).drop('Fruits',1)

temp['Fruit'] = np.concatenate(df_B.Fruits.values)

df_C = df_A.merge(temp, on=['date','Fruit'], how='left').fillna(0)

print (df_C)

date Fruit a b c d x y z
0 1 apple 0 3 5 1 a n q
1 3 apple 8 2 7 2 0 0 0
2 2 banana 1 4 3 5 b m p
3 4 banana 3 5 2 6 d f v
4 3 pineapple 2 6 4 6 c s o
5 5 pineapple 3 5 7 9 r ñ t





share|improve this answer





















  • 1





    Thanks it's working!

    – Santiago Hernàndez
    Nov 23 '18 at 20:15












Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53438964%2fmerge-two-dataframes-on-string-column-compound-string-column%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









0














Use:



print (df_A)

date Fruit a b c d
0 1 apple 0 3 5 1
1 3 apple 8 2 7 2
2 2 banana 1 4 3 5
3 4 banana 3 5 2 6
4 3 pineapple 2 6 4 6
5 5 pineapple 3 5 7 9

print (df_B)

date Fruits x y z
0 1 apple, pear, strawberry a n q
1 2 banana, apple, coconut b m p
2 3 pineapple, pear, banana c s o
3 4 banana, apple, coconut d f v
4 5 pineapple, pear, banana r ñ t




import pandas as pd
import numpy as np

# Split the strings into list.
df_B.Fruits = df_B.Fruits.str.split(', ')

# reindex and repeat on length of list
temp = df_B.reindex(df_B.index.repeat(df_B.Fruits.str.len())).drop('Fruits',1)

temp['Fruit'] = np.concatenate(df_B.Fruits.values)

df_C = df_A.merge(temp, on=['date','Fruit'], how='left').fillna(0)

print (df_C)

date Fruit a b c d x y z
0 1 apple 0 3 5 1 a n q
1 3 apple 8 2 7 2 0 0 0
2 2 banana 1 4 3 5 b m p
3 4 banana 3 5 2 6 d f v
4 3 pineapple 2 6 4 6 c s o
5 5 pineapple 3 5 7 9 r ñ t





share|improve this answer





















  • 1





    Thanks it's working!

    – Santiago Hernàndez
    Nov 23 '18 at 20:15
















0














Use:



print (df_A)

date Fruit a b c d
0 1 apple 0 3 5 1
1 3 apple 8 2 7 2
2 2 banana 1 4 3 5
3 4 banana 3 5 2 6
4 3 pineapple 2 6 4 6
5 5 pineapple 3 5 7 9

print (df_B)

date Fruits x y z
0 1 apple, pear, strawberry a n q
1 2 banana, apple, coconut b m p
2 3 pineapple, pear, banana c s o
3 4 banana, apple, coconut d f v
4 5 pineapple, pear, banana r ñ t




import pandas as pd
import numpy as np

# Split the strings into list.
df_B.Fruits = df_B.Fruits.str.split(', ')

# reindex and repeat on length of list
temp = df_B.reindex(df_B.index.repeat(df_B.Fruits.str.len())).drop('Fruits',1)

temp['Fruit'] = np.concatenate(df_B.Fruits.values)

df_C = df_A.merge(temp, on=['date','Fruit'], how='left').fillna(0)

print (df_C)

date Fruit a b c d x y z
0 1 apple 0 3 5 1 a n q
1 3 apple 8 2 7 2 0 0 0
2 2 banana 1 4 3 5 b m p
3 4 banana 3 5 2 6 d f v
4 3 pineapple 2 6 4 6 c s o
5 5 pineapple 3 5 7 9 r ñ t





share|improve this answer





















  • 1





    Thanks it's working!

    – Santiago Hernàndez
    Nov 23 '18 at 20:15














0












0








0







Use:



print (df_A)

date Fruit a b c d
0 1 apple 0 3 5 1
1 3 apple 8 2 7 2
2 2 banana 1 4 3 5
3 4 banana 3 5 2 6
4 3 pineapple 2 6 4 6
5 5 pineapple 3 5 7 9

print (df_B)

date Fruits x y z
0 1 apple, pear, strawberry a n q
1 2 banana, apple, coconut b m p
2 3 pineapple, pear, banana c s o
3 4 banana, apple, coconut d f v
4 5 pineapple, pear, banana r ñ t




import pandas as pd
import numpy as np

# Split the strings into list.
df_B.Fruits = df_B.Fruits.str.split(', ')

# reindex and repeat on length of list
temp = df_B.reindex(df_B.index.repeat(df_B.Fruits.str.len())).drop('Fruits',1)

temp['Fruit'] = np.concatenate(df_B.Fruits.values)

df_C = df_A.merge(temp, on=['date','Fruit'], how='left').fillna(0)

print (df_C)

date Fruit a b c d x y z
0 1 apple 0 3 5 1 a n q
1 3 apple 8 2 7 2 0 0 0
2 2 banana 1 4 3 5 b m p
3 4 banana 3 5 2 6 d f v
4 3 pineapple 2 6 4 6 c s o
5 5 pineapple 3 5 7 9 r ñ t





share|improve this answer















Use:



print (df_A)

date Fruit a b c d
0 1 apple 0 3 5 1
1 3 apple 8 2 7 2
2 2 banana 1 4 3 5
3 4 banana 3 5 2 6
4 3 pineapple 2 6 4 6
5 5 pineapple 3 5 7 9

print (df_B)

date Fruits x y z
0 1 apple, pear, strawberry a n q
1 2 banana, apple, coconut b m p
2 3 pineapple, pear, banana c s o
3 4 banana, apple, coconut d f v
4 5 pineapple, pear, banana r ñ t




import pandas as pd
import numpy as np

# Split the strings into list.
df_B.Fruits = df_B.Fruits.str.split(', ')

# reindex and repeat on length of list
temp = df_B.reindex(df_B.index.repeat(df_B.Fruits.str.len())).drop('Fruits',1)

temp['Fruit'] = np.concatenate(df_B.Fruits.values)

df_C = df_A.merge(temp, on=['date','Fruit'], how='left').fillna(0)

print (df_C)

date Fruit a b c d x y z
0 1 apple 0 3 5 1 a n q
1 3 apple 8 2 7 2 0 0 0
2 2 banana 1 4 3 5 b m p
3 4 banana 3 5 2 6 d f v
4 3 pineapple 2 6 4 6 c s o
5 5 pineapple 3 5 7 9 r ñ t






share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 23 '18 at 0:55

























answered Nov 23 '18 at 0:33









AbhiAbhi

2,539422




2,539422








  • 1





    Thanks it's working!

    – Santiago Hernàndez
    Nov 23 '18 at 20:15














  • 1





    Thanks it's working!

    – Santiago Hernàndez
    Nov 23 '18 at 20:15








1




1





Thanks it's working!

– Santiago Hernàndez
Nov 23 '18 at 20:15





Thanks it's working!

– Santiago Hernàndez
Nov 23 '18 at 20:15




















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53438964%2fmerge-two-dataframes-on-string-column-compound-string-column%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Biblatex bibliography style without URLs when DOI exists (in Overleaf with Zotero bibliography)

ComboBox Display Member on multiple fields

Is it possible to collect Nectar points via Trainline?