Parsing big XML files efficiently
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}
I'm dealing with XML files that are 4GB+ in size and wondering how I can best parse them. Right now I run into memory issues and looking for a way to not load the whole file in memory and go through it in batches perhaps?
Current code is using lxml and iterating over the repeating elements. Namespaces are cleared up front:
from lxml import etree, objectify
import pandas as pd
file = 'some_huge_file.xml'
if file.lower().endswith('.xml'):
parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse(mvnFile, parser)
root = tree.getroot()
####
for elem in root.getiterator():
if not hasattr(elem.tag, 'find'): continue # (1)
i = elem.tag.find('}')
if i >= 0:
elem.tag = elem.tag[i + 1:]
objectify.deannotate(root, cleanup_namespaces=True)
####
data = [{
'Element1': tp.findtext('element1'),
'Element2': tp.findtext('element2'),
'Element3': tp.findtext('element3'),
}
for tp in tree.xpath('//mainelement/subelement')]
df = pd.DataFrame(data)
print(df)
Furthermore I need to do some splitting of the values of the elements as they are space-separated. However, I only need specific values so I'm wondering if I can do this somehow within the parsing instead of splitting the columns on space afterwards?
xml example:
<mainelement>
<subelement tc="00:00:00:000" ms="0">
<element1>0.861668 0.496888 0.000000 0.000000 0.867815</element1>
<element2>0.043423 0.509801 -0.111990 -0.070212 0.126711</element2>
<element3>-0.001501 0.008416 0.000098 0.005241 0.005301</element3>
</subelement>
<subelement tc="00:00:00:001" ms="1">
<element1>-0.503814 0.005664 -0.070326 -0.860926 -0.503720</element1>
<element2>-0.044658 0.046381 0.909291 -0.033390 0.049348</element2>
<element3>-0.000000 -0.000000 -0.000000 -0.005217 0.007849</element3>
</subelement>
<subelement tc="00:00:00:002" ms="2">
<element1> -0.861173 0.503578 -0.007163 0.056031 0.862101</element1>
<element2>0.371398 1.325794 -0.030966 0.059466 1.388910</element2>
<element3>-0.010139 0.001048 0.026847 -0.010139 0.001048</element3>
</subelement>
<subelement tc="00:00:00:003" ms="3">
<element1>0.856813 0.494664 0.003921 0.023356 0.868762</element1>
<element2>-0.030966 0.059466 1.388910 -0.152636 -0.008650</element2>
<element3>0.001048 0.026847 -0.010139 0.001048 0.035846</element3>
</subelement>
</mainelement>
python lxml
add a comment |
I'm dealing with XML files that are 4GB+ in size and wondering how I can best parse them. Right now I run into memory issues and looking for a way to not load the whole file in memory and go through it in batches perhaps?
Current code is using lxml and iterating over the repeating elements. Namespaces are cleared up front:
from lxml import etree, objectify
import pandas as pd
file = 'some_huge_file.xml'
if file.lower().endswith('.xml'):
parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse(mvnFile, parser)
root = tree.getroot()
####
for elem in root.getiterator():
if not hasattr(elem.tag, 'find'): continue # (1)
i = elem.tag.find('}')
if i >= 0:
elem.tag = elem.tag[i + 1:]
objectify.deannotate(root, cleanup_namespaces=True)
####
data = [{
'Element1': tp.findtext('element1'),
'Element2': tp.findtext('element2'),
'Element3': tp.findtext('element3'),
}
for tp in tree.xpath('//mainelement/subelement')]
df = pd.DataFrame(data)
print(df)
Furthermore I need to do some splitting of the values of the elements as they are space-separated. However, I only need specific values so I'm wondering if I can do this somehow within the parsing instead of splitting the columns on space afterwards?
xml example:
<mainelement>
<subelement tc="00:00:00:000" ms="0">
<element1>0.861668 0.496888 0.000000 0.000000 0.867815</element1>
<element2>0.043423 0.509801 -0.111990 -0.070212 0.126711</element2>
<element3>-0.001501 0.008416 0.000098 0.005241 0.005301</element3>
</subelement>
<subelement tc="00:00:00:001" ms="1">
<element1>-0.503814 0.005664 -0.070326 -0.860926 -0.503720</element1>
<element2>-0.044658 0.046381 0.909291 -0.033390 0.049348</element2>
<element3>-0.000000 -0.000000 -0.000000 -0.005217 0.007849</element3>
</subelement>
<subelement tc="00:00:00:002" ms="2">
<element1> -0.861173 0.503578 -0.007163 0.056031 0.862101</element1>
<element2>0.371398 1.325794 -0.030966 0.059466 1.388910</element2>
<element3>-0.010139 0.001048 0.026847 -0.010139 0.001048</element3>
</subelement>
<subelement tc="00:00:00:003" ms="3">
<element1>0.856813 0.494664 0.003921 0.023356 0.868762</element1>
<element2>-0.030966 0.059466 1.388910 -0.152636 -0.008650</element2>
<element3>0.001048 0.026847 -0.010139 0.001048 0.035846</element3>
</subelement>
</mainelement>
python lxml
I don't know about Python XML handling, but in the Java world there are two types: SAX and DOM. DOM means "Document Object Model" and it means that the whole XML file gets loaded into memory, so that queries can be done afterwards. Once loaded, it is very quick, but it might consume huge amounts of memory. SAX however runs over your XML, and when a certain tag, attribute, content is reached, an event might be launched. This might take quite a while but it almost takes no memory. So, when you say "efficient", do you mean speed or memory related efficiency?
– Dominique
Nov 22 '18 at 12:29
speed at this point is not important I'd say. I just need to get the data into a database for now, and if I try to read all elements and all values I need, python will just crash or error (and MacOS starts force quitting applications ;) ).
– Chrisvdberge
Nov 22 '18 at 12:37
1
Did you check this?
– Andersson
Nov 22 '18 at 12:41
1
Possible duplicate of xml parsing in python for big data
– stovfl
Nov 22 '18 at 13:31
those links were quite helpful. The possible duplicate pointed in the right direction but didn't provided a clear concrete answer, so I added the code I came up with as answer to this question for clarity
– Chrisvdberge
Nov 23 '18 at 12:05
add a comment |
I'm dealing with XML files that are 4GB+ in size and wondering how I can best parse them. Right now I run into memory issues and looking for a way to not load the whole file in memory and go through it in batches perhaps?
Current code is using lxml and iterating over the repeating elements. Namespaces are cleared up front:
from lxml import etree, objectify
import pandas as pd
file = 'some_huge_file.xml'
if file.lower().endswith('.xml'):
parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse(mvnFile, parser)
root = tree.getroot()
####
for elem in root.getiterator():
if not hasattr(elem.tag, 'find'): continue # (1)
i = elem.tag.find('}')
if i >= 0:
elem.tag = elem.tag[i + 1:]
objectify.deannotate(root, cleanup_namespaces=True)
####
data = [{
'Element1': tp.findtext('element1'),
'Element2': tp.findtext('element2'),
'Element3': tp.findtext('element3'),
}
for tp in tree.xpath('//mainelement/subelement')]
df = pd.DataFrame(data)
print(df)
Furthermore I need to do some splitting of the values of the elements as they are space-separated. However, I only need specific values so I'm wondering if I can do this somehow within the parsing instead of splitting the columns on space afterwards?
xml example:
<mainelement>
<subelement tc="00:00:00:000" ms="0">
<element1>0.861668 0.496888 0.000000 0.000000 0.867815</element1>
<element2>0.043423 0.509801 -0.111990 -0.070212 0.126711</element2>
<element3>-0.001501 0.008416 0.000098 0.005241 0.005301</element3>
</subelement>
<subelement tc="00:00:00:001" ms="1">
<element1>-0.503814 0.005664 -0.070326 -0.860926 -0.503720</element1>
<element2>-0.044658 0.046381 0.909291 -0.033390 0.049348</element2>
<element3>-0.000000 -0.000000 -0.000000 -0.005217 0.007849</element3>
</subelement>
<subelement tc="00:00:00:002" ms="2">
<element1> -0.861173 0.503578 -0.007163 0.056031 0.862101</element1>
<element2>0.371398 1.325794 -0.030966 0.059466 1.388910</element2>
<element3>-0.010139 0.001048 0.026847 -0.010139 0.001048</element3>
</subelement>
<subelement tc="00:00:00:003" ms="3">
<element1>0.856813 0.494664 0.003921 0.023356 0.868762</element1>
<element2>-0.030966 0.059466 1.388910 -0.152636 -0.008650</element2>
<element3>0.001048 0.026847 -0.010139 0.001048 0.035846</element3>
</subelement>
</mainelement>
python lxml
I'm dealing with XML files that are 4GB+ in size and wondering how I can best parse them. Right now I run into memory issues and looking for a way to not load the whole file in memory and go through it in batches perhaps?
Current code is using lxml and iterating over the repeating elements. Namespaces are cleared up front:
from lxml import etree, objectify
import pandas as pd
file = 'some_huge_file.xml'
if file.lower().endswith('.xml'):
parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse(mvnFile, parser)
root = tree.getroot()
####
for elem in root.getiterator():
if not hasattr(elem.tag, 'find'): continue # (1)
i = elem.tag.find('}')
if i >= 0:
elem.tag = elem.tag[i + 1:]
objectify.deannotate(root, cleanup_namespaces=True)
####
data = [{
'Element1': tp.findtext('element1'),
'Element2': tp.findtext('element2'),
'Element3': tp.findtext('element3'),
}
for tp in tree.xpath('//mainelement/subelement')]
df = pd.DataFrame(data)
print(df)
Furthermore I need to do some splitting of the values of the elements as they are space-separated. However, I only need specific values so I'm wondering if I can do this somehow within the parsing instead of splitting the columns on space afterwards?
xml example:
<mainelement>
<subelement tc="00:00:00:000" ms="0">
<element1>0.861668 0.496888 0.000000 0.000000 0.867815</element1>
<element2>0.043423 0.509801 -0.111990 -0.070212 0.126711</element2>
<element3>-0.001501 0.008416 0.000098 0.005241 0.005301</element3>
</subelement>
<subelement tc="00:00:00:001" ms="1">
<element1>-0.503814 0.005664 -0.070326 -0.860926 -0.503720</element1>
<element2>-0.044658 0.046381 0.909291 -0.033390 0.049348</element2>
<element3>-0.000000 -0.000000 -0.000000 -0.005217 0.007849</element3>
</subelement>
<subelement tc="00:00:00:002" ms="2">
<element1> -0.861173 0.503578 -0.007163 0.056031 0.862101</element1>
<element2>0.371398 1.325794 -0.030966 0.059466 1.388910</element2>
<element3>-0.010139 0.001048 0.026847 -0.010139 0.001048</element3>
</subelement>
<subelement tc="00:00:00:003" ms="3">
<element1>0.856813 0.494664 0.003921 0.023356 0.868762</element1>
<element2>-0.030966 0.059466 1.388910 -0.152636 -0.008650</element2>
<element3>0.001048 0.026847 -0.010139 0.001048 0.035846</element3>
</subelement>
</mainelement>
python lxml
python lxml
asked Nov 22 '18 at 12:13
ChrisvdbergeChrisvdberge
5752822
5752822
I don't know about Python XML handling, but in the Java world there are two types: SAX and DOM. DOM means "Document Object Model" and it means that the whole XML file gets loaded into memory, so that queries can be done afterwards. Once loaded, it is very quick, but it might consume huge amounts of memory. SAX however runs over your XML, and when a certain tag, attribute, content is reached, an event might be launched. This might take quite a while but it almost takes no memory. So, when you say "efficient", do you mean speed or memory related efficiency?
– Dominique
Nov 22 '18 at 12:29
speed at this point is not important I'd say. I just need to get the data into a database for now, and if I try to read all elements and all values I need, python will just crash or error (and MacOS starts force quitting applications ;) ).
– Chrisvdberge
Nov 22 '18 at 12:37
1
Did you check this?
– Andersson
Nov 22 '18 at 12:41
1
Possible duplicate of xml parsing in python for big data
– stovfl
Nov 22 '18 at 13:31
those links were quite helpful. The possible duplicate pointed in the right direction but didn't provided a clear concrete answer, so I added the code I came up with as answer to this question for clarity
– Chrisvdberge
Nov 23 '18 at 12:05
add a comment |
I don't know about Python XML handling, but in the Java world there are two types: SAX and DOM. DOM means "Document Object Model" and it means that the whole XML file gets loaded into memory, so that queries can be done afterwards. Once loaded, it is very quick, but it might consume huge amounts of memory. SAX however runs over your XML, and when a certain tag, attribute, content is reached, an event might be launched. This might take quite a while but it almost takes no memory. So, when you say "efficient", do you mean speed or memory related efficiency?
– Dominique
Nov 22 '18 at 12:29
speed at this point is not important I'd say. I just need to get the data into a database for now, and if I try to read all elements and all values I need, python will just crash or error (and MacOS starts force quitting applications ;) ).
– Chrisvdberge
Nov 22 '18 at 12:37
1
Did you check this?
– Andersson
Nov 22 '18 at 12:41
1
Possible duplicate of xml parsing in python for big data
– stovfl
Nov 22 '18 at 13:31
those links were quite helpful. The possible duplicate pointed in the right direction but didn't provided a clear concrete answer, so I added the code I came up with as answer to this question for clarity
– Chrisvdberge
Nov 23 '18 at 12:05
I don't know about Python XML handling, but in the Java world there are two types: SAX and DOM. DOM means "Document Object Model" and it means that the whole XML file gets loaded into memory, so that queries can be done afterwards. Once loaded, it is very quick, but it might consume huge amounts of memory. SAX however runs over your XML, and when a certain tag, attribute, content is reached, an event might be launched. This might take quite a while but it almost takes no memory. So, when you say "efficient", do you mean speed or memory related efficiency?
– Dominique
Nov 22 '18 at 12:29
I don't know about Python XML handling, but in the Java world there are two types: SAX and DOM. DOM means "Document Object Model" and it means that the whole XML file gets loaded into memory, so that queries can be done afterwards. Once loaded, it is very quick, but it might consume huge amounts of memory. SAX however runs over your XML, and when a certain tag, attribute, content is reached, an event might be launched. This might take quite a while but it almost takes no memory. So, when you say "efficient", do you mean speed or memory related efficiency?
– Dominique
Nov 22 '18 at 12:29
speed at this point is not important I'd say. I just need to get the data into a database for now, and if I try to read all elements and all values I need, python will just crash or error (and MacOS starts force quitting applications ;) ).
– Chrisvdberge
Nov 22 '18 at 12:37
speed at this point is not important I'd say. I just need to get the data into a database for now, and if I try to read all elements and all values I need, python will just crash or error (and MacOS starts force quitting applications ;) ).
– Chrisvdberge
Nov 22 '18 at 12:37
1
1
Did you check this?
– Andersson
Nov 22 '18 at 12:41
Did you check this?
– Andersson
Nov 22 '18 at 12:41
1
1
Possible duplicate of xml parsing in python for big data
– stovfl
Nov 22 '18 at 13:31
Possible duplicate of xml parsing in python for big data
– stovfl
Nov 22 '18 at 13:31
those links were quite helpful. The possible duplicate pointed in the right direction but didn't provided a clear concrete answer, so I added the code I came up with as answer to this question for clarity
– Chrisvdberge
Nov 23 '18 at 12:05
those links were quite helpful. The possible duplicate pointed in the right direction but didn't provided a clear concrete answer, so I added the code I came up with as answer to this question for clarity
– Chrisvdberge
Nov 23 '18 at 12:05
add a comment |
1 Answer
1
active
oldest
votes
Based on the links you posted in the comments I came up with the following to iterate and split more efficiently that works fine:
from lxml import etree, objectify
import pandas as pd
file = 'some_huge_file.xml'
time =
data1_element1_x =
data1_element1_y =
data1_element2_x =
data1_element2_y =
data2_element1_x =
data2_element1_y =
data2_element2_x =
data2_element2_y =
if file.lower().endswith('.xml'):
for event, elem in etree.iterparse(file):
if elem.tag == "subelement":
time.append(elem.get('tc'))
for child in elem:
if child.tag == "element1":
split_data = child.text.split(" ")
data1_element1_x.append(float(split_data[0]))
data1_element1_y.append(float(split_data[1]))
data2_element1_x.append(float(split_data[2]))
data2_element1_y.append(float(split_data[3]))
elif child.tag == "element2":
split_data = child.text.split(" ")
data1_element2_x.append(float(split_data[0]))
data1_element2_y.append(float(split_data[1]))
data2_element2_x.append(float(split_data[2]))
data2_element2_y.append(float(split_data[3]))
elem.clear()
df = pd.DataFrame({
'Time':time,
'Data1_element1_x': data1_element1_x,
'Data1_element1_y': data1_element1_y,
'Data1_element2_x': data1_element2_x,
'Data1_element2_y': data1_element2_y,
'Data2_element1_x': data2_element1_x,
'Data2_element1_y': data2_element1_y,
'Data2_element2_x': data2_element2_x,
'Data2_element2_y': data2_element2_y
})
print(df)
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53430768%2fparsing-big-xml-files-efficiently%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Based on the links you posted in the comments I came up with the following to iterate and split more efficiently that works fine:
from lxml import etree, objectify
import pandas as pd
file = 'some_huge_file.xml'
time =
data1_element1_x =
data1_element1_y =
data1_element2_x =
data1_element2_y =
data2_element1_x =
data2_element1_y =
data2_element2_x =
data2_element2_y =
if file.lower().endswith('.xml'):
for event, elem in etree.iterparse(file):
if elem.tag == "subelement":
time.append(elem.get('tc'))
for child in elem:
if child.tag == "element1":
split_data = child.text.split(" ")
data1_element1_x.append(float(split_data[0]))
data1_element1_y.append(float(split_data[1]))
data2_element1_x.append(float(split_data[2]))
data2_element1_y.append(float(split_data[3]))
elif child.tag == "element2":
split_data = child.text.split(" ")
data1_element2_x.append(float(split_data[0]))
data1_element2_y.append(float(split_data[1]))
data2_element2_x.append(float(split_data[2]))
data2_element2_y.append(float(split_data[3]))
elem.clear()
df = pd.DataFrame({
'Time':time,
'Data1_element1_x': data1_element1_x,
'Data1_element1_y': data1_element1_y,
'Data1_element2_x': data1_element2_x,
'Data1_element2_y': data1_element2_y,
'Data2_element1_x': data2_element1_x,
'Data2_element1_y': data2_element1_y,
'Data2_element2_x': data2_element2_x,
'Data2_element2_y': data2_element2_y
})
print(df)
add a comment |
Based on the links you posted in the comments I came up with the following to iterate and split more efficiently that works fine:
from lxml import etree, objectify
import pandas as pd
file = 'some_huge_file.xml'
time =
data1_element1_x =
data1_element1_y =
data1_element2_x =
data1_element2_y =
data2_element1_x =
data2_element1_y =
data2_element2_x =
data2_element2_y =
if file.lower().endswith('.xml'):
for event, elem in etree.iterparse(file):
if elem.tag == "subelement":
time.append(elem.get('tc'))
for child in elem:
if child.tag == "element1":
split_data = child.text.split(" ")
data1_element1_x.append(float(split_data[0]))
data1_element1_y.append(float(split_data[1]))
data2_element1_x.append(float(split_data[2]))
data2_element1_y.append(float(split_data[3]))
elif child.tag == "element2":
split_data = child.text.split(" ")
data1_element2_x.append(float(split_data[0]))
data1_element2_y.append(float(split_data[1]))
data2_element2_x.append(float(split_data[2]))
data2_element2_y.append(float(split_data[3]))
elem.clear()
df = pd.DataFrame({
'Time':time,
'Data1_element1_x': data1_element1_x,
'Data1_element1_y': data1_element1_y,
'Data1_element2_x': data1_element2_x,
'Data1_element2_y': data1_element2_y,
'Data2_element1_x': data2_element1_x,
'Data2_element1_y': data2_element1_y,
'Data2_element2_x': data2_element2_x,
'Data2_element2_y': data2_element2_y
})
print(df)
add a comment |
Based on the links you posted in the comments I came up with the following to iterate and split more efficiently that works fine:
from lxml import etree, objectify
import pandas as pd
file = 'some_huge_file.xml'
time =
data1_element1_x =
data1_element1_y =
data1_element2_x =
data1_element2_y =
data2_element1_x =
data2_element1_y =
data2_element2_x =
data2_element2_y =
if file.lower().endswith('.xml'):
for event, elem in etree.iterparse(file):
if elem.tag == "subelement":
time.append(elem.get('tc'))
for child in elem:
if child.tag == "element1":
split_data = child.text.split(" ")
data1_element1_x.append(float(split_data[0]))
data1_element1_y.append(float(split_data[1]))
data2_element1_x.append(float(split_data[2]))
data2_element1_y.append(float(split_data[3]))
elif child.tag == "element2":
split_data = child.text.split(" ")
data1_element2_x.append(float(split_data[0]))
data1_element2_y.append(float(split_data[1]))
data2_element2_x.append(float(split_data[2]))
data2_element2_y.append(float(split_data[3]))
elem.clear()
df = pd.DataFrame({
'Time':time,
'Data1_element1_x': data1_element1_x,
'Data1_element1_y': data1_element1_y,
'Data1_element2_x': data1_element2_x,
'Data1_element2_y': data1_element2_y,
'Data2_element1_x': data2_element1_x,
'Data2_element1_y': data2_element1_y,
'Data2_element2_x': data2_element2_x,
'Data2_element2_y': data2_element2_y
})
print(df)
Based on the links you posted in the comments I came up with the following to iterate and split more efficiently that works fine:
from lxml import etree, objectify
import pandas as pd
file = 'some_huge_file.xml'
time =
data1_element1_x =
data1_element1_y =
data1_element2_x =
data1_element2_y =
data2_element1_x =
data2_element1_y =
data2_element2_x =
data2_element2_y =
if file.lower().endswith('.xml'):
for event, elem in etree.iterparse(file):
if elem.tag == "subelement":
time.append(elem.get('tc'))
for child in elem:
if child.tag == "element1":
split_data = child.text.split(" ")
data1_element1_x.append(float(split_data[0]))
data1_element1_y.append(float(split_data[1]))
data2_element1_x.append(float(split_data[2]))
data2_element1_y.append(float(split_data[3]))
elif child.tag == "element2":
split_data = child.text.split(" ")
data1_element2_x.append(float(split_data[0]))
data1_element2_y.append(float(split_data[1]))
data2_element2_x.append(float(split_data[2]))
data2_element2_y.append(float(split_data[3]))
elem.clear()
df = pd.DataFrame({
'Time':time,
'Data1_element1_x': data1_element1_x,
'Data1_element1_y': data1_element1_y,
'Data1_element2_x': data1_element2_x,
'Data1_element2_y': data1_element2_y,
'Data2_element1_x': data2_element1_x,
'Data2_element1_y': data2_element1_y,
'Data2_element2_x': data2_element2_x,
'Data2_element2_y': data2_element2_y
})
print(df)
answered Nov 23 '18 at 12:04
ChrisvdbergeChrisvdberge
5752822
5752822
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53430768%2fparsing-big-xml-files-efficiently%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
I don't know about Python XML handling, but in the Java world there are two types: SAX and DOM. DOM means "Document Object Model" and it means that the whole XML file gets loaded into memory, so that queries can be done afterwards. Once loaded, it is very quick, but it might consume huge amounts of memory. SAX however runs over your XML, and when a certain tag, attribute, content is reached, an event might be launched. This might take quite a while but it almost takes no memory. So, when you say "efficient", do you mean speed or memory related efficiency?
– Dominique
Nov 22 '18 at 12:29
speed at this point is not important I'd say. I just need to get the data into a database for now, and if I try to read all elements and all values I need, python will just crash or error (and MacOS starts force quitting applications ;) ).
– Chrisvdberge
Nov 22 '18 at 12:37
1
Did you check this?
– Andersson
Nov 22 '18 at 12:41
1
Possible duplicate of xml parsing in python for big data
– stovfl
Nov 22 '18 at 13:31
those links were quite helpful. The possible duplicate pointed in the right direction but didn't provided a clear concrete answer, so I added the code I came up with as answer to this question for clarity
– Chrisvdberge
Nov 23 '18 at 12:05