Nested XML with all data as attributes to pandas dataframe, only final entry is being imported
Edit: Issue resolved, updated code below.
I've spent several days scouring the internet trying to figure this out and I can't seem to get anywhere.
Basically I have a heavily nested XML file with all the info I want to extract included as attributes, example below:
<TotalAllocationResultDocument xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" DtdRelease="0" DtdVersion="4" xsi:noNamespaceSchemaLocation="total-allocation-result-document.xsd">
<DocumentIdentification v="A_R-GB-NI-Q-BASE-------190101-01"/>
<DocumentVersion v="1"/>
<DocumentType v="A25"/>
<SenderIdentification codingScheme="A01" v="10X1001A1001A57U"/>
<SenderRole v="A07"/>
<ReceiverIdentification codingScheme="A01" v="10XNITSO-12345-O"/>
<ReceiverRole v="A04"/>
<CreationDateTime v="2018-11-13T12:55:08Z"/>
<BidTimeInterval v="2018-12-31T23:00Z/2019-03-31T22:00Z"/>
<Domain codingScheme="A01" v="10YDOM-1010A024Y"/>
<AllocationTimeSeries>
<TimeSeriesIdentification v="TotalAllocationResults_TS_4888599"/>
<BidDocumentIdentification v="GB-NI-Q-BASE-------190101-01"/>
<BidDocumentVersion v="1"/>
<BidIdentification v="BID00001-GB-NI"/>
<BiddingParty codingScheme="A01" v="xxxxxxxxxx"/>
<AuctionIdentification v="GB-NI-Q-BASE-------190101-01"/>
<BusinessType v="A03"/>
<InArea codingScheme="A01" v="10Y1001A1001A016"/>
<OutArea codingScheme="A01" v="10YGB----------A"/>
<ContractType v="A08"/>
<ContractIdentification v="311218D3-GB-NI-FK01"/>
<MeasureUnitQuantity v="MAW"/>
<Currency v="EUR"/>
<MeasureUnitPrice v="MWH"/>
<Period>
<TimeInterval v="2018-12-31T23:00Z/2019-03-31T22:00Z"/>
<Resolution v="P3M"/>
<Interval>
<Pos v="1"/>
<Qty v="0"/>
<PriceAmount v="4.02"/>
<BidQty v="3.0"/>
<BidPriceAmount v="2.51"/>
</Interval>
</Period>
</AllocationTimeSeries>
<AllocationTimeSeries>
<TimeSeriesIdentification v="TotalAllocationResults_TS_4888602"/>
<BidDocumentIdentification v="GB-NI-Q-BASE-------190101-01"/>
<BidDocumentVersion v="1"/>
<BidIdentification v="BID00004-GB-NI"/>
<BiddingParty codingScheme="A01" v="yyyyyyyyyy"/>
<AuctionIdentification v="GB-NI-Q-BASE-------190101-01"/>
<BusinessType v="A03"/>
<InArea codingScheme="A01" v="10Y1001A1001A016"/>
<OutArea codingScheme="A01" v="10YGB----------A"/>
<ContractType v="A08"/>
<ContractIdentification v="311218D3-GB-NI-FK01"/>
<MeasureUnitQuantity v="MAW"/>
<Currency v="EUR"/>
<MeasureUnitPrice v="MWH"/>
<Period>
<TimeInterval v="2018-12-31T23:00Z/2019-03-31T22:00Z"/>
<Resolution v="P3M"/>
<Interval>
<Pos v="1"/>
<Qty v="0"/>
<PriceAmount v="4.02"/>
<BidQty v="3.0"/>
<BidPriceAmount v="1.51"/>
</Interval>
</Period>
</AllocationTimeSeries>
</TotalAllocationResultDocument>
I'm trying to get this data into a pandas dataframe, with columns something like:
SenderIdentification | DocumentIdentification | InArea | BidID | AllocatedQty
The code I've got is based on a load of different sources, but looks like:
import pandas as pd
import xml.etree.cElementTree as et
def getvalueofnode(node):
""" return node text or None """
return node.text if node is not None else None
def main():
""" main """
parsed_xml =et.parse("example.xml")
dfcols = ['SenderIdentification', 'DocumentIdentification', 'InArea', 'BidID', 'AllocatedQty']
df_xml = pd.DataFrame(columns=dfcols)
for node in parsed_xml.getroot():
for SenderIdentificationFind in parsed_xml.findall('./SenderIdentification'):
SenderIdentification = SenderIdentificationFind.get('v')
for DocumentIdentificationFind in parsed_xml.findall('./DocumentIdentification'):
DocumentIdentification = DocumentIdentificationFind.get('v')
for InAreaFind in parsed_xml.findall('./AllocationTimeSeries/InArea'):
InArea = InAreaFind.get('v')
for BidIDFind in parsed_xml.findall('./AllocationTimeSeries/BidIdentification'):
BidID = BidIDFind.get('v')
for BidIDFind in parsed_xml.findall('./AllocationTimeSeries/Period/Interval/Qty'):
AllocatedQty = BidIDFind.get('v')
df_xml = df_xml.append(
pd.Series([SenderIdentification, DocumentIdentification, InArea, BidID, AllocatedQty], index=dfcols),
ignore_index=True)
return df_xml
main()
df_xml = main()
This is creating a dataframe, but only seems to be pulling data from the last "AllocationTimeSeries" entry in the XML, and seems to have far more entries than the total number of "AllocationTimeSeries" entries in the document (eg. I would expect two unique rows based on this example data, but I'm actually seeing 12 identical rows).
I'm quite confused so any help as to how to get to the dataframe I'm after would be appreciated.
I've also tried the approach of using an XLST file to simplify the XLS file to an easier format to parse, but I don't really have the knowledge of XLS to get that to work either.
Thanks!
Code that works:
import pandas as pd
import lxml.etree as et
def getvalueofnode(node):
""" return node text or None """
return node.text if node is not None else None
def main():
""" main """
parsed_xml = et.parse("example.xml")
dfcols = ['DocumentIdentification', 'DocumentVersion', 'CreationDateTime', 'BidTimeInterval', 'BidIdentification', 'BiddingParty', 'OutArea', 'InArea', 'ContractIdentification', 'BidPrice', ' BidQuantity', 'ClearingPrice', 'AllocatedQuantity']
df_xml = pd.DataFrame(columns=dfcols)
root = parsed_xml.getroot()
result = len(root.findall("./AllocationTimeSeries"))+1
for DocumentIdentificationFind in parsed_xml.findall('./DocumentIdentification'):
DocumentIdentification = DocumentIdentificationFind.get('v')
for DocumentVersionFind in parsed_xml.findall('./DocumentVersion'):
DocumentVersion = DocumentVersionFind.get('v')
for CreationDateTimeFind in parsed_xml.findall('./CreationDateTime'):
CreationDateTime = CreationDateTimeFind.get('v')
for BidTimeIntervalFind in parsed_xml.findall('./BidTimeInterval'):
BidTimeInterval = BidTimeIntervalFind.get('v')
for i in range(1,result):
BidIdentificationString = './AllocationTimeSeries[%d]/BidIdentification'%(i)
BiddingPartyString = './AllocationTimeSeries[%d]/BiddingParty'%(i)
InAreaString = './AllocationTimeSeries[%d]/InArea'%(i)
OutAreaString = './AllocationTimeSeries[%d]/OutArea'%(i)
ContractIdentificationString = './AllocationTimeSeries[%d]/ContractIdentification'%(i)
BidPriceString = './AllocationTimeSeries[%d]/Period/Interval/BidPriceAmount'%(i)
BidQuantityString = './AllocationTimeSeries[%d]/Period/Interval/BidQty'%(i)
ClearingPriceString = './AllocationTimeSeries[%d]/Period/Interval/PriceAmount'%(i)
AllocatedQuantityString = './AllocationTimeSeries[%d]/Period/Interval/Qty'%(i)
for BidIdentificationFind in parsed_xml.findall(BidIdentificationString):
BidIdentification = BidIdentificationFind.get('v')
for BiddingPartyFind in parsed_xml.findall(BiddingPartyString):
BiddingParty = BiddingPartyFind.get('v')
for InAreaFind in parsed_xml.findall(InAreaString):
InArea = InAreaFind.get('v')
for OutAreaFind in parsed_xml.findall(OutAreaString):
OutArea = OutAreaFind.get('v')
for ContractIdentificationFind in parsed_xml.findall(ContractIdentificationString):
ContractIdentification = ContractIdentificationFind.get('v')
for BidPriceFind in parsed_xml.findall(BidPriceString):
BidPrice = BidPriceFind.get('v')
for BidQuantityFind in parsed_xml.findall(BidQuantityString):
BidQuantity = BidQuantityFind.get('v')
for ClearingPriceFind in parsed_xml.findall(ClearingPriceString):
ClearingPrice = ClearingPriceFind.get('v')
for AllocatedQuantityFind in parsed_xml.findall(AllocatedQuantityString):
AllocatedQuantity = AllocatedQuantityFind.get('v')
df_xml = df_xml.append(
pd.Series([DocumentIdentification, DocumentVersion, CreationDateTime, BidTimeInterval, BidIdentification, BiddingParty, OutArea, InArea, ContractIdentification, BidPrice, BidQuantity, ClearingPrice, AllocatedQuantity], index=dfcols),
ignore_index=True)
return df_xml
main()
df_xml = main()
df_xml
python xml pandas
add a comment |
Edit: Issue resolved, updated code below.
I've spent several days scouring the internet trying to figure this out and I can't seem to get anywhere.
Basically I have a heavily nested XML file with all the info I want to extract included as attributes, example below:
<TotalAllocationResultDocument xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" DtdRelease="0" DtdVersion="4" xsi:noNamespaceSchemaLocation="total-allocation-result-document.xsd">
<DocumentIdentification v="A_R-GB-NI-Q-BASE-------190101-01"/>
<DocumentVersion v="1"/>
<DocumentType v="A25"/>
<SenderIdentification codingScheme="A01" v="10X1001A1001A57U"/>
<SenderRole v="A07"/>
<ReceiverIdentification codingScheme="A01" v="10XNITSO-12345-O"/>
<ReceiverRole v="A04"/>
<CreationDateTime v="2018-11-13T12:55:08Z"/>
<BidTimeInterval v="2018-12-31T23:00Z/2019-03-31T22:00Z"/>
<Domain codingScheme="A01" v="10YDOM-1010A024Y"/>
<AllocationTimeSeries>
<TimeSeriesIdentification v="TotalAllocationResults_TS_4888599"/>
<BidDocumentIdentification v="GB-NI-Q-BASE-------190101-01"/>
<BidDocumentVersion v="1"/>
<BidIdentification v="BID00001-GB-NI"/>
<BiddingParty codingScheme="A01" v="xxxxxxxxxx"/>
<AuctionIdentification v="GB-NI-Q-BASE-------190101-01"/>
<BusinessType v="A03"/>
<InArea codingScheme="A01" v="10Y1001A1001A016"/>
<OutArea codingScheme="A01" v="10YGB----------A"/>
<ContractType v="A08"/>
<ContractIdentification v="311218D3-GB-NI-FK01"/>
<MeasureUnitQuantity v="MAW"/>
<Currency v="EUR"/>
<MeasureUnitPrice v="MWH"/>
<Period>
<TimeInterval v="2018-12-31T23:00Z/2019-03-31T22:00Z"/>
<Resolution v="P3M"/>
<Interval>
<Pos v="1"/>
<Qty v="0"/>
<PriceAmount v="4.02"/>
<BidQty v="3.0"/>
<BidPriceAmount v="2.51"/>
</Interval>
</Period>
</AllocationTimeSeries>
<AllocationTimeSeries>
<TimeSeriesIdentification v="TotalAllocationResults_TS_4888602"/>
<BidDocumentIdentification v="GB-NI-Q-BASE-------190101-01"/>
<BidDocumentVersion v="1"/>
<BidIdentification v="BID00004-GB-NI"/>
<BiddingParty codingScheme="A01" v="yyyyyyyyyy"/>
<AuctionIdentification v="GB-NI-Q-BASE-------190101-01"/>
<BusinessType v="A03"/>
<InArea codingScheme="A01" v="10Y1001A1001A016"/>
<OutArea codingScheme="A01" v="10YGB----------A"/>
<ContractType v="A08"/>
<ContractIdentification v="311218D3-GB-NI-FK01"/>
<MeasureUnitQuantity v="MAW"/>
<Currency v="EUR"/>
<MeasureUnitPrice v="MWH"/>
<Period>
<TimeInterval v="2018-12-31T23:00Z/2019-03-31T22:00Z"/>
<Resolution v="P3M"/>
<Interval>
<Pos v="1"/>
<Qty v="0"/>
<PriceAmount v="4.02"/>
<BidQty v="3.0"/>
<BidPriceAmount v="1.51"/>
</Interval>
</Period>
</AllocationTimeSeries>
</TotalAllocationResultDocument>
I'm trying to get this data into a pandas dataframe, with columns something like:
SenderIdentification | DocumentIdentification | InArea | BidID | AllocatedQty
The code I've got is based on a load of different sources, but looks like:
import pandas as pd
import xml.etree.cElementTree as et
def getvalueofnode(node):
""" return node text or None """
return node.text if node is not None else None
def main():
""" main """
parsed_xml =et.parse("example.xml")
dfcols = ['SenderIdentification', 'DocumentIdentification', 'InArea', 'BidID', 'AllocatedQty']
df_xml = pd.DataFrame(columns=dfcols)
for node in parsed_xml.getroot():
for SenderIdentificationFind in parsed_xml.findall('./SenderIdentification'):
SenderIdentification = SenderIdentificationFind.get('v')
for DocumentIdentificationFind in parsed_xml.findall('./DocumentIdentification'):
DocumentIdentification = DocumentIdentificationFind.get('v')
for InAreaFind in parsed_xml.findall('./AllocationTimeSeries/InArea'):
InArea = InAreaFind.get('v')
for BidIDFind in parsed_xml.findall('./AllocationTimeSeries/BidIdentification'):
BidID = BidIDFind.get('v')
for BidIDFind in parsed_xml.findall('./AllocationTimeSeries/Period/Interval/Qty'):
AllocatedQty = BidIDFind.get('v')
df_xml = df_xml.append(
pd.Series([SenderIdentification, DocumentIdentification, InArea, BidID, AllocatedQty], index=dfcols),
ignore_index=True)
return df_xml
main()
df_xml = main()
This is creating a dataframe, but only seems to be pulling data from the last "AllocationTimeSeries" entry in the XML, and seems to have far more entries than the total number of "AllocationTimeSeries" entries in the document (eg. I would expect two unique rows based on this example data, but I'm actually seeing 12 identical rows).
I'm quite confused so any help as to how to get to the dataframe I'm after would be appreciated.
I've also tried the approach of using an XLST file to simplify the XLS file to an easier format to parse, but I don't really have the knowledge of XLS to get that to work either.
Thanks!
Code that works:
import pandas as pd
import lxml.etree as et
def getvalueofnode(node):
""" return node text or None """
return node.text if node is not None else None
def main():
""" main """
parsed_xml = et.parse("example.xml")
dfcols = ['DocumentIdentification', 'DocumentVersion', 'CreationDateTime', 'BidTimeInterval', 'BidIdentification', 'BiddingParty', 'OutArea', 'InArea', 'ContractIdentification', 'BidPrice', ' BidQuantity', 'ClearingPrice', 'AllocatedQuantity']
df_xml = pd.DataFrame(columns=dfcols)
root = parsed_xml.getroot()
result = len(root.findall("./AllocationTimeSeries"))+1
for DocumentIdentificationFind in parsed_xml.findall('./DocumentIdentification'):
DocumentIdentification = DocumentIdentificationFind.get('v')
for DocumentVersionFind in parsed_xml.findall('./DocumentVersion'):
DocumentVersion = DocumentVersionFind.get('v')
for CreationDateTimeFind in parsed_xml.findall('./CreationDateTime'):
CreationDateTime = CreationDateTimeFind.get('v')
for BidTimeIntervalFind in parsed_xml.findall('./BidTimeInterval'):
BidTimeInterval = BidTimeIntervalFind.get('v')
for i in range(1,result):
BidIdentificationString = './AllocationTimeSeries[%d]/BidIdentification'%(i)
BiddingPartyString = './AllocationTimeSeries[%d]/BiddingParty'%(i)
InAreaString = './AllocationTimeSeries[%d]/InArea'%(i)
OutAreaString = './AllocationTimeSeries[%d]/OutArea'%(i)
ContractIdentificationString = './AllocationTimeSeries[%d]/ContractIdentification'%(i)
BidPriceString = './AllocationTimeSeries[%d]/Period/Interval/BidPriceAmount'%(i)
BidQuantityString = './AllocationTimeSeries[%d]/Period/Interval/BidQty'%(i)
ClearingPriceString = './AllocationTimeSeries[%d]/Period/Interval/PriceAmount'%(i)
AllocatedQuantityString = './AllocationTimeSeries[%d]/Period/Interval/Qty'%(i)
for BidIdentificationFind in parsed_xml.findall(BidIdentificationString):
BidIdentification = BidIdentificationFind.get('v')
for BiddingPartyFind in parsed_xml.findall(BiddingPartyString):
BiddingParty = BiddingPartyFind.get('v')
for InAreaFind in parsed_xml.findall(InAreaString):
InArea = InAreaFind.get('v')
for OutAreaFind in parsed_xml.findall(OutAreaString):
OutArea = OutAreaFind.get('v')
for ContractIdentificationFind in parsed_xml.findall(ContractIdentificationString):
ContractIdentification = ContractIdentificationFind.get('v')
for BidPriceFind in parsed_xml.findall(BidPriceString):
BidPrice = BidPriceFind.get('v')
for BidQuantityFind in parsed_xml.findall(BidQuantityString):
BidQuantity = BidQuantityFind.get('v')
for ClearingPriceFind in parsed_xml.findall(ClearingPriceString):
ClearingPrice = ClearingPriceFind.get('v')
for AllocatedQuantityFind in parsed_xml.findall(AllocatedQuantityString):
AllocatedQuantity = AllocatedQuantityFind.get('v')
df_xml = df_xml.append(
pd.Series([DocumentIdentification, DocumentVersion, CreationDateTime, BidTimeInterval, BidIdentification, BiddingParty, OutArea, InArea, ContractIdentification, BidPrice, BidQuantity, ClearingPrice, AllocatedQuantity], index=dfcols),
ignore_index=True)
return df_xml
main()
df_xml = main()
df_xml
python xml pandas
add a comment |
Edit: Issue resolved, updated code below.
I've spent several days scouring the internet trying to figure this out and I can't seem to get anywhere.
Basically I have a heavily nested XML file with all the info I want to extract included as attributes, example below:
<TotalAllocationResultDocument xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" DtdRelease="0" DtdVersion="4" xsi:noNamespaceSchemaLocation="total-allocation-result-document.xsd">
<DocumentIdentification v="A_R-GB-NI-Q-BASE-------190101-01"/>
<DocumentVersion v="1"/>
<DocumentType v="A25"/>
<SenderIdentification codingScheme="A01" v="10X1001A1001A57U"/>
<SenderRole v="A07"/>
<ReceiverIdentification codingScheme="A01" v="10XNITSO-12345-O"/>
<ReceiverRole v="A04"/>
<CreationDateTime v="2018-11-13T12:55:08Z"/>
<BidTimeInterval v="2018-12-31T23:00Z/2019-03-31T22:00Z"/>
<Domain codingScheme="A01" v="10YDOM-1010A024Y"/>
<AllocationTimeSeries>
<TimeSeriesIdentification v="TotalAllocationResults_TS_4888599"/>
<BidDocumentIdentification v="GB-NI-Q-BASE-------190101-01"/>
<BidDocumentVersion v="1"/>
<BidIdentification v="BID00001-GB-NI"/>
<BiddingParty codingScheme="A01" v="xxxxxxxxxx"/>
<AuctionIdentification v="GB-NI-Q-BASE-------190101-01"/>
<BusinessType v="A03"/>
<InArea codingScheme="A01" v="10Y1001A1001A016"/>
<OutArea codingScheme="A01" v="10YGB----------A"/>
<ContractType v="A08"/>
<ContractIdentification v="311218D3-GB-NI-FK01"/>
<MeasureUnitQuantity v="MAW"/>
<Currency v="EUR"/>
<MeasureUnitPrice v="MWH"/>
<Period>
<TimeInterval v="2018-12-31T23:00Z/2019-03-31T22:00Z"/>
<Resolution v="P3M"/>
<Interval>
<Pos v="1"/>
<Qty v="0"/>
<PriceAmount v="4.02"/>
<BidQty v="3.0"/>
<BidPriceAmount v="2.51"/>
</Interval>
</Period>
</AllocationTimeSeries>
<AllocationTimeSeries>
<TimeSeriesIdentification v="TotalAllocationResults_TS_4888602"/>
<BidDocumentIdentification v="GB-NI-Q-BASE-------190101-01"/>
<BidDocumentVersion v="1"/>
<BidIdentification v="BID00004-GB-NI"/>
<BiddingParty codingScheme="A01" v="yyyyyyyyyy"/>
<AuctionIdentification v="GB-NI-Q-BASE-------190101-01"/>
<BusinessType v="A03"/>
<InArea codingScheme="A01" v="10Y1001A1001A016"/>
<OutArea codingScheme="A01" v="10YGB----------A"/>
<ContractType v="A08"/>
<ContractIdentification v="311218D3-GB-NI-FK01"/>
<MeasureUnitQuantity v="MAW"/>
<Currency v="EUR"/>
<MeasureUnitPrice v="MWH"/>
<Period>
<TimeInterval v="2018-12-31T23:00Z/2019-03-31T22:00Z"/>
<Resolution v="P3M"/>
<Interval>
<Pos v="1"/>
<Qty v="0"/>
<PriceAmount v="4.02"/>
<BidQty v="3.0"/>
<BidPriceAmount v="1.51"/>
</Interval>
</Period>
</AllocationTimeSeries>
</TotalAllocationResultDocument>
I'm trying to get this data into a pandas dataframe, with columns something like:
SenderIdentification | DocumentIdentification | InArea | BidID | AllocatedQty
The code I've got is based on a load of different sources, but looks like:
import pandas as pd
import xml.etree.cElementTree as et
def getvalueofnode(node):
""" return node text or None """
return node.text if node is not None else None
def main():
""" main """
parsed_xml =et.parse("example.xml")
dfcols = ['SenderIdentification', 'DocumentIdentification', 'InArea', 'BidID', 'AllocatedQty']
df_xml = pd.DataFrame(columns=dfcols)
for node in parsed_xml.getroot():
for SenderIdentificationFind in parsed_xml.findall('./SenderIdentification'):
SenderIdentification = SenderIdentificationFind.get('v')
for DocumentIdentificationFind in parsed_xml.findall('./DocumentIdentification'):
DocumentIdentification = DocumentIdentificationFind.get('v')
for InAreaFind in parsed_xml.findall('./AllocationTimeSeries/InArea'):
InArea = InAreaFind.get('v')
for BidIDFind in parsed_xml.findall('./AllocationTimeSeries/BidIdentification'):
BidID = BidIDFind.get('v')
for BidIDFind in parsed_xml.findall('./AllocationTimeSeries/Period/Interval/Qty'):
AllocatedQty = BidIDFind.get('v')
df_xml = df_xml.append(
pd.Series([SenderIdentification, DocumentIdentification, InArea, BidID, AllocatedQty], index=dfcols),
ignore_index=True)
return df_xml
main()
df_xml = main()
This is creating a dataframe, but only seems to be pulling data from the last "AllocationTimeSeries" entry in the XML, and seems to have far more entries than the total number of "AllocationTimeSeries" entries in the document (eg. I would expect two unique rows based on this example data, but I'm actually seeing 12 identical rows).
I'm quite confused so any help as to how to get to the dataframe I'm after would be appreciated.
I've also tried the approach of using an XLST file to simplify the XLS file to an easier format to parse, but I don't really have the knowledge of XLS to get that to work either.
Thanks!
Code that works:
import pandas as pd
import lxml.etree as et
def getvalueofnode(node):
""" return node text or None """
return node.text if node is not None else None
def main():
""" main """
parsed_xml = et.parse("example.xml")
dfcols = ['DocumentIdentification', 'DocumentVersion', 'CreationDateTime', 'BidTimeInterval', 'BidIdentification', 'BiddingParty', 'OutArea', 'InArea', 'ContractIdentification', 'BidPrice', ' BidQuantity', 'ClearingPrice', 'AllocatedQuantity']
df_xml = pd.DataFrame(columns=dfcols)
root = parsed_xml.getroot()
result = len(root.findall("./AllocationTimeSeries"))+1
for DocumentIdentificationFind in parsed_xml.findall('./DocumentIdentification'):
DocumentIdentification = DocumentIdentificationFind.get('v')
for DocumentVersionFind in parsed_xml.findall('./DocumentVersion'):
DocumentVersion = DocumentVersionFind.get('v')
for CreationDateTimeFind in parsed_xml.findall('./CreationDateTime'):
CreationDateTime = CreationDateTimeFind.get('v')
for BidTimeIntervalFind in parsed_xml.findall('./BidTimeInterval'):
BidTimeInterval = BidTimeIntervalFind.get('v')
for i in range(1,result):
BidIdentificationString = './AllocationTimeSeries[%d]/BidIdentification'%(i)
BiddingPartyString = './AllocationTimeSeries[%d]/BiddingParty'%(i)
InAreaString = './AllocationTimeSeries[%d]/InArea'%(i)
OutAreaString = './AllocationTimeSeries[%d]/OutArea'%(i)
ContractIdentificationString = './AllocationTimeSeries[%d]/ContractIdentification'%(i)
BidPriceString = './AllocationTimeSeries[%d]/Period/Interval/BidPriceAmount'%(i)
BidQuantityString = './AllocationTimeSeries[%d]/Period/Interval/BidQty'%(i)
ClearingPriceString = './AllocationTimeSeries[%d]/Period/Interval/PriceAmount'%(i)
AllocatedQuantityString = './AllocationTimeSeries[%d]/Period/Interval/Qty'%(i)
for BidIdentificationFind in parsed_xml.findall(BidIdentificationString):
BidIdentification = BidIdentificationFind.get('v')
for BiddingPartyFind in parsed_xml.findall(BiddingPartyString):
BiddingParty = BiddingPartyFind.get('v')
for InAreaFind in parsed_xml.findall(InAreaString):
InArea = InAreaFind.get('v')
for OutAreaFind in parsed_xml.findall(OutAreaString):
OutArea = OutAreaFind.get('v')
for ContractIdentificationFind in parsed_xml.findall(ContractIdentificationString):
ContractIdentification = ContractIdentificationFind.get('v')
for BidPriceFind in parsed_xml.findall(BidPriceString):
BidPrice = BidPriceFind.get('v')
for BidQuantityFind in parsed_xml.findall(BidQuantityString):
BidQuantity = BidQuantityFind.get('v')
for ClearingPriceFind in parsed_xml.findall(ClearingPriceString):
ClearingPrice = ClearingPriceFind.get('v')
for AllocatedQuantityFind in parsed_xml.findall(AllocatedQuantityString):
AllocatedQuantity = AllocatedQuantityFind.get('v')
df_xml = df_xml.append(
pd.Series([DocumentIdentification, DocumentVersion, CreationDateTime, BidTimeInterval, BidIdentification, BiddingParty, OutArea, InArea, ContractIdentification, BidPrice, BidQuantity, ClearingPrice, AllocatedQuantity], index=dfcols),
ignore_index=True)
return df_xml
main()
df_xml = main()
df_xml
python xml pandas
Edit: Issue resolved, updated code below.
I've spent several days scouring the internet trying to figure this out and I can't seem to get anywhere.
Basically I have a heavily nested XML file with all the info I want to extract included as attributes, example below:
<TotalAllocationResultDocument xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" DtdRelease="0" DtdVersion="4" xsi:noNamespaceSchemaLocation="total-allocation-result-document.xsd">
<DocumentIdentification v="A_R-GB-NI-Q-BASE-------190101-01"/>
<DocumentVersion v="1"/>
<DocumentType v="A25"/>
<SenderIdentification codingScheme="A01" v="10X1001A1001A57U"/>
<SenderRole v="A07"/>
<ReceiverIdentification codingScheme="A01" v="10XNITSO-12345-O"/>
<ReceiverRole v="A04"/>
<CreationDateTime v="2018-11-13T12:55:08Z"/>
<BidTimeInterval v="2018-12-31T23:00Z/2019-03-31T22:00Z"/>
<Domain codingScheme="A01" v="10YDOM-1010A024Y"/>
<AllocationTimeSeries>
<TimeSeriesIdentification v="TotalAllocationResults_TS_4888599"/>
<BidDocumentIdentification v="GB-NI-Q-BASE-------190101-01"/>
<BidDocumentVersion v="1"/>
<BidIdentification v="BID00001-GB-NI"/>
<BiddingParty codingScheme="A01" v="xxxxxxxxxx"/>
<AuctionIdentification v="GB-NI-Q-BASE-------190101-01"/>
<BusinessType v="A03"/>
<InArea codingScheme="A01" v="10Y1001A1001A016"/>
<OutArea codingScheme="A01" v="10YGB----------A"/>
<ContractType v="A08"/>
<ContractIdentification v="311218D3-GB-NI-FK01"/>
<MeasureUnitQuantity v="MAW"/>
<Currency v="EUR"/>
<MeasureUnitPrice v="MWH"/>
<Period>
<TimeInterval v="2018-12-31T23:00Z/2019-03-31T22:00Z"/>
<Resolution v="P3M"/>
<Interval>
<Pos v="1"/>
<Qty v="0"/>
<PriceAmount v="4.02"/>
<BidQty v="3.0"/>
<BidPriceAmount v="2.51"/>
</Interval>
</Period>
</AllocationTimeSeries>
<AllocationTimeSeries>
<TimeSeriesIdentification v="TotalAllocationResults_TS_4888602"/>
<BidDocumentIdentification v="GB-NI-Q-BASE-------190101-01"/>
<BidDocumentVersion v="1"/>
<BidIdentification v="BID00004-GB-NI"/>
<BiddingParty codingScheme="A01" v="yyyyyyyyyy"/>
<AuctionIdentification v="GB-NI-Q-BASE-------190101-01"/>
<BusinessType v="A03"/>
<InArea codingScheme="A01" v="10Y1001A1001A016"/>
<OutArea codingScheme="A01" v="10YGB----------A"/>
<ContractType v="A08"/>
<ContractIdentification v="311218D3-GB-NI-FK01"/>
<MeasureUnitQuantity v="MAW"/>
<Currency v="EUR"/>
<MeasureUnitPrice v="MWH"/>
<Period>
<TimeInterval v="2018-12-31T23:00Z/2019-03-31T22:00Z"/>
<Resolution v="P3M"/>
<Interval>
<Pos v="1"/>
<Qty v="0"/>
<PriceAmount v="4.02"/>
<BidQty v="3.0"/>
<BidPriceAmount v="1.51"/>
</Interval>
</Period>
</AllocationTimeSeries>
</TotalAllocationResultDocument>
I'm trying to get this data into a pandas dataframe, with columns something like:
SenderIdentification | DocumentIdentification | InArea | BidID | AllocatedQty
The code I've got is based on a load of different sources, but looks like:
import pandas as pd
import xml.etree.cElementTree as et
def getvalueofnode(node):
""" return node text or None """
return node.text if node is not None else None
def main():
""" main """
parsed_xml =et.parse("example.xml")
dfcols = ['SenderIdentification', 'DocumentIdentification', 'InArea', 'BidID', 'AllocatedQty']
df_xml = pd.DataFrame(columns=dfcols)
for node in parsed_xml.getroot():
for SenderIdentificationFind in parsed_xml.findall('./SenderIdentification'):
SenderIdentification = SenderIdentificationFind.get('v')
for DocumentIdentificationFind in parsed_xml.findall('./DocumentIdentification'):
DocumentIdentification = DocumentIdentificationFind.get('v')
for InAreaFind in parsed_xml.findall('./AllocationTimeSeries/InArea'):
InArea = InAreaFind.get('v')
for BidIDFind in parsed_xml.findall('./AllocationTimeSeries/BidIdentification'):
BidID = BidIDFind.get('v')
for BidIDFind in parsed_xml.findall('./AllocationTimeSeries/Period/Interval/Qty'):
AllocatedQty = BidIDFind.get('v')
df_xml = df_xml.append(
pd.Series([SenderIdentification, DocumentIdentification, InArea, BidID, AllocatedQty], index=dfcols),
ignore_index=True)
return df_xml
main()
df_xml = main()
This is creating a dataframe, but only seems to be pulling data from the last "AllocationTimeSeries" entry in the XML, and seems to have far more entries than the total number of "AllocationTimeSeries" entries in the document (eg. I would expect two unique rows based on this example data, but I'm actually seeing 12 identical rows).
I'm quite confused so any help as to how to get to the dataframe I'm after would be appreciated.
I've also tried the approach of using an XLST file to simplify the XLS file to an easier format to parse, but I don't really have the knowledge of XLS to get that to work either.
Thanks!
Code that works:
import pandas as pd
import lxml.etree as et
def getvalueofnode(node):
""" return node text or None """
return node.text if node is not None else None
def main():
""" main """
parsed_xml = et.parse("example.xml")
dfcols = ['DocumentIdentification', 'DocumentVersion', 'CreationDateTime', 'BidTimeInterval', 'BidIdentification', 'BiddingParty', 'OutArea', 'InArea', 'ContractIdentification', 'BidPrice', ' BidQuantity', 'ClearingPrice', 'AllocatedQuantity']
df_xml = pd.DataFrame(columns=dfcols)
root = parsed_xml.getroot()
result = len(root.findall("./AllocationTimeSeries"))+1
for DocumentIdentificationFind in parsed_xml.findall('./DocumentIdentification'):
DocumentIdentification = DocumentIdentificationFind.get('v')
for DocumentVersionFind in parsed_xml.findall('./DocumentVersion'):
DocumentVersion = DocumentVersionFind.get('v')
for CreationDateTimeFind in parsed_xml.findall('./CreationDateTime'):
CreationDateTime = CreationDateTimeFind.get('v')
for BidTimeIntervalFind in parsed_xml.findall('./BidTimeInterval'):
BidTimeInterval = BidTimeIntervalFind.get('v')
for i in range(1,result):
BidIdentificationString = './AllocationTimeSeries[%d]/BidIdentification'%(i)
BiddingPartyString = './AllocationTimeSeries[%d]/BiddingParty'%(i)
InAreaString = './AllocationTimeSeries[%d]/InArea'%(i)
OutAreaString = './AllocationTimeSeries[%d]/OutArea'%(i)
ContractIdentificationString = './AllocationTimeSeries[%d]/ContractIdentification'%(i)
BidPriceString = './AllocationTimeSeries[%d]/Period/Interval/BidPriceAmount'%(i)
BidQuantityString = './AllocationTimeSeries[%d]/Period/Interval/BidQty'%(i)
ClearingPriceString = './AllocationTimeSeries[%d]/Period/Interval/PriceAmount'%(i)
AllocatedQuantityString = './AllocationTimeSeries[%d]/Period/Interval/Qty'%(i)
for BidIdentificationFind in parsed_xml.findall(BidIdentificationString):
BidIdentification = BidIdentificationFind.get('v')
for BiddingPartyFind in parsed_xml.findall(BiddingPartyString):
BiddingParty = BiddingPartyFind.get('v')
for InAreaFind in parsed_xml.findall(InAreaString):
InArea = InAreaFind.get('v')
for OutAreaFind in parsed_xml.findall(OutAreaString):
OutArea = OutAreaFind.get('v')
for ContractIdentificationFind in parsed_xml.findall(ContractIdentificationString):
ContractIdentification = ContractIdentificationFind.get('v')
for BidPriceFind in parsed_xml.findall(BidPriceString):
BidPrice = BidPriceFind.get('v')
for BidQuantityFind in parsed_xml.findall(BidQuantityString):
BidQuantity = BidQuantityFind.get('v')
for ClearingPriceFind in parsed_xml.findall(ClearingPriceString):
ClearingPrice = ClearingPriceFind.get('v')
for AllocatedQuantityFind in parsed_xml.findall(AllocatedQuantityString):
AllocatedQuantity = AllocatedQuantityFind.get('v')
df_xml = df_xml.append(
pd.Series([DocumentIdentification, DocumentVersion, CreationDateTime, BidTimeInterval, BidIdentification, BiddingParty, OutArea, InArea, ContractIdentification, BidPrice, BidQuantity, ClearingPrice, AllocatedQuantity], index=dfcols),
ignore_index=True)
return df_xml
main()
df_xml = main()
df_xml
python xml pandas
python xml pandas
edited Nov 20 '18 at 16:36
rtunnicliffe
asked Nov 19 '18 at 21:22
rtunnicliffertunnicliffe
13
13
add a comment |
add a comment |
0
active
oldest
votes
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53382820%2fnested-xml-with-all-data-as-attributes-to-pandas-dataframe-only-final-entry-is%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
0
active
oldest
votes
0
active
oldest
votes
active
oldest
votes
active
oldest
votes
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53382820%2fnested-xml-with-all-data-as-attributes-to-pandas-dataframe-only-final-entry-is%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown