Python Read Xml How to Get Value From Tag

Processing XML in Python — ElementTree

A Beginner'south Guide

Deepesh Nair

"shallow focus photography of spider web" by Robert Anasch on Unsplash

Learn how you lot tin parse, explore, alter and populate XML files with the Python ElementTree package, for loops and XPath expressions. As a data scientist, you'll find that understanding XML is powerful for both web-scraping and general practice in parsing a structured document

Extensible Markup Language (XML) is a markup language whi c h encodes documents past defining a set up of rules in both machine-readable and human-readable format. Extended from SGML (Standard Generalized Markup Language), information technology lets usa describe the structure of the certificate. In XML, we can define custom tags. We tin can also utilize XML equally a standard format to exchange information.

  • XML documents have sections, called elements , divers by a beginning and an catastrophe tag . A tag is a markup construct that begins with < and ends with >. The characters between the get-go-tag and cease-tag, if there are any, are the element's content. Elements tin can comprise markup, including other elements, which are called "child elements".
  • The largest, height-level chemical element is called the root , which contains all other elements.
  • Attributes are proper noun–value pair that be within a start-tag or empty-element tag. An XML attribute tin only accept a single value and each attribute can appear at most one time on each chemical element.

Hither's a snapshot of movies.xml that nosotros will be using for this tutorial:

          <?xml version="1.0"?>
<collection>
<genre category="Activeness">
<decade years="1980s">
<movie favorite="Truthful" title="Indiana Jones: The raiders of the lost Ark">
<format multiple="No">DVD</format>
<twelvemonth>1981</year>
<rating>PG</rating>
<description>
'Archeologist and adventurer Indiana Jones
is hired by the U.S. government to discover the Ark of the Covenant before the Nazis.'
</description>
</movie>
<picture show favorite="True" title="THE KARATE Child">
<format multiple="Yep">DVD,Online</format>
<yr>1984</year>
<rating>PG</rating>
<description>None provided.</description>
</motion picture>
<picture show favorite="Fake" title="Back 2 the Future">
<format multiple="Simulated">Blu-ray</format>
<yr>1985</year>
<rating>PG</rating>
<description>Marty McFly</description>
</movie>
</decade>
<decade years="1990s">
<picture show favorite="False" title="X-Men">
<format multiple="Yes">dvd, digital</format>
<year>2000</twelvemonth>
<rating>PG-13</rating>
<description>2 mutants come to a individual academy for their kind whose resident superhero squad must oppose a terrorist arrangement with similar powers.</clarification>
</movie>
<movie favorite="True" title="Batman Returns">
<format multiple="No">VHS</format>
<year>1992</twelvemonth>
<rating>PG13</rating>
<description>NA.</clarification>
</moving picture>
<movie favorite="False" title="Reservoir Dogs">
<format multiple="No">Online</format>
<yr>1992</year>
<rating>R</rating>
<description>WhAtEvER I Want!!!?!</clarification>
</picture show>
</decade>
</genre>

<genre category="Thriller">
<decade years="1970s">
<movie favorite="False" championship="ALIEN">
<format multiple="Yep">DVD</format>
<year>1979</year>
<rating>R</rating>
<description>"""""""""</description>
</flick>
</decade>
<decade years="1980s">
<movie favorite="True" title="Ferris Bueller's Twenty-four hours Off">
<format multiple="No">DVD</format>
<year>1986</year>
<rating>PG13</rating>
<description>Funny moving-picture show on funny guy </clarification>
</movie>
<movie favorite="Faux" title="American Psycho">
<format multiple="No">blue-ray</format>
<twelvemonth>2000</twelvemonth>
<rating>Unrated</rating>
<description>psychopathic Bateman</description>
</flick>
</decade>
</genre>

Introduction to ElementTree

The XML tree structure makes navigation, modification, and removal relatively simple programmatically. Python has a built in library, ElementTree, that has functions to read and manipulate XMLs (and other similarly structured files).

First, import ElementTree. It's a common practice to utilize the alias of ET:

                      import xml.etree.ElementTree as ET                  

Parsing XML Data

In the XML file provided, there is a basic collection of movies described. The only problem is the data is a mess! There have been a lot of dissimilar curators of this collection and everyone has their own way of inbound data into the file. The chief goal in this tutorial will exist to read and empathise the file with Python — then fix the problems.

Offset you need to read in the file with ElementTree.

                      tree = ET.parse('movies.xml')
root = tree.getroot()

Now that yous have initialized the tree, yous should look at the XML and print out values in order to understand how the tree is structured.

                      root.tag                    'collection'        

At the summit level, y'all see that this XML is rooted in the collection tag.

                      root.attrib                    {}        

For Loops

Y'all can easily iterate over subelements (usually chosen "children") in the root by using a simple "for" loop.

                      for child in root:
print(kid.tag, kid.attrib)
genre {'category': 'Activity'}
genre {'category': 'Thriller'}
genre {'category': 'Comedy'}

Now you know that the children of the root drove are all genre. To designate the genre, the XML uses the aspect category. In that location are Action, Thriller, and Comedy movies according the genre element.

Typically information technology is helpful to know all the elements in the unabridged tree. 1 useful part for doing that is root.iter().

                      [elem.tag for elem in root.iter()]                    ['collection',
'genre',
'decade',
'picture show',
'format',
'twelvemonth',
'rating',
'description',
'movie',
.
.
.
.
'film',
'format',
'twelvemonth',
'rating',
'description']

In that location is a helpful way to run into the whole document. If you lot pass the root into the .tostring() method, yous tin return the whole certificate. Within ElementTree, this method takes a slightly strange form.

Since ElementTree is a powerful library that can translate more than just XML, you must specify both the encoding and decoding of the document you lot are displaying every bit the string.

You lot can aggrandize the apply of the iter() office to assist with finding particular elements of interest. root.iter() will list all subelements nether the root that match the element specified. Here, you will list all attributes of the movie element in the tree:

                      for picture show in root.iter('movie'):
print(movie.attrib)
{'favorite': 'True', 'title': 'Indiana Jones: The raiders of the lost Ark'}
{'favorite': 'True', 'title': 'THE KARATE KID'}
{'favorite': 'False', 'title': 'Back ii the Future'}
{'favorite': 'Faux', 'title': 'X-Men'}
{'favorite': 'True', 'title': 'Batman Returns'}
{'favorite': 'False', 'title': 'Reservoir Dogs'}
{'favorite': 'Faux', 'title': 'Alien'}
{'favorite': 'Truthful', 'championship': "Ferris Bueller's Solar day Off"}
{'favorite': 'FALSE', 'title': 'American Psycho'}
{'favorite': 'False', 'title': 'Batman: The Picture show'}
{'favorite': 'True', 'title': 'Like shooting fish in a barrel A'}
{'favorite': 'True', 'title': 'Dinner for SCHMUCKS'}
{'favorite': 'False', 'title': 'Ghostbusters'}
{'favorite': 'True', 'championship': 'Robin Hood: Prince of Thieves'}

XPath Expressions

Many times elements will not have attributes, they will only accept text content. Using the attribute .text, you tin print out this content.

At present, print out all the descriptions of the movies.

                      for description in root.iter('clarification'):
print(description.text)
'Archaeologist and adventurer Indiana Jones is hired past the U.S. government to notice the Ark of the Covenant before the Nazis.' None provided.
Marty McFly
Two mutants come up to a private academy for their kind whose resident superhero team must oppose a terrorist organization with similar powers.
NA.
WhAtEvER I Want!!!?!
"""""""""
Funny flick about a funny guy
psychopathic Bateman
What a joke!
Emma Stone = Hester Prynne
Tim (Rudd) is a rise executive who "succeeds" in finding the perfect guest, IRS employee Barry (Carell), for his boss' monthly issue, a so-called "dinner for idiots," which offers sure
advantages to the exec who shows up with the biggest buffoon.
Who ya gonna call?
Robin Hood slaying

Printing out the XML is helpful, but XPath is a query linguistic communication used to search through an XML quickly and hands. However, Understanding XPath is critically important to scanning and populating XMLs. ElementTree has a .findall() function that volition traverse the immediate children of the referenced chemical element.

Hither, you will search the tree for movies that came out in 1992:

                      for movie in root.findall("./genre/decade/movie/[year='1992']"):
print(picture.attrib)
{'favorite': 'True', 'title': 'Batman Returns'}
{'favorite': 'Simulated', 'championship': 'Reservoir Dogs'}

The function .findall() e'er begins at the chemical element specified. This type of function is extremely powerful for a "discover and replace". You lot can even search on attributes!

At present, impress out merely the movies that are available in multiple formats (an aspect).

                      for flick in root.findall("./genre/decade/movie/format/[@multiple='Yep']"):
print(film.attrib)
{'multiple': 'Yes'}
{'multiple': 'Yes'}
{'multiple': 'Yes'}
{'multiple': 'Yeah'}
{'multiple': 'Aye'}

Brainstorm why, in this case, the impress statement returns the "Yes" values of multiple. Call up about how the "for" loop is defined.

Tip: use '...' inside of XPath to render the parent element of the current element.

                      for movie in root.findall("./genre/decade/movie/format[@multiple='Yes']..."):
impress(pic.attrib)
{'favorite': 'True', 'title': 'THE KARATE Child'}
{'favorite': 'False', 'title': '10-Men'}
{'favorite': 'Fake', 'title': 'Alien'}
{'favorite': 'Simulated', 'title': 'Batman: The Movie'}
{'favorite': 'True', 'championship': 'Dinner for SCHMUCKS'}

Modifying an XML

Earlier, the motion picture titles were an absolute mess. Now, print them out again:

                      for moving picture in root.iter('moving-picture show'):
impress(movie.attrib)
{'favorite': 'True', 'title': 'Indiana Jones: The raiders of the lost Ark'}
{'favorite': 'True', 'title': 'THE KARATE Child'}
{'favorite': 'Fake', 'title': 'Dorsum ii the Time to come'}
{'favorite': 'False', 'championship': 'Ten-Men'}
{'favorite': 'Truthful', 'title': 'Batman Returns'}
{'favorite': 'False', 'title': 'Reservoir Dogs'}
{'favorite': 'False', 'title': 'Conflicting'}
{'favorite': 'True', 'title': "Ferris Bueller's Day Off"}
{'favorite': 'FALSE', 'title': 'American Psycho'}
{'favorite': 'Fake', 'title': 'Batman: The Movie'}
{'favorite': 'Truthful', 'championship': 'Easy A'}
{'favorite': 'True', 'championship': 'Dinner for SCHMUCKS'}
{'favorite': 'Imitation', 'championship': 'Ghostbusters'}
{'favorite': 'True', 'title': 'Robin Hood: Prince of Thieves'}

Set up the '2' in Back 2 the Hereafter. That should exist a detect and replace problem. Write code to find the championship 'Back ii the Future' and save it equally a variable:

                      b2tf = root.find("./genre/decade/moving picture[@title='Back 2 the Future']")
impress(b2tf)
<Element 'picture' at 0x10ce00ef8>

Notice that using the .find() method returns an element of the tree. Much of the time, it is more than useful to edit the content within an element.

Modify the title attribute of the Dorsum 2 the Future element variable to read "Dorsum to the Future". Then, print out the attributes of your variable to meet your change. Y'all can easily do this by accessing the attribute of an element and then assigning a new value to it:

                      b2tf.attrib["championship"] = "Dorsum to the Future"
impress(b2tf.attrib)
{'favorite': 'Fake', 'title': 'Back to the Future'}

Write out your changes back to the XML and then they are permanently fixed in the certificate. Print out your motion-picture show attributes once more to make sure your changes worked. Use the .write() method to do this:

                      tree.write("movies.xml")                                tree = ET.parse('movies.xml')
root = tree.getroot()
for film in root.iter('movie'):
print(movie.attrib)
{'favorite': 'Truthful', 'championship': 'Indiana Jones: The raiders of the lost Ark'}
{'favorite': 'True', 'title': 'THE KARATE Child'}
{'favorite': 'Fake', 'title': 'Dorsum to the Future'}
{'favorite': 'Faux', 'title': 'X-Men'}
{'favorite': 'True', 'championship': 'Batman Returns'}
{'favorite': 'False', 'title': 'Reservoir Dogs'}
{'favorite': 'False', 'title': 'ALIEN'}
{'favorite': 'True', 'title': "Ferris Bueller'due south Day Off"}
{'favorite': 'FALSE', 'title': 'American Psycho'}
{'favorite': 'Faux', 'championship': 'Batman: The Movie'}
{'favorite': 'True', 'title': 'Like shooting fish in a barrel A'}
{'favorite': 'True', 'championship': 'Dinner for SCHMUCKS'}
{'favorite': 'False', 'title': 'Ghostbusters'}
{'favorite': 'True', 'championship': 'Robin Hood: Prince of Thieves'}

Fixing Attributes

The multiple aspect is incorrect in some places. Utilize ElementTree to fix the designator based on how many formats the picture comes in. Commencement, print the formatattribute and text to see which parts demand to be stock-still.

                      for form in root.findall("./genre/decade/movie/format"):
print(form.attrib, form.text)
{'multiple': 'No'} DVD
{'multiple': 'Yes'} DVD,Online
{'multiple': 'False'} Blu-ray
{'multiple': 'Yes'} dvd, digital
{'multiple': 'No'} VHS
{'multiple': 'No'} Online
{'multiple': 'Yes'} DVD
{'multiple': 'No'} DVD
{'multiple': 'No'} blue-ray
{'multiple': 'Yep'} DVD,VHS
{'multiple': 'No'} DVD
{'multiple': 'Yes'} DVD,digital,Netflix
{'multiple': 'No'} Online,VHS
{'multiple': 'No'} Blu_Ray

At that place is some piece of work that needs to be washed on this tag.

Yous tin can use regex to find commas — that will tell whether the multiple attribute should be "Yes" or "No". Adding and modifying attributes can exist done easily with the .set()method.

                      import re                                for grade in root.findall("./genre/decade/movie/format"):
# Search for the commas in the format text
match = re.search(',',form.text)
if match:
form.fix('multiple','Yep')
else:
form.set('multiple','No')
# Write out the tree to the file again
tree.write("movies.xml")
tree = ET.parse('movies.xml')
root = tree.getroot()
for grade in root.findall("./genre/decade/motion picture/format"):
print(form.attrib, form.text)
{'multiple': 'No'} DVD
{'multiple': 'Yes'} DVD,Online
{'multiple': 'No'} Blu-ray
{'multiple': 'Yes'} dvd, digital
{'multiple': 'No'} VHS
{'multiple': 'No'} Online
{'multiple': 'No'} DVD
{'multiple': 'No'} DVD
{'multiple': 'No'} blue-ray
{'multiple': 'Yes'} DVD,VHS
{'multiple': 'No'} DVD
{'multiple': 'Yes'} DVD,digital,Netflix
{'multiple': 'Yep'} Online,VHS
{'multiple': 'No'} Blu_Ray

Moving Elements

Some of the data has been placed in the wrong decade. Utilise what you have learned about XML and ElementTree to discover and fix the decade information errors.

It will exist useful to print out both the decade tags and the year tags throughout the document.

                      for decade in root.findall("./genre/decade"):
print(decade.attrib)
for year in decade.findall("./movie/twelvemonth"):
print(year.text)
{'years': '1980s'}
1981
1984
1985
{'years': '1990s'}
2000
1992
1992
{'years': '1970s'}
1979
{'years': '1980s'}
1986
2000
{'years': '1960s'}
1966
{'years': '2010s'}
2010
2011
{'years': '1980s'}
1984
{'years': '1990s'}
1991

The 2 years that are in the wrong decade are the movies from the 2000s. Figure out what those movies are, using an XPath expression.

                      for movie in root.findall("./genre/decade/movie/[twelvemonth='2000']"):
print(movie.attrib)
{'favorite': 'False', 'title': 'X-Men'}
{'favorite': 'FALSE', 'title': 'American Psycho'}

Y'all have to add together a new decade tag, the 2000s, to the Action genre in order to movement the X-Men data. The .SubElement() method tin can be used to add this tag to the end of the XML.

                      action = root.find("./genre[@category='Action']")
new_dec = ET.SubElement(activity, 'decade')
new_dec.attrib["years"] = '2000s'

At present append the X-Men picture show to the 2000s and remove it from the 1990s, using .suspend() and .remove(), respectively.

                      xmen = root.notice("./genre/decade/moving picture[@title='10-Men']")
dec2000s = root.notice("./genre[@category='Activeness']/decade[@years='2000s']")
dec2000s.append(xmen)
dec1990s = root.detect("./genre[@category='Activity']/decade[@years='1990s']")
dec1990s.remove(xmen)

Build XML Documents

Nice, so y'all were able to substantially move an unabridged movie to a new decade. Save your changes back to the XML.

                      tree.write("movies.xml")                                tree = ET.parse('movies.xml')
root = tree.getroot()
print(ET.tostring(root, encoding='utf8').decode('utf8'))

Conclusion

ElementTree is an of import Python library that allows y'all to parse and navigate an XML certificate. Using ElementTree breaks down the XML certificate in a tree structure that is easy to work with. When in doubt, print information technology out (print(ET.tostring(root, encoding='utf8').decode('utf8'))) - utilise this helpful impress statement to view the unabridged XML document at once.

References

  • Original Mail as published past Steph Howson: Datacamp
  • Python 3 Documentation: ElementTree
  • Wikipedia: XML

overtonfelonfuld.blogspot.com

Source: https://towardsdatascience.com/processing-xml-in-python-elementtree-c8992941efd2

0 Response to "Python Read Xml How to Get Value From Tag"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel