revlis.nl
Stash of notes about OSS, OSes, virtualization, dev hobby projects &c
October 27, 2023 — 17:16
Xpath or “XML Path Language” is for querying XML and can also be used on HTML DOM. Like selecting ‘div’ elements with a specific class. This can be used when scraping webpages, e.g. with Selenium or Playwright. It works similar to CSS selectors.
Syntax and examples below are all xpath 1.0 since this version is always supported by tools and libs. Version 2.0 adds more types, functions and operators (there’s also 3.0 and 3.1).
Syntax
child::
(or'/'
) selects child (immediate)descendant::
selects children (recursive)descendant-or-self::
(or'//'
)@
selects attributetext()
selects element text
Examples
Select div with ‘myclass’ and ‘title’ attribute
html: <div class="myclass" title="My Title>
xpath: //div[@class="myclass"]/@title
returns: ‘My Title’
Select link with #my_id and then text
html: ‘<a id="my_id">foo bar</a>
’
xpath //a[@id="my_id"]/descendant::text()
returns: ‘foo bar’
Testing
Queries can be tested from CLI with ‘xmllint’ (apt install libxml2-utils
)
# html file:
xmllint --html --xpath '//a[@class="Results"]/@title' example.html
# actual xml, from curl:
curl http://restapi.adequateshop.com/api/Traveler?page=1 | \
xmllint --xpath '/TravelerinformationResponse/travelers/Travelerinformation/name -