XPath Basics
XPath is a powerful tool for selecting nodes in XML and HTML documents. In this article, I've compiled the basics of working with XPath and examples of its application.
What is XPath?
XPath, or XML Path Language, is a query language for selecting nodes from XML or HTML documents. It allows specifying patterns that match the structure of the document and returning all elements matching those patterns.
XPath Syntax Basics
Selecting Nodes by Tag Name
Simple XPath expressions allow selecting elements by tag name. For example:
//h1
selects all top-level<h1>
headings on the page//p
selects all paragraphs<p>
//img
selects all images<img>
Selecting Nodes by Attribute
You can select elements by attribute or attribute value:
//*[@class="highlighted"]
selects elements with the class "highlighted"//a[@href]
selects all links<a>
with the href attribute//img[@alt="Logo"]
selects images with the alternative text "Logo"
Selecting Nodes by Position
Selecting nodes based on their position:
//ul/li[1]
selects the first list item<li>
in each unordered list<ul>
//table/tr[last()]
selects the last row<tr>
in each table<table>
//ol/li[position() <= 3]
selects the first three list items<li>
in each ordered list<ol>
Selecting Nodes by Relationship
Navigating up and down the document tree:
//div[@class="content"]/*
selects all child elements<div>
with the class "content"//p/..
selects the parent elements of all paragraphs<p>
//h1/following-sibling::p
selects all paragraphs<p>
that are siblings after the<h1>
heading//section//img
selects all images<img>
that are descendants of<section>
Predicates and Functions
Using predicates and functions to refine selection:
//p[contains(text(),"scrapy")]
selects paragraphs<p>
containing the text "scrapy"//a[starts-with(@href,"https")]
selects links<a>
whose href attribute starts with "https"//ul[count(li) > 10]
selects lists<ul>
containing more than 10 list items<li>
//img[string-length(@alt) > 0]
selects images<img>
with a non-empty alt attribute
Using XPath with lxml and BeautifulSoup
Parsing HTML with BeautifulSoup
XPath is often used with Python libraries like BeautifulSoup. Example of working with HTML:
1from bs4 import BeautifulSoup
2
3html_content = "<html><body><h1>Example</h1><p>This is a paragraph.</p></body></html>"
4soup = BeautifulSoup(html_content, "html.parser")
5
6headings = soup.find_all("h1")
7paragraphs = soup.find_all("p")
Extracting Text and Attributes
Extracting data with BeautifulSoup:
1heading_text = headings[0].text
2paragraph_text = paragraphs[0].text
Simple Tips
- When creating XPath selectors, it's useful to use DevTools for testing.
- Handle conflicting markup before using XPath.
- Write reliable and maintainable XPath expressions.
- Cache HTML and XPath analysis results for improved performance.