XPath Basics

XPath is a powerful tool for selecting nodes in XML and HTML documents. In this article, I've compiled the basics of working with XPath and examples of its application.

What is XPath?

XPath, or XML Path Language, is a query language for selecting nodes from XML or HTML documents. It allows specifying patterns that match the structure of the document and returning all elements matching those patterns.

XPath Syntax Basics

Selecting Nodes by Tag Name

Simple XPath expressions allow selecting elements by tag name. For example:

  • //h1 selects all top-level <h1> headings on the page
  • //p selects all paragraphs <p>
  • //img selects all images <img>

Selecting Nodes by Attribute

You can select elements by attribute or attribute value:

  • //*[@class="highlighted"] selects elements with the class "highlighted"
  • //a[@href] selects all links <a> with the href attribute
  • //img[@alt="Logo"] selects images with the alternative text "Logo"

Selecting Nodes by Position

Selecting nodes based on their position:

  • //ul/li[1] selects the first list item <li> in each unordered list <ul>
  • //table/tr[last()] selects the last row <tr> in each table <table>
  • //ol/li[position() <= 3] selects the first three list items <li> in each ordered list <ol>

Selecting Nodes by Relationship

Navigating up and down the document tree:

  • //div[@class="content"]/* selects all child elements <div> with the class "content"
  • //p/.. selects the parent elements of all paragraphs <p>
  • //h1/following-sibling::p selects all paragraphs <p> that are siblings after the <h1> heading
  • //section//img selects all images <img> that are descendants of <section>

Predicates and Functions

Using predicates and functions to refine selection:

  • //p[contains(text(),"scrapy")] selects paragraphs <p> containing the text "scrapy"
  • //a[starts-with(@href,"https")] selects links <a> whose href attribute starts with "https"
  • //ul[count(li) > 10] selects lists <ul> containing more than 10 list items <li>
  • //img[string-length(@alt) > 0] selects images <img> with a non-empty alt attribute

Using XPath with lxml and BeautifulSoup

Parsing HTML with BeautifulSoup

XPath is often used with Python libraries like BeautifulSoup. Example of working with HTML:

1from bs4 import BeautifulSoup
2
3html_content = "<html><body><h1>Example</h1><p>This is a paragraph.</p></body></html>"
4soup = BeautifulSoup(html_content, "html.parser")
5
6headings = soup.find_all("h1")
7paragraphs = soup.find_all("p")

Extracting Text and Attributes

Extracting data with BeautifulSoup:

1heading_text = headings[0].text
2paragraph_text = paragraphs[0].text

Simple Tips

  • When creating XPath selectors, it's useful to use DevTools for testing.
  • Handle conflicting markup before using XPath.
  • Write reliable and maintainable XPath expressions.
  • Cache HTML and XPath analysis results for improved performance.
comments powered by Disqus

Translations: