XPath (XML Path Language)

Table of Contents

1. XPath 简介

XPath (XML Path Language) is a query language for selecting nodes from an XML document.

参考:
XML Path Language (XPath) Version 1.0
How XPath Works
http://www.zvon.org/xxl/XPathTutorial/General_chi/examples.html
XPath online tester: https://www.freeformatter.com/xpath-tester.html

1.1. 测试工具

推荐两个命令行 XPath 测试工具:xmllint (系统一般内置有这个工具)和 xmlstarlet (需要单独安装,功能强大)。

基本的 XPath 语法类似于在文件系统中定位文件。下面是使用 xmllint 测试 XPath 的简单例子:

$ cat 1.xml
<A>
  <B>
    <C>xxx</C>
  </B>
  <D>
  </D>
</A>
$ xmllint --xpath '/A/B/C' 1.xml
<C>xxx</C>

除上面的命令行工具外,使用浏览器容易在 html 文档中测试 XPath,详情可参考:https://stackoverflow.com/questions/22571267/how-to-verify-an-xpath-expression-in-chrome-developers-tool-or-firefoxs-firebug

1.2. 基本实例


假设有文件 file.xml,内容为:

<?xml version = "1.0"?>
<rooms>
  <room>
    <!-- This is a list of student -->
    <student rollno = "393">
      <firstname>Dinkar</firstname>
      <lastname>Kad</lastname>
      <marks>85</marks>
    </student>
    <student rollno = "493">
      <firstname>Vaneet</firstname>
      <lastname>Gupta</lastname>
      <marks>95</marks>
    </student>
  </room>
  <room>
    <!-- This is another list of student -->
    <student rollno = "593">
      <firstname>Jasvir</firstname>
      <lastname>Singh</lastname>
      <marks>90</marks>
    </student>
    <student rollno = "693">
      <firstname>William</firstname>
      <lastname>Shakespeare</lastname>
      <marks>70</marks>
    </student>
  </room>
</rooms>

下面是 XPath 和其结果的一些例子:

+-----------------------------------------------------------------------------------------------+------------------------------------+----------------------+
| XPath (1st line)                                                                              | XPath results                      | Note                 |
| Unabbreviated XPath syntax (2nd line)                                                         |                                    |                      |
+-----------------------------------------------------------------------------------------------+------------------------------------+----------------------+
| /rooms/room/student                                                                           | <student rollno="393">             |                      |
| /child::rooms/child::room/child::student                                                      |   <firstname>Dinkar</firstname>    |                      |
|                                                                                               |   <lastname>Kad</lastname>         |                      |
|                                                                                               |   <marks>85</marks>                |                      |
|                                                                                               | </student>                         |                      |
|                                                                                               | <student rollno="493">             |                      |
|                                                                                               |   <firstname>Vaneet</firstname>    |                      |
|                                                                                               |   <lastname>Gupta</lastname>       |                      |
|                                                                                               |   <marks>95</marks>                |                      |
|                                                                                               | </student>                         |                      |
|                                                                                               | <student rollno="593">             |                      |
|                                                                                               |   <firstname>Jasvir</firstname>    |                      |
|                                                                                               |   <lastname>Singh</lastname>       |                      |
|                                                                                               |   <marks>90</marks>                |                      |
|                                                                                               | </student>                         |                      |
|                                                                                               | <student rollno="693">             |                      |
|                                                                                               |   <firstname>William</firstname>   |                      |
|                                                                                               |   <lastname>Shakespeare</lastname> |                      |
|                                                                                               |   <marks>70</marks>                |                      |
|                                                                                               | </student>                         |                      |
+-----------------------------------------------------------------------------------------------+------------------------------------+----------------------+
| /rooms/room[1]/student[1]                                                                     | <student rollno="393">             | Index starts with 1  |
| /child::rooms/child::room[position() = 1]/child::student[position() = 1]                      |   <firstname>Dinkar</firstname>    |                      |
|                                                                                               |   <lastname>Kad</lastname>         |                      |
|                                                                                               |   <marks>85</marks>                |                      |
|                                                                                               | </student>                         |                      |
+-----------------------------------------------------------------------------------------------+------------------------------------+----------------------+
| /rooms/room[2]/student[1]                                                                     | <student rollno="593">             |                      |
| /child::rooms/child::room[position() = 2]/child::student[position() = 1]                      |   <firstname>Jasvir</firstname>    |                      |
|                                                                                               |   <lastname>Singh</lastname>       |                      |
|                                                                                               |   <marks>90</marks>                |                      |
|                                                                                               | </student>                         |                      |
+-----------------------------------------------------------------------------------------------+------------------------------------+----------------------+
| /rooms/room/student[1]                                                                        | <student rollno="393">             |                      |
| /child::rooms/child::room/child::student[position() = 1]                                      |   <firstname>Dinkar</firstname>    |                      |
|                                                                                               |   <lastname>Kad</lastname>         |                      |
|                                                                                               |   <marks>85</marks>                |                      |
|                                                                                               | </student>                         |                      |
|                                                                                               | <student rollno="593">             |                      |
|                                                                                               |   <firstname>Jasvir</firstname>    |                      |
|                                                                                               |   <lastname>Singh</lastname>       |                      |
|                                                                                               |   <marks>90</marks>                |                      |
|                                                                                               | </student>                         |                      |
+-----------------------------------------------------------------------------------------------+------------------------------------+----------------------+
| (/rooms/room/student[1])[2]                                                                   | <student rollno="593">             | Note the parenthesis |
| (/child::rooms/child::room/child::student[position() = 1])[position() = 2]                    |   <firstname>Jasvir</firstname>    |                      |
|                                                                                               |   <lastname>Singh</lastname>       |                      |
|                                                                                               |   <marks>90</marks>                |                      |
|                                                                                               | </student>                         |                      |
+-----------------------------------------------------------------------------------------------+------------------------------------+----------------------+
| (/rooms/room/student)[1]                                                                      | <student rollno="393">             |                      |
| (/child::rooms/child::room/child::student)[position() = 1]                                    |   <firstname>Dinkar</firstname>    |                      |
|                                                                                               |   <lastname>Kad</lastname>         |                      |
|                                                                                               |   <marks>85</marks>                |                      |
|                                                                                               | </student>                         |                      |
+-----------------------------------------------------------------------------------------------+------------------------------------+----------------------+
| /rooms/room/student[firstname="William"]                                                      | <student rollno="693">             |                      |
| /child::rooms/child::room/child::student[child::firstname="William"]                          |   <firstname>William</firstname>   |                      |
|                                                                                               |   <lastname>Shakespeare</lastname> |                      |
|                                                                                               |   <marks>70</marks>                |                      |
|                                                                                               | </student>                         |                      |
+-----------------------------------------------------------------------------------------------+------------------------------------+----------------------+
| /rooms/room/student[@rollno = 593]                                                            | <student rollno="593">             | Note the @           |
| /child::rooms/child::room/child::student[attribute::rollno=593]                               |   <firstname>Jasvir</firstname>    |                      |
|                                                                                               |   <lastname>Singh</lastname>       |                      |
|                                                                                               |   <marks>90</marks>                |                      |
|                                                                                               | </student>                         |                      |
+-----------------------------------------------------------------------------------------------+------------------------------------+----------------------+
| /rooms/room/student[@rollno > 500 and @rollno < 600]                                          | <student rollno="593">             |                      |
| /child::rooms/child::room/child::student[attribute::rollno > 500 and attribute::rollno < 600] |   <firstname>Jasvir</firstname>    |                      |
|                                                                                               |   <lastname>Singh</lastname>       |                      |
|                                                                                               |   <marks>90</marks>                |                      |
|                                                                                               | </student>                         |                      |
+-----------------------------------------------------------------------------------------------+------------------------------------+----------------------+
| /rooms/room/student[@rollno > 500][@rollno < 600]                                             | <student rollno="593">             |                      |
| /child::rooms/child::room/child::student[attribute::rollno > 500][attribute::rollno < 600]    |   <firstname>Jasvir</firstname>    |                      |
|                                                                                               |   <lastname>Singh</lastname>       |                      |
|                                                                                               |   <marks>90</marks>                |                      |
|                                                                                               | </student>                         |                      |
+-----------------------------------------------------------------------------------------------+------------------------------------+----------------------+
| //marks                                                                                       | <marks>85</marks>                  |                      |
| /descendant-or-self::node()/marks                                                             | <marks>95</marks>                  |                      |
|                                                                                               | <marks>90</marks>                  |                      |
|                                                                                               | <marks>70</marks>                  |                      |
+-----------------------------------------------------------------------------------------------+------------------------------------+----------------------+
| (//marks)[1]                                                                                  | <marks>85</marks>                  |                      |
| (/descendant-or-self::node()/marks)[1]                                                        |                                    |                      |
+-----------------------------------------------------------------------------------------------+------------------------------------+----------------------+
| (//marks)[2]                                                                                  | <marks>95</marks>                  |                      |
| (/descendant-or-self::node()/marks)[2]                                                        |                                    |                      |
+-----------------------------------------------------------------------------------------------+------------------------------------+----------------------+
| //student/@rollno                                                                             | rollno="393"                       |                      |
| /descendant-or-self::node()/student/attribute::rollno                                         | rollno="493"                       |                      |
|                                                                                               | rollno="593"                       |                      |
|                                                                                               | rollno="693"                       |                      |
+-----------------------------------------------------------------------------------------------+------------------------------------+----------------------+

参考:
http://www.tutorialspoint.com/xpath/

2. XPath 语法

XPath 语法类似于在文件系统中定位文件,它不同的定位步骤之间用 / 分开,每一个定位步骤由三个构成成分组成:
1、轴(axis);
2、节点测试(node test);
3、零个或多个谓词(predicates)。

“轴”和“节点测试”之间用 :: 分开,“谓词”放在方括号中。即 每个定位步骤的基本语法为:

axis::node-test[predicate 1][predicate 2]

如下面的 XPath 中包含三个定位步骤,中间定位步骤的说明如下:

/child::rooms/child::room/child::student[attribute::rollno = 593]
                             ^       ^              ^
                             |       |              |
                           axis   node-test      predicate

2.1. 简写形式语法

如果“轴”为 child ,则轴名和后面的 :: 可以省略;如果“轴”为 attribute ,则轴名和后面的 :: 可以简写为 @ 。从而,下面两个 XPath 是相同的:

/child::rooms/child::room/child::student[attribute::rollno = 593]
/rooms/room/student[@rollno = 593]

XPath 中的简写形式如表 1 所示。

Table 1: XPath 中的简写形式
全写形式 简写形式
child:: 无(直接省略)
attribute:: @
/descendant-or-self::node()/ //
self::node() .
parent::node() ..

2.2. 轴(axes)

2 中列出 XPath 所支持的轴及其解释。

Table 2: Axes in XPath
Axis Name Result
ancestor 选取当前节点的所有先辈(父、祖父等)。
ancestor-or-self 选取当前节点的所有先辈(父、祖父等)以及当前节点本身。
attribute 选取当前节点的所有属性。
child 选取当前节点的所有子元素。
descendant 选取当前节点的所有后代元素(子、孙等)。
descendant-or-self 选取当前节点的所有后代元素(子、孙等)以及当前节点本身。
following 选取文档中当前节点的结束标签之后的所有节点。
following-sibling 选取文档中当前节点的结束标签之后的兄弟节点。
namespace 选取当前节点的所有命名空间节点。
parent 选取当前节点的父节点。
preceding 选取文档中当前节点的开始标签之前的所有节点。
preceding-sibling 选取当前节点之前的所有同级节点。
self 选取当前节点。

2.2.1. ancestor

ancestor 选取当前节点的所有先辈(父、祖父等)。

假设有下面 xml 文件:

<A>
  <B>
    <C>
      <D>
      </D>
    </C>
  </B>
</A>

下面是 ancestor 的一些测试例子:

+----------------------+------------+
| XPath                | Result     |
+----------------------+------------+
| /A/B/C/D/ancestor::* | <A>        |
|                      |   <B>      |
|                      |     <C>    |
|                      |       <D>  |
|                      |       </D> |
|                      |     </C>   |
|                      |   </B>     |
|                      | </A>       |
|                      | <B>        |
|                      |   <C>      |
|                      |     <D>    |
|                      |     </D>   |
|                      |   </C>     |
|                      | </B>       |
|                      | <C>        |
|                      |   <D>      |
|                      |   </D>     |
|                      | </C>       |
+----------------------+------------+
| /A/B/C/D/ancestor::B | <B>        |
|                      |   <C>      |
|                      |     <D>    |
|                      |     </D>   |
|                      |   </C>     |
|                      | </B>       |
+----------------------+------------+

2.2.2. attribute

attribute 选取当前节点的所有属性。 attribute:: 可以省写为 @

假设有下面 xml 文件:

<AAA>
    <BBB id = "b1"/>
    <BBB id = "b2"/>
    <BBB name = "bbb"/>
    <BBB/>
</AAA>

下面是 attribute 的一些测试例子:

+----------------+---------------------+-----------------------------------------------+
| XPath          | Results             | Note                                          |
+----------------+---------------------+-----------------------------------------------+
| //@id          | id="b1"             | Select all attributes id                      |
|                | id="b2"             |                                               |
+----------------+---------------------+-----------------------------------------------+
| //BBB[@id]     | <BBB id="b1"/>      | Select BBB elements which have attribute id   |
|                | <BBB id="b2"/>      |                                               |
+----------------+---------------------+-----------------------------------------------+
| //BBB[@name]   | <BBB name = "bbb"/> | Select BBB elements which have attribute name |
+----------------+---------------------+-----------------------------------------------+
| //BBB[@*]      | <BBB id="b1"/>      | Select BBB elements which have any attribute  |
|                | <BBB id="b2"/>      |                                               |
|                | <BBB name = "bbb"/> |                                               |
+----------------+---------------------+-----------------------------------------------+
| //BBB[not(@*)] | <BBB/>              | Select BBB elements without an attribute      |
+----------------+---------------------+-----------------------------------------------+

参考:http://zvon.org/xxl/XPathTutorial/Output/example5.html

2.2.3. parent

parent 选取当前节点的父节点。

假设有下面 xml 文件:

<A>
  <B>
    <C>xxx</C>
  </B>
  <B>
  </B>
  <D>
    <C>yyy</C>
  </D>
</A>

下面是 parent 的一些测试例子:

+---------------+--------------+
| XPath         | Result       |
+---------------+--------------+
| //C/parent::* | <B>          |
|               |   <C>xxx</C> |
|               | </B>         |
|               | <D>          |
|               |   <C>yyy</C> |
|               | </D>         |
+---------------+--------------+
| //C/parent::B | <B>          |
|               |   <C>xxx</C> |
|               | </B>         |
+---------------+--------------+

2.3. 节点测试(Node Tests)

最常用的节点测试(node test)就是 xml 中 tag 的名字,如:

/child::rooms/child::room/child::student[attribute::rollno = 593]
          ^            ^            ^
          |            |            |
       node-test    node-test    node-test

注: * 是一个特殊的节点测试名字,下文将介绍。

参考:https://www.w3.org/TR/xpath/#node-tests

2.3.1. node() VS. * VS. @*

Table 3: node() VS. *
XPath Syntax Abbreviated Syntax Meaning
child::node() node() Selects all children of the current node
child::* * Selects all element children of the current node
attribute::* @* Selects all attributes of the context node

假设有下面 xml 文件:

<A>foo
  <!-- this is comments -->
  bar
  <B>xxx</B>
  baz
  <B>xxx</B>
  quz
</A>

下面是关于 node()和*的两个测试:

+-----------+-------------------------------------+
| XPath     | Result                              |
+-----------+-------------------------------------+
| /A/node() | Text='foo                           |
|           |   '                                 |
|           | Comment='<!-- this is comments -->' |
|           | Text='                              |
|           |   bar                               |
|           |   '                                 |
|           | Element='<B>xxx</B>'                |
|           | Text='                              |
|           |   baz                               |
|           |   '                                 |
|           | Element='<B>xxx</B>'                |
|           | Text='                              |
|           |   quz                               |
|           | '                                   |
+-----------+-------------------------------------+
| /A/*      | Element='<B>xxx</B>'                |
|           | Element='<B>xxx</B>'                |
+-----------+-------------------------------------+

2.4. 谓词(Predicates)

谓词(Predicates)是方块号里的表达式(后文将介绍表达式),使用它可以对找到的节点进行进一步的过滤。

假设有 xml 文件:

<AAA>
  <BBB>foo</BBB>
  <BBB>bar</BBB>
  <BBB>baz</BBB>
  <BBB>quz</BBB>
</AAA>

下面是一些使用谓词(Predicates)的例子:

+----------------------------------------+------------------+---------------+
| Xpath                                  | Result           | Note          |
+----------------------------------------+------------------+---------------+
| /AAA/BBB[1]                            | <BBB>foo</BBB>   |               |
+----------------------------------------+------------------+---------------+
| /AAA/BBB[position()=1]                 | <BBB>foo</BBB>   | Same as above |
+----------------------------------------+------------------+---------------+
| /AAA/BBB[last()]                       | <BBB>quz</BBB>   |               |
+----------------------------------------+------------------+---------------+
| /AAA/BBB[position()=1 or position()=2] | <BBB>foo</BBB>   |               |
|                                        | <BBB>bar</BBB>   |               |
+----------------------------------------+------------------+---------------+
| /AAA/BBB[text()='foo']                 | <BBB>foo</BBB>   |               |
+----------------------------------------+------------------+---------------+
| /AAA/BBB[.='foo']                      | <BBB>foo</BBB>   | Same as above |
+----------------------------------------+------------------+---------------+
| /AAA[BBB='bar']                        | <AAA>            |               |
|                                        |   <BBB>foo</BBB> |               |
|                                        |   <BBB>bar</BBB> |               |
|                                        |   <BBB>baz</BBB> |               |
|                                        |   <BBB>quz</BBB> |               |
|                                        | </AAA>           |               |
+----------------------------------------+------------------+---------------+

2.5. 表达式

An XPath expression returns either a node-set, a string, a Boolean, or a number.

2.5.1. 操作符

4 是表达式中可以使用的操作符。

Table 4: Operators in XPath expression
Operator Description Example Return Value
() Grouping    
| Union two node-sets //book | //cd Returns a node-set with all book and cd elements
+ Addition 6 + 4 10
Subtraction 6 – 4 2
* Multiplication 6 * 4 24
div Division 8 div 4 2
= Equal price=9.80 true if price is 9.80
!= Not equal price!=9.80 true if price is not 9.80
< Less than price<9.80 true if price is less than 9.80
<= Less than or equal to price<=9.80 true if price is less than or equal to 9.80
> Greater than price>9.80 true if price is greater than 9.80
>= Greater than or equal to price>=9.80 true if price is greater than or equal to 9.80
or or price=9.80 or price=9.70 true if price is 9.80 or 9.70
and and price>9.00 and price<9.90 true if price is greater than 9.00 and less than 9.90
mod Modulus (division remainder) 5 mod 2 1

关于 Grouping,即小括号 () 的用法可以参考节 1.2 ,不使用 Grouping(如 /rooms/room/student[1] )和使用 Grouping(如 (/rooms/room/student)[1] )是有区别的。

2.6. 函数

2.6.1. Node Set Functions

Node Set 相关函数如表 5 所示。

Table 5: XPath Node Set Functions
Node Set Function Description
count Returns the number of nodes in the node-setargument.
id Selects elements by their unique ID.
last Returns a number equal to context size of the expression evaluation context.
local-name Returns the local part of the expanded name of the node in the node-setargument that is first in document order.
name Returns a string containing a QName representing the expanded name of the node in the node-set argument that is first in document order.
namespace-uri Returns the namespace Uniform Resource Identifier (URI) of the expanded name of the node in the node-set argument that is first in document order.
position Returns the index number of the node within the parent.

2.6.2. String Functions

String 相关函数如表 6 所示。

Table 6: XPath String Functions
String Function Description
concat Returns the concatenation of the arguments.
contains Returns true if the first argument string contains the second argument string; otherwise returns false.
normalize-space Returns the argument string with the white space stripped.
starts-with Returns true if the first argument string starts with the second argument string; otherwise returns false.
string Converts an object to a string.
string-length Returns the number of characters in the string.
substring Returns the substring of the first argument starting at the position specified in the second argument and the length specified in the third argument.
substring-after Returns the substring of the first argument string that follows the first occurrence of the second argument string in the first argument string.
substring-before Returns the substring of the first argument string that precedes the first occurrence of the second argument string in the first argument string.
translate Returns the first argument string with occurrences of characters in the second argument string replaced by the character at the corresponding position in the third argument string.

2.6.3. Boolean Functions

Boolean 相关函数如表 7 所示。

Table 7: XPath Boolean Functions
Boolean Function Description
boolean Converts the argument to a Boolean.
false Returns false.
lang Returns true if the xml:lang attribute of the context node is the same as the argument string.
not Returns true if the argument is false, otherwise, false.
true Returns true.

2.6.4. Number Functions

Number 相关函数如表 8 所示。

Table 8: XPath Number Functions
Number Function Description
ceiling Returns the smallest integer that is not less than the argument.
floor Returns the largest integer that is not greater than the argument.
number Converts the argument to a number.
round Returns an integer closest in value to the argument.
sum Returns the sum of all nodes in the node-set. Each node is first converted to a number value before summing.

3. Tips

3.1. 获取文件中所有超链接

获取文件中所有超链接:

$ xmllint --html --xpath '//@href' index.html
$ xmllint --html --xpath '//a/@href' index.html    # 仅获取a元素中的超链接,如<a href="xx"></a>

Author: cig01

Created: <2017-04-09 Sun>

Last updated: <2018-01-05 Fri>

Creator: Emacs 27.1 (Org mode 9.4)