XPath (XML Path Language)
Table of Contents
1. XPath 简介
XPath (XML Path Language) is a query language for selecting nodes from an XML document.
参考:
XML Path Language (XPath) Version 1.0
How XPath Works
http://www.zvon.org/xxl/XPathTutorial/General_chi/examples.html
XPath online tester: https://www.freeformatter.com/xpath-tester.html
1.1. 测试工具
推荐两个命令行 XPath 测试工具:xmllint (系统一般内置有这个工具)和 xmlstarlet (需要单独安装,功能强大)。
基本的 XPath 语法类似于在文件系统中定位文件。下面是使用 xmllint 测试 XPath 的简单例子:
$ cat 1.xml <A> <B> <C>xxx</C> </B> <D> </D> </A> $ xmllint --xpath '/A/B/C' 1.xml <C>xxx</C>
除上面的命令行工具外,使用浏览器容易在 html 文档中测试 XPath,详情可参考:https://stackoverflow.com/questions/22571267/how-to-verify-an-xpath-expression-in-chrome-developers-tool-or-firefoxs-firebug
1.2. 基本实例
<?xml version = "1.0"?> <rooms> <room> <!-- This is a list of student --> <student rollno = "393"> <firstname>Dinkar</firstname> <lastname>Kad</lastname> <marks>85</marks> </student> <student rollno = "493"> <firstname>Vaneet</firstname> <lastname>Gupta</lastname> <marks>95</marks> </student> </room> <room> <!-- This is another list of student --> <student rollno = "593"> <firstname>Jasvir</firstname> <lastname>Singh</lastname> <marks>90</marks> </student> <student rollno = "693"> <firstname>William</firstname> <lastname>Shakespeare</lastname> <marks>70</marks> </student> </room> </rooms>
下面是 XPath 和其结果的一些例子:
+-----------------------------------------------------------------------------------------------+------------------------------------+----------------------+ | XPath (1st line) | XPath results | Note | | Unabbreviated XPath syntax (2nd line) | | | +-----------------------------------------------------------------------------------------------+------------------------------------+----------------------+ | /rooms/room/student | <student rollno="393"> | | | /child::rooms/child::room/child::student | <firstname>Dinkar</firstname> | | | | <lastname>Kad</lastname> | | | | <marks>85</marks> | | | | </student> | | | | <student rollno="493"> | | | | <firstname>Vaneet</firstname> | | | | <lastname>Gupta</lastname> | | | | <marks>95</marks> | | | | </student> | | | | <student rollno="593"> | | | | <firstname>Jasvir</firstname> | | | | <lastname>Singh</lastname> | | | | <marks>90</marks> | | | | </student> | | | | <student rollno="693"> | | | | <firstname>William</firstname> | | | | <lastname>Shakespeare</lastname> | | | | <marks>70</marks> | | | | </student> | | +-----------------------------------------------------------------------------------------------+------------------------------------+----------------------+ | /rooms/room[1]/student[1] | <student rollno="393"> | Index starts with 1 | | /child::rooms/child::room[position() = 1]/child::student[position() = 1] | <firstname>Dinkar</firstname> | | | | <lastname>Kad</lastname> | | | | <marks>85</marks> | | | | </student> | | +-----------------------------------------------------------------------------------------------+------------------------------------+----------------------+ | /rooms/room[2]/student[1] | <student rollno="593"> | | | /child::rooms/child::room[position() = 2]/child::student[position() = 1] | <firstname>Jasvir</firstname> | | | | <lastname>Singh</lastname> | | | | <marks>90</marks> | | | | </student> | | +-----------------------------------------------------------------------------------------------+------------------------------------+----------------------+ | /rooms/room/student[1] | <student rollno="393"> | | | /child::rooms/child::room/child::student[position() = 1] | <firstname>Dinkar</firstname> | | | | <lastname>Kad</lastname> | | | | <marks>85</marks> | | | | </student> | | | | <student rollno="593"> | | | | <firstname>Jasvir</firstname> | | | | <lastname>Singh</lastname> | | | | <marks>90</marks> | | | | </student> | | +-----------------------------------------------------------------------------------------------+------------------------------------+----------------------+ | (/rooms/room/student[1])[2] | <student rollno="593"> | Note the parenthesis | | (/child::rooms/child::room/child::student[position() = 1])[position() = 2] | <firstname>Jasvir</firstname> | | | | <lastname>Singh</lastname> | | | | <marks>90</marks> | | | | </student> | | +-----------------------------------------------------------------------------------------------+------------------------------------+----------------------+ | (/rooms/room/student)[1] | <student rollno="393"> | | | (/child::rooms/child::room/child::student)[position() = 1] | <firstname>Dinkar</firstname> | | | | <lastname>Kad</lastname> | | | | <marks>85</marks> | | | | </student> | | +-----------------------------------------------------------------------------------------------+------------------------------------+----------------------+ | /rooms/room/student[firstname="William"] | <student rollno="693"> | | | /child::rooms/child::room/child::student[child::firstname="William"] | <firstname>William</firstname> | | | | <lastname>Shakespeare</lastname> | | | | <marks>70</marks> | | | | </student> | | +-----------------------------------------------------------------------------------------------+------------------------------------+----------------------+ | /rooms/room/student[@rollno = 593] | <student rollno="593"> | Note the @ | | /child::rooms/child::room/child::student[attribute::rollno=593] | <firstname>Jasvir</firstname> | | | | <lastname>Singh</lastname> | | | | <marks>90</marks> | | | | </student> | | +-----------------------------------------------------------------------------------------------+------------------------------------+----------------------+ | /rooms/room/student[@rollno > 500 and @rollno < 600] | <student rollno="593"> | | | /child::rooms/child::room/child::student[attribute::rollno > 500 and attribute::rollno < 600] | <firstname>Jasvir</firstname> | | | | <lastname>Singh</lastname> | | | | <marks>90</marks> | | | | </student> | | +-----------------------------------------------------------------------------------------------+------------------------------------+----------------------+ | /rooms/room/student[@rollno > 500][@rollno < 600] | <student rollno="593"> | | | /child::rooms/child::room/child::student[attribute::rollno > 500][attribute::rollno < 600] | <firstname>Jasvir</firstname> | | | | <lastname>Singh</lastname> | | | | <marks>90</marks> | | | | </student> | | +-----------------------------------------------------------------------------------------------+------------------------------------+----------------------+ | //marks | <marks>85</marks> | | | /descendant-or-self::node()/marks | <marks>95</marks> | | | | <marks>90</marks> | | | | <marks>70</marks> | | +-----------------------------------------------------------------------------------------------+------------------------------------+----------------------+ | (//marks)[1] | <marks>85</marks> | | | (/descendant-or-self::node()/marks)[1] | | | +-----------------------------------------------------------------------------------------------+------------------------------------+----------------------+ | (//marks)[2] | <marks>95</marks> | | | (/descendant-or-self::node()/marks)[2] | | | +-----------------------------------------------------------------------------------------------+------------------------------------+----------------------+ | //student/@rollno | rollno="393" | | | /descendant-or-self::node()/student/attribute::rollno | rollno="493" | | | | rollno="593" | | | | rollno="693" | | +-----------------------------------------------------------------------------------------------+------------------------------------+----------------------+
2. XPath 语法
XPath 语法类似于在文件系统中定位文件,它不同的定位步骤之间用 /
分开,每一个定位步骤由三个构成成分组成:
1、轴(axis);
2、节点测试(node test);
3、零个或多个谓词(predicates)。
“轴”和“节点测试”之间用 ::
分开,“谓词”放在方括号中。即 每个定位步骤的基本语法为:
axis::node-test[predicate 1][predicate 2]
如下面的 XPath 中包含三个定位步骤,中间定位步骤的说明如下:
/child::rooms/child::room/child::student[attribute::rollno = 593] ^ ^ ^ | | | axis node-test predicate
2.1. 简写形式语法
如果“轴”为 child
,则轴名和后面的 ::
可以省略;如果“轴”为 attribute
,则轴名和后面的 ::
可以简写为 @
。从而,下面两个 XPath 是相同的:
/child::rooms/child::room/child::student[attribute::rollno = 593] /rooms/room/student[@rollno = 593]
XPath 中的简写形式如表 1 所示。
全写形式 | 简写形式 |
---|---|
child:: | 无(直接省略) |
attribute:: | @ |
/descendant-or-self::node()/ |
// |
self::node() | . |
parent::node() | .. |
2.2. 轴(axes)
表 2 中列出 XPath 所支持的轴及其解释。
Axis Name | Result |
---|---|
ancestor | 选取当前节点的所有先辈(父、祖父等)。 |
ancestor-or-self | 选取当前节点的所有先辈(父、祖父等)以及当前节点本身。 |
attribute | 选取当前节点的所有属性。 |
child | 选取当前节点的所有子元素。 |
descendant | 选取当前节点的所有后代元素(子、孙等)。 |
descendant-or-self | 选取当前节点的所有后代元素(子、孙等)以及当前节点本身。 |
following | 选取文档中当前节点的结束标签之后的所有节点。 |
following-sibling | 选取文档中当前节点的结束标签之后的兄弟节点。 |
namespace | 选取当前节点的所有命名空间节点。 |
parent | 选取当前节点的父节点。 |
preceding | 选取文档中当前节点的开始标签之前的所有节点。 |
preceding-sibling | 选取当前节点之前的所有同级节点。 |
self | 选取当前节点。 |
2.2.1. ancestor
ancestor 选取当前节点的所有先辈(父、祖父等)。
假设有下面 xml 文件:
<A> <B> <C> <D> </D> </C> </B> </A>
下面是 ancestor 的一些测试例子:
+----------------------+------------+ | XPath | Result | +----------------------+------------+ | /A/B/C/D/ancestor::* | <A> | | | <B> | | | <C> | | | <D> | | | </D> | | | </C> | | | </B> | | | </A> | | | <B> | | | <C> | | | <D> | | | </D> | | | </C> | | | </B> | | | <C> | | | <D> | | | </D> | | | </C> | +----------------------+------------+ | /A/B/C/D/ancestor::B | <B> | | | <C> | | | <D> | | | </D> | | | </C> | | | </B> | +----------------------+------------+
2.2.2. attribute
attribute 选取当前节点的所有属性。 attribute::
可以省写为 @
。
假设有下面 xml 文件:
<AAA> <BBB id = "b1"/> <BBB id = "b2"/> <BBB name = "bbb"/> <BBB/> </AAA>
下面是 attribute 的一些测试例子:
+----------------+---------------------+-----------------------------------------------+ | XPath | Results | Note | +----------------+---------------------+-----------------------------------------------+ | //@id | id="b1" | Select all attributes id | | | id="b2" | | +----------------+---------------------+-----------------------------------------------+ | //BBB[@id] | <BBB id="b1"/> | Select BBB elements which have attribute id | | | <BBB id="b2"/> | | +----------------+---------------------+-----------------------------------------------+ | //BBB[@name] | <BBB name = "bbb"/> | Select BBB elements which have attribute name | +----------------+---------------------+-----------------------------------------------+ | //BBB[@*] | <BBB id="b1"/> | Select BBB elements which have any attribute | | | <BBB id="b2"/> | | | | <BBB name = "bbb"/> | | +----------------+---------------------+-----------------------------------------------+ | //BBB[not(@*)] | <BBB/> | Select BBB elements without an attribute | +----------------+---------------------+-----------------------------------------------+
2.2.3. parent
parent 选取当前节点的父节点。
假设有下面 xml 文件:
<A> <B> <C>xxx</C> </B> <B> </B> <D> <C>yyy</C> </D> </A>
下面是 parent 的一些测试例子:
+---------------+--------------+ | XPath | Result | +---------------+--------------+ | //C/parent::* | <B> | | | <C>xxx</C> | | | </B> | | | <D> | | | <C>yyy</C> | | | </D> | +---------------+--------------+ | //C/parent::B | <B> | | | <C>xxx</C> | | | </B> | +---------------+--------------+
2.3. 节点测试(Node Tests)
最常用的节点测试(node test)就是 xml 中 tag 的名字,如:
/child::rooms/child::room/child::student[attribute::rollno = 593] ^ ^ ^ | | | node-test node-test node-test
注: *
是一个特殊的节点测试名字,下文将介绍。
2.3.1. node() VS. * VS. @*
XPath Syntax | Abbreviated Syntax | Meaning |
---|---|---|
child::node() | node() | Selects all children of the current node |
child::* | * | Selects all element children of the current node |
attribute::* | @* | Selects all attributes of the context node |
假设有下面 xml 文件:
<A>foo <!-- this is comments --> bar <B>xxx</B> baz <B>xxx</B> quz </A>
下面是关于 node()和*的两个测试:
+-----------+-------------------------------------+ | XPath | Result | +-----------+-------------------------------------+ | /A/node() | Text='foo | | | ' | | | Comment='<!-- this is comments -->' | | | Text=' | | | bar | | | ' | | | Element='<B>xxx</B>' | | | Text=' | | | baz | | | ' | | | Element='<B>xxx</B>' | | | Text=' | | | quz | | | ' | +-----------+-------------------------------------+ | /A/* | Element='<B>xxx</B>' | | | Element='<B>xxx</B>' | +-----------+-------------------------------------+
2.4. 谓词(Predicates)
谓词(Predicates)是方块号里的表达式(后文将介绍表达式),使用它可以对找到的节点进行进一步的过滤。
假设有 xml 文件:
<AAA> <BBB>foo</BBB> <BBB>bar</BBB> <BBB>baz</BBB> <BBB>quz</BBB> </AAA>
下面是一些使用谓词(Predicates)的例子:
+----------------------------------------+------------------+---------------+ | Xpath | Result | Note | +----------------------------------------+------------------+---------------+ | /AAA/BBB[1] | <BBB>foo</BBB> | | +----------------------------------------+------------------+---------------+ | /AAA/BBB[position()=1] | <BBB>foo</BBB> | Same as above | +----------------------------------------+------------------+---------------+ | /AAA/BBB[last()] | <BBB>quz</BBB> | | +----------------------------------------+------------------+---------------+ | /AAA/BBB[position()=1 or position()=2] | <BBB>foo</BBB> | | | | <BBB>bar</BBB> | | +----------------------------------------+------------------+---------------+ | /AAA/BBB[text()='foo'] | <BBB>foo</BBB> | | +----------------------------------------+------------------+---------------+ | /AAA/BBB[.='foo'] | <BBB>foo</BBB> | Same as above | +----------------------------------------+------------------+---------------+ | /AAA[BBB='bar'] | <AAA> | | | | <BBB>foo</BBB> | | | | <BBB>bar</BBB> | | | | <BBB>baz</BBB> | | | | <BBB>quz</BBB> | | | | </AAA> | | +----------------------------------------+------------------+---------------+
2.5. 表达式
An XPath expression returns either a node-set, a string, a Boolean, or a number.
2.5.1. 操作符
表 4 是表达式中可以使用的操作符。
Operator | Description | Example | Return Value |
---|---|---|---|
() | Grouping | ||
| | Union two node-sets | //book | //cd | Returns a node-set with all book and cd elements |
+ | Addition | 6 + 4 | 10 |
– | Subtraction | 6 – 4 | 2 |
* | Multiplication | 6 * 4 | 24 |
div | Division | 8 div 4 | 2 |
= | Equal | price=9.80 | true if price is 9.80 |
!= | Not equal | price!=9.80 | true if price is not 9.80 |
< | Less than | price<9.80 | true if price is less than 9.80 |
<= | Less than or equal to | price<=9.80 | true if price is less than or equal to 9.80 |
> | Greater than | price>9.80 | true if price is greater than 9.80 |
>= | Greater than or equal to | price>=9.80 | true if price is greater than or equal to 9.80 |
or | or | price=9.80 or price=9.70 | true if price is 9.80 or 9.70 |
and | and | price>9.00 and price<9.90 | true if price is greater than 9.00 and less than 9.90 |
mod | Modulus (division remainder) | 5 mod 2 | 1 |
关于 Grouping,即小括号 ()
的用法可以参考节 1.2 ,不使用 Grouping(如 /rooms/room/student[1]
)和使用 Grouping(如 (/rooms/room/student)[1]
)是有区别的。
2.6. 函数
2.6.1. Node Set Functions
Node Set 相关函数如表 5 所示。
Node Set Function | Description |
---|---|
count | Returns the number of nodes in the node-setargument. |
id | Selects elements by their unique ID. |
last | Returns a number equal to context size of the expression evaluation context. |
local-name | Returns the local part of the expanded name of the node in the node-setargument that is first in document order. |
name | Returns a string containing a QName representing the expanded name of the node in the node-set argument that is first in document order. |
namespace-uri | Returns the namespace Uniform Resource Identifier (URI) of the expanded name of the node in the node-set argument that is first in document order. |
position | Returns the index number of the node within the parent. |
2.6.2. String Functions
String 相关函数如表 6 所示。
String Function | Description |
---|---|
concat | Returns the concatenation of the arguments. |
contains | Returns true if the first argument string contains the second argument string; otherwise returns false. |
normalize-space | Returns the argument string with the white space stripped. |
starts-with | Returns true if the first argument string starts with the second argument string; otherwise returns false. |
string | Converts an object to a string. |
string-length | Returns the number of characters in the string. |
substring | Returns the substring of the first argument starting at the position specified in the second argument and the length specified in the third argument. |
substring-after | Returns the substring of the first argument string that follows the first occurrence of the second argument string in the first argument string. |
substring-before | Returns the substring of the first argument string that precedes the first occurrence of the second argument string in the first argument string. |
translate | Returns the first argument string with occurrences of characters in the second argument string replaced by the character at the corresponding position in the third argument string. |
2.6.3. Boolean Functions
Boolean 相关函数如表 7 所示。
Boolean Function | Description |
---|---|
boolean | Converts the argument to a Boolean. |
false | Returns false. |
lang | Returns true if the xml:lang attribute of the context node is the same as the argument string. |
not | Returns true if the argument is false, otherwise, false. |
true | Returns true. |
2.6.4. Number Functions
Number 相关函数如表 8 所示。
Number Function | Description |
---|---|
ceiling | Returns the smallest integer that is not less than the argument. |
floor | Returns the largest integer that is not greater than the argument. |
number | Converts the argument to a number. |
round | Returns an integer closest in value to the argument. |
sum | Returns the sum of all nodes in the node-set. Each node is first converted to a number value before summing. |
3. Tips
3.1. 获取文件中所有超链接
获取文件中所有超链接:
$ xmllint --html --xpath '//@href' index.html $ xmllint --html --xpath '//a/@href' index.html # 仅获取a元素中的超链接,如<a href="xx"></a>