Regular Expressions

Table of Contents

1. 正则表达式简介

Regular expressions originated in 1956, when mathematician Stephen Cole Kleene described regular languages using his mathematical notation called regular sets.

Regular expressions entered popular use from 1968 in two uses: pattern matching in a text editor and lexical analysis in a compiler.

参考:
Regular Expressions Quick Reference: http://www.regular-expressions.info/refquick.html
Mastering Regular Expressions, by Jeffrey E.F. Friedl
Comparison of regular expression engines: https://en.wikipedia.org/wiki/Comparison_of_regular_expression_engines
Online Regex Tester 1: http://www.regextester.com/
Online Regex Tester 2: http://www.regexr.com/

1.1. 正则表达式 POSIX 标准

POSIX 标准中定义了两种正则标准:Basic Regular Expressions (BREs)和 Extended Regular Expressions (EREs)。

参考:POSIX standard, Chapter 9: Regular Expressions: http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html

1.1.1. BRE 和 ERE 的区别

在基本的正则表达式中, {, |, (, ) 仅是其代表的字面字符,要进行转义(前面加反斜线 \ )才是正则的元字符;在基本的正则表达式中不支持元字符 ?, +

在扩展的正则表达式中, ?, +, {, |, (, ) 都是正则元字符,这样写书正则表达式更方便。

1.1.2. 工具与 POSIX 正则表达式

grep 默认使用 BRE(可通过 grep -E 使用 ERE)。
egrep 使用 ERE。
sed 默认使用 BRE(GNU sed 中指定 -r 使用 ERE,FreeBSD sed 中指定 -E 使用 ERE)。
awk 使用 ERE 的超集。

2. Regular Expressions Features

2.1. Any Character (.)

The dot (.) matches any character (except line break) when used outside a bracket expression.

点号默认不会匹配换行符,是由于历史的原因。最早的正则处理工具都是一行一行地读取文本再进行匹配,字符串中是不会包含换行符,这样点号不可能匹配到换行符。

2.1.1. Single-line Mode (使点号可匹配换行符)

In Perl, the mode where the dot also matches line breaks is called "single-line mode". You can activate single-line mode by adding an s after the regex code, like this: m/^regex$/s.

Perl 中 Single-line 模式实例:

$_="first line\nsecond line";
if (m/first.*second/s) {
  print "Match!\n"
} else {
  print "Not match!\n"
}

上面程序会输出"Match!";如果没有指定 s 修饰符,则匹配会失败,因为 first 和 second 不在同一行,会输出"Not match!".

2.2. Anchors and Word Boundaries (^, $ 等)

Table 1: Anchors and Word Boundaries in regex
Feature Java Perl PCRE ECMA Python Ruby POSIX BRE POSIX ERE
^ (start of string/line) Yes Yes Yes Yes Yes Yes Yes Yes
$ (end of string/line) Yes Yes Yes Yes Yes Yes Yes Yes
\A (start of string) Yes Yes Yes No Yes Yes No No
\Z (end of string, before final line break) Yes Yes Yes No No Yes No No
\z (end of string) Yes Yes Yes No \Z Yes No No
\b (at the beginning or end of a word) Yes Yes ascii ascii option ascii No No
\B (NOT at the beginning or end of a word) Yes Yes ascii ascii option ascii No No

2.2.1. Multi-line Mode (改变了 ^$ 的行为)

非多行模式中(默认地):
^ 只能匹配字符串开头;
$ 只能匹配字符串结尾。

在多行模式中:
^ 可以匹配字符串开头(字符串的开始位置),也可以匹配行的开头(即换行符 \n 之后的位置);
$ 可以匹配字符串结尾(字符串的结束位置),也可以匹配行的结尾(即换行符 \n 之前的位置)。

In Perl, you can activate multi-line mode by adding an m after the regex code, like this: m/^regex$/m.

注意:Multi-line Mode 和 Single-line Mode 没有直接关系!

2.3. Repetition (?, *, +, {m,n})

Table 2: Repetition in regex
Feature Java Perl PCRE ECMA Python Ruby POSIX BRE POSIX ERE
? (0 or 1) Yes Yes Yes Yes Yes Yes No Yes
* (0 or more) Yes Yes Yes Yes Yes Yes Yes Yes
+ (1 or more) Yes Yes Yes Yes Yes Yes No Yes
{n} (exactly n) Yes Yes Yes Yes Yes Yes \{n\} Yes
{n,m} (between n and m) Yes Yes Yes Yes Yes Yes \{n,m\} Yes
{n,} (n or more) Yes Yes Yes Yes Yes Yes \{n,\} Yes
? after any of the above quantifiers (make it "lazy") Yes Yes Yes Yes Yes Yes No No

2.3.1. Lazy Repetition (??, *?, +?, {m,n}?)

默认地,正则为贪婪匹配(最左最长匹配 leftmost longest)。
在重复操作符 ?, *, +, {n}, {m,n}, {n,} 后面加上问号 ? ,会变成非贪婪模式,即懒惰模式。

如:
file1 的内容为:

This 'test' isn't successful?

我们想匹配单引号间的内容,即 test,尝试下面操作

grep -P "'.*'" file1  #实例上匹配的是test' isn,因为点号可以匹配单引号,默认为贪婪方式!

怎么仅匹配 test 呢?
方法一,用排除法来匹配单引号以外的字符

grep -P "'[^']*'" file1

方法二,使用非贪婪的方式!

grep -P "'.*?'" file1   #在重复操作符星号*后增加了问号?,正则变为lazy模式。

参考:
懒惰模式应用——匹配源代码中的注释:http://blog.ostermiller.org/find-comment

2.3.2. Possessive (?+, *+, ++, {m,n}+)

除了“贪婪的”和“懒惰的”,Java 中引入了“占有优先”/“独占的”(Possessive quantifier)。其他语言(如 Ruby 1.9 从开始,Perl 从 5.10 开始)也支持“独占的”。

参考节 2.6.2

2.4. Character Classes

Table 3: Character Classes in regex
Feature Java Perl PCRE ECMA Python Ruby POSIX BRE POSIX ERE
[abc] (character class) Yes Yes Yes Yes Yes Yes Yes Yes
[^abc] (negated character class) Yes Yes Yes Yes Yes Yes Yes Yes
[a-z] (character class range) Yes Yes Yes Yes Yes Yes Yes Yes
\d (shorthand for digits) ascii Yes ascii ascii option ascii No No
\w (shorthand for word characters) ascii Yes ascii ascii option ascii No No
\s (shorthand for whitespace) ascii Yes ascii Yes option ascii No No
\D, \W and \S (negated character classes for \d, \w and \s) Yes Yes Yes Yes Yes Yes No No

2.4.1. POSIX Bracket Expressions

The main purpose of bracket expressions is that they adapt to the user's or application's locale. A locale is a collection of rules and settings that describe language and cultural conventions, like sort order, date format, etc. The POSIX standard defines these locales.

Table 4: POSIX Bracket Expressions in regex
Feature Java Perl ECMA Python Ruby POSIX BRE POSIX ERE
[:alpha:] (POSIX character class) No Yes No No Yes Yes Yes
\p{Alpha} (POSIX character class) ascii No No No No No No
[.span-ll.] (POSIX collation sequence) No No No No No Yes Yes
[=x=] (POSIX character equivalence) No No No No No Yes Yes

参考:http://www.regular-expressions.info/posixbrackets.html

2.4.1.1. [:alpha:] 应再放入另一对中括号中

[:alpha:] 应放入到一对中括号中,才表示其特别含义(Alphanumeric characters)。

测试实例:

$ echo "Test" | grep [:alpha:]     # [:alpha:] 仅仅是匹配 : a l p h 这几个字母!
$ echo ":" | grep [:alpha:]
:
$ echo "Test" | grep [[:alpha:]]   # [[:alpha:]] 匹配 Alphanumeric characters。
Test

2.5. Group and Backreferences ((regex), (?:regex))

Table 5: Group and Backreferences in regex
Feature Java Perl PCRE ECMA Python Ruby POSIX BRE POSIX ERE
(regex) (numbered capturing group) Yes Yes Yes Yes Yes Yes \( \) Yes
(?:regex) (non-capturing group) Yes Yes Yes Yes Yes Yes No No
\1 \2 ... (backreferences) Yes Yes Yes Yes Yes Yes Yes No
Backreferences non-existent groups are an error Yes Yes Yes No Yes No Yes n/a
Backreferences to failed groups also fail Yes Yes Yes No Yes Yes Yes n/a

说明:尽管 POSIX 标准中描述 ERE 不支持 backreferences,但 egrep 等工具都支持 backreferences。

反向引用例子:

$ echo hellohello | egrep '(hello)\1'
hellohello

参考:
http://www.regular-expressions.info/backref2.html

2.5.1. Named Capturing Group and Backreferences

.NET 正则引擎和 Python 正则引擎支持命名捕获分组,即给捕获的分组取名字,但其它主流正则引擎都不支持命名捕获分组。

Table 6: Named Capturing Group and Backreferences
Feature .NET Java Perl PCRE ECMA Python Ruby POSIX BRE POSIX ERE
(?<name>regex) (.NET-style named capturing group) Yes No No No No No No No No
(?'name'regex) (.NET-style named capturing group) Yes No No No No No No No No
\k<name> (.NET-style named backreference) Yes No No No No No No No No
\k'name' (.NET-style named backreference) Yes No No No No No No No No
(?P<name>regex) (Python-style named capturing group) No No No Yes No Yes No No No
(?P=name) (Python-style named backreference) No No No Yes No Yes No No No
multiple capturing groups can have the same name Yes n/a n/a No n/a No n/a n/a n/a

2.6. Atomic Group ((?>regex)) and Possessive Quantifiers (?+, *+, ++, {m,n}+)

Table 7: Atomic Group and Possessive Quantifiers in regex
Feature Java Perl PCRE ECMA Python Ruby POSIX BRE POSIX ERE
(?>regex) (atomic group) Yes Yes Yes No No Yes No No
?+, *+, ++ and {m,n}+ (possessive quantifiers) Yes No Yes No No No No No

2.6.1. Atomic Group ((?>regex)) 固化分组

An atomic group is a group that, when the regex engine exits from it, automatically throws away all backtracking positions remembered by any tokens inside the group. Atomic groups are non-capturing. The syntax is (?>group).

An example will make the behavior of atomic groups clear. The regular expression a(bc|b)c (capturing group) matches abcc and abc. The regex a(?>bc|b)c (atomic group) matches abcc but not abc.
当用正则表达式 a(bc|b)c 匹配 abc 时,分组中的 bc 匹配上字符串 bc 后,正则表达式中最后一个字符 c 会匹配失败,这时会进行回溯,尝试用分组的另一个分支 b 进行匹配,这样最终能匹配成功。
当用正则表达式 a(?>bc|b)c 匹配 abc 时,分组中的 bc 匹配上字符串 bc 后,正则表达式中最后一个字符 c 会匹配失败,但由于是固化分组,所以不会进行回溯!最终结果是匹配失败!

固化分组可以提高匹配效率,而且能够对什么能匹配,什么不能匹配进行准确的控制。

参考:
http://www.regular-expressions.info/atomic.html
Mastering Regular Expressions, 3rd Edition 第 6 章 提高表达式速度的诀窍->使用固化分组和占有优先量词
Mastering Regular Expressions, 3rd Edition 第 6 章 消除循环->使用固化分组和占有优先量词

2.6.2. Possessive Quantifiers (?+, *+, ++, {m,n}+) “占有优先”量词


Like a greedy quantifier, a possessive quantifier repeats the token as many times as possible. Unlike a greedy quantifier, it does not give up matches as the engine backtracks. With a possessive quantifier, the deal is all or nothing. You can make a quantifier possessive by placing an extra + after it. * is greedy, *? is lazy, and *+ is possessive. ++, ?+ and {n,m}+ are all possessive as well.

用了“占有优先”量词后,一旦匹配到某些内容,就不会再“交还”,和固化分组类似。“占有优先”量词都可以改写为固化分组的形式。
Technically, possessive quantifiers are a notational convenience to place an atomic group around a single quantifier.

如: X*+ 可改写为 (?>X*)
(?:a|b)*+ 可改写为 (?>(?:a|b)*) ,而不是 (?>a|b)*

总结: “占有优先”其含义和“贪婪的”类似,会尽可能多地匹配字符,但“占有优先”不会放弃已经匹配的字符(也就是说正则引擎不会回溯)。

2.6.2.1. Java 实例:Greedy, Lazy 和 Possessive 的比较

下面是演示 Greedy, Lazy 和 Possessive 的区别的 Java 程序。参考:http://docs.oracle.com/javase/tutorial/essential/regex/quant.html

import java.util.regex.Pattern;
import java.util.regex.Matcher;

public class RegexTestHarness {

    public static void main(String[] args) {
        String text = "xfooxxxxxxfoo";

        String regex1 = ".*foo";       // greedy quantifier
        String regex2 = ".*?foo";      // reluctant quantifier
        String regex3 = ".*+foo";      // possessive quantifier

        System.out.println("Begin test greedy quantifier");
        test(regex1, text);
        System.out.println("Begin test reluctant quantifier");
        test(regex2, text);
        System.out.println("Begin test possessive quantifier");
        test(regex3, text);
    }

    public static void test(String regex, String text) {
        Pattern pattern = Pattern.compile(regex);
        Matcher matcher = pattern.matcher(text);

        boolean found = false;
        while (matcher.find()) {
            System.out.format("I found the text" + " \"%s\" starting at " + "index %d and ending at index %d.%n",
                    matcher.group(), matcher.start(), matcher.end());
            found = true;
        }
        if (!found) {
            System.out.println("No match found.");
        }
    }
}

上面 Java 程序会输出:

Begin test greedy quantifier
I found the text "xfooxxxxxxfoo" starting at index 0 and ending at index 13.
Begin test reluctant quantifier
I found the text "xfoo" starting at index 0 and ending at index 4.
I found the text "xxxxxxfoo" starting at index 4 and ending at index 13.
Begin test possessive quantifier
No match found.

从上面的输出可知“独占的”正则表达式".*+foo"不能匹配上"xfooxxxxxxfoo",为什么呢?
因为正则表达式的前部分".*+"已经“吃掉了”整个"xfooxxxxxxfoo",“独占的”方式下,不会回溯,导致正则表达式的后部分"foo"没有东西可匹配了,从而整个匹配失败。

2.7. Lookaround ((?=regex), (?!regex), (?<=text), (?<!text))

先来一个 lookaround 的总结:

Table 8: Lookaround in regex
Feature Java Perl PCRE ECMA Python Ruby POSIX BRE POSIX ERE
(?=regex) (positive lookahead) Yes Yes Yes Yes Yes Yes No No
(?!regex) (negative lookahead) Yes Yes Yes Yes Yes Yes No No
(?<=text) (positive lookbehind) finite length fixed length fixed length No fixed length No No No
(?<!text) (negative lookbehind) finite length fixed length fixed length No fixed length No No No

lookaround,中文为“环视”。 环视不匹配任何字符,仅用来匹配文本中的某个位置。

设 X 是正则表达式,则:

?=X  表示这个位置后面必须有X
?!X  表示这个位置后面不能有X
?<=X 表示这个位置前面必须有X
?<!X 表示这个位置前面不能有X

参考:
http://www.regular-expressions.info/lookaround.html
《精通正则表达式(第三版)》2.3.5 节。

2.7.1. Positive Lookahead ((?=regex))

“顺序环视”(Positive Lookahead)匹配一个位置,它本身不会“占用”文本。

我们通过一个简单的例子来理解“顺序环视”(Positive Lookahead)。
例如,正则表达式 (?=Jeffrey) 可以匹配下面字符串中“Jeffrey 前的那个位置”(它并不匹配任何文本):

...by Jeffrey Friedl.

再看一个例子,正则表达式 (?=Jeffrey)Jeff (注:这个正则表达式和 Jeff(?=rey) 是等价的),可以匹配下面字符串中的“Jeff”:

by Jeffrey Fridle.

(?=Jeffrey)Jeff 无法匹配下面字符串的“Jeff”:

by Thomas Jefferson.

可以通过 grep 命令来验证:

$ echo "by Jeffrey Friedl"   | grep --perl-regexp --only-matching '(?=Jeffrey)Jeff'     #可以匹配上Jeff
Jeff
$ echo "by Thomas Jefferson" | grep --perl-regexp --only-matching '(?=Jeffrey)Jeff'     #不能匹配上Jeff

2.7.2. Positive Lookahead/Positive lookbehind 的应用:为数值添加逗号

有下面字符串:

The population of 298444215 is growing

我们想要把其中的数字增加逗号分隔符,这样更加容易阅读,即变为:

The population of 298,444,215 is growing

怎么实现呢?把正则表达式 (?<=\d)(?=(\d\d\d)+\b) 匹配的位置换为逗号就行。即:

$ echo "The population of 298444215 is growing" | perl -pe 's/(?<=\d)(?=(\d\d\d)+\b)/,/g'
The population of 298,444,215 is growing

2.8. Conditionals ((?cond re1), (?cond re1 | re2))

A special construct (?cond re1 | re2) allows you to create conditional regular expressions. If the cond part evaluates to true, then the regex engine will attempt to match the re1 part. Otherwise, the re2 part is attempted instead. You can omit the | re2 part.

cond 和种类因流派的不同而不同,但是大多数实现都允许在其中引用捕获的子表达式和环视结构。

Table 9: Conditionals in regex
Feature Java Perl PCRE ECMA Python Ruby POSIX BRE POSIX ERE
(?(?=regex)re1 | re2) (using any lookaround) No Yes Yes No No No No No
(?(1)re1 | re2) No Yes Yes No Yes No No No
(?(group)re1 | re2) No No Yes No Yes No No No

The regex (a)?b(?(1)c|d) consists of the optional capturing group (a)?, the literal b, and the conditional (?(1)c|d) that tests the capturing group. This regex matches bd and abc. It does not match bc, but does match bd in text abd.

参考:http://www.regular-expressions.info/conditional.html

2.9. Modifiers

Table 10: Modifiers in regex
Feature Java Perl PCRE ECMA Python Ruby POSIX BRE POSIX ERE
(?i) (case insensitive) Yes Yes Yes /i only Yes Yes No No
(?s) (dot matches newlines) Yes Yes Yes No Yes (?m) No No
(?m) (^ and $ match at line breaks) Yes Yes Yes /m only Yes always on No No
(?x) (free-spacing mode) Yes Yes Yes No Yes Yes No No
(?-ismx) (turn off mode modifiers) Yes Yes Yes No No Yes No No
(?ismx:group) (mode modifiers local to group) Yes Yes Yes No No Yes No No

参考:
《精通正则表达式(第三版)》3.4.4 节
http://www.regular-expressions.info/modifiers.html

2.10. Comments ((?#comment))

Table 11: Comments in regex
Feature Java Perl PCRE ECMA Python Ruby POSIX BRE POSIX ERE
(?#comment) No Yes Yes No Yes Yes No No

2.11. 正则 flavors 之间的比较

3. 实现正则表达式

参考:Implementing Regular Expressions, by Russ Cox: https://swtch.com/~rsc/regexp/

4. Tips

4.1. 如何匹配中文

如果正则引擎支持 \u ,可以用这个正则表达式匹配中文: [\u4e00-\u9fa5]

在 Perl 中,可以用 \p{Han} 来匹配中文。如:

$ perl -le'use utf8; if ( "这是中文" =~ /\p{Han}/ ) { print "OK!" }'
OK!

如找出文件 file1.txt 中含有中文的行:

$ cat file1.txt | perl -C -ne 'print if /\p{Han}/'

4.2. 验证用户密码是否符合要求

如何写一个正则表达式来验证用户密码是否符合下面要求:

  • Contain between 8 and 15 characters
  • Must contain an uppercase letter
  • Must contain a lowercase letter
  • Must contain a digit
  • Must contain one of special symbols

可以用 Lookaround 可以实现这个需求。如:

^(?=.{8,15}$)(?=.*[A-Z])(?=.*[a-z])(?=.*[0-9])(?=.*[!@#$%^&*]).*$
 \__________/\_________/\_________/\_________/\______________/
    length      upper      lower      digit        symbol

说明:如果不限制只使用“一个”正则表达式,则用多个正则表达式依次验证那些要求也可以完成任务。

参考:http://stackoverflow.com/questions/3533408/regex-i-want-this-and-that-and-that-in-any-order/3533526

Author: cig01

Created: <2011-12-17 Sat>

Last updated: <2017-12-25 Mon>

Creator: Emacs 27.1 (Org mode 9.4)