Regular Expressions

8 downloads 11480 Views 65KB Size Report
Honda. 60 1993 £1340. NSR250R-SP Honda. 65 1994 £2000 grep -e "6. 1" mb. txt | grep ..... XEmacs Lisp Reference Manual - Searching and Matching. Regular  ...
Regular Expressions Ashley J.S Mills

Copyright © 2005 The University Of Birmingham

Table of Contents 1. Introduction ................................................................................................................................................. 1 2. Basics ......................................................................................................................................................... 1 2.1. Single Character ................................................................................................................................. 1 2.2. Any Character: . ................................................................................................................................. 1 2.3. The Escape Character: \ ....................................................................................................................... 1 2.4. The Caret: ^ ...................................................................................................................................... 2 2.5. The Dollar Symbol: $ .......................................................................................................................... 2 2.6. The Kleene star: * .............................................................................................................................. 2 2.7. The Kleene plus: + ............................................................................................................................. 2 2.8. Ranges: [ ], [cn-cm] and [^cn-cm] ......................................................................................................... 2 2.9. Grouping: \( \) ................................................................................................................................... 2 2.10. Alternatives: | .................................................................................................................................. 2 2.11. Repetition: \{n\}, \{,n\}, \{n,\}, \{n,m\} ................................................................................................. 2 3. Grep Examples ............................................................................................................................................. 3 4. java.util.regex, Java 1.4 .................................................................................................................................. 5 5. Emacs Regular Expressions ............................................................................................................................ 8 6. References ................................................................................................................................................... 10

1. Introduction Pattern matching is an important topic in Computer Science, it is the process of matching defined patterns to information. Humans use pattern matching everyday to recognise objects and faces, computers use pattern matching everyday to perform the most basic of operations, when you execute a command at the command line, some kind of pattern matching is being employed to determine what your command is asking the computer to do, pattern matching is used in compilers and programming languages. Regular Expressions are a particular kind of pattern matching located in the Regular Language subclass of pattern matching languages. They are considered the least complex of the pattern matching languages but are very useful. You have probably used regular expressions before, for instance if you have specified that you want to delete *.* at the command line, referring to any basename followed by a dot followed by any extension, then you have used the concepts of regular expressions at least once. Most of you will be aware that the * character, known as a Kleene star or asterisk, means "match anything" and indeed it is used in a very similar manner in the regular expressions we are about to discuss. There are many programs out there that have some kind of builtin regular expression handling capabilities. The thing is, they all seem to have slight syntactical variation, fortunately the concepts are identical in each case and the differences are often marginal, this text will describe the most common components of a regular expression and will present program specific examples where appropriate.

2. Basics Regular expressions consist of literal characters and meta characters, literal characters are the actual characters you want to find, meta characters are special characters, like the Kleene star, and are the core concept behind regular expressions hence we will begin this section with a brief introduction to the most common meta characters.

2.1. Single Character A single character such as Q is a regular expression, it is the regular expression that matches every string that contains the character Q, so it would match Quick, Quiet and Quantum but not quick.

2.2. Any Character: . The period, or full-stop as we call it in Britain, is used to signify that any character may be replaced by it in the search, it matches any character. For example, ".t.m would match atom, item and stem and probably some other words too. A fun example of using this character can be found at http://www.oneacross.com/ where it is used to help people find words for their crosswords, they also use the the character ?' as an alternative.

2.3. The Escape Character: \ 1

Regular Expressions \ is used to signify that we want to use a meta character as a literal character, this is necessary otherwise the character in question would be interpreted as meta-data, the character that the is being escaped is the character immediately following the escape character. For example, "\*" would match the string containing the character that has been escaped, that is, it would match the string (or any string containing) *. The converse can also be true, sometimes \ is used to signify that we want to use a literal character as a meta character, for example, within a double quoted string in an implementation that requires that meta characters are escaped. You should read the documentation of the particular regular expression implementation you are using to find out which approach your implementation takes.

2.4. The Caret: ^ ^, known as a caret, is used to match the beginning of a line, so "^CAPITAL" would match "CAPITAL's signify emphasised speech, anger or SHOUTING", it would not match "Your such a CAPITAL idiot!".

2.5. The Dollar Symbol: $ $ is used to match the end of a line, so "here$" would match "I like it here" but would not match "here is a potato".

2.6. The Kleene star: * * is used to match zero or more occurrences of the regular expression immediately preceding the meta character. "10*" would match "1", "10", "100", "1000" and so on.

2.7. The Kleene plus: + + is used to match one or more occurrences of the regular expression immediately preceding the meta character. "10+" would match "10", "100", "1000" and so on but would not match "1".

Note (regular expression)+ is the same as (regular expression)(regular expression)*.

2.8. Ranges: [ ], [cn-cm] and [^cn-cm] [ ] is used to signify that any of the characters or expressions enclosed within them may be matched. 1[123]512 would match "11512", "12512" and "13512". [cn-cm] is used to specify a range of characters (inclusively) that may be matched at this point in the regular expression. ";[b-f]oo" would match "boo", "coo", "doo", "eoo" and "foo" but not "goo". [^cn-cm] is used to exclude a range of characters from a match, notice that the caret has been used again, when it is used immediately after an opening [ it has this special meaning, if you want to exclude the caret then you would escape it: "[^\^]. "[^1-8]00" would match "900" but not any of the other three digit hundreds such as "500".

2.9. Grouping: \( \) \( \) is used to treat regular expression contained within the (escaped in this case) brackets as a group, this group can then be back referenced later like \1 to refer to the first group defined. How this is implemented in various programs that use regular expressions varies, some tools do not require you to escape the brackets, some use different conventions to back reference defined groups. For instance a program may use "$1" to refer to the first bracketed group instead of "\1". There may also be limits on the number of groups that can be referenced in this way, sometimes it is a maximum of nine. In the program grep "\(a\)b\1" would match "aba".

2.10. Alternatives: | | is used to delimit the OR operator, in this case the operands are the regular expressions either side of it, signifying that if either the first expression OR the second expression matches, then the whole expression will match. For example "^aba\|b$" will match the lines "aba", "abb" but not "abc". The | meta character may or may not need to be escaped depending on the program.

2.11. Repetition: \{n\}, \{,n\}, \{n,\}, \{n,m\} \{n\} is used to specify that the regular expression immediately preceding must be matched n times exactly. "^10\{3\}$" will match the line "1000" but not "100" or "10000". \{,n\} is used to specify that the regular expression immediately preceding may be matched up to a maximum of n times. "^10\{,3\}$" will match the lines "1", "10", "100" and "1000" but will not match "10000". \{n,\} is used to specify that the regular expression immediately preceding must be matched at least n times. "^10\{3,\}$" will match the lines "1000", "10000", "100000" and so on but will not match "100".

Note This is an alternative to using the Kleene star and the Kleene plus, they may not be supported in your implementation. "a\{0,}\" is the same as "a*" and "a\{1,}\" is the same as "a+". \{n,m\} is used to specify that the regular expression immediately preceding must be matched at least n times but may not exceed 2

Regular Expressions m matches. "^10\{3,4\}$" will match the lines "1000" and "10000" but not "100" or "100000". The necessity to escape the characters may vary. Not all programs support all the types of repetition described.

3. Grep Examples Grep is a tool used to search text using regular expressions, its origins highlight its function, according to http://www.faqs.org/faqs/usenet/faq/part1/section-21.html [http://www.faqs.org/faqs/usenet/faq/part1/section-21.html:] its origins are as follows: The original UNIX text editor "ed" has a construct g/re/p, where "re" stands for a regular expression, to Globally search for matches to the Regular Expression and Print the lines containing them. This was so often used that it was packaged up into its own command, thus named "grep". According to Dennis Ritchie, this is the true origin of the command. I will present a few examples, of which the first two are based on the following text file, mb.txt: NAME NSR250R NSR250R-SP KR1S GSX250 GS250T RGV250 RGV250-SP

MAKE Honda Honda Kawasaki Suzuki Suzuki Suzuki Suzuki

HP 60 65 60 26 26 60 65

YEAR 1993 1994 1989 1981 1982 1993 1994

PRICE £1340 £2000 £1250 £300 £250 £1400 £2400

The examples will use the -E option which specifies that grep should expect syntax in the form of an extended regular expression. grep -e Honda mb.txt

Lists all the lines that contain the text string "Honda": NSR250R Honda NSR250R-SP Honda

60 1993 £1340 65 1994 £2000

grep -e "6. 1" mb.txt | grep -v

Output: KR1S RGV250 RGV250-SP

Kawasaki 60 1989 £1250 Suzuki 60 1993 £1400 Suzuki 65 1994 £2400

First lists all bikes that are sixty something BHP and then pipes this to another instance of grep which excludes all the lines containing "Honda" with the -v

Note Quotes are used to preserve whitespace and are used whenever '\' is used since this is also special within the shell so needs to be hidden from the shell ls -l | grep -e "Aristotle\.txt"

Output: Aristotle.txt

Pipes the output from a directory listing to grep which searches filters the lines containing "Aristotle.txt", note the use of the escape character '.' to literally match '\' grep -e "\(101\)\1" in.file

Matches the string "101101", notice that "101" is first grouped by enclosing within an escaped opening parentheses "\(" and an escaped closing parentheses "\)". The first group is then referenced with "\1". Something like: grep -e "1\(0\)\*" in.file

3

Regular Expressions Would match "1" followed by zero or more occurrences of "0", whereas: grep

-e "1\(0\)\+" in.file

Would match "1" followed by at leastone occurrence of "0", this is the same as: grep -e "1\(0\)\(\1\)*" in.file

'[' followed by ']' can be used to match a range of characters and some special ranges are already defined: •

[[:alnum:]] matches [0-9a-zA-Z]



[[:alpha:]] matches [a-zA-Z]



[[:cntrl:]] matches control characters



[[:digit:]] matches [0-9]



[[:lower:]] matches [a-z]



[[:punct:]] matches punctuation characters



[[:upper:]] matches [A-Z]



[[:space:]] matches any white space

grep -e "\([[:alpha:]]\)\+[[:digit:]][[:upper:]]"

Would match one or more characters in the range [a-zA-Z] followed by one character in the range [0-9] followed by one character in the range [A-Z]. So it would match the string "abc9Z". The number of times a pattern must be matched can be specified after the group: grep -e "\(abc\)\{3\}" in.file

Would match any lines containing 3 occurrences of the pattern "abc". grep -e "^\(abc\)\{22\}" in.file

Would match any lines containing 2 occurrences of the pattern "abc", with the restriction that the sequence must start at the beginning of a line, specified by the use of '^', there are similar commands to '^': •

$ matches the end of a line



\ matches the beginning of a word



\> matches the end of a word



\b matches the empty string at the edge of a word



\B matches the empty string provided it is not at the edge of a word.

grep -e "^\([[:alpha:]][[:alnum:]]*\)=\1"

Would match an an alpha character beginning at the start of a line followed by zero or more alphanumeric characters followed by '=' followed by the same sequence of characters that were matched before the '=', so "abc=abc" would be matched but "abc=abd" would not be matched. Suppose you wanted to match the h1, h2, h3... etc. elements in an HTML file. Assume the text file html.txt: