Wednesday, June 19, 2013

Regular Expression (Regexe)

yet another insignificant Programming Notes

Regular Expression 

Regular Expression, or regexe in short, is extremely and amazingly powerful in searching and manipulating text strings. One line of regexe can easily replace several dozen lines of programming codes. Regexe is supported in almost all the scripting languages (such as Perl, Python, PHP, and JavaScript) as well as general purpose programming languages such as Java, and even word processor such as Word for searching texts. 

>>> convert('CamelCase')
camel_case
def convert(name):
    s1 = re.sub('(.)([A-Z][a-z]+)', r'\1_\2', name)
    return re.sub('([a-z0-9])([A-Z])', r'\1_\2', s1).lower()
Or if you're going to call it a zillion times, you can pre-compile the regexes:
first_cap_re = re.compile('(.)([A-Z][a-z]+)')
all_cap_re = re.compile('([a-z0-9])([A-Z])')
def convert(name):
    s1 = first_cap_re.sub(r'\1_\2', name)
    return all_cap_re.sub(r'\1_\2', s1).lower()
Example 1: Numeric String /^[0-9]+$/
  1. wrapper/.../.
  2. position anchors: The leading ^ and the trailing $, which match the beginning and ending of the input string, respectively. As a result, the entire input string shall be matched, instead of a portion of the input string. Without these position anchors, the regexe can match any part of the input string, i.e., with leading and trailing sub-string unmatched.
  3. digits: The [0-9] matches any character between 0 and 9 .
  4. occurrences: The + indicate 1 or more occurrences of the previous sub-expression. In this case, [0-9]+ matches one or more digits.
  5. This regexe matches any non-empty numeric string (of digits 0 to 9), e.g., "0", "12345". It does not match with "" (empty string), "abc", "a123", etc
Example 2: Integer Literal /[1-9][0-9]*|0/

Example 3: An Identifier (or Name) /[a-zA-Z_][a-zA-Z_]*/

Example 4: An Image Filename \^\S+\.(gif|png|jpg|jpeg)$\i.

  1.  non-whitespaces\S+ matches one or more non-whitespaces.
  2. escape code\. matches the . character. We need to use \. to represent . as . has special meaning in regexe. The \ is known as the escape code, which restore the original literal meaning of the following character.
  3. or: (gif|png|jpg|jpeg) matches either "gif", "png", "jpg" or "jpeg". The | means "or".
  4. case-insensitive: The modifier i after the regexe specifies case-insensitive matching. That is, it accepts "test.GIF" and "test.Gif".

Example 5: Email Address /^\w+([\.-]?\w+)*@\w+([\.-]?\w+)*(\.\w{2,3})+$/
  1. character: \w+ matches 1 or more word characters (a-z, A-Z, 0-9 and underscore).
  2. []:[\.-] matches character . or -
  3. ?: [\.-]? matches 0 or 1 occurrence of [\.-]. [0, 1]
  4. *: ([\.-]?\w+)* matches 0 or more occurrences of [\.-]?\w+.
  5. sub-expression: The sub-expression \w+([\.-]?\w+)* is used to match the username in the email, before the @ sign. 
  6. @ itself: The @ matches itself.
  7. {2,3}: The sub-expression \.\w{2,3} matches a . followed by two or three word characters, e.g., ".com", ".edu", ".us", ".uk", ".co".
  8. +: (\.\w{2,3})+ specifies that the above sub-expression shall occur one or more times, e.g., ".com", ".co.uk", ".edu.sg" etc. [1, +oo)
Exercise: Interpret this regexe, which provide another representation of email address: /^[\w\-\.\+]+\@[a-zA-Z0-9\.\-]+\.[a-zA-z0-9]{2,4}$/.
Example 6: Swapping the First two words using Back-References /^(\S+)\s+(\S+)$/
  1. Whitespace, non-whitespace: \S+ (uppercase S) matches one or more non-whitespaces; \s+ (lowercase s) matches one or more whitespaces (blank, tab and newline). In regexe, the uppercase metacharacter denotes the inverse of the lowercase counterpart. For example, \s for whitespace and \S for non-whitespace.
  2. (...): The parentheses in (\S+), called parenthesized back-reference, is used to extract the matched sub-string from the input string. In this regexe, there are two (\S+), match the first two words, separated by one or more whitespaces \s+. The two matched words are extracted from the input string and kept in special variables $1 and $2 respectively.
  3. To reverse the first two words, you can access the special variables, and print "$2 $1" (via a programming language).
Example 7: HTTP Address /^http:\/\/\S+(\/\S+)*(\/)?$/
[TODO] more

Regexe Syntax

It consists of a sequence of characters, meta-characters (such as .\d\s, \S) and operators (such as +*?|^)
For examples,
RegexeMatches
/^[0-9]+$/A numeric string with at least one digits
/^\w+([\.-]?\w+)*@\w+([\.-]?\w+)*(\.\w{2,3})+$/A valid email address

Sub-Expressions 

A regexe is constructed by combining many smaller sub-expressions

OR ('|') Operator


 vertical bar '|'



Bracket [ ] and Range [ - ] Expressions


bracket expression is a list of characters enclosed by [ ], also called character class. It matches any single character in the list. However, if the first character of the list is the caret (^), then it matches any single character NOT in the list. For example, the regexes [02468] matches a single digit 0, 2, 4, 6, or 8; the regexes [^02468] matches any single character other than 0, 2, 4, 6, or 8.

Instead of listing all characters, you could use a range expression inside the bracket. A range expression consists of 2 characters separated by a hyphen (-). It matches any single character that sorts between the two characters, inclusive. For example, [a-d] is the same as [abcd]. You could include a caret (^) in front of the range to invert the matching. For example, [^a-d] is equivalent to [^abcd].

Special characters: To include a ], place it first in the list. To include a ^, place it anywhere but first. Finally, to include a - place it last. Most metacharacters (such as +*) lose their special meaning inside bracketed lists.
[]^+*-]

Occurrence Indicators (aka Repetition Operators): +*?{}

A regexe sub-expression may be followed by an occurrence indicator (aka repetition operator):
  • ?: [0, 1] boolean
  • *: [0, +oo) wildcard
  • +: [1, +oo) positive 
  • {m}: [m, m] number range 
  • {m,}: [m, +oo)
  • {m,n}: [m, n]

Metacharacters .\w\W\d\D\s\S\xnn\onn

metacharacter is a symbol with a special meaning inside a regexe.
  • dot (.): The metacharacter dot (.) matches any single character except newline \n (same as [^\n]). For example, /.../ matches any 3 characters (including alphabets, numbers, white-spaces, but except newline); /the../ matches "there", "these", "the " and etc.
  • \w: \w (word character) matches any single letter, number or underscore (same as [a-zA-Z0-9_]). 
  • \W: The uppercase counterpart \W (non-word-character) matches any single character that doesn't match by \w (same as [^a-zA-Z0-9_]). In regexe, the uppercase metacharacter is always the inverse of the lowercase counterpart.
  • \d: \d (digit) matches any single digit (same as [0-9]). 
  • \D: The uppercase counterpart \D (non-digit) matches any single character that is not a digit (same as [^0-9]).
  • \s (space) matches any single whitespace (same as [\t\n\r\f ]). The uppercase counterpart \S (non-space) matches any single character that doesn't match by \s (same as[^\t\n\r\f ]).
  • \xnn: \xnn matches hexadecimal number nn
  • \onn: and \onn matches octal number nn.
Examples:
/\s\s/      # Matches two whitespaces
/\S\S\s/    # Two non-whitespaces followed by a whitespace
/\s+/       # one or more whitespaces
/\S+\s\S+/  # two words (non-whitespaces) separated by a whitespace

Positional Metacharacters (aka Position Anchors) ^$\b\B\<\>\A\Z


Positional anchors do NOT match actual character but matches position in a string, such as begin-of-line, end-of-line, begin-of-word, and end-of-word.

  • The metacharacter caret (^) matches the beginning of the line (or input string); and the metacharacter dollar ($) matches the end of line, excluding newline (or end of input string). These two are the most commonly used position anchors.
  • The metacharacter \b matches the the edge of a word (i.e., word boundary after a whitespace); and \B matches a string provided it's not at the edge of a word. For example,/\bcat\b/ matches the word "cat" (excluding the leading and trailing whitespaces).
  • The metacharacters \< and \> match the empty string at the beginning and the end of a word respectively (compared with \b, which can match both the beginning and the end of a word).
  • \A matches the actual beginning of the line, just like ^\Z matches the actual end of the line, just like $. However, \A and \Z match only the actual beginning and ending, and ignore the embedded newlines within the string.

You can use positional anchors liberally to increase the speed of matching. For examples:

/ing$/           # ending with 'ing'
/^testing 123$/  # Matches only one pattern. Should use equality comparison instead.

Parenthesized Back-References & Matched Variables $1,... , $9


Parentheses ( ) serve two purposes in regexes.

  1. group sub-expressions: Firstly, parentheses ( ) can be used to group sub-expressions for overriding the precedence or applying a repetition operator. For example, /(a|e|i|o|u){3,5}/ is the same as /a{3,5}|e{3,5}|i{3,5}|o{3,5}|u{3,5}/.
  2. back-reference: Secondly, parentheses are used to provide the so called back-references. A back-reference contains the matched sub-string. For examples, the regexe /(\S+)/ creates one back-reference (\S+), which contains the first word (consecutive non-spaces) of the input string; the regexe /(\S+)\s+(\S+)/ creates two back-references: (\S+) and another (\S+), containing the first two words, separated by one or more spaces \s+.

The back-references are stored in special variables $1$2, …, $9, where $1 contains the substring matched the first pair of parentheses, and so on. For example, /(\S+)\s+(\S+)/creates two back-references which matched with the first two words. The matched words are stored in $1 and $2, respectively.

Back-references are important if you wish to manipulate the string.
For example, the following Perl expression swap the first and second words separate by a space:
s/(\S+) (\S+)/$2 $1/;   # Swap the first and second words separated by a single space

Modifier

You can attach modifiers after a regexe, in the form of /.../modifiers, to control the matching behavior. For example,
  • i: case insensitive matching. The default is case-sensitive.
  • g: global matching, i.e., search for all the matches. The default searches only the first match.

Regexe in Programming Languages

No comments:

Post a Comment