Regular-Expression Notes From Everywhere

January 5th, 2010.
Filed under Programming.

Knowing how to use regular expressions is important. Regular expressions help me solve problems both large and small.

There nothing magic about Regular expression. Just like a magician, there is nothing magic about magic. The magician understands something simple which does not appear to be simple or natural to the untrained audience.

The special characters (like the * from the filename analogy) are called metacharacters, while the rest are called literal, or normal text characters.

Start and End of the line
Metacharacters are

^ (caret)

and

$(dollar)

which represent the start and end of the line of text as it is being checked. ^Millad matches if you have the beginning of a line, followed immediately by M, followed immediately by i, followed immediately by l…etc

Matching any one of many characters
Let’s say you want to search for “grey,” but also want to find it if it were spelled “gray.” Using  gr[ea]y: this means to find “g, followed by r, followed by either an e or an a, all followed by y.”

The contents of a class [] is a list of characters that can match at that point, so the implication is “or”.

Another example, you want to allow capitalization of a word’s first letter, such as with [Ss]mith. Remember that this still matches lines that contain smith (or Smith) embedded within another word, such as with blacksmith.

You can list in the class as many characters as you like. For example, [123456] matches any of the listed digits. This particular class might be useful as part of < H[123456] >, which matches < H1 >, < H2 >, < H3 >, etc. This can be useful when searching for HTML headers.

Within a character class, the character-class metacharacter ‘-’ (dash) indicates a range of characters: < H[1-6] > is identical to the previous example. [0-9] and [a-z] are common shorthands for classes to match digits and English lowercase letters, respectively.

Multiple ranges are fine, so [0123456789abcdefABCDEF] can be written as [0-9a-fA-F] (or, perhaps, [A-Fa-f0-9], since the order in which ranges are given doesn’t matter).

These last three examples can be useful when processing hexadecimal numbers. You can freely combine ranges with literal characters: [0-9A-Z_!.?] matches a digit, uppercase letter, underscore, exclamation point, period, or a question mark.

Ignoring Characters In Classes

[^1-6]

Matches a character that’s not 1 through 6. The leading ^ in the class “negates” the list, so rather than listing the characters you want to include in the class, you list the characters you don’t want to be included.

You might have noticed that the ^ used here is the same as the start-of-line caret, but it’s not.

Here’s what happened. (What I typed is in bold.)

‘q[^u]‘
Iraqi
Iraqian
miqra
qasida
qintar
qoph

Two notable words not listed are “Qantas”, the Australian airline, and “Iraq”. Although both words are in the word list file.

Qantas didn’t match because the regular expression called for a lowercase q, whereas the Q in Qantas is uppercase. Had we used Q[^u] instead, we would have found it.

The Iraq example is somewhat of a trick question. The regular expression calls for q followed by a character that’s not u, which precludes matching q at the end of the line. Lines generally have newline characters at the very end, but a little fact I neglected to mention (sorry!) is that egrep strips those before checking with the regular expression, so after a line-ending q, there’s no non-u to be matched.

Matching Any Character with Dot
The metacharacter . (usually called dot or point) is a shorthand for a character class that matches any character. It can be convenient when you want to have an “any character here” placeholder in your expression. For example, if you want to search for a date such as 03/19/76, 03-19-76, or even 03.19.76, you could go to the trouble to construct a regular expression that uses character classes to explicitly allow ‘/’, ‘-’, or ‘.’ between each number, such as 03[-./]19[-./]76. However, you might also try simply using 03.19.76. In 03[-./]19[-./]76, the dots are not metacharacters because they are within a character class [].

Matching any one of several subexpressions
A very convenient metacharacter is |, which means “or.” It allows you to combine multiple expressions into a single expression that matches any of the individual ones. For example, Bob and Robert are separate expressions, but Bob|Robert is one expression that matches either. When combined this way, the subexpressions are called alternatives.

Looking back to our gr[ea]y example, it can be written as grey|gray, and even gr(a|e)y. The latter case uses parentheses to constrain the alternation. (For the record, parentheses are metacharacters too.) Note that something like gr[a|e]y is not what we want within a class, the ‘|’ character is just a normal character, like a and e.

With gr(a|e)y, the parentheses are required because without them, gra|ey means “gra or ey,” which is not what we want here. Alternation reaches far, but not beyond parentheses. Another example is (First|1st)•[Ss]treet.[] Actually, since both First and 1st end with st, the combination can be shortened to (Fir|1)st • [Ss]treet. That’s not necessarily quite as easy to read, but be sure to understand that (first|1st) and (fir|1)st effectively mean the same thing.

Here’s an example involving an alternate spelling of my name. Compare and contrast the following three expressions, which are all effectively the same:

Jeffrey|Jeffery

Jeff(rey|ery)

Jeff(re|er)y

To have them match the British spellings as well, they could be:

(Geoff|Jeff)(rey|ery)

(Geo|Je)ff(rey|ery)

(Geo|Je)ff(re|er)y

Optional Items
Matching color or colour. Since they are the same except that one has a u and the other doesn’t, we can use colou?r to match either. The metacharacter ? (question mark) means optional. It is placed after the character that is allowed to appear at that point in the expression, but whose existence isn’t actually required to still be considered a successful match.

Unlike other metacharacters we have seen so far, the question mark attaches only to the immediately-preceding item. Thus, colou?r is interpreted as ” c then o then l then o then u? then r.”

The u? part is always successful: sometimes it matches a u in the text, while other times it doesn’t. The whole point of the ?-optional part is that it’s successful either way. This isn’t to say that any regular expression that contains ? is always successful. For example, against ‘semicolon’, both colo and u? are successful (matching colo and nothing, respectively). However, the final r fails, and that’s what disallows semicolon, in the end, from being matched by colou?r.

As another example, consider matching a date that represents July fourth, with the “July” part being either July or Jul, and the “fourth” part being fourth, 4th, or simply 4. Of course, we could just use (July;Jul) • (fourth|4th|4), but let’s explore other ways to express the same thing.

First, we can shorten the (July;Jul) to (July?). Do you see how they are effectively the same? The removal of the | means that the parentheses are no longer really needed. Leaving the parentheses doesn’t hurt, but with them removed, July? is a bit less cluttered. This leaves us with July? • (fourth|4th|4).

Moving now to the second half, we can simplify the 4th|4 to 4(th)?. As you can see, ? can attach to a parenthesized expression. Inside the parentheses can be as complex a subexpression as you like, but “from the outside” it is considered a single unit. Grouping for ? (and other similar metacharacters which I’ll introduce momentarily) is one of the main uses of parentheses.

Other Quantifiers: Repetition
Similar to the question mark are + (plus) and * (an asterisk, but as a regular-expression metacharacter, I prefer the term star). The metacharacter + means “one or more of the immediately-preceding item,” and * means “any number, including none, of the item.” Phrased differently, ⋯* means “try to match it as many times as possible, but it’s OK to settle for nothing if need be.” The construct with plus, ⋯+, is similar in that it also tries to match as many times as possible, but different in that it fails if it can’t match at least once. These three metacharacters, question mark, plus, and star, are called quantifiers because they influence the quantity of what they govern.

Leave a Reply