Regular Expressions: Advanced Constructs


The “bad” reputation of regular expressions is due in large part to the obscure syntax: there are many rules and abbreviations, some symbols have other meanings depending on their location, some constructs seem incomplete, while they are functioning properly.

In the previous lesson you’ve seen the basic elements of this syntax. This article will examine some more advanced uses of various constructs already submitted and introduce new ones. We will start first from character classes.

Character classes: Part 2

In the previous article we talked about character classes: that characters delimited by square brackets to which the regex engine tries occurrence in the text, one for each character. For example:

se[ae] //search "see" and "sea"

We also said that the hyphen (-) can be used to indicate a jump from one character to another, as in:

[A-Z] //all capital letters from A to Z
[0-9] //all numbers from 0 to 9

Because of the different settings found on some computers, however, these character classes are not portable. This is because, for example, according to some settings, characters are stored as “ABCD Zabcd … … z”, while other characters are stored as “AABBCC … Zz. Therefore, the regular expression [AZ] might operate according to our intent on some machines and fail on others.

To overcome this problem, are established various classes “standard POSIX” that are independent from the system on which they operate, allowing maximum portability. These classes have the following form: [: nome_mnemonico:]. Then bracket and colon, followed by a name, in turn followed by a colon and the bracket. For example, one of the most used classes is [: alnum:] which finds occurrences of letters (uppercase and lowercase) and numbers. Now:

[[:alnum:]] is equivalent to [A-Za-z09]

Note well, that to use the POSIX classes I used two sets of brackets: the most externals indicate to the motor of regex that has to validate a character class, while the inners limit the specific class. So, wanting to use more than one POSIX class would have a similar expression:

[[:alpha:][:digit:]]asa //Aasa, Basa, ..., aasa, basa, ..., 1asa, 2asa, ...

You can find the list of the main character classes in POSIX table at end of article.

The denial in character classes

In the last week we saw how the character superscript (^) serves to identify a regular expression that is at the left margin.

But when the apex is used within character classes, takes a different meaning: instead of searching for all instances of that class, acts as a denial and would seek “all characters except those specified“. For example:

[abc]  //any of the letters 'a', 'b' o 'c'
[^abc] //all characters except 'a', 'b' o 'c'

In the first case, therefore, the class allows us to narrow the field to the characters inside the brackets, in the second case we decide which characters to exclude from our research. A convenient way to make the code phrase “All the characters, except ….”.

Obviously, this construct works also with POSIX classes: to exclude all the numbers from our research we could write:

[^[:digit:]] //all characters except numbers

This construct is used because it allows 70% of cases replace the item (which means “any character”) which is very expensive for the regex engine.

Shortened classes

Some character classes are used so frequently that have been created “shortcuts”:

  • \w: equivalent to class [:alnum:]
  • \d: equivalent to class [:digit:]
  • \s: equivalent to class [:space:]

For convenience, abbreviations have been created for their denials, using capital letters:

  • \W: equivalent to class [^[:alnum:]]
  • \D: equivalent to class [^[:digit:]]
  • \S: equivalent to class [^[:space:]]

But unlike classes, these abbreviations do not need the brackets. Then we have the following regular expression:

A\w //A "A" followed by an alphanumeric character

Below is a table summarizing the main character classes, with classes in POSIX and abbreviations:

POSIX Abbreviation Character Class Description
[:alnum:] [A-Za-z0-9] Alphanumeric characters
[:word:] \w [A-Za-z0-9_] alphanumeric characters plus underscore
\W [^\w] non-word character
[:alpha:] [A-Za-z] Alphabetic characters
[:blank:] [ \t] Spaces and tabs
[:digit:] \d [0-9] Numbers
\D [^\d] non-numeric characters
[:lower:] [a-z] Lowercase
[:punct:] [-!"#$%&'()*+,./:;<=>?@[\\\]^_`{|}~] punctuation characters
[:space:] \s [ \t\r\n\v\f] All whitespace characters
\S [^\s] All characters except whitespace
[:upper:] [A-Z] Uppercase

Define the words (\b)

In the previous article we saw how sequences of characters representing a regular expression for all purposes. If we search for the word ‘had’ resulting in research actually the word ‘had’, but also words like ‘hades’ . Very often this behavior is wanted, other times is necessary make a research about the exact word or a word that starts with a given regular expression.

To do this you use once more ‘\ b’. This sequence of characters can be placed on the left and/or right of a regex and it is indicative, respectively, “the word that starts with ..” and “the word that ends with ..”. Take an example. To obtain only the word ‘ebbe’ in our text we could write:

\bhad //the word that starts with "had"

In our case, proving that regex in the test text, it seems to work: all the right words are highlighted. Trying to add the word ‘ebbene’ (italian for “so”) you’ll notice, though, that is highlighted. We then add the right edge to our regular expression:

\bhad\b //character sequence that begins and ends with 'had'

Clearly, the sequence of characters found is equivalent to the word ‘had’ and no other, so our task is finished.

We should be careful not to confuse the anchor ‘\b’ with the anchors ‘^’ and ‘$‘: the first is the occurrence at the beginning and end of a word, the second work on the start and end of a line.

Specify the number of occurrences

In the last lesson we saw how the braces to the right of a regular expression allow you to specify how often it should recur. For example:

\d{2}-\d{2}-\d{4}

Analyzing the regular expression just written: used the abbreviated class ‘\d’ which stands for “any number”, after each class is specified a number between curly brackets, which indicates how many times must be searched the regex just before and three classes of characters are separated by two hyphens. So we make words like: “Find all occurrences of two numbers, followed by a hyphen, followed by two numbers, followed by a hyphen, followed in turn by four numbers“. This could therefore be a good way to validate a date within a form with the format dd-mm-yyyy.

The braces can be used in two ways:

  • {min,max}: find at least ‘min’ occurrences, but no more than ‘max’.
  • {min,}: find at least ‘min’ occurrence, without upper limit.

The first method is to give lower and upper limit to the regular occurrence:

a{2,4} //find aa, aaa, aaaa but not others

The second method, however, is used to give a lower limit:

a{3,} //find aaa, aaaa, aaaaa, etc.

Conclusions

In this second part we went deeper into the consideration of some basic regular expression constructs. However, the knowledge of the operation and syntax of these elements is not sufficient to understand and to actively exploit this tool. There are necessary long practical applications and the study of some models (perhaps made by others) to understand successfully how are related to each other the various elements. This is what we do in the next article. Until next time!

Master per Web Designer Freelance
In tutti questi anni abbiamo ricevuto centinaia di richieste di approfondimento sulle numerose tematiche del web design vissuto da freelance. Le abbiamo affrontate volta per volta. Ma ci siamo resi conto che era necessario fare qualcosa di più. Ecco perché è nato One Year Together, un vero e proprio master per web designer freelance che apre finalmente le porte al mondo del lavoro.
Scopri One Year Together »
[pdf]Scarica articolo in PDF[/pdf]
Tags: ,

The Author

Fond of web design, takes delight in creating (X)HTML+CSS layouts. A maniac of polished and tidy codes, the type of person you find in your house straightening the paintings hanging on the wall. He has made his mind of becoming a web designer with a capital “w”, and spends entire nights awake in order to make his dream come true.

Author's web site | Other articles written by

Related Posts

You may be interested in the following articles:

2 comments

Trackback e pingback

  1. Element with jquery |Avnish Namdev
    [...] [ No Comments ] [...]

Leave a Reply

Current month ye@r day *