Regular expressions: What are they and how to begin?

Among all the powerful struments available to a development, the regular expressions surely occupy a dominant position. Their flexibility in both research and validation of strings and data makes them irreplaceable for a web development (and not only). Unfortunately, this flexibility is paid with a high price: the regular expressions every time are wrapped by a type of mistery aura. A layman who reads a line of a regex (as often called) will be able to recognize only sets of characters and symbols separated by parentheses, apparently without meaning. Is the cryptic syntax the main reason of why the study of this struments is often postponed.

In this series of articles we will see how to dispel this myth, by examining step by step this fascinating language. We start with the basics.

First of all, we start from a simple definition, which does not want to be “academic”: a regular expression is a model, written in an appropriate language, through which a word or a phrase can be searched within a text or validated to comply to a certain format.

In everyday life we use to link, without realizing it, a certain format of data at very specific information: three numbers composed by two digits only and separated by a hyphen make us think of a date (16-07-86), letters separated by an at sign (@) followed by dot and other letters remind us an email address (example1@example.exl), and so on.

The regular expressions do nothing but translate these models from a natural language like Italian language (three numbers composed by two digits separated by a hyphen) in a language understandable by a machine: (\ d {2} – \ d {2} – \ d {2}).

Before starting to examine the different elements of regular expressions, we can test the functioning. In this regard, the regex language is too particular, because it can only be used inside another language (Perl, PHP, JavaScript, and others) or program (OpenOffice to search the text, for example) .

To test the regex proposed in this article, we will use a text from “Alice in Wonderland and an online tool developed by Steven Levithan: regexpal whose aim is precisely to be able to test the various regular expressions in the top of the text entered in the main screen: all characters that match will be highlighted.

The basis of regex: the single character

The simplest regular expression is represented by a single character: the letter ‘g‘ for example is the regular expression that must be used to find all occurrences of ‘g‘ within a text.

Obviously this base case can be extended to a “character group”: using the regular expression ‘had’ are highlighted all occurrences of the word in the text.

If you’ve tried this last example in regexpal, you may have noticed that not only the word ‘had’ as past tense of the verb to have is highlighted but also the final part of word ‘would’: this is because the regular expression search to find as many matches as possible, since the model to search. We will see later how to find only matches for a given word.

Wildcards

As mentioned, each character is a regular expression. There are exceptions, represented by the following characters:

  1. The asterisk: *
  2. The plus sign: +
  3. The question mark: ?
  4. Parentheses: ( )
  5. The square brackets: [ ]
  6. The braces: { }
  7. The dot: .
  8. The ‘pipes’: |
  9. The highlight: ^
  10. The dollar sign: $
  11. The backslash: \

The asterisk (*)

In the language of regular expressions the asterisk is placed after a character (or group of characters) and means “find in the text as many as possible matches of the character.” Let’s make an example. The regular expression:

pa*

which, translated into words is “find all occurrences of ‘p‘ followed by zero or more ‘a‘, will match all the following lines:

p //because 'a' is optional
pa //'a' is present only one time
paa //two occurrences
paaa //three occurrences
paaaa //four occurrences
........and so on

The important thing to understand is that the asterisk (as most of the wildcards) only affects the character that precedes it: for it to act on groups of characters you must use parentheses. The regular expression:

H(ome)*

or “all occurrences of a ‘h‘ capital followed by zero or more group ‘ome‘” will match the first three following strings, but not the last:

H
Home
Homeome
Homeomememe //because this doesn't behave correctly

The plus sign (+)

Very similar to the asterisk as a function, the symbol ‘plus‘ search for “one or more occurrences of the character that precedes it“. The difference with an asterisk, is that the character is no longer optional, but must be present for strength. Changing the example shown above:

pa+

this regular expression is rendered as: “find all occurrences of the letter ‘p‘ followed by at least one lowercase letter ‘a‘. In this case the string ‘p‘ will no longer be considered as match (as was the case using the asterisk ). For this wildcard there are the same considerations about the use of brackets to match groups of characters.

The question mark (?)

This wildcard makes optional the occurrence of the character which is preceding it, the character may be present at most once, but no more. It is often used to find occurrences of words that have more equivalent records, for example the regex:

obb?iettivo

for “looking characters ‘ob‘ followed by possibly as one ‘b‘ and then the characters ‘iettivo‘”, or search for all occurrences of the word ‘obbiettivo‘ and ‘obiettivo‘. As stated previously, we can use parentheses to group multiple characters, making them followed by the question sign:

(Pes)?care //search "Pescare" and "care"

The dot (.)

The dot wildcard is a regular expression that can be summarized as follows: “any character”. What are spaces, numbers, symbols or letters, the point is always an occurrence. Often its use is discouraged, since it implies a big computational effort by the regular expression engine; but often it can be useful. For example:

A....

This regular expression search for all occurrences of the letter ‘A‘ followed by any character (punctuation, spaces, numbers, letters), then possible matches are the following:

A dor //A-white space-letter-letter-letter
Alice //A-letter-letter-letter-letter
A1234 //A-number-number-number-number
A!"2e //A-symbol-symbol-number-letter

Very often you want to search occurrences of a letter followed by other letters, or occurrences of strings consisting of only numbers. For example: how to find all occurrences of the letter ‘A’ followed by four (and only) letters? To do this we will use character classes.

Brackets: the character classes []

The brackets allow you to specify in their inside a list of characters: the text is searched for all occurrences of characters in brackets. For example, the regular expression:

[Cc]

will seek every occurrence of C in uppercase, or lowercase. We can use it to independent research by uppercase:

[Cc]asa //search the occurrences of Casa and casa

At this point, the solution to our example problem is fairly trivial: just list all the possible letters, four times, preceded by the letter ‘A’, into the brackets:

//Any of letters (uppercase and lowercase)
A[ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghjkilmnopqrstuvwxyz]...// four times

As you can see, that writing is a bit wordy and is subject to possible typing errors. To remedy this situation you can make the ‘abbreviations’: using the hyphen (-) separating the two letters to indicate “treats all the characters from that letter to this letter“. For example, [A-Z] means “by A capital letter to capital letter Z, considering all the letters that are there between these two”. In this way, however, we are considering only the capital letters.

To add the lower case, it uses the same technique: [A-Za-z]. So our example, “find all occurrences of the letter ‘A’ followed by four letters” becomes:

//A-any letter-any letter-any letter-any letter
A[A-Za-z][A-Za-z][A-Za-z][A-Za-z]

This writing is still too repetitive: it would be a useful method to specify how many times a character (or group of characters) should be repeated.

Repeat a character: braces { }

To specify how often a character must be present within the text you can use braces preceded by the character (or group of characters). For example:

A{4} //find occurrences of exactly 4 A consecutive uppercase

Strings that verify this regular expression are the only which contains exactly four uppercase A:

AAAA //yes
AAA //no

At this point we can solve our example problem:

A[A-Za-z]{4}

which, translated into Italian could be rendered as: “seeking A capital followed by either upper or lower case, and considers this class of characters for four times”. We have obtained a written compact and functional for what we set out to search.

Conclusions

In this article we introduced the basis for the understanding of regular expressions: is important to understand and assimilate this concepts as the learning of regex is incremental. All constructs (from simple to complex) have more or less the same function, so once you understand the mechanism of one, you can easily use the others.

For sure practice plays a key role in this field, as well as using the right method for the creation of the regex. I find it handy that, before I start writing my manuscript, to write down all the possible variations that can take the string which I’m looking for, maybe helping me with a simple graph. Having a clear vision of the subject you are dealing with, is necessary to use regular expressions.

In the next article we will examine other key constructs and see some applications of the “real world”. See you next time!

Summary of lessons

  • The Basics
  • The advanced constructs and models
  • Actual use of regular expressions

Master per Web Designer Freelance
In tutti questi anni abbiamo ricevuto centinaia di richieste di approfondimento sulle numerose tematiche del web design vissuto da freelance. Le abbiamo affrontate volta per volta. Ma ci siamo resi conto che era necessario fare qualcosa di più. Ecco perché è nato One Year Together, un vero e proprio master per web designer freelance che apre finalmente le porte al mondo del lavoro.
Scopri One Year Together »
[pdf]Scarica articolo in PDF[/pdf]
Tags: ,

The Author

Fond of web design, takes delight in creating (X)HTML+CSS layouts. A maniac of polished and tidy codes, the type of person you find in your house straightening the paintings hanging on the wall. He has made his mind of becoming a web designer with a capital “w”, and spends entire nights awake in order to make his dream come true.

Author's web site | Other articles written by

Related Posts

You may be interested in the following articles:

5 comments

Trackback e pingback

  1. Tweets that mention Regular expressions: What are they and how to begin? | Your Inspiration Web -- Topsy.com
    [...] This post was mentioned on Twitter by Web RSS News, agatacruciani and Your Inspiration Web, Tom Bangham. Tom Bangham …
  2. The mod_rewrite and the magic of rewriting the URL (first part) | Your Inspiration Web
    [...] need  to work with the regular expressions. If you are new at this, don’t worry. Giustino wrote an excellent …
  3. How to use Regular Expressions in JavaScript | mdigbazova
    [...] having examined the theoretical basis and syntax of regular expressions, it is important to know how to use this …

Leave a Reply

Current day month ye@r *