Free Trial

Safari Books Online is a digital library providing on-demand subscription access to thousands of learning resources.

  • Create BookmarkCreate Bookmark
  • Create Note or TagCreate Note or Tag
  • PrintPrint

4.3. Using the RegExp Object

Regular expressions are the syntax you use to match and manipulate strings. If you’ve worked with a command prompt in Microsoft Windows, or the shell in Linux/Unix, you may have looked for files by trying to match all files using an asterisk, or star (*) character, as in:

dir *.*

or:

dir *.txt

If you’ve used a wildcard character such as the asterisk, you’ve used an element akin to a regular expression. In fact, the asterisk is also a character used in regular expressions.

Regular expressions, through use of the RegExp object and regular expression literals in JavaScript, provide a powerful way to work with strings of text or alphanumerics. The ECMA-262 implementation of regular expressions is largely borrowed from the Perl 5 regular expression parser. Here’s a regular expression to match the word JavaScript:

var myRegex = /JavaScript/;

The regular expression shown would match the string JavaScript anywhere that it appeared within another string. For example, the regular expression would match in the sentence “This is a book about JavaScript,” and it would match in the string “ThisIsAJavaScriptBook,” but it would not match “This is a book about javascript,” because regular expressions are case sensitive. (You can change this, as you’ll see later in this chapter.)

4.3.1. The Syntax of Regular Expressions

Because of string parsing, regular expressions have a terse—and some would argue cryptic—syntax. But don’t let terse syntax scare you away from regular expressions, because in that syntax is power. For example, the following regular expression looks for digits and then performs a substitution to reformat an entire Internet Protocol (IP) address block (in the format 192.168.0/24) by using grouping. What this example does is not really relevant to our discussion beyond showing an example of a more complex regular expression. (It was part of a Perl script that parses an Asia Pacific Network Information Centre (APNIC) network list on a firewall, if you must know.)

s/([0-9]+)\.([0-9]+)(\/[0-9]+)/$1\.$2\.0$3/;

The same regular expression can be written in JavaScript using the replace method of the String object, like so:

var theIP = "192.168.0/28";
alert(theIP.replace(/([0-9]+)\.([0-9]+)(\/[0-9]+)/,"$1\.$2\.0$3"));

The syntax of regular expressions includes several characters that have special meaning, including characters that anchor the match to the beginning or end of a string, a wildcard, and grouping characters, among others. Table 4-6 shows several of the special characters

Table 4-6. Common Special Characters in JavaScript Regular Expressions
CharacterDescription
^Sets an anchor to the beginning of the input.
$Sets an anchor to the end of the input.
.Matches any )character.
*Matches the previous character zero or more times. Think of this as a wildcard.
+Matches the previous character one or more times.
?Matches the previous character zero or one time.
( )Places the match inside of the parentheses into a group, which can be used later.
{n, }Matches the previous character at least n times.
{n,m}Matches the previous character at least n but no more than m times.
[ ]Defines a character class to match any of the characters contained in the brackets. This character can use a range like 0–9 to match any number or a–z to match any letter.
[^ ]The use of a caret within a character class negates that character class, meaning that the characters in that class cannot appear in the match.
\Typically used as an escape character, and meaning that whatever follows the backslash is treated as a literal character instead of as having its special meaning. Can also be used to define special character sets, which are shown in Table 4-7.


In addition to the special characters, several sequences exist to match groups of characters or nonalphanumeric characters. Some of these sequences are shown in Table 4-7.

Table 4-7. Common Character Sequences in JavaScript Regular Expressions
CharacterMatch
\bWord boundary.
\BNonword boundary.
\cControl character when used in conjunction with another character. For example, \cA is the escape sequence for Control-A.
\dDigit.
\DNondigit.
\nNewline.
\rCarriage return.
\sSingle whitespace character such as a space or tab.
\SSingle nonwhitespace character.
\tTab.
\wAny alphanumeric character, whether number or letter.
\WAny nonalphanumeric character.


In addition to the characters in Table 4-7, you can use two modifiers, i and g, to specify that the regular expression should be parsed in a case-insensitive manner and that the regular expression matching should continue after the first match, sometimes called global or greedy (thus the g).

The RegExp object has its own methods, including exec and test, the latter of which tests a regular expression against a string and returns true or false based on whether the regular expression matches that string. However, when working with regular expressions, using methods native to the String type, such as match, search, replace, and her from the bank, is just as common.

The exec() method of the RegExp object is used to parse the regular expression against a string and return the result. For example, parsing a simple URL and extracting the domain might look like this:

var myString = "http://www.braingia.org";
var myRegex = /http:\/\/\w+\.(.*)/;
var results = myRegex.exec(myString);
alert(results[1]);

The output from this code is an alert showing the domain portion of the address, as shown in Figure 4-8.

Figure 4-8. Parsing a typical web URL using a regular expression.


A breakdown of this code is helpful. First you have the string declaration:

var myString = "http://www.braingia.org";

This is followed by the regular expression declaration and then a call to the exec() method, which parses the regular expression against the string found in myString and places the results into a variable called results.

var myRegex = /http:\/\/\w+\.(.*)/;
var results = myRegex.exec(myString);

The regular expression contains several important elements. It begins by looking for the literal string http:. The two forward slashes follow, but because forward slashes (/) are special characters in regular expressions, you must escape them using backslashes (\),making the regular expression http:\/\/ so far.

The next part of the regular expression, \w, looks for any single alphanumeric character. Web addresses are typically www, so don’t be confused into thinking that the expression is looking for three literal w’s—the host in this example could be called web, host1, myhost, or www, as shown in the code you’re examining. Because \w matches any single character, and web hosts typically have three characters (www), the regular expression adds a special character + to indicate that the regular expression must find an alphanumeric character at least once and possibly more than once. So now the code has http:\/\/\w+, which matches the address http://www right up to .braingia.org portion.

You need to account for the dot character between the host name (www) and the domain name (braingia.org). You accomplish this by adding a dot character (.), but because the dot is also a special character, you need to escape it with \.. You now have http:\/\/\w+\., which matches all the elements of a typical address right up to the domain name.

Finally, you need to capture the domain and use it later, so place the domain inside parentheses. Because you don’t care what the domain is or what follows it, you can use two special characters: the dot, to match any character; and the asterisk, to match any and all of the previous characters, which is any character in this example. You’re left with the final regular expression, which is used by the exec() method. The result is placed into the results variable.

If a match is found, the output from the exec() method is an array containing the entire string and an index for each captured portion of the expression. The second index (1) is sent to an alert, which produces the output shown in Figure 4-8.

alert(results[1]);

That’s a lot to digest, and I admit this regular expression could be vastly improved with the addition of other characters to anchor the match, and to account for characters after the domain as well as nonalphanumerics in the host name portion. However, in the interest of keeping the example somewhat simpler, the less-strict match is shown.

The String object type contains three methods for both matching and working with strings, and uses regular expressions to do so. The match, replace, and search methods all use regular expression pattern matching. Because you’ve learned about regular expressions, it’s time to introduce these methods.

The match method returns an array with the same information as the Regexp data type’s exec method. Here’s an example:

var emailAddr = "suehring@braingia.com";
var myRegex = /\.com/;
var checkMatch = emailAddr.match(myRegex);
alert(checkMatch[0]); //Returns .com

This can be used in a conditional to determine whether a given email address contains the string .com:

var emailAddr = "suehring@braingia.com";
var myRegex = /\.com/;
var checkMatch = emailAddr.match(myRegex);
if (checkMatch !== null) {
    alert(checkMatch[0]); //Returns .com
}

The search method works in much the same way as the match method but only sends back the index (position) of the first match, as shown here:

var emailAddr = "suehring@braingia.com";
var myRegex = /\.com/;
var searchResult = emailAddr.search(myRegex);
alert(searchResult); //Returns 17

If no match is found, the search method returns -1.

The replace method does just what its name implies—it replaces one string with another when a match is found. Assume in the email address example that I want to change any .com email address to .net email address. You can accomplish this by using the replace method, like so:

var emailAddr = "suehring@braingia.com";
var myRegex = /\.com$/;
var replaceWith = ".net";
var result = emailAddr.replace(myRegex,replaceWith);
alert(result); //Returns suehring@braingia.net

If the pattern doesn’t match, the original string is placed into the result variable; if it does, the new value is returned.


Note:

You can use several special characters to help with substitutions. Please see the ECMA-262 specification for more information about these methods.


Later chapters show more examples of string methods related to regular expressions. Feel free to use this chapter as a reference for the special characters used in regular expressions.

4.3.2. References and Garbage Collection

Some types of variables or the values they contain are primitive, whereas others are reference types. The implications of this might not mean much to you at first glance—you might not even think you’ll ever care about this. But you’ll change your mind the first time you encounter odd behavior with a variable that you just copied.

First, some explanation: objects, arrays, and functions operate as reference types, whereas numbers, Booleans, null, and undefined are known as primitive types. According to the ECMA-262 specification, other primitive types exist such as Numbers and Strings, but Strings aren’t relevant to this discussion.

When a number is copied, the behavior is what you’d expect: The original and the copy both get the same value. If you change the original, however, the copy is unaffected. Here’s an example:

// Set the value of myNum to 20.
var myNum = 20;
// Create a new variable, anotherNum, and copy the contents of myNum to it.
// Both anotherNum and myNum are now 20.
var anotherNum = myNum;
// Change the value of myNum to 1000.
myNum = 1000;
// Display the contents of both variables.
// Note that the contents of anotherNum haven't changed.
alert(myNum);
alert(anotherNum);

The alerts display 1000 and 20, respectively. Once the variable anotherNum gets a copy of myNum’s contents, it holds on to the contents no matter what happens to the variable myNum after that. The variable does this because numbers are primitive types in JavaScript.

Contrast that example with a variable type that’s a reference type, as in this example:

// Create an array of three numbers in a variable named myNumbers.
var myNumbers = [20, 21, 22];
// Make a copy of myNumbers in a newly created variable named copyNumbers.
var copyNumbers = myNumbers;
// Change the first index value of myNumbers to the integer 1000.
myNumbers[0] = 1000;
// Alert both.
alert(myNumbers);
alert(copyNumbers);

In this case, because arrays are reference types, both alerts display 1000,21,22, even though only myNumbers was directly changed in the code. The moral of this story is to be aware that object, array, and function variable types are reference types, so any change to the original changes all copies.

Loosely related to this discussion of differences between primitive types and reference types is the subject of garbage collection. Garbage collection refers to the destruction of unused variables by the JavaScript interpreter to save memory. When a variable is no longer used within a program, the interpreter frees up the memory for reuse. It also does this for you if you’re using Java Virtual machine or .NET Common Language Runtime.

This automatic freeing of memory in JavaScript is different from the way in which other languages such as C++ deal with unused variables. In those languages, the programmer must perform the garbage collection task manually. This is all you really need to know about garbage collection.

  • Safari Books Online
  • Create BookmarkCreate Bookmark
  • Create Note or TagCreate Note or Tag
  • PrintPrint