Hey, Scripting Guy!There's No Mistaking Regular Expressions
The Microsoft Scripting Guys
Download the code for this article: HeyScriptingGuy2008_01.exe (150KB)
Everyone makes mistakes. (What, you thought the Scripting Guys team was created on purpose?) However, if there's anything positive that can be said about mistakes, it's—what's that? You thought we were going to say that the one positive thing about mistakes is that you can always learn from those mistakes? Oh, heavens no: about the only thing you can learn from a mistake is that you would have been way better off having never made that mistake in the first place.
Note: Sorry; no jokes about Microsoft Bob, not today. We made the mistake of poking fun at Microsoft Bob once before, and, in that case, we did learn from our mistake. When it comes to Microsoft Bob, our lips are sealed.
To be honest, the Scripting Guys can't really think of anything good to say about mistakes; that's why we've devoted our lives to minimizing the mistakes that we make. (Admittedly, the only way we could minimize our mistakes was by minimizing the number of things that we actually do. But that's another story.)
Now, it's all well and good that we've tried to minimize our own mistakes; however, that might not do us much good if the people we work with continue to make mistakes. (No, we already told you: no jokes about Microsoft Bob!) If you've ever written a front-end to, say, a database program or to Active Directory® (and we know that many of you have), then you know exactly what we're talking about: your front-end data entry program is only as good as the people doing the data entry.
Suppose you want first names to be entered like this: Ken, with an uppercase first letter and all subsequent letters in lowercase. What happens when a user enters that name like this: KEN? Suppose dates are supposed to be entered like this: 01/23/2007. What happens when a user enters a date like this: January 23, 2007? Suppose—well, you get the idea. Like we said, your front-end data entry program is only as good as the people doing the data entry. And, like it or not, data entry people are going to make mistakes. Unless, of course, you take steps to help ensure that they don't.
Make Sure Your Data Is Valid
Let the Games Be Over
Have you ever wondered what the one event is that the Scripting Guys look forward to all year? If you've been to the Script Center in February either of the past two years, you might guess that event is "The Winter Scripting Games." Well, you'd be wrong. The Scripting Guys look forward to the end of the Scripting Games, when—after two weeks of non-stop fun and excitement—the Scripting Guys get to bask in the glow of another successful Scripting Games competition and sleep for the next month.
So, in preparation for our favorite event of the year—the end of the Scripting Games—it's time to start the 2008 Winter Scripting Games! Join us February 15-March 3, 2008, at the Script Center for more than two weeks of scripting fun and the best scripting competition in existence.
The Scripting Games are a great way to test—and improve—your scripting skills. And because this year the Scripting Guys want to sleep for an extra few days after the competition is over, we're making this year's competition even bigger and better than last year! Once again, we'll have two divisions: Beginner and Advanced. That means that whether you're brand new to scripting, an old (or young) pro, or somewhere in between, this competition is for you.
So what's new this year? We're adding yet another scripting language. You can enter the competition in VBScript, Windows PowerShell, or—here's the new one—Perl. (We know Perl isn't new, but it's new to this competition.) There will probably be even more fun, but since we had to put this sidebar together months ago we have no idea what it will be. You'll have to come to the Script Center to find out what we came up with.
In case you're wondering, yes, there will be prizes! What will they be? Hey, we're not spoiling the surprise here—come to the Script Center and find out! microsoft.com/technet/scriptcenter/funzone/games/default.mspx.
A lot of you are probably already doing rudimentary data entry validation. And, for many things, this simple data entry validation is perfectly fine. Need to make sure that the string value strName has 20 or fewer characters in it? This tiny block of code will do:
If Len(strName) > 20 Then
Wscript.Echo "Invalid name; a " & _
"name must have 20 or fewer characters."
End If
But suppose strName is a new part number, and part numbers must meet a specified pattern: four digits followed by two uppercase letters (something like 1234AB). Using plain vanilla VBScript, can you verify that strName follows the required pattern for part numbers? Yes, you can. But trust us, you don't want to. Instead, you want to use a regular expression.
Make Sure that Only Numbers Appear in a Value
Regular expressions date back to the 1950s, when they were first described as a form of mathematical notation. In the 1960s, this mathematical model was incorporated into the QED text editor as a way to search for character patterns in a text file. Sometime after that—what's that? Oh. Apparently not everyone finds the history of regular expressions as fascinating as we do. Okey doke. In that case, we'll just show you what we're talking about.
Want to make sure a variable (in this case, the variable strSearchString) contains only numbers? Figure 1 shows one way to do that.
Figure 1 Numbers only, please
Set objRegEx = _
CreateObject("VBScript.RegExp")
objRegEx.Global = True
objRegEx.Pattern = "\D"
strSearchString = "353627"
Set colMatches = _
objRegEx.Execute(strSearchString)
If colMatches.Count > 0 Then
Wscript.Echo "Invalid: a " & _
"non-numeric character was found."
Else
Wscript.Echo "This is a valid entry."
End If
As you can see, the script starts out by creating an instance of the VBScript.RegExp object, the object that—well, yes, you're right, the object that enables us to use regular expressions in our script. (Ah, we wanted to be the ones to explain that.) After that, we assign values to two key properties of the regular expressions object: Global and Pattern.
The Global property determines whether the script should match all occurrences of a pattern or should simply find the first such occurrence and then stop. In general, you're better off setting this value to True, meaning you want to search for all occurrences of a pattern. (The default is False.) More often than not, you will want to find all occurrences of a pattern. Even if you don't, the only advantage to finding just the first occurrence of a pattern is that the script will run faster if it stops partway through the string rather than searching all the way to the end. In theory, that sounds good. In practice, though, the search typically completes so quickly that you won't notice much of a difference.
This is the good part: the Pattern is where we specify the characters (and the character blueprint) we're interested in. In our sample script, we're looking for any character that isn't one of the digits 0 through 9. If we find such a character (a letter, a punctuation mark, whatever), then we know the value assigned to strSearchString isn't valid. With regular expressions, the syntax \D is used to match any non-digit character; thus we assign the value \D to the Pattern property.
Note: How did we know that \D matches any non-digit character? Well, there's a long history behind that. You see, several years ago—oh, right. We looked this up in the VBScript language reference on MSDN® at msdn2.microsoft.com/1400241x.
After assigning values to the Global and Pattern properties, our next step is to assign a value to the variable strSearchString:
strSearchString = "353627"
So how do we know whether there are any non-digit characters in this string? That's easy. We just call the Execute method, which searches our variable for any non-digit characters:
Set colMatches = _
objRegEx.Execute(strSearchString)
When you call the Execute method, any matches—that is, any instances of the Pattern—that are found are automatically stored in the Matches collection. (In our sample script, we gave this collection the name colMatches.) If we want to know whether or not our variable contains any non-digit characters (and we do), all we have to do now is check the value of the collection's Count property, which tells us how many items are in the collection:
If colMatches.Count > 0 Then
Wscript.Echo "Invalid: a " & _
"non-numeric character was found."
Else
Wscript.Echo "This is a valid entry."
End If
If the Count is greater than 0, that can mean only one thing: a non-digit character was found somewhere in the variable value. (If all the characters in the value were digits, then the collection would be empty and thus have a Count of 0.) In our sample script, we then echo back the fact that a non-numeric character was discovered. In an actual script or database front end, you'd probably echo a similar message and loop back around to have the user try again. But we'll let you worry about that; as Scripting Guys, we have plenty of other things to worry about.
Like what? Like the thought that someday the people at TechNet Magazine might actually start reading the articles we send to them each month. We'd just as soon they not learn from their mistakes.
Ah, good question: couldn't we just use the IsNumeric function to determine whether or not strSearchString contains a number? Sure, provided that strSearchString can be any number. But what if strSearchString must be a positive integer? That could be a problem; IsNumeric identifies both of these as valid numbers:
-45
672.69
Why are these identified as valid numbers? Because they are valid numbers; they just aren't positive integers. Our regular expression will flag these as invalid entries, however, because the regular expression catches the non-digit characters minus sign (–) and period (.). If you've been reading this column and thinking, "Why am I reading this column?" well, there's a pretty good reason right there.
What's that you said? That's not good enough? Wow, tough crowd this month. OK, let's take a look at a few other types of data entry validation that can be accomplished using regular expressions.
Make Sure that No Numbers Appear in a Value
We just showed you how to ensure that only numbers appear in a value. What if we wanted to do the opposite; what if we wanted to make sure that no numbers appear in our value? Well, here's one search pattern we can use:
objRegEx.Pattern = "[0-9]"
What's going on here? Well, in the wild and wacky world of regular expressions, the bracket characters ([ and ]) enable you to specify a set of characters or, as we've done here, a range of characters. What does the Pattern [0-9] mean? That means we're searching for any characters in the range 0 through 9—in other words, any numeric value. If we wanted only even numbers (that is, if we wanted to ensure that our value doesn't include any 1s, 3s, 5s, 7s, or 9s), we would use this pattern:
objRegEx.Pattern = "[13579]"
Note that because we didn't use a hyphen, we aren't looking for a range of characters. In fact, we're looking for the actual characters 1, 3, 5, 7, and 9, specifically.
Of course, the pattern [0-9] looks only for numbers; it won't find punctuation marks or other characters that aren't letters but aren't numbers, either. Do you think we can create a pattern that looks for any non-letter characters? Of course we can. You can do practically anything you want with regular expressions:
objRegEx.Pattern = "[^A-Z][^a-z]"
In this case, we're actually combining two criteria in our pattern: we're searching for [^A-Z] and [^a-z]. As you probably figured out, A-Z means the range of characters from, well, uppercase A to uppercase Z. What does the ^ character mean? Well, when included in a pair of square brackets, the ^ means, "Look for any character not in the character set." Thus, [^A-Z] means, "Look for any character that isn't an uppercase letter." Meanwhile, [^a-z] means, "Look for any character that isn't a lowercase letter." The net result: this pattern means that we're looking for anything that isn't an uppercase letter or a lowercase letter. That includes numbers, punctuation marks, and any other weird characters you might find on your keyboard.
Alternatively, we could have set the IgnoreCase property to True:
objRegEx.IgnoreCase = True
By doing that, our search would ignore letter case. In turn, we could then use this Pattern, which—combined with the IgnoreCase property—would look for anything that isn't an uppercase or lowercase letter:
objRegEx.Pattern = "[^A-Z]"
Make Sure a Part Number Is Valid
Now let's get a little fancier. (And, trust us, you can get really fancy with regular expressions; take a look at the Web site regexlib.com for some examples.) Earlier in this article we said something like this: Suppose strName is a new part number, and part numbers must meet a specified pattern: four digits followed by two uppercase letters (for example, 1234AB). How can you verify that a variable meets this pattern? Why, like this, of course:
objRegEx.Pattern = "^\d{4}[A-Z]{2}"
What does this pattern mean? What a coincidence: we were just about to explain what this pattern means. In this case, we have three criteria we're searching for:
^\d{4}
The \d means we're looking for digits (a character in the range 0 through 9). What does the {4} mean? That means we want to match exactly four of the characters (that is, exactly four digits). Thus we're looking for four consecutive digits.
And what about the ^ character? Well, outside a pair of square brackets, the ^ indicates that the value must begin with the subsequent pattern; that means that AA1234AB would not be flagged as a match. (Why? Because the value actually begins with AA, not with four consecutive digits.) If we wanted to, we could use the $ character to indicate that the match must occur at the end of the string. But we just don't want to. (Well, at least not with this example.)
[A-Z]{2}
You already know what the [A-Z] component means; that's any uppercase letter. And the {2}? You got it: the four digits must be followed by exactly two uppercase characters.
Pretty easy, huh? By the way, in this case, if the Count is 0 then we have an invalid value. Why? Well, this time we're not searching for a single character that would invalidate the string; instead, we're searching for an exact pattern match. If we don't get such a match, that means we have an invalid value. And it also means that the collection's Count property will be 0.
Note: Two other useful constructions are these: {3,} and {3,7}. The {3,} means that there must be at least 3 consecutive characters (or expressions); however, while there is a minimum of 3, there is no maximum. This \d{3,} matches 123, it matches 1234, and it also matches 123456789. The {3,7} means there must be at least 3 consecutive characters (or expressions) but no more than 7. Thus 123456 is a match, but 123456789 (which has more than 7 consecutive digits) is not a match.
Here's another useful pattern. Suppose your part number starts off with 4 digits followed by the letters US followed by two final characters (which might be letters, or might be numbers, or might be anything). Here's one way to match that:
objRegEx.Pattern = "^\d{4}US.."
As you can see, this pattern starts off with four digits (\d), which must be followed by the characters US (and only the characters US; anything else will not be a match). After the characters US, we then need two more characters of some kind. How do we indicate any character in a regular expression? Hold on a second; we know this one ... wait, we've got it: any single character is indicated by a period (.). Logically enough, that means two characters can be indicated by two periods:
..
OK, that one was easy. Let's try a harder one. Suppose those internal two characters indicate the country where the part was manufactured. If you have manufacturing plants in the U.S., the U.K., and Spain, that means that any of these two-character codes are valid: US, UK, ES. How the heck can we do multiple-choice in a regular expression?
Well, how about like this:
objRegEx.Pattern = "^\d{4}(US|UK|ES).."
The key here is this little construction: (US|UK|ES). We've placed the three acceptable values (US, UK, and ES) inside a set of parentheses, separating each value with the pipe (|) character. That's how you do multiple-choice with a regular expression.
But wait: didn't we say that you could place multiple options inside square brackets? Isn't that what the whole [13579] thing was for? Yes, it was. However, that works only for single characters; [13579] will always be interpreted as 1, 3, 5, 7, and 9. To use 135 and 79 as the choices, you need to use this syntax: (135|79). You can also use parentheses to delineate a single word to be searched for:
objRegEx.Pattern = "(scripting)"
Or, just leave the parentheses off:
objRegEx.Pattern = "scripting"
Note: OK, that's good to know but what if the parentheses need to be included as part of the search term; that is, what if I want to search for (scripting)? Well, because both the open and close parentheses are reserved characters in regular expressions, you need to preface each character with a \. In other words:
objRegEx.Pattern = "\(scripting\)"
Freedom of Expression
That's all the time we have for this month. If you'd like to learn more about regular expressions, well, you might want to take a peek at the webcast String Theory for System Administrators: An Introduction to Regular Expressions (microsoft.com/technet/scriptcenter/webcasts/archive.mspx). And as you start to learn, and use, regular expressions, be sure to keep in mind the immortal words of Sophia Loren: "Mistakes are part of the dues one pays for a full life."
Wow. The Scripting Guys have lived much fuller lives than we realized!
Dr. Scripto's Scripting Perplexer
The monthly challenge that tests not only your puzzle-solving skills, but also your scripting skills.
January 2008: Winding Nouns
This month Dr. Scripto decided to work with Windows PowerShell. Of course, you can't work with Windows PowerShell—not if you're using it right anyway (and Dr. Scripto always does everything right)—without using cmdlets. For this puzzle, Dr. Scripto has wound some cmdlets around a grid, and you need to find them.
Now, we have to admit this puzzle would be a little more confusing if we were to use the whole cmdlet. Because of the verb-noun construct of cmdlets, there are dozens that start with Get-, Set-, and so on. So we've left off the verb part of the cmdlet (and the hyphen), leaving only the nouns. (But we'll tell you that all of the verbs were "Get-".)
Your task this month is to unwind the cmdlet nouns. We've provided you with the list of nouns hidden in the puzzle. (Are we generous or what?) Each word can wind around vertically and horizontally, forward and backward, but never diagonally. We've given you the first word (Location) as an example. Good luck.
Word List
Alias
ChildItem
Credential
Date
EventLog
ExecutionPolicy
Location
Member
PSSnapin
TraceSource
Unique
Variable
WMIObject
ANSWER:
Dr. Scripto's Scripting Perplexer
Answer: Winding Nouns, January 2008
The Microsoft Scripting GuysThe Scripting Guys work for—well, are employed by—Microsoft. When not playing/coaching/watching baseball (and various other activities), they run the TechNet Script Center. Check it out at www.scriptingguys.com.
© 2008 Microsoft Corporation and CMP Media, LLC. All rights reserved; reproduction in part or in whole without permission is prohibited.