Much ADO About Text Files
Greg Stemp
Microsoft Corporation
March 5, 2004
Summary: Demonstrates how script writers can use Microsoft ActiveX Database Objects (ADO) techniques to retrieve data from logs and other text files. (14 printed pages)
If we Microsoft® Scripting Guys were smart (well, let's face it: if we were smart we'd be the Microsoft Highly-Paid Executive Guys rather than the Microsoft Scripting Guys) you would think that at some point we would learn not to make offhand remarks, unless we were ready and able to back them up. For example, in the Readme file to the ADSI Scriptomatic, I said something like, "And if you'd be interested in a Microsoft® Active Directory® browser, just let us know and we'll see what we can do." That seemed like a pretty innocuous thing to say; after all, who ever reads Readme files? And who would want an Active Directory browser anyway?
Well, it turns out that a lot of people read Readme files, and that a lot of people want an Active Directory browser. Even now, a year later, we get 2 or 3 e-mail messages a day from people asking for the browser. That enthusiasm is great, but the truth is, we didn't really have an Active Directory browser. (Well, actually we did, and do, but it ran kind of slow on a large directory service like the one we have at Microsoft. We didn't release it right away because we had to find a way to speed it up a bit, and then we had to clean up some of the code. But if you really want a copy right now... No, wait, that's how we got ourselves into trouble in the first place.)
So did I learn my lesson from the Active Directory browser incident? Of course not. When I did a webcast in January on database scripting, I showed an example of how you could use database techniques to query a text file. In my usual manner, I said something like, "We're not going to spend any time talking about this today, but if you'd like to know more about accessing text files using database techniques, just let me know."
Well, as the old saying goes, once the toothpaste is out of the tube, it's awfully hard to get it back in. No sooner had I said that than I received scores of e-mail messages from people saying, yes, we would like to know more about accessing text files using database techniques. As the other old saying goes: Me and my big mouth. (I'm just glad I didn't say something like, "If you'd like to know how to read data from and write data to Microsoft® Excel using a script, just let me know." No telling how much e-mail a statement like that might generate.)
Oh, right: I shouldn't have said that either. Okay, I'll tell you what: This month we'll talk about accessing delimited and fixed-width text files using ADO (Microsoft® ActiveX® Database Objects), and next month we'll talk about Excel. In the meantime, I'll keep saying to myself, "Silence is golden, silence is golden, silence is ...."
So What Are Delimited Files?
A delimited file is nothing more than a text file in which individual values are separated by a standard character (typically a comma). For example, suppose we have a file consisting of last names, first names, and middle initials; the file might look like this:
LastName,FirstName,MiddleInitial
Myer,Ken,W
Poe, Deborah,L
In this example, the comma is our "delimiter," the character used to separate one field from another. (In fact, comma-separated-value files, or CSVs, are probably the most popular form of delimited file.) Not all text files use the comma as a delimiter; many log files, for example, are tab-delimited files instead. But don't worry; everything we'll show you in this column can be used regardless of the character serving as the delimiter.
**Scripting Guys' Tip **The sample file we showed you a second ago represents a typical CSV file, but it's not the recommended way of doing things. Instead, it's recommended that you surround individual fields with double-quotation marks, like this:
"Myer","Ken","W"
What's the difference? Well, in this simple case, there really isn't one. But suppose you had a text file like this, where there happens to be a comma in the value:
Address
555 Main Street, Apartment 5
There's only one field, but your script will see two values (555 Main Street and Apartment 5). That's because there's a comma in there, and the script will assume the comma is being used as a delimiter. To keep things clear, format your CSV files like this:
Address
"555 Main Street, Apartment 5"
When the comma is embedded in double-quotes, ADO treats it as just another character in a string. We'll have more to say about this topic in a moment.
Why Can't I Just Use the FileSystemObject to Read Text Files?
If you've ever worked with text files in your scripts, you've likely used the FileSystemObject, a COM object that ships with Microsoft® Windows® Script Host and enables you to read and write text files. For example, you've probably used code similar to this to read a text file line-by-line:
Set objFSO = CreateObject("Scripting.FileSystemObject")
Set objFile = objFSO.OpenTextFile("C:\Databases\PhoneList.csv", 1)
Do Until objFile.AtEndOfStream
strLine = objFile.ReadLine
Wscript.Echo strLine
Loop
objFile.Close
So what's wrong with this? To a certain degree, nothing; it will usually work just fine. Of course, it will work just fine provided you overcome the obstacles presented by the FileSystemObject:
- No filtering. The nice thing about databases is that you can issue a query like "Select * From Logfile Where Result = 'Error'" and you'll get back only those records where the Result field is equal to error. That can't be done with the FileSystemObject. You might want only the records where Result is equal to Error, but you'll still have to read through the entire file, from top to bottom, checking the value of the Result field each time. That's not necessarily slower (the FileSystemObject is actually pretty darn fast), but it does make your code a bit trickier to write.
- Difficulty in calculating statistics. Suppose you have a directory of some kind, and you'd like to count the number of people in each of your departments. With a database, you can issue a single query that will return that information in a flash. With the FileSystemObject, well, no such luck. Instead, you'll have to examine each record, and then use an array or a Dictionary object to manually tally up the number of people in each department. This will work, but it's tedious to code (and even more tedious if you have to change that code somewhere down the road).
- One and done. Another problem with the FileSystemObject is that it's a one-way street, and a dead-end street to boot. What does that mean? Well, suppose you use the FileSystemObject to read through a text file and calculate a statistic of some kind. Now you want to read through it a second time, and calculate a second statistic. Oops, with the FileSystemObject there's no turning back: Once you get to the end of the file, you're done. You can't loop back through your file. Instead, you'll have to close the file and re-open it. Are there ways to work around this? Sure, but that's even more code you'll have to write.
- Difficulty in getting at the individual fields. When you use the ReadLine method to read in a line from a text file, you get, well, a line from a text file. In other words, you'll get back something that looks like this:
Myer,Ken,W
- As you can see, that's at best only half the battle. After all, you want to parse out the user's first name, last name, and middle initial. Unfortunately, all that information, along with an assortment of commas, is glommed together into a single string. By contrast, using database techniques you can essentially say, "Just get me the last name," and ADO will determine which portion of the record represents the user's last name. Much, much easier.
Can't I Just Use Split to Get at Individual Values?
If you read the preceding bullet point, you might wonder why you can't just read each line in the text file and then use the Split function to separate the fields; after all, that's exactly what the Split function is for. For example, suppose we have the following line in a text file:
a,b,c,d,e,f,g
We write a script that reads this line from the file (using the FileSystemObject), and stores the text into a variable named strLine. Our script then runs this code:
arrItems = Split(strLine, ",", -1, 1)
For Each strItem in arrItems
Wscript.Echo strItem
Next
What happens when this block of code runs? We get the following output, with each field (in this case, each letter in the string) now stored as a separate item in the array named arrItems:
a
b
c
d
e
f
g
That's exactly the output we want. So why not use Split instead of all that crazy database stuff?
Well, two reasons. First, the database code is far more flexible. Suppose we wanted to only extract the values a, d, and f. Can you do that using Split? Sure, but it requires a considerable amount of coding to weed out the unwanted fields. By contrast, a database query can pull out selected values just as easily as it can pull out all the values. Likewise, you can create database queries that will automatically return statistical information from your text file (for example, the number of successful operations versus the number of failed operations). Again, you can read through the text file and calculate these statistics yourself, but it's nowhere near as easy.
More important, however, is the fact that the Split function will run into problems if your CSV file includes commas as part of the data. For example, consider this line of information extracted from Active Directory. Note the commas that are part of the user's distinguished name (this is actually too long to fit on a single line, so use your imagination and visualize it all on one line):
105,"cn=Ken Myer,ou=Accounting,ou=North America,dc=fabrikam,dc=com","Fiscal Specialist"
There are only three fields here, and we'd like to get output like this:
105
cn=Ken Myer,ou=Accounting,ou=North America,dc=fabrikam,dc=com
Fiscal Specialist
Instead, we get output like this:
105
"cn=Ken Myer
ou=Accounting
ou=North America
dc=fabrikam
dc=com"
"Fiscal Specialist"
Eep! Not only did Split get "fooled" by the commas in the distinguished name, but it left the double-quote marks in as well. But look what happens when we use ADO to query that text file:
105
cn=Ken Myer,ou=Accounting,ou=North America,dc=fabrikam,dc=com
Fiscal Specialist
Now that's more like it! By design, ADO can seamlessly handle double-quotes around individual fields and individual values; Split cannot. (Are there workarounds for this? Sure. But you're much better off relying on ADO in cases like this.)
So How Do I Use ADO to Query a Text File?
We're not going to spend any time talking about ADO in this column; if you need a background in using ADO, you might want to take a look at our webcast, Database Scripting for System Administrators. Instead, we'll just show you a sample script that uses ADO techniques to search a text file, pointing out that:
- You need to define the constant adCmdText. This is a special text-file-only constant you must use in addition to adOpenStatic and adLockOptimistic. (The latter two, by the way, are constants you use in most ADO scripts.)
- You need to specify the name of the folder where the next file is stored. Note that you must use a trailing \ in the folder name. In the sample script below, the path is C:\Databases\. If your file was in C:\Windows\Logs, the path would be C:\Windows\Logs\.
- Add the ExtendedProperties parameter as shown below. (Depending on the nature of your file, you might have to make minor modifications to this parameter. That's something we'll explain momentarily.) In particular, you need to indicate that this is a text file, you need to specify whether or not the file has a header row, and you need to tell ADO whether this is a delimited or fixed-width file.
- In your Microsoft® SQL query, specify the name of the text file you want to work with (in this example, PhoneList.csv). Don't specify the entire path name; remember, we've already indicated the folder where the file is stored. If you're used to working with SQL queries, you put the file name in the spot where you would typically put the table name.
After you've taken care of the above, your text file script is no different than any other ADO script. Now you just create a Do Until loop and start looping through the records, echoing the appropriate field values. Below is what our sample script looks like. All you really have to worry about are the items in boldface and, of course, using field names that match the field names in your text file. The rest is all boilerplate.
On Error Resume Next
Const adOpenStatic = 3
Const adLockOptimistic = 3
Const adCmdText = &H0001
Set objConnection = CreateObject("ADODB.Connection")
Set objRecordSet = CreateObject("ADODB.Recordset")
strPathtoTextFile = "C:\Databases\"
objConnection.Open "Provider=Microsoft.Jet.OLEDB.4.0;" & _
"Data Source=" & strPathtoTextFile & ";" & _
"Extended Properties=""text;HDR=YES;FMT=Delimited"""
objRecordset.Open "SELECT * FROM PhoneList.csv", _
objConnection, adOpenStatic, adLockOptimistic, adCmdText
Do Until objRecordset.EOF
Wscript.Echo "Name: " & objRecordset.Fields.Item("Name")
Wscript.Echo "Department: " & _
objRecordset.Fields.Item("Department")
Wscript.Echo "Extension: " & objRecordset.Fields.Item("Extension")
objRecordset.MoveNext
Loop
Note You'll notice that we keeping talking about reading text files, and never say anything about writing to text files. Why? Well, unfortunately, the ODBC Text Driver (which is what we're using here) can only read from files; it can't write to them. To write to a text file, you'll have to use the FileSystemObject or something similar.
I Dozed Off; What Did You Say Header Rows Were?
A header row simply means that the first row in the text file is a list of fields, with all subsequent rows containing the actual data. For example:
LastName,FirstName,MiddleInitial
Myer,Ken,W
Poe, Deborah,L
Most CSV files have header rows, but not all of them; sometimes you'll see data files that look like this:
Myer,Ken,W
Poe, Deborah,L
That presents a problem: How can you specify the items to echo if none of the items have a name?
The easiest way to do this is to create a Schema.ini file, something we'll discuss in more detail in a minute or two (depends on how fast you read, I guess). In this file, you define the columns (fields) using syntax similar to this:
Col1=LastName Text
Col2=FirstName Text
Col3=MiddleInitial Text
Notice that you need to indicate the type of data contained in the field. In this case, all three fields contain text (string) data. Other data types you can specify in a Schema.ini file include:
- Short
- Long
- Currency
- Single
- Double
- DateTime
- Memo
After you create the Schema.ini file, make sure you indicate in your script that the file does not use a header row. To do that, just set the HDR parameter to No, like this:
objConnection.Open "Provider=Microsoft.Jet.OLEDB.4.0;" & _
"Data Source=" & strPathtoTextFile & ";" & _
"Extended Properties=""text;HDR=No;FMT=Delimited"""
Wait a Second; I Have a Tab-Delimited File. What Do I Do Now?
Although the comma is probably the most widely-used delimiter, it's far from the only one. Many log files use a tab to separate fields; Microsoft® Internet Information Services (IIS) actually uses a space-delimited file, in which items are separated by blank spaces, as shown below:
2004-01-05 04:37:38 W3SVC802775961 192.168.195.28 GET /personal/index.html
2004-01-05 04:37:38 W3SVC802775961 192.168.195.28 GET /personal/navcond.js
Is that a problem? No, not at all. In fact, the ODBC Text Driver supports the following delimiter formats:
Format | Description | Schema.ini Syntax |
---|---|---|
Tab Delimited | Fields in the file are separated by tabs:
|
Format = TabDelimited |
CSV Delimited | Fields in the file are separated by commas (note that there should not be a space between the comma and the start of the next field name or value):
|
Format = CSVDelimited |
Custom Delimited | Fields in the file are separated by some character other than a tab or a comma (with one exception: you can't use the double-quote as a delimiter). For example, in this file the asterisk is used as the delimiter:
|
Format = Delimited (x) where x represents the character used as the delimiter. For example, if you are using the asterisk (*) as the delimiter, your Format statement would look like this:
Format = Delimited(*) |
Fixed-Length | Fields in a file take up a specific number of characters. If a value is too long, "extra" characters are chopped off the end. If a value is too short, blank spaces are appended to it to make it fill out the requisite number of characters.
|
Format=FixedLength |
So how do you access a text file that uses something other than a comma as a delimiter? You actually have two options here: You can either modify the registry prior to running your script, or you can create a Schema.ini file.
How Can I Modify the Registry?
If no Schema.ini file exists, then ADO uses the following registry value to determine how to parse a text file:
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Jet\4.0\Engines\Text\Format
By default, the Format value is set to CSVDelimited; thus ADO assumes that every text file uses a comma as a delimiter. But what if you have a text file that uses a tab as the delimiter? No problem. If you check the preceding table, you'll see that the official syntax for a tab-delimited file is TabDelimited. If you change the value of Format from CSVDelimited to TabDelimited, then ADO will treat everything as a tab-delimited text file.
**Note **What happens if the registry is set to CSVDelimited, but you try to open a tab-delimited text file anyway? Well, because ADO can't determine the names of the fields (it's expecting to see names separated by commas), you'll typically get the error message, "Item cannot be found in the collection corresponding to the requested name or ordinal."
Admittedly, opening the Registry, changing the value of Format, running the script, re-opening the Registry, and then changing the value back again can be a bit tedious, to say the least. But remember, you can change the registry from within your script. For example, say you'd like to keep the default value as CSVDelimited, but you have a script that needs to read a tab-delimited file. That's okay; start the script by including a few lines of code to change the value of Format to tab-delimited:
Const HKEY_LOCAL_MACHINE = &H80000002
strComputer = "."
Set objReg=GetObject("winmgmts:\\" & strComputer & _
"\root\default:StdRegProv")
strKeyPath = "SOFTWARE\Microsoft\Jet\4.0\Engines\Text"
strValueName = "Format"
strValue = "TabDelimited"
objReg.SetStringValue HKEY_LOCAL_MACHINE,strKeyPath,strValueName,strValue
Now, have the script open the file and do whatever it needs to do. When you're done with the file, tack on a few lines of code to change Format back to CSVDelimited:
Const HKEY_LOCAL_MACHINE = &H80000002
strComputer = "."
Set objReg=GetObject("winmgmts:\\" & strComputer & _
"\root\default:StdRegProv")
strKeyPath = "SOFTWARE\Microsoft\Jet\4.0\Engines\Text"
strValueName = "Format"
strValue = "CSVDelimited"
objReg.SetStringValue _
HKEY_LOCAL_MACHINE,strKeyPath,strValueName,strValue
Pretty easy, huh?
Creating a Schema.ini File
If you'd rather not mess around with the Registry, you have the option of creating a Schema.ini file. (This, by the way, is optional only when using a delimited file; if you have a fixed-length text file, you must use a Schema.ini file.) Schema.ini is simply a text file that tells ADO how to deal with your data file. With a delimited file, this often involves nothing more than telling ADO what the delimiter is. For example, suppose you have a log file named MyLog.txt, and fields within that file are separated using tabs. In a case like that, your Schema.ini file doesn't have to be any more complicated than this:
[MyLog.txt]
Format=TabDelimited
That's it. Just the name of the file (enclosed in square brackets) and the format type. Notice that we only specified the file name and not the full path to the file. Why? Well, that's the one catch to Schema.ini: It must be in the same folder as your text file. (If your text file is in C:\Databases, then Schema.ini has to be in C:\Databases as well.) Of course, that leads to an immediate dilemma: What if you have two or more text files in the same folder? For that matter, can you have two or more text files in the same folder?
Fortunately the answer is yes. All you have to do is specify the information for each text file in Schema.ini. For example, suppose you have three text files in a folder, and all three files use different delimiters. In that case, you might have a Schema.ini file that looks like this:
[File_1.txt]
Format=CSVDelimited
[File_2.txt]
Format=TabDelimited
[File_3.txt]
Format=Delimited( )
**Note **What do you mean we forgot to put something inside the parentheses for that last entry? Believe it or not there really is something there: a blank space (hit the spacebar on the keyboard). This is how you specify a file, like the IIS logs, that uses a blank space as a delimiter.
Incidentally, the Schema.ini file overrides the registry settings. If the Format value in the registry is set to CSVDelimited but Schema.ini specifies TabDelimited, ADO will attempt to open the file in tab-delimited format. In other words, if you choose to use Schema.ini files, you don't have to worry about how the registry has been configured; after all, ADO is going to ignore that registry value anyway.
Working with Fixed-Length Files
Delimited files are so easy to work with that you might think this is the standard format used by all log files. Unfortunately, it's not. (Although we highly recommend it as the format you should use when creating log files.) A number of log files also use the fixed-length format, a format in which each field is allocated a certain number of characters.
Suppose you have a field called FirstName, and you allot it 7 characters. What happens if you have a first name of Reginald? Too bad; because the field is allocated only 7 characters, the final d in Reginald will be chopped off, and the first name will be stored as Reginal. And what if you have a first name of Tom? Tom only takes up three characters, but 7 characters have been set aside for the first name. Because of that, Tom will have four blank spaces appended to it, making the "official" first name Tom---- (with the four hyphens representing blank spaces).
**Note **Do those four blank spaces really matter? You bet. Tom with no blank spaces is not the same as Tom----. When working with fixed-length files, you might have to use the RTrim function to chop off any blank spaces tacked on to the end of a string.
Let's take a gander at what a fixed-length file looks like under the covers. Here we have a simple file with three fields:
- FirstName, which takes up spaces 1-7.
- LastName, which takes up spaces 8-17.
- ID, which takes up spaces 18-20.
In this illustration, the grey boxes represent spaces that are added to the end of a value in order to fill up the requisite field length. Because Ken only has three characters, we had to append 4 blank spaces (the four grey boxes) in order to make it 7 characters long.
Figure 1. Sample fixed-length text file
So how can you use ADO to parse a fixed-length text file? That's where our old pal Schema.ini comes in. To work with a fixed-length text file, you need to do two things. First, specify the fixed-length format:
[PhoneList.txt]
Format=FixedLength
Second, indicate the names and the length (in characters) of each field. For example, in our sample text file shown above, we have the fields:
- FirstName, which takes up 7 characters.
- LastName, which takes up 10 characters.
- ID, which takes up three characters.
Consequently, our complete Schema.ini entry would look like this:
[Test.txt]
Format=FixedLength
Col1=FirstName Text Width 7
Col2=LastName Text Width 10
Col3=ID Text Integer 3
And what would our script look like? Well, it wouldn't look much different than a script that uses ADO to parse a CSV file. In fact, as shown in boldface below, the only difference is that we use the parameter FMT=FixedLength rather than FMT=Delimited.
On Error Resume Next
Const adOpenStatic = 3
Const adLockOptimistic = 3
Const adCmdText = &H0001
Set objConnection = CreateObject("ADODB.Connection")
Set objRecordSet = CreateObject("ADODB.Recordset")
strPathtoTextFile = "C:\Databases\"
objConnection.Open "Provider=Microsoft.Jet.OLEDB.4.0;" & _
"Data Source=" & strPathtoTextFile & ";" & _
"Extended Properties=""text;HDR=YES;FMT=FixedLength"""
objRecordset.Open "SELECT * FROM PhoneList.txt", _
objConnection, adOpenStatic, adLockOptimistic, adCmdText
Do Until objRecordset.EOF
Wscript.Echo "Name: " & objRecordset.Fields.Item("FirstName")
Wscript.Echo "Department: " & objRecordset.Fields.Item("LastName")
Wscript.Echo "Extension: " & objRecordset.Fields.Item("ID")
objRecordset.MoveNext
Loop
Could you do this with the FileSystemObject? Yes, but you would need to write code that chops each string into the requisite pieces. That's not an impossible task, but the preceding method is much easier, if for no other reason that so much of it is boilerplate code.
Working with Custom File Extensions
If you use ADO techniques to read a file with one of these file extensions, your script should work just fine:
- .csv
- .tab
- .asc.
- .tmp
- .htm.
- .html
That's great, but what happens if you try to open a file with a different file extension, say, a file named Directory_service.log? In that case, you're probably going to get an error message like this:
Microsoft JET Database Engine: Cannot update. Database or object is read-only.
To be honest, this is a very misleading error message. The problem is not that the database or object is read-only; the problem is that the file extension (.log) is not automatically recognized by the Jet database provider. Don't spend your time (like I, er, I mean, like one of the other Scripting Guys did) setting and re-setting the read-only attributes of the file in question. Instead, start up Regedit, and find this value:
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Jet\4.0\Engines\Text\
DisabledExtensions
Most likely DisabledExtensions will be set to this:
!txt,csv,tab,asc,tmp,htm,html
Double-click DisabledExtensions, and add a comma and the word log to the end of the value, like so:
!txt,csv,tab,asc,tmp,htm,html**,log**
Click OK and re-run your script. The Jet database provider should now recognize .log as a valid file extension for text databases.
Why is the registry value named DisabledExtensions? And why does the list start with an exclamation mark (!)? Well, I don't know why they chose to go this route. However, in database coding the exclamation mark means Not. You actually read this as: "What are the Disabled Extensions? Not txt, csv, tab, asc, and so on." In other words, the extensions listed here are not disabled; everything else is. I suppose if you took the exclamation mark out, everything would be disabled except the ones listed. But I haven't tried that, and I probably never will. (My computer blows up on me enough without me monkeying around with stuff like that.)
If you want to read data from a file with a .xyz file extension, just repeat the process and add xyz to the list of file extensions stored in DisabledExtensions.
That Was Cool. What Comes Next?
Well, that's up to you I suppose. (Silence is golden, remember?) If you're well-versed in ADO, you might start poking around your hard drive, looking for log files and other text files you might want to read using these techniques. If you're a newcomer to ADO, you might want to watch the Database Scripting for System Administrators**webcast. And if you have questions or comments about this column, send them to scripter@microsoft.com.
Well, unless you're trying to get your hands on the Active Directory browser. Although what we could do is... No, never mind. Silence is golden, silence is golden, silence is ...
Scripting Clinic
Greg Stemp has long been acknowledged as one of the country's foremost authorities on scripting, and has been widely acclaimed as a world-class... huh? Well, how come they let football coaches make up stuff on their resumes? Really? He got fired? Oh, all right. Greg Stemp works at... Oh, come on now, can't I even say that? Fine. Greg Stemp gets paid by Microsoft, where he tenuously holds the title of lead writer for the System Administration Scripting Guide.