Binary Files
In a sense, all files are "binary" in that they are just a collection of bytes stored in an operating system construct called a file. However, when we talk about binary files, we are really referring to the way VB opens and processes the file.
The other file types (sequential and random) have a definitive structure, and there are mechanisms built into the language to read and write these files based on that structure. For example, the Input # statement reads a sequential comma-delimited file field-by-field, the Line Input statement reads a sequential file line by line, etc.
On the other hand, it is necessary to process a file in binary mode when that file does not have a simple line-based or record-based structure. For example, an Excel "xls" file contains a series of complex data structures to manage worksheets, formulas, charts, etc. If you really wanted to process an "xls" file at a very low level, you could open the file in binary mode and move to certain byte locations within the file to access data contained in the various internal data structures.
Fortunately, in the case of Excel, Microsoft provides us with the Excel object model, which makes it a relatively simple matter to process xls files in VB applications. But the concept should be clear: to process a file that does not contain simple line-oriented or record-oriented data, the binary mode needs to be used and you must traverse or parse through the file to get at the data that you need.
The Open Statement
We have seen partial syntax for the Open statement in the first topic on sequential files. The full syntax for the Open statement, taken from MSDN, is:
Open pathname For mode [Access access] [lock] As [#]filenumber [Len=reclength]
The Open statement syntax has these parts:
Part |
Description |
pathname |
Required. String expression that specifies a file name — may include directory or folder, and drive. |
mode |
Required. Keyword specifying the file mode: Append, Binary, Input, Output, or Random. If unspecified, the file is opened for Random access. |
access |
Optional. Keyword specifying the operations permitted on the open file: Read, Write, or Read Write. |
lock |
Optional. Keyword specifying the operations restricted on the open file by other processes: Shared, Lock Read, Lock Write, and Lock Read Write. |
filenumber |
Required. A valid file number in the range 1 to 511, inclusive. Use the FreeFile function to obtain the next available file number. |
reclength |
Optional. Number less than or equal to 32,767 (bytes). For files opened for random access, this value is the record length. For sequential files, this value is the number of characters buffered. |
Remarks
You must open a file before any I/O operation can be performed on it. Open allocates a buffer for I/O to the file and determines the mode of access to use with the buffer.
If the file specified by pathname doesn't exist, it is created when a file is opened for Append, Binary, Output, or Random modes.
If the file is already opened by another process and the specified type of access is not allowed, the Open operation fails and an error occurs.
The Len clause is ignored if mode is Binary.
Important: In Binary, Input, and Random modes, you can open a file using a different file number without first closing the file. In Append and Output modes, you must close a file before opening it with a different file number.
(End of MSDN definition)
Given the information above, we would not use the optional Len clause when opening a file in binary mode, as it does not apply. In the sample programs to follow, the optional lock entry is not used either.
Thus, in the sample programs to follow, the following syntax will be used to open a binary file for input:
Open filename For Binary Access Read As #filenumber
and to open a binary file for output:
Open filename For Binary Access Write As #filenumber
The Get Statement
The Get statement is used read data from a file opened in binary mode. The syntax, as it applies to binary files is:
Get [#]filenumber, [byte position], varname
The filenumber is any valid filenumber as defined above.
Byte position is the byte position within the file at which the reading begins. The byte position is "one-based", meaning the first byte position in the file is 1, the second position is 2, and so on. You can omit this entry, in which case the next byte following the last Get or Put statement is read. If you omit the byte position entry, you must still include the delimiting commas in the Get statement, for example:
Get #intMyFile, , strData
Varname is a string variable into which the data will be read. This string variable is often referred to as a "buffer" when processing binary files. It is important to note that the length, or size, of this string variable determines how many bytes of data from the file will be read. Thus, it is necessary to set the length of the string variable prior to issuing the Get statement. This is commonly done by using the String$ function to pad the string variable with a number of blank spaces equal to the number of bytes you want to read at a given time.
For example, the following statement pads the string variable strData with 10,000 blank spaces:
strData = String$(10000, " ")
Now that VB "knows" how big "strData" is, the following Get statement will read the first (or next) 10,000 bytes from file number "intMyFile" and overlay strData with that file data:
Get #intMyFile, , strData
Depending on the application, it is sometimes necessary to process the file in "chunks". Recall that you can omit the "byte position" entry, in which case VB will "keep track" of where it is in the file. For example, the first time the above Get statement is executed, bytes 1 through 10000 will be read; the second time the above Get statement is executed, bytes 10001 through 20000 will be read; and so on.
In that a VB string variable can hold in the neighborhood of 2 GB worth of data, it would not be unreasonable in most cases to read in the whole file in "one shot", as opposed to reading it in "chunks" as described above. To do this, you can set the length of the "buffer" string variable to the size of the file using the LOF (length of file) function as the first argument of the String$ function. The LOF function takes the filenumber of the file to be processed as its argument, and returns the length of the file in bytes. Thus, the following statement will fill the variable "strData" with a number of blank spaces equal to the size of the file:
strData = String$(LOF(intMyFile), " ")
Then, when the subsequent Get statement is executed, the entire contents of the file will be stored in strData:
Get #intMyFile, , strData
The Input Function
The Input function (not to be confused with the Input # or Line Input statements) can be used as an alternative to the Get statement. The syntax is:
varname = Input(number, [#] filenumber)
where varname is the string variable into which the file data will be stored, number is the number of characters to be read, and filenumber is a valid filenumber identifying the file from which you want to read.
The following table contains examples that contrast the Get statement and Input function as ways of reading data from a binary file:
String Setup and Get Statement |
Input Function |
strData = String$(10000, " ") Get #intMyFile, , strData
|
strData = Input(10000, #intMyFile) |
strData = String$(LOF(intMyFile), " ") Get #intMyFile, , strData
|
strData = Input(LOF(intMyFile), #intMyFile) |
The Put Statement
The Put statement is used write data to a file opened in binary mode. The syntax, as it applies to binary files is:
Put [#]filenumber, [byte position], varname
The filenumber is any valid filenumber as defined above.
Byte position is the byte position within the file at which the writing begins. The byte position is "one-based", meaning the first byte position in the file is 1, the second position is 2, and so on. You can omit this entry, in which case the next byte following the last Get or Put statement is written. If you omit the byte position entry, you must still include the delimiting commas in the Put statement, for example:
Put #intMyFile, , strData
Varname is a string variable from which the data will be written. This string variable is often referred to as a "buffer" when processing binary files. It is important to note that the length, or size, of this string variable determines how many bytes of data will be written to the file.
For example, the following statements cause 1 byte of data to file number "intMyFile":
strCharacter = Mid$(strData, lngCurrentPos, 1)
Put #intMyFile, , strCharacter
Recall that you can omit the "byte position" entry, in which case VB will "keep track" of where it is in the file. For example, the first time the above Put statement is executed, byte 1 will be written; the second time the above Put statement is executed, byte 2 will be written; and so on.
Sample Programs
Three sample "Try It" programs will now be presented, using the statements and functions described above. All three read in the same input file and write out the same output file; the difference is in how the input file is read. The first sample program uses the Get statement to process the file in "chunks", and second uses the Get statement to process the file all at once, and third uses the Input function to process the file all at once.
The job of the sample programs is to read in an HTML file, strip out all tags (i.e., everything between the "less than" and "greater than" angle brackets as well as the brackets themselves), and write out the remaining text.
The figure below shows excerpts of both the HTML input file and the plain text output file. In the HTML excerpt on the left, the text that was extracted out (i.e., the "non-tag" data) is shown in bold for greater clarity.
HTML Input File (excerpt) |
Plain Text Output File (excerpt) |
<html>
<head> <meta http-equiv=Content-Type content="text/html; charset=windows-1252"> <meta name=Generator content="Microsoft Word 10 (filtered)"> <title>Working with Files</title>
<style> . . . <p class=MsoNormal align=center style='text-align:center'><b><span style='font-size:12.0pt;font-family:Arial'>Working with Files – Part 1</span></b></p>
<p class=MsoNormal align=center style='text-align:center'><b><span style='font-size:12.0pt;font-family:Arial'>Sequential File Processing Statements and Functions</span></b></p>
<p class=MsoNormal align=center style='text-align:center'><b><span style='font-size:12.0pt;font-family:Arial'>Processing a Comma-Delimited File</span></b></p>
<p class=MsoNormal align=center style='text-align:center'><span style='font-size:12.0pt;font-family:Arial'> </span></p>
<p class=MsoNormal><span style='font-size:12.0pt;font-family:Arial'>Visual Basic provides the capability of processing three types of files:</span></p>
<p class=MsoNormal><span style='font-size:12.0pt;font-family:Arial'> </span></p>
<p class=MsoNormal style='margin-left:2.0in;text-indent:-1.5in'><b><span style='font-size:12.0pt;font-family:Arial'>sequential files </span></b><span style='font-size:12.0pt;font-family:Arial'>Files that must be read in the same order in which they were written – one after the other with no skipping around</span></p>
<p class=MsoNormal style='margin-left:2.0in;text-indent:-1.5in'><b><span style='font-size:12.0pt;font-family:Arial'> </span></b></p>
<p class=MsoNormal style='margin-left:2.0in;text-indent:-1.5in'><b><span style='font-size:12.0pt;font-family:Arial'>binary files </span></b><span style='font-size:12.0pt;font-family:Arial'>"unstructured" files which are read from or written to as series of bytes, where it is up to the programmer to specify the format of the file</span></p>
<p class=MsoNormal style='margin-left:.5in'><span style='font-size:12.0pt; font-family:Arial'> </span></p>
<p class=MsoNormal style='margin-left:1.0in;text-indent:-.5in'><b><span style='font-size:12.0pt;font-family:Arial'>random files </span></b><span style='font-size:12.0pt;font-family:Arial'>files which support "direct access" by record number</span></p> . . . |
Working with Files
Working with Files – Part 1
Sequential File Processing Statements and Functions
Processing a Comma-Delimited File
Visual Basic provides the capability of processing three types of files:
sequential files Files that must be read in the same order in which they were written – one after the other with no skipping around
binary files "unstructured" files which are read from or written to as series of bytes, where it is up to the programmer to specify the format of the file
random files files which support "direct access" by record number
These three file types are "native" to Visual Basic and its predecessors (QBasic, GW-BASIC, etc.). The next several topics address VB's sequential file processing capabilities. Binary and Random files will be covered in later topics.
The following sequential file-related statements and functions will be discussed:
Open Prepares a file to be processed by the VB program.
App.Path Supplies the path of your application
FreeFile Supplies a file number that is not already in use
Input # Reads fields from a comma-delimited sequential file
. . . |
Note: The sample programs use the Dir$ function and the Kill statement for the purpose of deleting the output file if it exists, prior to creating it anew. Dir$ and Kill are covered in the later topic of "File System Commands and Functions".
Sample Program 1 – Using the Get Statement to Read a Binary File In "Chunks"
The first sample program uses the technique of reading and processing a binary file one "chunk" at a time (in this case 10,000 bytes at a time) using the Get statement. Since the file size is a little over 60,000 bytes, you will see that it took seven passes to read through the file. The code listed below is heavily commented to aid in the understanding of how the program works.
"Try It" Code:
Private Sub cmdTryIt_Click()
Dim strHTMFileName As String
Dim strTextFileName As String
Dim strBackSlash As String
Dim intHTMFileNbr As Integer
Dim intTextFileNbr As Integer
Dim strBuffer As String
Dim strCurrentChar As String * 1
Dim blnTagPending As Boolean
Dim lngX As Long
Dim lngBytesRemaining As Long
Dim lngCurrentBufferSize As Long
Const lngMAX_BUFFER_SIZE As Long = 10000
' Prepare the file names ...
strBackSlash = IIf(Right$(App.Path, 1) = "\", "", "\")
strHTMFileName = App.Path & strBackSlash & "Files_Lesson1.htm"
strTextFileName = App.Path & strBackSlash & "TestOut.txt"
Print "Opening files ..."
' Open the input file ...
intHTMFileNbr = FreeFile
Open strHTMFileName For Binary Access Read As #intHTMFileNbr
' If the file we want to open for output already exists, delete it ...
If Dir$(strTextFileName) <> "" Then
Kill strTextFileName
End If
' Open the output file ...
intTextFileNbr = FreeFile
Open strTextFileName For Binary Access Write As #intTextFileNbr
' Initialize the "bytes remaining" variable to the length of the input file ...
lngBytesRemaining = LOF(intHTMFileNbr)
' Set up a loop which will process the file in "chunks" of 10,000 bytes at a time.
' We will keep track of how many bytes we have remaining to process, and
' the loop will continue as long as there are bytes remaining.
Do While lngBytesRemaining > 0
Print "Processing 'chunk' ..."
' Note: The "buffer" is simply a string variable into which the "current
' chunk" of the file will be read.
' Set the current buffer size to be either the maximum size (10,000) as
' long as there are least 10,000 bytes remaining. If there are less (as
' there would be the last time through the loop), set the buffer size
' equal to the number of bytes remaining.
If lngBytesRemaining >= lngMAX_BUFFER_SIZE Then
lngCurrentBufferSize = lngMAX_BUFFER_SIZE
Else
lngCurrentBufferSize = lngBytesRemaining
End If
' Because the Get statement relies on the size of the string variable (the
' "buffer") into which the data will be read to know how many bytes to read
' from the file, we fill the buffer string variable with a number of blank
' spaces - where the number of blank spaces was determined in the statement
' above.
strBuffer = String$(lngCurrentBufferSize, " ")
' The Get statement now reads the next chunk of data from the input file
' and stores it in the strBuffer variable.
Get #intHTMFileNbr, , strBuffer
' The For loop below now processes the current chunk of data character by
' character, writing out only the characters that are NOT enclosed in the
' HTML tags (i.e., it is skipping every character between a pair of angle
' brackets "<" and ">") ...
For lngX = 1 To lngCurrentBufferSize
strCurrentChar = Mid$(strBuffer, lngX, 1)
Select Case strCurrentChar
Case "<"
blnTagPending = True
Case ">"
blnTagPending = False
Case Else
If Not blnTagPending Then
' The current character is outside of the tag brackets, so
' write it out ...
Put #intTextFileNbr, , strCurrentChar
End If
End Select
Next
' Adjust the "bytes remaining" variable by subtracting the current buffer size
' from it ...
lngBytesRemaining = lngBytesRemaining - lngCurrentBufferSize
Loop
Print "Closing files ..."
' Close the input and output files ...
Close #intHTMFileNbr
Close #intTextFileNbr
Print "Done."
End Sub
After the cmdTryIt_Click event procedure has run, the form should look like the screen shot below, and the output plain-text file should be present in the project directory.
Download the VB project code for the example above here.
Sample Program 2 – Using the Get Statement to Read a Binary File All At Once
The second sample program uses the technique of reading and processing a binary file all at once, using the Get statement in conjunction with the LOF function. The code listed below is heavily commented to aid in the understanding of how the program works.
"Try It" Code:
Private Sub cmdTryIt_Click()
Dim strHTMFileName As String
Dim strTextFileName As String
Dim strBackSlash As String
Dim intHTMFileNbr As Integer
Dim intTextFileNbr As Integer
Dim strBuffer As String
Dim strCurrentChar As String * 1
Dim lngX As Long
Dim blnTagPending As Boolean
' Prepare the file names ...
strBackSlash = IIf(Right$(App.Path, 1) = "\", "", "\")
strHTMFileName = App.Path & strBackSlash & "Files_Lesson1.htm"
strTextFileName = App.Path & strBackSlash & "TestOut.txt"
Print "Opening files ..."
' Open the input file ...
intHTMFileNbr = FreeFile
Open strHTMFileName For Binary Access Read As #intHTMFileNbr
' If the file we want to open for output already exists, delete it ...
If Dir$(strTextFileName) <> "" Then
Kill strTextFileName
End If
' Open the output file ...
intTextFileNbr = FreeFile
Open strTextFileName For Binary Access Write As #intTextFileNbr
Print "Reading input file ..."
' Note: The "buffer" is simply a string variable into which the "current
' chunk" of the file will be read.
' Because the Get statement relies on the size of the string variable (the
' "buffer") into which the data will be read to know how many bytes to read
' from the file, we fill the buffer string variable with a number of blank
' spaces - where the number of blank spaces is equal to the size of the
' entire file (as determined by the LOF function) ...
strBuffer = String$(LOF(intHTMFileNbr), " ")
' The Get statement now reads the entire contents of the input file
' and stores it in the strBuffer variable.
Get #intHTMFileNbr, , strBuffer
Print "Generating output file ..."
' The For loop below now processes the contents of the file character by
' character, writing out only the characters that are NOT enclosed in the
' HTML tags (i.e., it is skipping every character between a pair of angle
' brackets "<" and ">") ...
For lngX = 1 To Len(strBuffer)
strCurrentChar = Mid$(strBuffer, lngX, 1)
Select Case strCurrentChar
Case "<"
blnTagPending = True
Case ">"
blnTagPending = False
Case Else
If Not blnTagPending Then
' The current character is outside of the tags, so write it out ...
Put #intTextFileNbr, , strCurrentChar
End If
End Select
Next
Print "Closing files ..."
' Close the input and output files ...
Close #intHTMFileNbr
Close #intTextFileNbr
Print "Done."
End Sub
After the cmdTryIt_Click event procedure has run, the form should look like the screen shot below, and the output plain-text file should be present in the project directory.
Download the VB project code for the example above here.
Sample Program 3 – Using the Input Function to Read a Binary File All At Once
The third sample program uses the technique of reading and processing a binary file all at once, using the Input function in conjunction with the LOF function. The code listed below is heavily commented to aid in the understanding of how the program works.
"Try It" Code:
Private Sub cmdTryIt_Click()
Dim strHTMFileName As String
Dim strTextFileName As String
Dim strBackSlash As String
Dim intHTMFileNbr As Integer
Dim intTextFileNbr As Integer
Dim strBuffer As String
Dim strCurrentChar As String * 1
Dim lngX As Long
Dim blnTagPending As Boolean
' Prepare the file names ...
strBackSlash = IIf(Right$(App.Path, 1) = "\", "", "\")
strHTMFileName = App.Path & strBackSlash & "Files_Lesson1.htm"
strTextFileName = App.Path & strBackSlash & "TestOut.txt"
Print "Opening files ..."
' Open the input file ...
intHTMFileNbr = FreeFile
Open strHTMFileName For Binary Access Read As #intHTMFileNbr
' If the file we want to open for output already exists, delete it ...
If Dir$(strTextFileName) <> "" Then
Kill strTextFileName
End If
' Open the output file ...
intTextFileNbr = FreeFile
Open strTextFileName For Binary Access Write As #intTextFileNbr
Print "Reading input file ..."
' Note: The "buffer" is simply a string variable into which the "current
' chunk" of the file will be read.
' The Input function reads a number of bytes from a file. The first argument
' of the function specifies how many bytes to read, which in this case is
' the size of the entire file (as determined by the LOF function). The second
' argument specifies the file number of the file from which the data is to be
' read. The resulting data is stored in the "strBuffer" variable.
strBuffer = Input(LOF(intHTMFileNbr), #intHTMFileNbr)
Print "Generating output file ..."
' The For loop below now processes the contents of the file character by
' character, writing out only the characters that are NOT enclosed in the
' HTML tags (i.e., it is skipping every character between a pair of angle
' brackets "<" and ">") ...
For lngX = 1 To Len(strBuffer)
strCurrentChar = Mid$(strBuffer, lngX, 1)
Select Case strCurrentChar
Case "<"
blnTagPending = True
Case ">"
blnTagPending = False
Case Else
If Not blnTagPending Then
' The current character is outside of the tags, so write it out ...
Put #intTextFileNbr, , strCurrentChar
End If
End Select
Next
Print "Closing files ..."
' Close the input and output files ...
Close #intHTMFileNbr
Close #intTextFileNbr
Print "Done."
End Sub
After the cmdTryIt_Click event procedure has run, the form should look like the screen shot below, and the output plain-text file should be present in the project directory.
Download the VB project code for the example above here.