Binary Files

 

In a sense, all files are "binary" in that they are just a collection of bytes stored in an operating system construct called a file. However, when we talk about binary files, we are really referring to the way VB opens and processes the file.

 

The other file types (sequential and random) have a definitive structure, and there are mechanisms built into the language to read and write these files based on that structure. For example, the Input # statement reads a sequential comma-delimited file field-by-field, the Line Input statement reads a sequential file line by line, etc.

 

On the other hand, it is necessary to process a file in binary mode when that file does not have a simple line-based or record-based structure. For example, an Excel "xls" file contains a series of complex data structures to manage worksheets, formulas, charts, etc. If you really wanted to process an "xls" file at a very low level, you could open the file in binary mode and move to certain byte locations within the file to access data contained in the various internal data structures.

 

Fortunately, in the case of Excel, Microsoft provides us with the Excel object model, which makes it a relatively simple matter to process xls files in VB applications. But the concept should be clear: to process a file that does not contain simple line-oriented or record-oriented data, the binary mode needs to be used and you must traverse or parse through the file to get at the data that you need.

 

The Open Statement

 

We have seen partial syntax for the Open statement in the first topic on sequential files. The full syntax for the Open statement, taken from MSDN, is:

Open pathname For mode [Access access] [lock] As [#]filenumber [Len=reclength]

The Open statement syntax has these parts:

Part

Description

pathname

Required. String expression that specifies a file name — may include directory or folder, and drive.

mode

Required. Keyword specifying the file mode: Append, Binary, Input, Output, or Random. If unspecified, the file is opened for Random access.

access

Optional. Keyword specifying the operations permitted on the open file: Read, Write, or Read Write.

lock

Optional. Keyword specifying the operations restricted on the open file by other processes: Shared, Lock Read, Lock Write, and Lock Read Write.

filenumber

Required. A valid file number in the range 1 to 511, inclusive. Use the FreeFile function to obtain the next available file number.

reclength

Optional. Number less than or equal to 32,767 (bytes). For files opened for random access, this value is the record length. For sequential files, this value is the number of characters buffered.

Remarks

You must open a file before any I/O operation can be performed on it. Open allocates a buffer for I/O to the file and determines the mode of access to use with the buffer.

If the file specified by pathname doesn't exist, it is created when a file is opened for Append, Binary, Output, or Random modes.

If the file is already opened by another process and the specified type of access is not allowed, the Open operation fails and an error occurs.

The Len clause is ignored if mode is Binary.

Important: In Binary, Input, and Random modes, you can open a file using a different file number without first closing the file. In Append and Output modes, you must close a file before opening it with a different file number.

(End of MSDN definition)

Given the information above, we would not use the optional Len clause when opening a file in binary mode, as it does not apply. In the sample programs to follow, the optional lock entry is not used either.

 

Thus, in the sample programs to follow, the following syntax will be used to open a binary file for input:

 

            Open filename  For Binary Access Read  As #filenumber

 

and to open a binary file for output:

 

            Open filename  For Binary Access Write As #filenumber

 

 

The Get Statement

 

The Get statement is used read data from a file opened in binary mode. The syntax, as it applies to binary files is:

 

Get [#]filenumber, [byte position], varname

 

The filenumber is any valid filenumber as defined above.

 

Byte position is the byte position within the file at which the reading begins. The byte position is "one-based", meaning the first byte position in the file is 1, the second position is 2, and so on. You can omit this entry, in which case the next byte following the last Get or Put statement is read. If you omit the byte position entry, you must still include the delimiting commas in the Get statement, for example:

 

      Get #intMyFile, , strData

 

Varname is a string variable into which the data will be read. This string variable is often referred to as a "buffer" when processing binary files. It is important to note that the length, or size, of this string variable determines how many bytes of data from the file will be read. Thus, it is necessary to set the length of the string variable prior to issuing the Get statement.  This is commonly done by using the String$ function to pad the string variable with a number of blank spaces equal to the number of bytes you want to read at a given time.

 

For example, the following statement pads the string variable strData with 10,000 blank spaces:

 

      strData = String$(10000, " ")

 

Now that VB "knows" how big "strData" is, the following Get statement will read the first (or next) 10,000 bytes from file number "intMyFile" and overlay strData with that file data:

 

      Get #intMyFile, , strData

 

Depending on the application, it is sometimes necessary to process the file in "chunks". Recall that you can omit the "byte position" entry, in which case VB will "keep track" of where it is in the file. For example, the first time the above Get statement is executed, bytes 1 through 10000 will be read; the second time the above Get statement is executed, bytes 10001 through 20000 will be read; and so on.

 

In that a VB string variable can hold in the neighborhood of 2 GB worth of data, it would not be unreasonable in most cases to read in the whole file in "one shot", as opposed to reading it in "chunks" as described above. To do this, you can set the length of the "buffer" string variable to the size of the file using the LOF (length of file) function as the first argument of the String$ function. The LOF function takes the filenumber of the file to be processed as its argument, and returns the length of the file in bytes. Thus, the following statement will fill the variable "strData" with a number of blank spaces equal to the size of the file:

 

      strData = String$(LOF(intMyFile), " ")

 

Then, when the subsequent Get statement is executed, the entire contents of the file will be stored in strData:

 

      Get #intMyFile, , strData

 

 

The Input Function

 

The Input function (not to be confused with the Input # or Line Input statements) can be used as an alternative to the Get statement. The syntax is:

 

varname = Input(number, [#] filenumber)

 

where varname is the string variable into which the file data will be stored, number is the number of characters to be read, and filenumber is a valid filenumber identifying the file from which you want to read.

 

The following table contains examples that contrast the Get statement and Input function as ways of reading data from a binary file:

 

String Setup and Get Statement

Input Function

     

strData = String$(10000, " ")

Get #intMyFile, , strData

 

 

strData = Input(10000, #intMyFile)

     

strData = String$(LOF(intMyFile), " ")

Get #intMyFile, , strData

 

 

strData = Input(LOF(intMyFile), #intMyFile)

 

The Put Statement

 

The Put statement is used write data to a file opened in binary mode. The syntax, as it applies to binary files is:

 

Put [#]filenumber, [byte position], varname

 

The filenumber is any valid filenumber as defined above.

 

Byte position is the byte position within the file at which the writing begins. The byte position is "one-based", meaning the first byte position in the file is 1, the second position is 2, and so on. You can omit this entry, in which case the next byte following the last Get or Put statement is written. If you omit the byte position entry, you must still include the delimiting commas in the Put statement, for example:

 

      Put #intMyFile, , strData

 

Varname is a string variable from which the data will be written. This string variable is often referred to as a "buffer" when processing binary files. It is important to note that the length, or size, of this string variable determines how many bytes of data will be written to the file.

 

For example, the following statements cause 1 byte of data to file number "intMyFile":

 

      strCharacter = Mid$(strData, lngCurrentPos, 1)

      Put #intMyFile, , strCharacter

 

Recall that you can omit the "byte position" entry, in which case VB will "keep track" of where it is in the file. For example, the first time the above Put statement is executed, byte 1 will be written; the second time the above Put statement is executed, byte 2 will be written; and so on.

 

Sample Programs

 

Three sample "Try It" programs will now be presented, using the statements and functions described  above. All three read in the same input file and write out the same output file; the difference is in how the input file is read. The first sample program uses the Get statement to process the file in "chunks", and second uses the Get statement to process the file all at once, and third uses the Input function to process the file all at once.

 

The job of the sample programs is to read in an HTML file, strip out all tags (i.e., everything between the "less than" and "greater than" angle brackets as well as the brackets themselves), and write out the remaining text.

 

The figure below shows excerpts of both the HTML input file and the plain text output file. In the HTML excerpt on the left, the text that was extracted out (i.e., the "non-tag" data) is shown in bold for greater clarity.

 

HTML Input File (excerpt)

Plain Text Output File (excerpt)

<html>

 

<head>

<meta http-equiv=Content-Type content="text/html; charset=windows-1252">

<meta name=Generator content="Microsoft Word 10 (filtered)">

<title>Working with Files</title>

 

<style>

. . .

<p class=MsoNormal align=center style='text-align:center'><b><span

style='font-size:12.0pt;font-family:Arial'>Working with Files – Part 1</span></b></p>

 

<p class=MsoNormal align=center style='text-align:center'><b><span

style='font-size:12.0pt;font-family:Arial'>Sequential File Processing

Statements and Functions</span></b></p>

 

<p class=MsoNormal align=center style='text-align:center'><b><span

style='font-size:12.0pt;font-family:Arial'>Processing a Comma-Delimited File</span></b></p>

 

<p class=MsoNormal align=center style='text-align:center'><span

style='font-size:12.0pt;font-family:Arial'>&nbsp;</span></p>

 

<p class=MsoNormal><span style='font-size:12.0pt;font-family:Arial'>Visual

Basic provides the capability of processing three types of files:</span></p>

 

<p class=MsoNormal><span style='font-size:12.0pt;font-family:Arial'>&nbsp;</span></p>

 

<p class=MsoNormal style='margin-left:2.0in;text-indent:-1.5in'><b><span

style='font-size:12.0pt;font-family:Arial'>sequential files        </span></b><span

style='font-size:12.0pt;font-family:Arial'>Files that must be read in the same

order in which they were written – one after the other with no skipping around</span></p>

 

<p class=MsoNormal style='margin-left:2.0in;text-indent:-1.5in'><b><span

style='font-size:12.0pt;font-family:Arial'>&nbsp;</span></b></p>

 

<p class=MsoNormal style='margin-left:2.0in;text-indent:-1.5in'><b><span

style='font-size:12.0pt;font-family:Arial'>binary files               </span></b><span

style='font-size:12.0pt;font-family:Arial'>&quot;unstructured&quot; files which

are read from or written to as series of bytes, where it is up to the

programmer to specify the format of the file</span></p>

 

<p class=MsoNormal style='margin-left:.5in'><span style='font-size:12.0pt;

font-family:Arial'>&nbsp;</span></p>

 

<p class=MsoNormal style='margin-left:1.0in;text-indent:-.5in'><b><span

style='font-size:12.0pt;font-family:Arial'>random files             </span></b><span

style='font-size:12.0pt;font-family:Arial'>files which support &quot;direct

access&quot; by record number</span></p>

. . .

 

 

 

 

 

Working with Files

 

 

 

 

 

 

 

 

 

 

 

Working with Files – Part 1

 

Sequential File Processing

Statements and Functions

 

Processing a Comma-Delimited File

 

&nbsp;

 

Visual

Basic provides the capability of processing three types of files:

 

&nbsp;

 

sequential files        Files that must be read in the same

order in which they were written – one after the other with no skipping around

 

&nbsp;

 

binary files               &quot;unstructured&quot; files which

are read from or written to as series of bytes, where it is up to the

programmer to specify the format of the file

 

&nbsp;

 

random files             files which support &quot;direct

access&quot; by record number

 

&nbsp;

 

These three

file types are &quot;native&quot; to Visual Basic and its predecessors (QBasic,

GW-BASIC, etc.).  The next several topics address VB's sequential file

processing capabilities. Binary and Random files will be covered in later

topics.

 

 

 

The

following sequential file-related statements and functions will be discussed:

 

&nbsp;

 

Open                          Prepares a file to be processed by the VB

program.

 

App.Path                   Supplies the path of your application

 

FreeFile                     Supplies a file number that is not

already in use

 

Input #                       Reads fields from a comma-delimited sequential

file

 

. . .

 

Note: The sample programs use the Dir$ function and the Kill statement for the purpose of deleting the output file if it exists, prior to creating it anew. Dir$ and Kill are covered in the later topic of "File System Commands and Functions".

 

Sample Program 1 – Using the Get Statement to Read a Binary File In "Chunks"

 

The first sample program uses the technique of reading and processing a binary file one "chunk" at a time (in this case 10,000 bytes at a time) using the Get statement. Since the file size is a little over 60,000 bytes, you will see that it took seven passes to read through the file. The code listed below is heavily commented to aid in the understanding of how the program works.

 

"Try It" Code:

 

Private Sub cmdTryIt_Click()

 

    Dim strHTMFileName         As String

    Dim strTextFileName        As String

    Dim strBackSlash           As String

    Dim intHTMFileNbr          As Integer

    Dim intTextFileNbr         As Integer

   

    Dim strBuffer              As String

    Dim strCurrentChar         As String * 1

    Dim blnTagPending          As Boolean

   

    Dim lngX                   As Long

    Dim lngBytesRemaining      As Long

    Dim lngCurrentBufferSize   As Long

    Const lngMAX_BUFFER_SIZE   As Long = 10000

   

    ' Prepare the file names ...

    strBackSlash = IIf(Right$(App.Path, 1) = "\", "", "\")

    strHTMFileName = App.Path & strBackSlash & "Files_Lesson1.htm"

    strTextFileName = App.Path & strBackSlash & "TestOut.txt"

   

    Print "Opening files ..."

   

    ' Open the input file ...

    intHTMFileNbr = FreeFile

    Open strHTMFileName For Binary Access Read As #intHTMFileNbr

   

    ' If the file we want to open for output already exists, delete it ...

    If Dir$(strTextFileName) <> "" Then

        Kill strTextFileName

    End If

    ' Open the output file ...

    intTextFileNbr = FreeFile

    Open strTextFileName For Binary Access Write As #intTextFileNbr

   

    ' Initialize the "bytes remaining" variable to the length of the input file ...

    lngBytesRemaining = LOF(intHTMFileNbr)

   

    ' Set up a loop which will process the file in "chunks" of 10,000 bytes at a time.

    ' We will keep track of how many bytes we have remaining to process, and

    ' the loop will continue as long as there are bytes remaining.

   

    Do While lngBytesRemaining > 0

   

        Print "Processing 'chunk' ..."

       

        ' Note: The "buffer" is simply a string variable into which the "current

        ' chunk" of the file will be read.

       

        ' Set the current buffer size to be either the maximum size (10,000) as

        ' long as there are least 10,000 bytes remaining. If there are less (as

        ' there would be the last time through the loop), set the buffer size

        ' equal to the number of bytes remaining.

       

        If lngBytesRemaining >= lngMAX_BUFFER_SIZE Then

            lngCurrentBufferSize = lngMAX_BUFFER_SIZE

        Else

            lngCurrentBufferSize = lngBytesRemaining

        End If

       

        ' Because the Get statement relies on the size of the string variable (the

        ' "buffer") into which the data will be read to know how many bytes to read

        ' from the file, we fill the buffer string variable with a number of blank

        ' spaces - where the number of blank spaces was determined in the statement

        ' above.

       

        strBuffer = String$(lngCurrentBufferSize, " ")

       

        ' The Get statement now reads the next chunk of data from the input file

        ' and stores it in the strBuffer variable.

       

        Get #intHTMFileNbr, , strBuffer

       

        ' The For loop below now processes the current chunk of data character by

        ' character, writing out only the characters that are NOT enclosed in the

        ' HTML tags (i.e., it is skipping every character between a pair of angle

        ' brackets "<" and ">") ...

 

        For lngX = 1 To lngCurrentBufferSize

            strCurrentChar = Mid$(strBuffer, lngX, 1)

            Select Case strCurrentChar

                Case "<"

                    blnTagPending = True

                Case ">"

                    blnTagPending = False

                Case Else

                    If Not blnTagPending Then

                        ' The current character is outside of the tag brackets, so

                        ' write it out ...

                        Put #intTextFileNbr, , strCurrentChar

                    End If

            End Select

        Next

       

        ' Adjust the "bytes remaining" variable by subtracting the current buffer size

        ' from it ...

        lngBytesRemaining = lngBytesRemaining - lngCurrentBufferSize

       

    Loop

           

    Print "Closing files ..."

    

    ' Close the input and output files ...

    Close #intHTMFileNbr

    Close #intTextFileNbr

 

    Print "Done."

   

End Sub

 

After the cmdTryIt_Click event procedure has run, the form should look like the screen shot below, and the output plain-text file should be present in the project directory.

 

 

Download the VB project code for the example above here.

 

 

Sample Program 2 – Using the Get Statement to Read a Binary File All At Once

 

The second sample program uses the technique of reading and processing a binary file all at once, using the Get statement in conjunction with the LOF function. The code listed below is heavily commented to aid in the understanding of how the program works.

 

"Try It" Code:

 

Private Sub cmdTryIt_Click()

 

    Dim strHTMFileName         As String

    Dim strTextFileName        As String

    Dim strBackSlash           As String

    Dim intHTMFileNbr          As Integer

    Dim intTextFileNbr         As Integer

    Dim strBuffer              As String

    Dim strCurrentChar         As String * 1

    Dim lngX                   As Long

    Dim blnTagPending          As Boolean

   

    ' Prepare the file names ...

    strBackSlash = IIf(Right$(App.Path, 1) = "\", "", "\")

    strHTMFileName = App.Path & strBackSlash & "Files_Lesson1.htm"

    strTextFileName = App.Path & strBackSlash & "TestOut.txt"

   

    Print "Opening files ..."

   

    ' Open the input file ...

    intHTMFileNbr = FreeFile

    Open strHTMFileName For Binary Access Read As #intHTMFileNbr

   

    ' If the file we want to open for output already exists, delete it ...

    If Dir$(strTextFileName) <> "" Then

        Kill strTextFileName

    End If

    ' Open the output file ...

    intTextFileNbr = FreeFile

    Open strTextFileName For Binary Access Write As #intTextFileNbr

 

    Print "Reading input file ..."

 

    ' Note: The "buffer" is simply a string variable into which the "current

    ' chunk" of the file will be read.

 

    ' Because the Get statement relies on the size of the string variable (the

    ' "buffer") into which the data will be read to know how many bytes to read

    ' from the file, we fill the buffer string variable with a number of blank

    ' spaces - where the number of blank spaces is equal to the size of the

    ' entire file (as determined by the LOF function) ...

 

    strBuffer = String$(LOF(intHTMFileNbr), " ")

   

    ' The Get statement now reads the entire contents of the input file

    ' and stores it in the strBuffer variable.

   

    Get #intHTMFileNbr, , strBuffer

       

    Print "Generating output file ..."

   

    ' The For loop below now processes the contents of the file character by

    ' character, writing out only the characters that are NOT enclosed in the

    ' HTML tags (i.e., it is skipping every character between a pair of angle

    ' brackets "<" and ">") ...

       

    For lngX = 1 To Len(strBuffer)

        strCurrentChar = Mid$(strBuffer, lngX, 1)

        Select Case strCurrentChar

            Case "<"

                blnTagPending = True

            Case ">"

                blnTagPending = False

            Case Else

                If Not blnTagPending Then

                    ' The current character is outside of the tags, so write it out ...

                    Put #intTextFileNbr, , strCurrentChar

                End If

        End Select

    Next

           

    Print "Closing files ..."

   

    ' Close the input and output files ...

    Close #intHTMFileNbr

    Close #intTextFileNbr

 

    Print "Done."

 

End Sub

 

After the cmdTryIt_Click event procedure has run, the form should look like the screen shot below, and the output plain-text file should be present in the project directory.

 

 

Download the VB project code for the example above here.

 

Sample Program 3 – Using the Input Function to Read a Binary File All At Once

 

The third sample program uses the technique of reading and processing a binary file all at once, using the Input function in conjunction with the LOF function. The code listed below is heavily commented to aid in the understanding of how the program works.

 

"Try It" Code:

 

Private Sub cmdTryIt_Click()

 

    Dim strHTMFileName         As String

    Dim strTextFileName        As String

    Dim strBackSlash           As String

    Dim intHTMFileNbr          As Integer

    Dim intTextFileNbr         As Integer

    Dim strBuffer              As String

    Dim strCurrentChar         As String * 1

    Dim lngX                   As Long

    Dim blnTagPending          As Boolean

   

    ' Prepare the file names ...

    strBackSlash = IIf(Right$(App.Path, 1) = "\", "", "\")

    strHTMFileName = App.Path & strBackSlash & "Files_Lesson1.htm"

    strTextFileName = App.Path & strBackSlash & "TestOut.txt"

   

    Print "Opening files ..."

    

    ' Open the input file ...

    intHTMFileNbr = FreeFile

    Open strHTMFileName For Binary Access Read As #intHTMFileNbr

   

    ' If the file we want to open for output already exists, delete it ...

    If Dir$(strTextFileName) <> "" Then

        Kill strTextFileName

    End If

    ' Open the output file ...

    intTextFileNbr = FreeFile

    Open strTextFileName For Binary Access Write As #intTextFileNbr

 

    Print "Reading input file ..."

 

    ' Note: The "buffer" is simply a string variable into which the "current

    ' chunk" of the file will be read.

 

    ' The Input function reads a number of bytes from a file. The first argument

    ' of the function specifies how many bytes to read, which in this case is

    ' the size of the entire file (as determined by the LOF function). The second

    ' argument specifies the file number of the file from which the data is to be

    ' read. The resulting data is stored in the "strBuffer" variable.

 

    strBuffer = Input(LOF(intHTMFileNbr), #intHTMFileNbr)

        

    Print "Generating output file ..."

   

    ' The For loop below now processes the contents of the file character by

    ' character, writing out only the characters that are NOT enclosed in the

    ' HTML tags (i.e., it is skipping every character between a pair of angle

    ' brackets "<" and ">") ...

       

    For lngX = 1 To Len(strBuffer)

        strCurrentChar = Mid$(strBuffer, lngX, 1)

        Select Case strCurrentChar

            Case "<"

                blnTagPending = True

            Case ">"

                blnTagPending = False

            Case Else

                If Not blnTagPending Then

                    ' The current character is outside of the tags, so write it out ...

                    Put #intTextFileNbr, , strCurrentChar

                End If

        End Select

    Next

           

    Print "Closing files ..."

   

    ' Close the input and output files ...

    Close #intHTMFileNbr

    Close #intTextFileNbr

 

    Print "Done."

 

End Sub

 

After the cmdTryIt_Click event procedure has run, the form should look like the screen shot below, and the output plain-text file should be present in the project directory.

 

 

Download the VB project code for the example above here.