다음을 통해 공유


VB.NET: Reading large files made simple

Introduction

Inevitably when a developer’s task involves reading massive text files questions abound. The first shot may leave the application unresponsive while another attempt creates low memory issues. When this happens usually because of lack of expertise working with massive text files, a novice coders or poor business requirements that the developer does not offer alternatives. Through experience and or due diligence, there are ways to achieve success along when possible challenge business requirements and offer alternative. The following material is intended to provide ideas for alternate methods to work with larger files using Windows Form and by changing the user interface can work for WPF also.

Unresponsive applications solutions

When working with larger files that does not allow user interaction consider these alternates

When the work is a small portion of the application’s responsibility consider breaking out initial file manipulation into a utility where the code does not change which means the user can still work. 

When breaking out initial file manipulation into a utility the next consideration is to split the file up without performing any other operations such as parsing into a database or list. By monitoring a specific folder, another process parses a smaller file. Splitting the larger file to smaller file at first will be trial and error to get the right size to keep the processes flowing. For instance with a million lines in a file not only will splitting be trial and error for the right size but also how long it takes to split.

Splitting delegated to a Windows Service using a scheduler to look for a file, split the file up during a nightly windows task or a utility, which a user runs manually.

Other considerations when splitting and parsing files is other processes competing for resources on a local machine, network server along with anti-virus services interjecting into the split parse operations.

Split/chunking larger files

First rule, use asynchronous methods sparingly as they will slow down processing. Second rule, don’t attempt to continually update the user interface when not working with a Windows Service as this will slow down processing. Do math on the operations to update the user interface every ten percent done. Third rule, when working with a user interface always provide a method to cancel the operation without the user clicking the X button on a form which may leave data in a unstable state.

Basic splitting/chunking

These are step to consider.

  • Decide the location where the large file will reside.
  • Decide the location where the smaller files will reside.
  • How will splitting be verified against the original file.

Locations may be on a local box, on a local box and server or all on the server. Usually when splitting is done in a nightly job using a network server is best while keeping in mind to check with network support for maintenance times when the server may be restarted or a backup being performed which can halt or delay processing.

Code walk through

In this walk through a single button runs the split process. Folder locations and split by line settings are stored in the application's configuration file which an optional window could be offered to allow users to change settings although it's better to leave configuration to engineers, help desk and developers.

Configuration settings

<?xml version="1.0" encoding="utf-8" ?>
<configuration>
  <startup>
    <supportedRuntime version="v4.0" sku=".NETFramework,Version=v4.7.2" />
  </startup>
  <appSettings>
    <add key="LinesToSplit" value="1000" />
    <add key="ChunkFolderLocation" value="ChunkFiles" />
    <add key="ChunkFileBaseName" value="INCOMING_" />
    <add key="WorkFilesLocation" value="WorkFiles" />
  </appSettings>
</configuration>

To access setting the following class provides access.

Imports System.Configuration
Imports System.IO
Imports System.Reflection
 
Namespace Classes
 
    Public Class  ApplicationSettings
        Public Shared  Function GetLineSplit() As String
            Try
                Return ConfigurationManager.AppSettings("LinesToSplit")
            Catch e1 As Exception
                Return "2000"
            End Try
        End Function
        Public Shared  Function GetChunkFolderLocation() As String
            Try
                Return ConfigurationManager.AppSettings("ChunkFolderLocation")
            Catch e1 As Exception
                Throw New  Exception("Failed to read chunk folder location")
            End Try
        End Function
        Public Shared  Function GetWorkFolderLocation() As String
            Try
                Return ConfigurationManager.AppSettings("WorkFilesLocation")
            Catch e1 As Exception
                Throw New  Exception("Failed to read work folder location")
            End Try
        End Function
        Public Shared  Function GetChunkFileBaseName() As String
            Try
                Return ConfigurationManager.AppSettings("ChunkFileBaseName")
            Catch e1 As Exception
                Throw New  Exception("Failed to read chunk base file name")
            End Try
        End Function
        Public Shared  Sub LinesToSplit(ByVal value As String)
 
            Try
                Dim applicationDirectoryName = Path.GetDirectoryName(Assembly.
                    GetExecutingAssembly().Location)
                Dim configFile = Path.Combine(applicationDirectoryName,
                    $"{Assembly.GetExecutingAssembly().GetName().Name}.exe.config")
                Dim configFileMap = New ExeConfigurationFileMap With {.ExeConfigFilename = configFile}
                Dim config = ConfigurationManager.
                        OpenMappedExeConfiguration(configFileMap, ConfigurationUserLevel.None)
 
                config.AppSettings.Settings("LinesToSplit").Value = value
                config.Save()
 
            Catch e1 As Exception
                ' ignored
            End Try
        End Sub
    End Class
End Namespace

File operations class
In the code sample this resides in the forms project but could also reside in a class project is there is a consideration to use in another project.

Code flow:
Starting with the form, in this case a file name is hard coded, this could be set as is or in the application configuration file unless there is a chance the location may be dynamic which means adding an open dialog and in this case the last location should be saved back to the configuration file using the same logic used to remember the line split amount done in ApplicationSettings.LinesToSplit method.

A call is made to SplitLargeFile passing a number indicating how many lines to split the file by. SplitLargeFile reads the file synchronously while as stated earlier to keep away from asynchronous as this does keep the application responsive but takes longer to process. If asynchronous is desired there is StreamReader.ReadToEndAsync which once read can be split, see the following method in another project provided with this article.

Once the file contents has been read a call to WriteChunk accept as the first argument IEnumerable(Of String) which are lines in the file, second argument indicates how many lines will be in each file while the last argument is used to append to each of the smaller files e.g. Incoming_1.text,  Incoming_1.text and so forth.

WriteChunk uses a StreamWriter to create a file to the selected size specified unless the file is smaller then only those line are written. It is possible to end up with a zero byte file which will be taken care of later in the same class shown below.

Imports System.IO
 
Namespace Classes
 
    Public Class  FileOperations
 
        Private Shared  _chunkFolderLocation As String
        Private Shared  _workFolderLocation As String
        Private Shared  _chunkFileBaseName As String
 
        ''' <summary>
        ''' Location where smaller chunk files are created
        ''' </summary>
        ''' <returns></returns>
        Public Shared  ReadOnly Property  ChunkFolderLocation() As String
            Get
                If String.IsNullOrWhiteSpace(_chunkFolderLocation) Then
                    _chunkFolderLocation = Path.Combine(AppDomain.CurrentDomain.BaseDirectory,
                                                        ApplicationSettings.GetChunkFolderLocation())
                End If
 
                Return _chunkFolderLocation
 
            End Get
        End Property
        Public Shared  ReadOnly Property  WorkFolderLocation() As String
            Get
                If String.IsNullOrWhiteSpace(_workFolderLocation) Then
                    _workFolderLocation = Path.Combine(AppDomain.CurrentDomain.BaseDirectory,
                                                       ApplicationSettings.GetWorkFolderLocation())
                End If
 
                Return _workFolderLocation
 
            End Get
        End Property
        ''' <summary>
        ''' Base chunk file name which has a _Number appended
        ''' </summary>
        ''' <returns></returns>
        Public Shared  ReadOnly Property  ChunkFileBaseName() As String
            Get
                If String.IsNullOrWhiteSpace(_chunkFileBaseName) Then
                    _chunkFileBaseName = ApplicationSettings.GetChunkFileBaseName()
                End If
 
                Return _chunkFileBaseName
 
            End Get
        End Property
        ''' <summary>
        ''' Split a larger file into smaller files where the smaller files
        ''' line count equal the line count of the larger file
        ''' </summary>
        ''' <param name="fileName">Valid existing file with path</param>
        ''' <param name="splitSize">How many lines to split on</param> 
        Public Shared  Sub SplitLargeFile(fileName As String, ByVal  splitSize As  Integer)
 
            If Not  File.Exists(fileName) Then
                Throw New  FileNotFoundException(fileName)
            End If
 
            Dim files = Directory.GetFiles(ChunkFolderLocation)
 
            For fileIndex As Integer  = 0 To  files.Count() - 1
                File.Delete(files(fileIndex))
            Next
 
            Using lineIterator As  IEnumerator(Of String) = File.ReadLines(fileName).GetEnumerator()
 
                Dim stillGoing = True
 
                Dim chunkIndex As Integer  = 0
 
                Do While  stillGoing
                    stillGoing = WriteChunk(lineIterator, splitSize, chunkIndex)
                    chunkIndex += 1
                Loop
 
            End Using
 
            RemoveZeroLengthFiles()
 
        End Sub
        ''' <summary>
        ''' Create smaller chunk file
        ''' </summary>
        ''' <param name="lineIterator"></param>
        ''' <param name="splitSize"></param>
        ''' <param name="chunk"></param>
        ''' <returns></returns>
        Private Shared  Function WriteChunk(lineIterator As IEnumerator(Of String),
                                           splitSize As  Integer, chunk As Integer) As  Boolean
 
            Dim fileName = Path.Combine(ChunkFolderLocation, $"{ChunkFileBaseName}{chunk + 1}.txt")
 
            Using writer As  StreamWriter = File.CreateText(fileName)
 
                For index As Integer  = 0 To  splitSize - 1
 
                    If Not  lineIterator.MoveNext() Then
                        Return False
                    End If
 
                    writer.WriteLine(lineIterator.Current)
 
                Next index
            End Using
 
            Return True
 
        End Function
 
        ''' <summary>
        ''' Verify the chunked files equal line count in the original file
        ''' </summary>
        ''' <param name="incomingFileName">The larger file</param>
        ''' <returns></returns>
        Public Shared  Function VerifyLineCounts(incomingFileName  As  String) As List(Of Verify)
            Dim verifyList = New List(Of Verify)()
 
            Dim directory = New DirectoryInfo(ChunkFolderLocation)
            Dim files = directory.GetFiles("*.*", SearchOption.AllDirectories)
 
            Dim lineCount = 0
            Dim totalLines = 0
 
            For Each  fileInfo As  FileInfo In  files
                Using reader = File.OpenText(Path.Combine(ChunkFolderLocation, fileInfo.Name))
                    Do While  reader.ReadLine() IsNot Nothing
                        lineCount += 1
                    Loop
                End Using
 
                verifyList.Add(New Verify() With {.FileName = fileInfo.Name, .Count = lineCount})
 
                totalLines += lineCount
                lineCount = 0
 
            Next fileInfo
 
            '
            ' IMPORTANT: The chunk method appends _N to each of the smaller files
            ' which this statement expects so if the above method changes this must too.
            '
            verifyList = verifyList.Select(
                Function(verify) New  With {
                      Key .Name = verify,
                      Key .Index = Convert.ToInt32(verify.FileName.Split("_"c)(1).
                                                      Replace(".txt", ""))}).
                OrderBy(Function(item) item.Index).
                Select(Function(anonymousItem) anonymousItem.Name).ToList()
 
            verifyList.Add(New Verify() With {.FileName = "Total", .Count = totalLines})
 
            lineCount = 0
 
            Dim baseFile = New FileInfo(incomingFileName)
 
            Using reader = File.OpenText(Path.Combine(AppDomain.CurrentDomain.BaseDirectory, baseFile.Name))
                Do While  reader.ReadLine() IsNot Nothing
                    lineCount += 1
                Loop
            End Using
 
            verifyList.Add(New Verify() With {
                              .FileName = Path.GetFileName(incomingFileName),
                              .Count = lineCount})
 
            Return verifyList
 
        End Function
        ''' <summary>
        ''' Use for <see cref="RemoveZeroLengthFiles"/>
        ''' </summary>
        ''' <param name="path"></param>
        ''' <returns></returns>
        Private Shared  Function GetFilesWithZeroLengthFiles(path  As  String) As String()
            Dim directory = New DirectoryInfo(path)
            Dim files = directory.GetFiles("*.*", SearchOption.AllDirectories)
 
            Return (
                From file In  files
                Where file.Length = 0
                Select file.Name).ToArray()
 
        End Function
        ''' <summary>
        ''' We don't want any empty lines, this ensures there are no
        ''' empty lines
        ''' </summary>
        Public Shared  Sub RemoveZeroLengthFiles()
 
 
            Dim files = GetFilesWithZeroLengthFiles(ChunkFolderLocation)
 
            For Each  currentFile In  files
                File.Delete(Path.Combine(ChunkFolderLocation, currentFile))
            Next
 
        End Sub
        ''' <summary>
        ''' The idea here is to take files in the chunk folder and one
        ''' by one process them where in this case there is no processing.
        '''
        ''' Processing can take many forms e.g. move to a work folder and parse
        ''' or perhaps move to a monitor folder which a windows service watches
        ''' and processes any file in that location. In this case a fictitious
        ''' Windows service looks for a known file which means any file here
        ''' is moved it always overwrites the current file the service is watching for.
        ''' </summary>
        ''' <returns></returns>
        Public Shared  Function CheckIfNewIncomingFileIsNeeded()  As  String
 
            Dim availableFiles = Directory.GetFiles(ChunkFolderLocation)
 
            Dim fileNamesOrdered = availableFiles.Select(
                Function(fName) New  With {
                        Key .Name = fName,
                        Key .Index = Convert.ToInt32(fName.Split("_"c)(1).
                                                        Replace(".txt", ""))}).
                    OrderBy(Function(item) item.Index).ToArray()
 
            If fileNamesOrdered.Length > 0 Then
                Dim chunkFileName = Path.GetFileName(fileNamesOrdered(0).Name)
 
                '
                ' Process the file or move the file to another folder for processing
                ' then do the delete or not
                '
 
                Dim destinationFileName = Path.Combine(WorkFolderLocation, "Current.txt")
                If File.Exists(destinationFileName) Then
                    File.Delete(destinationFileName)
                End If
 
                File.Move(fileNamesOrdered(0).Name, destinationFileName)
 
                Return chunkFileName
            Else
                ' Signifies all files have been processed
                Return Nothing
            End If
        End Function
    End Class
End Namespace

Simulation for after splitting file

Once the split operation has completed working with those files can take many forms although what is constant is obtaining each file one by one, moving the file to another folder for another process to work the files such as a Windows service running a job scheduler or another project watching a folder for new files.

Here a method is called to simulate getting a file to work on in a button click event. In CheckIfNewIncomingFileIsNeeded the chunk folder is queried for all files and ordered by the _N (e.g. File_1.txt, File_2.txt etc) then the first file is taken and moved to a folder where a Windows service will process or another application watches the folder for newer files. There are many ways to perform this operation, rather than getting into this each developer needs to decide how to do this which could be that once a file has been processed it's deleted and in the application which moves files has a Timer to check if a new file is needed using the method below that when there are no more files an assertion may be done to see if an empty string has been returned.

Private Sub  GetNextButton_Click(sender As Object, e As  EventArgs) Handles  GetNextButton.Click
    Dim fileName = ""
 
    Try
        fileName = FileOperations.CheckIfNewIncomingFileIsNeeded()
        If String.IsNullOrWhiteSpace(fileName) Then
            MessageBox.Show("Finished")
        Else
            MessageBox.Show(fileName)
        End If
    Catch ex As Exception
        MessageBox.Show($"Failed: {ex.Message}")
    End Try
End Sub

Verifying

To ensure the chunk process was successful the following method is used to count lines in each file and present total lines to the total line count in the original file.

''' <summary>
''' Verify the chunked files equal line count in the original file
''' </summary>
''' <param name="incomingFileName">The larger file</param>
''' <returns></returns>
Public Shared  Function VerifyLineCounts(incomingFileName  As  String) As List(Of Verify)
    Dim verifyList = New List(Of Verify)()
 
    Dim directory = New DirectoryInfo(ChunkFolderLocation)
    Dim files = directory.GetFiles("*.*", SearchOption.AllDirectories)
 
    Dim lineCount = 0
    Dim totalLines = 0
 
    For Each  fileInfo As  FileInfo In  files
        Using reader = File.OpenText(Path.Combine(ChunkFolderLocation, fileInfo.Name))
            Do While  reader.ReadLine() IsNot Nothing
                lineCount += 1
            Loop
        End Using
 
        verifyList.Add(New Verify() With {.FileName = fileInfo.Name, .Count = lineCount})
 
        totalLines += lineCount
        lineCount = 0
 
    Next
 
    '
    ' IMPORTANT: The chunk method appends _N to each of the smaller files
    ' which this statement expects so if the above method changes this must too.
    '
    verifyList = verifyList.Select(
        Function(verify) New  With {
              Key .Name = verify,
              Key .Index = Convert.ToInt32(verify.FileName.Split("_"c)(1).
                                              Replace(".txt", ""))}).
        OrderBy(Function(item) item.Index).
        Select(Function(anonymousItem) anonymousItem.Name).ToList()
 
    verifyList.Add(New Verify() With {.FileName = "Total", .Count = totalLines})
 
    lineCount = 0
 
    Dim baseFile = New FileInfo(incomingFileName)
 
    Using reader = File.OpenText(Path.Combine(AppDomain.CurrentDomain.BaseDirectory, baseFile.Name))
        Do While  reader.ReadLine() IsNot Nothing
            lineCount += 1
        Loop
    End Using
 
    verifyList.Add(New Verify() With {
                      .FileName = Path.GetFileName(incomingFileName),
                      .Count = lineCount})
 
    Return verifyList
 
End Function

The following is presented to the user for visual verification


 

Showing progress

As stated before, if the application user interface is unresponsive but working and in a timely manner that should be fine if the developer can reassure this to business. If not consider a splash screen with an animated gif or continual ProgressBar. If neither of these work for them the next option is to create a child form and delegates to communicate to the back end class performing work or consider looking a Windows API Code pack TaskDialog.

In the screenshot below a TaskDialog has been configured to present a progress bar showing the current position reading a file along with providing a button that when clicked will send a cancellation request which stops all processing. If the process completes the dialog auto-closes followed by populating a ListView. 

Caution/warning: All of the Windows API Code pack code samples are done in C# on the authors GitHub repository except for one small code sample, nothing like what is shown below. But with the source code provided and studying the VB.NET code it will be easy to adapt to other projects.  Refer to the following code block to see how the screenshot dialog was coded.

 


Bonus
With this library dialogs like the following are possible.

Imports Microsoft.WindowsAPICodePack.Dialogs
 
Namespace Modules
 
    Public Module DialogHelpers
        Public Function ExitApplication(Form As Form) As TaskDialogResult
            Dim taskDialogResult As TaskDialogResult
 
            Dim stayButton = New TaskDialogButton("StayButton", "I want to stay") With {.Default = True}
            Dim closeButton = New TaskDialogButton("CancelButton", "Leave now!!!")
 
            Using dialog = New TaskDialog With {
                .Caption = "Question",
                .InstructionText = $"Close me",
                .Icon = TaskDialogStandardIcon.Warning,
                .Cancelable = True, .OwnerWindowHandle = Form.Handle,
                .StartupLocation = TaskDialogStartupLocation.CenterOwner}
 
                dialog.Controls.Add(stayButton)
                dialog.Controls.Add(closeButton)
 
                AddHandler dialog.Opened, Sub(sender As Object, ea As EventArgs)
                                              Dim taskDialog As TaskDialog = TryCast(sender, TaskDialog)
                                              taskDialog.Icon = taskDialog.Icon
 
                                          End Sub
 
                AddHandler stayButton.Click,
                    Sub(o, ev)
                        dialog.Close(TaskDialogResult.Cancel)
                    End Sub
 
                AddHandler closeButton.Click,
                    Sub(o, ev)
                        dialog.Close(TaskDialogResult.Ok)
                    End Sub
 
                taskDialogResult = dialog.Show()
 
                If taskDialogResult = TaskDialogResult.Ok Then
                    Form.Close()
                End If
 
            End Using
 
            Return taskDialogResult
 
        End Function
 
 
    End Module
End Namespace

Summary

Information has been provided for alternate methods for working with larger size files which can assist with sluggish and or low memory when working with these files.

See also

VB.NET Writing better code part 1
How to Handle a Huge Collection of Strings in VB.Net
VB.NET: Invoke Method to update UI from secondary threads
VB.NET Windows Forms delegates and events
C# Processing CSV Files

Source code

Clone the repository or download from the following repository
Once the solution is opened in Visual Studio, from solution explorer right click and select "Restore NuGet Packages"