In general (Like cheong00 mentioned), the source files could either contain text or contain images of text. There is a solution that can handle both types of source files, which is the LEADTOOLS Document Converter SDK Libraries. (Disclaimer: I am a LEADTOOLS employee).
The following code converts many types of input files and images to text, and automatically invoked OCR if needed:
private void ConvertToText(string inputFileName)
{
var options = new LoadDocumentOptions();
using (var document = DocumentFactory.LoadFromFile(inputFileName, options))
{
using (DocumentConverter documentConverter = new DocumentConverter())
{
Leadtools.Ocr.IOcrEngine ocrEngine = Leadtools.Ocr.OcrEngineManager.CreateEngine(Leadtools.Ocr.OcrEngineType.LEAD);
ocrEngine.Startup(null, null, null, ocrEnginePath);
documentConverter.SetOcrEngineInstance(ocrEngine, false);
var outFile = inputFileName + "_converted.txt";
var format = Leadtools.Document.Writer.DocumentFormat.Text;
var jobData = DocumentConverterJobs.CreateJobData(document, outFile, format);
jobData.JobName = "conversion job";
var job = documentConverter.Jobs.CreateJob(jobData);
documentConverter.Jobs.RunJob(job);
MessageBox.Show(job.Errors.Count.ToString() + " " + outFile);
}
}
}
If you would like to try LEADTOOLS, you can download the free evaluation from this page