Tesseract OCR with RegEx
Kmcnet
1,006
Reputation points
Hello everyone and thanks for the help in advance. I am developing a C# application that reads PNG files with Tesseract 5. I am having problems extracting a date from the Tesseract recognized text. The problem is the extracted text, when displayed in a textbox, is correct and displays like:
DOB: 04/02/2016
My code to extract the text:
public string PatientDateOfBirth { get; set; } public ExtractPatientDateOfBirth(string TextToParse) { string MatchPattern = "DOB:" + @"(\ +)([0-9\.\<\/\s+]+)"; Regex r = new Regex(MatchPattern, RegexOptions.IgnoreCase); Match m = r.Match(TextToParse); string PatientDOB = ""; while (m.Success) { PatientDOB = m.ToString(); m = m.NextMatch(); } PatientDOB = PatientDOB.Replace("DOB:", ""); PatientDOB = PatientDOB.Trim(); //var PatientDateTime = DateTime.ParseExact(PatientDOB, "M/d/yyyy", CultureInfo.InvariantCulture); //PatientDateOfBirth = PatientDateTime.ToString("MM/dd/yyyy"); PatientDateOfBirth = PatientDOB; }
returns the value 04702/2016 where one of the forward slashes is a 7. So I am not sure what is causing this problem or how to correct it.
Sign in to answer