I have some data that is emailed to me in a 50-ish page PDF that has four (4) columns of structured tabular data in several pages. The data takes up about 4 pages of the PDF so parsing the entire PDF as a single-set of data isn't practical. My current
approach is to parse the data after it's copy/pasted from the PDF. When I copy/paste I can structure the data into a single column into worksheet "RawData". Hence, with four (4) columns of 50-100 rows in each column I can get 200-400 rows of Excel content
that is pasted into column A and only column A of my "RawData" worksheet.
Uniquely, the data has several spaces as separators. That is, from what I've worked with each "part" of the data that I need to parse has a space at the END of THAT data except at the very end of the row of data which is denoted by an end-of-line, CRLF,
carriage return/line-feed. The data is consistent within each row but not fully consistent of ALL rows. See below for examples!
Here's an example of the data inColumn A of my worksheet "RawData" and my parsing which works fine for the first four (4) columns that I've done.
41 48 32 B B 35.8 FRes 4.8 BEN 22.70 +1.11 –9 4.1m 8o
My parsing command is as follows with the results immediately below it:
=IF(RawData!$A19="","",TRIM(MID(RawData!$A19,FIND(" ",RawData!$A19,LEN($A19)+1+LEN($B19)+1+LEN($C19)),LEN($A19)+LEN($B19)+LEN($C19)-1)))
Columns:
A B C D
41 48 32 B B
What I want is to put each subset of the data into a separate column so that I can sort, calculate, process etc....!! That is, the structure that I want is to put each subset into its own unique column. A 'subset' of the data is separated by blanks or
spaces.
Second example:
.. .. 38 .. B– 9.9 Frnk 11FTF 8.54 +.07 –33 85 ..k
Hence, ideally, I need this output structure:
Columns:
A B C D E F G H I J K L M N
41 48 32 B B 35.8 Fres 4.8 BEN 22.70 +1.11 -9 4.1m 8o
.. .. 38 .. B- 0.0 Frnk 11 FTF 8.54 +0.07 -33 85 ..k
You'll notice that Column I generally has alphabetic data and Column H has numbers. The original raw data for my first example has a space between the 4.8 and BEN but the
second example does not. That is, the raw data for the second example has 11FTF together - as far as I can tell from the copy/paste of the data. I could probably parse from the left for the Columns A
through G and parse from the right for Columns N through I and look for the numeric vs alphabetic differences in the data. Column G can be 20 characters +/- at times as it describes what the ROW is about.
Thirdly, sometimes the data comes in withoutanything FOR Column H. Here's an example:
87 34 90 BA– NH FreHC FXHE 18.75 +1.06 +176 286 45k
Hence, ideally, I need this output structure:
Columns:
A B C D E F G H I J K L M N
87 34 90 B A- NH FreHC FXHE 18.75 +1.06 +176 286 45k
Is there a better way to parse other than the MID function? Is the parsing even do-able?
These are just three (3) examples and it appears that these three (3) are the main if not all the various types of rows and examples.
Note: I'm not afraid of parsing by steps. That is do the main parsing into a ParsedWkSht1 and then parse the exceptions into a ParsedWkSht2 etc.
Thanks!!