extracting unstructured data and comparison(AI/ML)
Challenge 1:
Extracting data from unstructured data files, for example, I have attached a file here: https://docs.google.com/document/d/1DJb1p0o5---xs1Hfo0PCzG2mt6FvooRD/edit?usp=sharing&ouid=107579116880049687042&rtpof=true&sd=true
In the above file, as you can see the unstructured data which looks like in a table format but it's really text typed in table format.
This is a good table format which I have no problem in extracting the data: https://drive.google.com/file/d/1xS_GlLskINtFjDXXEMB6J0f_sXQRJJUE/view?usp=sharing
I have worked with unstructured data and dumped it into data-frame and even arranged it into a good table format and made it ready for extraction.
But the problem is, in the future I'm not going to get this same unstructured format! it's going to change either with spaces in between rows or entirely different column names. so, basically with those continuous changes I cant update my code for each and every file!
Below is the column from that given unstructured data link.
S Qty
item
Place of
supply
HSN/SAC
Quantity
Unit Price
Net Price
TAX TYPE
Tax Rate
Tax Amount
Challenge 2:
Comparison of data after successful data extraction(structured data):
let's say there is a user who is a buyer and also there are multiple vendors, the buyer gives requirements to the vendor:
Buyer requirements:
i5 processor,ram 8gb, SSD 120GB, HDD 1TB
The vendor gives the available specs to the buyer:
i7 processor, ram 16gb, SSD 220gb, HDD 2TB
In the above scenario, it's easy to compare with the given buyer requirement against the vendor.
so I can give my output like this in excel sheet:
Buyer Vendor
i5 processor, i7 processor
ram 8GB, ram 16gb
SSD 120GB, SSD 220gb
HDD 1TB HDD 2TB
I'm using Levenshtein distance for this particular scenario, what it does is it compares the distance between the strings and gives you relative output
But below is the challenge I'm facing:
Buyer requirements:
1) i5 processor,ram 8gb, SSD 120GB, HDD 1TB
2) i7 processor,ram 16gb, SSD 420GB, HDD 2TB
The vendor gives the available specs to the buyer:
1) i7 processor, ram 16gb, SSD 220gb, HDD 2TB
2) i9 processor, ram 32gb, SSD 500gb, HDD 4TB
3) i5 processor, ram 8gb, SSD 120gb, HDD 2TB
Below is the output I'm getting by comparing this
Buyer Vendor
i5 processor, i7 processor,i9 processor,i5 processor
ram 8GB, ram 16gb, ram 32gb,ram 8gb
SSD 120GB, SSD 220gb,SSD 500gb,SSD 120gb
HDD 1TB HDD 2TB, HDD 4TB,HDD 2TB
so on for buyer line item 2) i7 data...
As you can see it gives me jumbled data, which is not feasible for excel viewers.
NOTE: I understand I can store each item in a different array and using for loop I can compare the corresponding vendor data with buyer data!
But the problem is I'm getting those vendor and buyer data from a single unstructured file(a single file for buyer and another single file from a vendor with more than one line item)!
How do I know which data from buyer line items of data to be compared with vendor multiple line items of data?
for ex: The buyer line item 1) i5 processor can be compared with i7, i9, i5 because it's very similar but need to be compared with only i5 of line item 1 not from line item 2 or others of vendor file. I can somewhat achieve this with structured data if the data format in a file never changes!
I hope my explanation makes you understand.