extracting unstructured data and comparison(AI/ML)

Vibu 1 Reputation point
2021-11-26T06:21:04.057+00:00

Challenge 1:

Extracting data from unstructured data files, for example, I have attached a file here: https://docs.google.com/document/d/1DJb1p0o5---xs1Hfo0PCzG2mt6FvooRD/edit?usp=sharing&ouid=107579116880049687042&rtpof=true&sd=true

In the above file, as you can see the unstructured data which looks like in a table format but it's really text typed in table format.

This is a good table format which I have no problem in extracting the data: https://drive.google.com/file/d/1xS_GlLskINtFjDXXEMB6J0f_sXQRJJUE/view?usp=sharing

I have worked with unstructured data and dumped it into data-frame and even arranged it into a good table format and made it ready for extraction.

But the problem is, in the future I'm not going to get this same unstructured format! it's going to change either with spaces in between rows or entirely different column names. so, basically with those continuous changes I cant update my code for each and every file!

Below is the column from that given unstructured data link.

S Qty

item

Place of

supply

HSN/SAC

Quantity

Unit Price

Net Price

TAX TYPE

Tax Rate

Tax Amount

Challenge 2:

Comparison of data after successful data extraction(structured data):

let's say there is a user who is a buyer and also there are multiple vendors, the buyer gives requirements to the vendor:

Buyer requirements:

i5 processor,ram 8gb, SSD 120GB, HDD 1TB

The vendor gives the available specs to the buyer:

i7 processor, ram 16gb, SSD 220gb, HDD 2TB

In the above scenario, it's easy to compare with the given buyer requirement against the vendor.

so I can give my output like this in excel sheet:

Buyer Vendor

i5 processor, i7 processor

ram 8GB, ram 16gb

SSD 120GB, SSD 220gb

HDD 1TB HDD 2TB

I'm using Levenshtein distance for this particular scenario, what it does is it compares the distance between the strings and gives you relative output

But below is the challenge I'm facing:

Buyer requirements:

1) i5 processor,ram 8gb, SSD 120GB, HDD 1TB

2) i7 processor,ram 16gb, SSD 420GB, HDD 2TB

The vendor gives the available specs to the buyer:

1) i7 processor, ram 16gb, SSD 220gb, HDD 2TB

2) i9 processor, ram 32gb, SSD 500gb, HDD 4TB

3) i5 processor, ram 8gb, SSD 120gb, HDD 2TB

Below is the output I'm getting by comparing this

Buyer Vendor

i5 processor, i7 processor,i9 processor,i5 processor

ram 8GB, ram 16gb, ram 32gb,ram 8gb

SSD 120GB, SSD 220gb,SSD 500gb,SSD 120gb

HDD 1TB HDD 2TB, HDD 4TB,HDD 2TB

so on for buyer line item 2) i7 data...

As you can see it gives me jumbled data, which is not feasible for excel viewers.

NOTE: I understand I can store each item in a different array and using for loop I can compare the corresponding vendor data with buyer data!

But the problem is I'm getting those vendor and buyer data from a single unstructured file(a single file for buyer and another single file from a vendor with more than one line item)!

How do I know which data from buyer line items of data to be compared with vendor multiple line items of data?

for ex: The buyer line item 1) i5 processor can be compared with i7, i9, i5 because it's very similar but need to be compared with only i5 of line item 1 not from line item 2 or others of vendor file. I can somewhat achieve this with structured data if the data format in a file never changes!

I hope my explanation makes you understand.

Office Development
Office Development
Office: A suite of Microsoft productivity software that supports common business tasks, including word processing, email, presentations, and data management and analysis.Development: The process of researching, productizing, and refining new or existing technologies.
3,720 questions
0 comments No comments
{count} votes