Power Query lookup with "contain" condition on large dataset

Question

Hi there,

Just wondering if anyone could assist me on the below exercise:

Table A has only Clean ID (123456789) or (414141)
Table B has column1, complex ID (123456789abc5489 or qwe414141) and column 2 which is result I want to return (e.g. Rain, Sunny etc)
Would like to perform "vlookup" for Table A and search for items in table B column 1 only rows that contain EXACT string in table A and, if in such case, return value in column 2
So effective would see
ID..........................Results
123456789.........Rain
414141.................Sunny
1. So far have tried:
  a. standard merge - doesn't work as not exact match
  b. merge with fuzzy match - doesnt work even threshold is 0.1, just doesnt return good results
  c. List contain - doesnt work as it will list final table A each row so many times and return wrong values for some very weird reason..
  d. interestingly, xlookup with ""&column A&"" does work .. but the data set 1 is around 1.5million rows and data set 2 is over 10millions rows so cant perform such xlookup

Highly appreciate for any suggestions!!

Accepted Answer

Technically the method is still try to cleanse table B to "exact" match with table A, just wondering if there is any function that would perform "contain" search (i.e. sub text string of B matches A). Not sure if that would drag the performance by too much?
Any existing or custom function involving something like "contains" can not work. As an example, if TableA has [Clean ID] 12345 and somewhere in TableB a [Complex ID] is something like "123456789*abc5489", **12345 would be found in that [Complex ID] and what you want is an EXACT match

Just wondering I can see a fx-rephrase being added. Therefore I assume this is to cleanse the ID in table B before can be looked up?
Not sure where you found fx-rephrase in what I uploaded. The function I created was named fxParseComplexID. If that's what you talk about, then Yes that's what the function does, inside query UnMatchedComplexClean

I think it works well so far but just also thinking there are probably 10,20 different ways how column B ID can be presented therefore shall I just keep adding cleanse methods?

I told you to upload at least 20 different and representative [Complex ID]. You did not, so I did with the few example you provided
Sorry but I don't really understand, especially "shall I just keep adding cleanse methods". Please clarify

Answer

Hi @Xin ,

As long as CleanID in Table A is always numbers, then I'd suggest trying the following (always in PQ):

1) in Table B add a costum column with the follwing formula: Text.Select([ComplexID], {"0".."9"}). This formula will extract only the numbers, ignoring everything else, into the new column.
2) You should then have [NewColumn1], the the CleanID, and [Column2], the info you want, in the same table.

Hope that works.

Answer

Hi @Xin

Based on the very few examples of [Complex ID] you provided the query in this workbook does here

Test
I created a CSV of 1 Million of rows to test the performances of the Parsing function and in average the latter did the job in 12s (Intel Core I7 + SSD), including the preceeding Table.ReplaceValue ("*" with "-")

Settings
In the Query Options/CURRENT WORKBOOK

Unchecked "Allow data previews to download in the background"
Seelected option "Ignore the Privacy Levels and potentially improve performance"
If by the greatest of chance the approach does +/- what you expect, before applying it in real conditions, ensure you use the above settings

Suggestion
You intially said "data set 1 is around 1.5million rows", later you said "Given Table A is ALWAYS below 1 million rows"... In the event where some [Clean ID] would match more than one record (as in my example with [Clean ID] 123456789) in your TableB you'll likely exceed the max. number of rows of an Excel worksheet ==> Better you load to the Data Model (you can export tables from the Data Model as CSV with DAX Studio if necessary) from where you can setup a flattened Pivot Table that will be the same as the Table returned by the query

Share via

Power Query lookup with "contain" condition on large dataset

2 additional answers