Power Query lookup with "contain" condition on large dataset

Xin 66 Reputation points
2021-06-27T11:39:32.093+00:00

Hi there,

Just wondering if anyone could assist me on the below exercise:

  1. Table A has only Clean ID (123456789) or (414141)
  2. Table B has column1, complex ID (123456789abc5489 or qwe414141) and column 2 which is result I want to return (e.g. Rain, Sunny etc)
  3. Would like to perform "vlookup" for Table A and search for items in table B column 1 only rows that contain EXACT string in table A and, if in such case, return value in column 2
  4. So effective would see
    ID..........................Results
    123456789.........Rain
    414141.................Sunny
    1. So far have tried:
      a. standard merge - doesn't work as not exact match
      b. merge with fuzzy match - doesnt work even threshold is 0.1, just doesnt return good results
      c. List contain - doesnt work as it will list final table A each row so many times and return wrong values for some very weird reason..
      d. interestingly, xlookup with ""&column A&"" does work .. but the data set 1 is around 1.5million rows and data set 2 is over 10millions rows so cant perform such xlookup

Highly appreciate for any suggestions!!

Not Monitored
Not Monitored
Tag not monitored by Microsoft.
37,800 questions
{count} vote

Accepted answer
  1. Lz._ 8,991 Reputation points
    2021-07-03T13:42:21.437+00:00

    Hi @Xin

    Technically the method is still try to cleanse table B to "exact" match with table A, just wondering if there is any function that would perform "contain" search (i.e. sub text string of B matches A). Not sure if that would drag the performance by too much?
    Any existing or custom function involving something like "contains" can not work. As an example, if TableA has [Clean ID] 12345 and somewhere in TableB a [Complex ID] is something like "123456789*abc5489", **12345 would be found in that [Complex ID] and what you want is an EXACT match

    Just wondering I can see a fx-rephrase being added. Therefore I assume this is to cleanse the ID in table B before can be looked up?
    Not sure where you found fx-rephrase in what I uploaded. The function I created was named fxParseComplexID. If that's what you talk about, then Yes that's what the function does, inside query UnMatchedComplexClean

    I think it works well so far but just also thinking there are probably 10,20 different ways how column B ID can be presented therefore shall I just keep adding cleanse methods?

    • I told you to upload at least 20 different and representative [Complex ID]. You did not, so I did with the few example you provided
    • Sorry but I don't really understand, especially "shall I just keep adding cleanse methods". Please clarify

2 additional answers

Sort by: Most helpful
  1. Shinka 131 Reputation points
    2021-07-05T11:39:21.3+00:00

    Hi @Xin ,

    As long as CleanID in Table A is always numbers, then I'd suggest trying the following (always in PQ):

    1) in Table B add a costum column with the follwing formula: Text.Select([ComplexID], {"0".."9"}). This formula will extract only the numbers, ignoring everything else, into the new column.
    2) You should then have [NewColumn1], the the CleanID, and [Column2], the info you want, in the same table.

    Hope that works.

    1 person found this answer helpful.

  2. Lz._ 8,991 Reputation points
    2021-06-28T11:16:21.043+00:00

    Hi @Xin

    Based on the very few examples of [Complex ID] you provided the query in this workbook does here

    Test
    I created a CSV of 1 Million of rows to test the performances of the Parsing function and in average the latter did the job in 12s (Intel Core I7 + SSD), including the preceeding Table.ReplaceValue ("*" with "-")

    Settings
    In the Query Options/CURRENT WORKBOOK

    • Unchecked "Allow data previews to download in the background"
    • Seelected option "Ignore the Privacy Levels and potentially improve performance"
      If by the greatest of chance the approach does +/- what you expect, before applying it in real conditions, ensure you use the above settings

    Suggestion
    You intially said "data set 1 is around 1.5million rows", later you said "Given Table A is ALWAYS below 1 million rows"... In the event where some [Clean ID] would match more than one record (as in my example with [Clean ID] 123456789) in your TableB you'll likely exceed the max. number of rows of an Excel worksheet ==> Better you load to the Data Model (you can export tables from the Data Model as CSV with DAX Studio if necessary) from where you can setup a flattened Pivot Table that will be the same as the Table returned by the query