AD LDS - LDAP query with non ascii character gets re-interpreted and returns incorrect results.

Stephan Steiner 1 Reputation point
2021-10-25T19:24:23.297+00:00

I'm running this query against an AD LDS:

(sn=bäch*)

The results contain both results where sn starts with 'bäch' as well as results where the sn starts with 'bach'.

Is there a way to tell LDS to cut this out and only return items where the sn starts with 'bach' if the ldap query is (sn=bach)?

Active Directory
Active Directory
A set of directory-based technologies included in Windows Server.
6,087 questions
0 comments No comments
{count} votes

7 answers

Sort by: Most helpful
  1. Gary Nebbett 5,761 Reputation points
    2021-10-30T09:05:24.337+00:00

    Hello @GaryReynolds-8098,

    Thank you for some very interesting research; it is unfortunate that the time difference (CET (Switzerland) for me and presumably one of the Australian zones for you) means that we can only exchange one update per day :-(

    I found that article too, but I did not cite it since it is rather lacking in context; the "contextual" information that I found useful was that this functionality is possibly just intended to support the Microsoft Exchange Global Address List.

    I think that the event entries referenced in the article are logged when the NTDS service starts and finds a language identifier in HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\NTDS\Language for a language pack that is not installed - it is not logged when an ordering rule is used.

    I took a different approach and wrote a program that calls LCMapStringEx with flags of NORM_IGNORECASE | NORM_IGNORENONSPACE | LCMAP_SORTKEY | SORT_STRINGSORT | NORM_IGNOREKANATYPE | NORM_IGNOREWIDTH and configurable locale and string and then dumps the result. Here are some examples:

    LCMapStringEx de-DE Gäry
    0E-25-0E-02-0E-8A-0E-A7-01-01-01-01-00

    LCMapStringEx de-DE Gary
    0E-25-0E-02-0E-8A-0E-A7-01-01-01-01-00

    LCMapStringEx se-SE Gary
    0E-25-0E-02-0E-8A-0E-A7-01-01-01-01-00

    LCMapStringEx se-SE Gäry
    0E-25-0E-AF-0E-8A-0E-A7-01-01-01-01-00

    As you say, the reference that you found probably explains the behaviour:

    In Swedish, for example, some vowels with an accent sort after "Z," whereas in other European countries the same accented vowel comes right after the non-diacritic vowel.

    Gary

    0 comments No comments

  2. Gary Reynolds 9,396 Reputation points
    2021-10-30T23:37:47.753+00:00

    Hi Gary

    Yeah I'm Melbourne based, so the time difference can be a pain for Europe.

    I did find this post which did lead me to Sorting Weight Table Order and this reference 3.1.5.2.3.1 in MS-UCODEREF and the download of the Sorting Table. Which contains the sorting details for the {0000001A-57EE-1E5C-00B4-D0000BB1E11E} used for the Swedish sorting table, which is used to create the character map returned by LCMapStringEx function:

     SORTGUID 0000001A-57EE-1E5C-00B4-D0000BB1E11E  
     LOCALENAME fi-FI        ;Finnish - fi-FI  
     LOCALENAME sv-SE        ;Swedish - sv-SE  
     LOCALENAME sv-FI        ;Swedish - Finland - sv-FI  
     LOCALENAME fi           ;Finnish - fi  
     LOCALENAME sv           ;Swedish - sv  
      
     TWO 14  
      
    0x0077 0x0302 14 162 18 2 ;w Circumflex  
    0x0057 0x0302 14 162 18 18 ;W Circumflex  
    0x0075 0x030b 14 167 27 2 ;u Double Acute  
    0x0055 0x030b 14 167 27 18 ;U Double Acute  
    0x0075 0x0308 14 167 123 2 ;u Diaeresis  
    0x0055 0x0308 14 167 123 18 ;U Diaeresis  
    0x0061 0x030a 14 173 2 2 ;a Ring  
    0x0041 0x030a 14 173 2 18 ;A Ring  
    0x0061 0x0308 14 175 2 2 ;a Diaeresis  
    0x0041 0x0308 14 175 2 18 ;A Diaeresis  
    0x006f 0x0308 14 176 2 2 ;o Diaeresis  
    0x004f 0x0308 14 176 2 18 ;O Diaeresis  
    0x006f 0x030b 14 176 27 2 ;o Double Acute  
    0x004f 0x030b 14 176 27 18 ;O Double Acute  
    

    I think this also answers the question on which languages have the LINGUISTIC_IGNOREDIACRITIC option set, it appears to be based on the CW flag attribute and could be on a per character basis

    ; CW Values:  
    ;   0-1 - reserved for delimiter/terminator  
    ;   0x01 bit - Full Width (if set). (1 == Full Width, 0 == Half/Normal Width)  
    ;   0x02 bit - Set by default, can be cleared in some case (if another higher bit is set)  
    ;   0x04 bit - Super/Subscript?  
    ;   0x08 bit -  
    ;   0x10 bit - Upper Case.  (16 == upper case, 0 == lower case)  
    ;   0x20 bit -  
    ;   0x40/0x80 bits - Reserved for nlstrans, which uses these as flags for characters that may compress  
    ;  
    ; Flags to NLSTrans  
    ; After the GUID  
    ;     HAS_3_BYTE_WEIGHTS - This ID has 3 byte weights.  Must be tagged on the EXCEPTION AND on the COMPRESSION  
    ;     LINGUISTIC_CASING  - Tagged (only) on the EXCEPTION table that applies to the linguistic case.  
    ;                          Should also have another untagged EXCEPTION table for non-linguistic casing.  
    

    I think that's probably far enough for this one, as I don't think there is an option to change the sorting table but at least we have identified an option to change the LDAP sorting order if required.

    Gary.

    0 comments No comments