AD LDS - LDAP query with non ascii character gets re-interpreted and returns incorrect results.

Stephan Steiner 1

I'm running this query against an AD LDS:

(sn=bäch*)

The results contain both results where sn starts with 'bäch' as well as results where the sn starts with 'bach'.

Is there a way to tell LDS to cut this out and only return items where the sn starts with 'bach' if the ldap query is (sn=bach)?

7 answers

Gary Nebbett 5,761 Reputation points

2021-10-30T09:05:24.337+00:00

Hello @GaryReynolds-8098,

Thank you for some very interesting research; it is unfortunate that the time difference (CET (Switzerland) for me and presumably one of the Australian zones for you) means that we can only exchange one update per day :-(

I found that article too, but I did not cite it since it is rather lacking in context; the "contextual" information that I found useful was that this functionality is possibly just intended to support the Microsoft Exchange Global Address List.

I think that the event entries referenced in the article are logged when the NTDS service starts and finds a language identifier in HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\NTDS\Language for a language pack that is not installed - it is not logged when an ordering rule is used.

I took a different approach and wrote a program that calls LCMapStringEx with flags of NORM_IGNORECASE | NORM_IGNORENONSPACE | LCMAP_SORTKEY | SORT_STRINGSORT | NORM_IGNOREKANATYPE | NORM_IGNOREWIDTH and configurable locale and string and then dumps the result. Here are some examples:

LCMapStringEx de-DE Gäry
0E-25-0E-02-0E-8A-0E-A7-01-01-01-01-00

LCMapStringEx de-DE Gary
0E-25-0E-02-0E-8A-0E-A7-01-01-01-01-00

LCMapStringEx se-SE Gary
0E-25-0E-02-0E-8A-0E-A7-01-01-01-01-00

LCMapStringEx se-SE Gäry
0E-25-0E-AF-0E-8A-0E-A7-01-01-01-01-00

As you say, the reference that you found probably explains the behaviour:

In Swedish, for example, some vowels with an accent sort after "Z," whereas in other European countries the same accented vowel comes right after the non-diacritic vowel.

Gary
Please sign in to rate this answer.

0 comments No comments
Sign in to comment

Gary Reynolds 9,396

Hi Gary

Yeah I'm Melbourne based, so the time difference can be a pain for Europe.

I did find this post which did lead me to Sorting Weight Table Order and this reference 3.1.5.2.3.1 in MS-UCODEREF and the download of the Sorting Table. Which contains the sorting details for the {0000001A-57EE-1E5C-00B4-D0000BB1E11E} used for the Swedish sorting table, which is used to create the character map returned by LCMapStringEx function:

 SORTGUID 0000001A-57EE-1E5C-00B4-D0000BB1E11E  
 LOCALENAME fi-FI        ;Finnish - fi-FI  
 LOCALENAME sv-SE        ;Swedish - sv-SE  
 LOCALENAME sv-FI        ;Swedish - Finland - sv-FI  
 LOCALENAME fi           ;Finnish - fi  
 LOCALENAME sv           ;Swedish - sv  
  
 TWO 14  
  
0x0077 0x0302 14 162 18 2 ;w Circumflex  
0x0057 0x0302 14 162 18 18 ;W Circumflex  
0x0075 0x030b 14 167 27 2 ;u Double Acute  
0x0055 0x030b 14 167 27 18 ;U Double Acute  
0x0075 0x0308 14 167 123 2 ;u Diaeresis  
0x0055 0x0308 14 167 123 18 ;U Diaeresis  
0x0061 0x030a 14 173 2 2 ;a Ring  
0x0041 0x030a 14 173 2 18 ;A Ring  
0x0061 0x0308 14 175 2 2 ;a Diaeresis  
0x0041 0x0308 14 175 2 18 ;A Diaeresis  
0x006f 0x0308 14 176 2 2 ;o Diaeresis  
0x004f 0x0308 14 176 2 18 ;O Diaeresis  
0x006f 0x030b 14 176 27 2 ;o Double Acute  
0x004f 0x030b 14 176 27 18 ;O Double Acute

I think this also answers the question on which languages have the LINGUISTIC_IGNOREDIACRITIC option set, it appears to be based on the CW flag attribute and could be on a per character basis

; CW Values:  
;   0-1 - reserved for delimiter/terminator  
;   0x01 bit - Full Width (if set). (1 == Full Width, 0 == Half/Normal Width)  
;   0x02 bit - Set by default, can be cleared in some case (if another higher bit is set)  
;   0x04 bit - Super/Subscript?  
;   0x08 bit -  
;   0x10 bit - Upper Case.  (16 == upper case, 0 == lower case)  
;   0x20 bit -  
;   0x40/0x80 bits - Reserved for nlstrans, which uses these as flags for characters that may compress  
;  
; Flags to NLSTrans  
; After the GUID  
;     HAS_3_BYTE_WEIGHTS - This ID has 3 byte weights.  Must be tagged on the EXCEPTION AND on the COMPRESSION  
;     LINGUISTIC_CASING  - Tagged (only) on the EXCEPTION table that applies to the linguistic case.  
;                          Should also have another untagged EXCEPTION table for non-linguistic casing.

I think that's probably far enough for this one, as I don't think there is an option to change the sorting table but at least we have identified an option to change the LDAP sorting order if required.

Gary.

Share via

AD LDS - LDAP query with non ascii character gets re-interpreted and returns incorrect results.

7 answers