AD LDS - LDAP query with non ascii character gets re-interpreted and returns incorrect results.

Stephan Steiner 1 Reputation point
2021-10-25T19:24:23.297+00:00

I'm running this query against an AD LDS:

(sn=bäch*)

The results contain both results where sn starts with 'bäch' as well as results where the sn starts with 'bach'.

Is there a way to tell LDS to cut this out and only return items where the sn starts with 'bach' if the ldap query is (sn=bach)?

Active Directory
Active Directory
A set of directory-based technologies included in Windows Server.
5,907 questions
0 comments No comments
{count} votes

7 answers

Sort by: Newest
  1. Gary Reynolds 9,391 Reputation points
    2021-10-30T23:37:47.753+00:00

    Hi Gary

    Yeah I'm Melbourne based, so the time difference can be a pain for Europe.

    I did find this post which did lead me to Sorting Weight Table Order and this reference 3.1.5.2.3.1 in MS-UCODEREF and the download of the Sorting Table. Which contains the sorting details for the {0000001A-57EE-1E5C-00B4-D0000BB1E11E} used for the Swedish sorting table, which is used to create the character map returned by LCMapStringEx function:

     SORTGUID 0000001A-57EE-1E5C-00B4-D0000BB1E11E  
     LOCALENAME fi-FI        ;Finnish - fi-FI  
     LOCALENAME sv-SE        ;Swedish - sv-SE  
     LOCALENAME sv-FI        ;Swedish - Finland - sv-FI  
     LOCALENAME fi           ;Finnish - fi  
     LOCALENAME sv           ;Swedish - sv  
      
     TWO 14  
      
    0x0077 0x0302 14 162 18 2 ;w Circumflex  
    0x0057 0x0302 14 162 18 18 ;W Circumflex  
    0x0075 0x030b 14 167 27 2 ;u Double Acute  
    0x0055 0x030b 14 167 27 18 ;U Double Acute  
    0x0075 0x0308 14 167 123 2 ;u Diaeresis  
    0x0055 0x0308 14 167 123 18 ;U Diaeresis  
    0x0061 0x030a 14 173 2 2 ;a Ring  
    0x0041 0x030a 14 173 2 18 ;A Ring  
    0x0061 0x0308 14 175 2 2 ;a Diaeresis  
    0x0041 0x0308 14 175 2 18 ;A Diaeresis  
    0x006f 0x0308 14 176 2 2 ;o Diaeresis  
    0x004f 0x0308 14 176 2 18 ;O Diaeresis  
    0x006f 0x030b 14 176 27 2 ;o Double Acute  
    0x004f 0x030b 14 176 27 18 ;O Double Acute  
    

    I think this also answers the question on which languages have the LINGUISTIC_IGNOREDIACRITIC option set, it appears to be based on the CW flag attribute and could be on a per character basis

    ; CW Values:  
    ;   0-1 - reserved for delimiter/terminator  
    ;   0x01 bit - Full Width (if set). (1 == Full Width, 0 == Half/Normal Width)  
    ;   0x02 bit - Set by default, can be cleared in some case (if another higher bit is set)  
    ;   0x04 bit - Super/Subscript?  
    ;   0x08 bit -  
    ;   0x10 bit - Upper Case.  (16 == upper case, 0 == lower case)  
    ;   0x20 bit -  
    ;   0x40/0x80 bits - Reserved for nlstrans, which uses these as flags for characters that may compress  
    ;  
    ; Flags to NLSTrans  
    ; After the GUID  
    ;     HAS_3_BYTE_WEIGHTS - This ID has 3 byte weights.  Must be tagged on the EXCEPTION AND on the COMPRESSION  
    ;     LINGUISTIC_CASING  - Tagged (only) on the EXCEPTION table that applies to the linguistic case.  
    ;                          Should also have another untagged EXCEPTION table for non-linguistic casing.  
    

    I think that's probably far enough for this one, as I don't think there is an option to change the sorting table but at least we have identified an option to change the LDAP sorting order if required.

    Gary.

    0 comments No comments

  2. Gary Reynolds 9,391 Reputation points
    2021-10-29T08:41:54.597+00:00

    Hi Gary.

    Did you manage to get the LDAP_SERVER_SORT_OID control working with different Ordering rule OID. On my DC which has Australia English locale only 1.2.840.113556.1.4.1499 English: United States, and 1.2.840.113556.1.4.1665 English: Australia work, none of the other 158 Ordering rule OIDs work. I've tried install additional language pack and changing the regional settings but this doesn't make any difference. However, this could be expected as [MS-ADTS] 3.1.1.2.2.4.13 does state that the search order is independent of the servers locale so all DCs return the same result and all alphabets are included, so you would expect the OID to work.

    With the LDAP_SERVER_SORT_OID control defined with a 1.2.840.113556.1.4.1499 Ordering rule OID I get this result:

    Server: w2k19.w2k12.local
    Domain: 
    
       Control: 1.2.840.113556.1.4.473  - Len: 43
          30 84 00 00 00 25 30 84 00 00 00 1F 04 04 6E   
          61 6D 65 80 17 31 2E 32 2E 38 34 30 2E 31 31   
          33 35 35 36 2E 31 2E 34 2E 31 34 39 39         
    
          ASN.1 Structure Decode
              30 84 00 00 00 25           : Sequence (len: 37)
              |  30 84 00 00 00 1F        : Sequence (len: 31)
              |  |  04 04                 : Octet String (len: 4)
              |  |     6E 61 6D 65                   : name
              |  |  80 17                 : Context Specific[0] (len: 23 Tag: 0 Class: 2 P/C: 0)
              |  |     31 2E 32 2E 38 34 30 2E       : 1.2.840.
              |  |     31 31 33 35 35 36 2E 31       : 113556.1
              |  |     2E 34 2E 31 34 39 39          : .4.1499
    
    BaseDN: DC=w2k12,DC=local
    Filter: (objectclass=*)
    
    <results>
    
    Controls:
        1.2.840.113556.1.4.474 - RESP_SORT
    
       Control: 1.2.840.113556.1.4.474  - Len: 5
          30 03 0A 01 00                                 
    
          ASN.1 Structure Decode
              30 03                       : Sequence (len: 3)
              |  0A 01                    : Enumerated (len: 1)
              |     00                            : .
    

    Which is successful and the returned LDAP_SERVER_RESP_SORT_OID Enum of 0 = successful

    The same query with a Ordering rule OID of 1.2.840.113556.1.4.1594 fails:

    Server: w2k19.w2k12.local
    Domain: 
    
       Control: 1.2.840.113556.1.4.473  - Len: 43
          30 84 00 00 00 25 30 84 00 00 00 1F 04 04 6E   
          61 6D 65 80 17 31 2E 32 2E 38 34 30 2E 31 31   
          33 35 35 36 2E 31 2E 34 2E 31 35 39 34         
    
          ASN.1 Structure Decode
              30 84 00 00 00 25           : Sequence (len: 37)
              |  30 84 00 00 00 1F        : Sequence (len: 31)
              |  |  04 04                 : Octet String (len: 4)
              |  |     6E 61 6D 65                   : name
              |  |  80 17                 : Context Specific[0] (len: 23 Tag: 0 Class: 2 P/C: 0)
              |  |     31 2E 32 2E 38 34 30 2E       : 1.2.840.
              |  |     31 31 33 35 35 36 2E 31       : 113556.1
              |  |     2E 34 2E 31 35 39 34          : .4.1594
    
    BaseDN: DC=w2k12,DC=local
    Filter: (objectclass=*)
    
    Error: (0x5D) did not find the specified control, Server Error: 00000057: LdapErr: DSID-0C090B16, comment: Error processing control, data 0, v4563, Ext Error: (87) The parameter is incorrect.
    
    Controls:
        1.2.840.113556.1.4.474 - RESP_SORT
    
       Control: 1.2.840.113556.1.4.474  - Len: 5
          30 03 0A 01 12                                 
    
          ASN.1 Structure Decode
              30 03                       : Sequence (len: 3)
              |  0A 01                    : Enumerated (len: 1)
              |     12                            : .
    

    The returned LDAP_SERVER_RESP_SORT_OID Enum of 0x12 = inappropriateMatching , unrecognized or inappropriate matching rule in sort key.

    Do you know how to enable additional Ordering rule OIDs?

    Gary.


  3. Gary Nebbett 5,721 Reputation points
    2021-10-30T09:05:24.337+00:00

    Hello @GaryReynolds-8098,

    Thank you for some very interesting research; it is unfortunate that the time difference (CET (Switzerland) for me and presumably one of the Australian zones for you) means that we can only exchange one update per day :-(

    I found that article too, but I did not cite it since it is rather lacking in context; the "contextual" information that I found useful was that this functionality is possibly just intended to support the Microsoft Exchange Global Address List.

    I think that the event entries referenced in the article are logged when the NTDS service starts and finds a language identifier in HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\NTDS\Language for a language pack that is not installed - it is not logged when an ordering rule is used.

    I took a different approach and wrote a program that calls LCMapStringEx with flags of NORM_IGNORECASE | NORM_IGNORENONSPACE | LCMAP_SORTKEY | SORT_STRINGSORT | NORM_IGNOREKANATYPE | NORM_IGNOREWIDTH and configurable locale and string and then dumps the result. Here are some examples:

    LCMapStringEx de-DE Gäry
    0E-25-0E-02-0E-8A-0E-A7-01-01-01-01-00

    LCMapStringEx de-DE Gary
    0E-25-0E-02-0E-8A-0E-A7-01-01-01-01-00

    LCMapStringEx se-SE Gary
    0E-25-0E-02-0E-8A-0E-A7-01-01-01-01-00

    LCMapStringEx se-SE Gäry
    0E-25-0E-AF-0E-8A-0E-A7-01-01-01-01-00

    As you say, the reference that you found probably explains the behaviour:

    In Swedish, for example, some vowels with an accent sort after "Z," whereas in other European countries the same accented vowel comes right after the non-diacritic vowel.

    Gary

    0 comments No comments

  4. Gary Reynolds 9,391 Reputation points
    2021-10-29T23:30:28.027+00:00

    Thanks @Gary Nebbett

    I did a bit more searching and found this article which is a bit more up to date, which provides the same information, however, unlike the article suggests, I never received the event log entry when using the sort control without the additional locale set. Yes it did create additional indexes when the new locale were added.
    https://learn.microsoft.com/pt-pt/troubleshoot/windows-server/deployment/use-language-id-identify-language-pack

    I completed some additional testing with the new locale configured and got some surprising results. @Stephan Steiner yes you can limit the returned items to only object that exactly match the accent\diarictics filter.

    I added the following LCID to the server:
    145029-lcid.png

    Here are the LCID, Language, Ordering OID

    41D - Swedish - 1.2.840.113556.1.4.1594  
    81D - Swedish: Finland - 1.2.840.113556.1.4.1595  
    C09 - German: Germany - 1.2.840.113556.1.4.1523  
    

    I've setup a test OU with two object
    Gary Test1
    Gäry Test2

    I had to create them with different names as ADUC considered Gary Test and Gäry Test to be the same based on the default unicode matching rules.

    If I use a search filter of (displayname=gary*) or (displayname=gäry*) both objects are returned.

    If I include the LDAP_SERVER_SORT_OID (1.2.840.113556.1.4.473) control with the sorting OID for Germany 1.2.840.113556.1.4.1523 I get the same result:

    BaseDN: OU=test1,DC=w2k12,DC=local  
    Filter: (&(objectclass=user)(displayname=g\C3\A4ry*))  
      
    DN> CN=Gäry Test1,OU=test1,DC=w2k12,DC=local  
    DN> CN=Gary Test2,OU=test1,DC=w2k12,DC=local  
    2 records returned  
    

    However, if I use the sorting OID for Swedish 1.2.840.113556.1.4.1594 or 1.2.840.113556.1.4.1595, it will only return the object that exactly match to the filter:

    BaseDN: OU=test1,DC=w2k12,DC=local  
    Filter: (&(objectclass=user)(displayname=gary*))  
      
    DN> CN=Gary Test2,OU=test1,DC=w2k12,DC=local  
    1 records returned  
    

    With the accent character filter

    BaseDN: OU=test1,DC=w2k12,DC=local  
    Filter: (&(objectclass=user)(displayname=g\C3\A4ry*))  
      
    DN> CN=Gäry Test1,OU=test1,DC=w2k12,DC=local  
    1 records returned  
    

    This is the dump of the LDAP_SERVER_SORT_OID control I used:
    145134-asn1-control.png

    In the MS Unicode reference [MS-UCODEREF] I'm struggling to find anything that suggests that Swedish has a different matching pattern. The only reference I did find was a reference to a different sort order for vowels in Swedish, which might explain it.

    Also the definition of the NORM_IGNORENONSPACE flag in LCMapStringEx makes reference to that scripts (notably Latin scripts), NORM_IGNORENONSPACE coincides with LINGUISTIC_IGNOREDIACRITIC but I have been unable to find any reference to a lookup table that shows which language do and don't have it defined.

    I did find that the Swedish language uses the Sorting ID {0000001A-57EE-1E5C-00B4-D0000BB1E11E} but can't find any reference to the rules associated to this Sorting ID.

    145099-screenshot-2021-10-30-102934.png

    144989-screenshot-2021-10-30-102038.png

    Going deep into the rabbit hole of unicode and locale complexities here!

    Gary.

    0 comments No comments

  5. Gary Nebbett 5,721 Reputation points
    2021-10-28T18:31:15.663+00:00

    Hello @Stephan Steiner ,

    It was difficult to find dependable sources of information that could help to answer this question.

    The suggestion from @GaryReynolds-8098 does not have any impact on the problem. The filter string is encoded as an ASN.1 OCTET STRING, implicitly containing an UTF-8 encoded string. The interpretation of escaped binary values is performed on the client when constructing the ASN.1 AttributeValue. Correctly escaping “ä” in UTF-8 gives “\C3\A4” (\00\E4 is a UTF-16 encoding of “ä”).

    In the following ETW trace, the first highlighted element is the filter as provided to the client API (this ETW string is encoded as UTF-16). The “g?ry” strings in the trace are just an artefact of the tracing (these ETW strings are plain ASCII strings). The later highlighted elements are the encoding of the search request.

    144687-x.png

    This is a formatted dump of the ASN.1 content of the LDAP request (see RFC 4511 for complete ASN.1 definitions):

    [APPLICATION 3] {
    OCTET STRING 'CN=Gary,CN=Users,DC=Home,DC=Org'
    ENUMERATED 2
    ENUMERATED 0
    INTEGER 1000
    INTEGER 60
    BOOLEAN FALSE
    [4] {
    OCTET STRING 'mail'
    SEQUENCE {
    [0]
    67 C3 A4 72 79
    }
    }
    SEQUENCE {
    OCTET STRING
    2A
    }
    }

    Correctly escaping “ä” results in an identical search request:

    144550-x.png

    The statement by @Limitless Technology is also misleading – RFC 4715 “Lightweight Directory Access Protocol (LDAP): Syntaxes and Matching Rules” does define a mechanism for expressing something close to what you probably want: matching rules and more specifically the caseExactMatch rule. The syntax would be (mail: caseExactMatch:=gäry) or (mail: 2.5.13.5:=gäry). However, as [MS-ADTS] confirms in section 3.1.1.3.4.4, Active Directory LDAP does not implement this matching rule.

    The schema definition of an LDAP attribute determines its default comparison operation; a different matching rule can only be used when it is compatible with the schema syntax of the attribute (and when the matching rule is implemented).

    Most of the string attributes defined in the standard Active Directory schema (including “mail”, “sn”, etc.) have the attribute schema type of 2.5.5.12 (String(Unicode)).

    [MS-ADTS] describes (in section 6.5) how Unicode strings are compared. In practice, the Win32 API routine LCMapStringEx is called with a map flags argument of NORM_IGNORECASE | NORM_IGNORENONSPACE | LCMAP_SORTKEY | SORT_STRINGSORT | NORM_IGNOREKANATYPE | NORM_IGNOREWIDTH. The presence of NORM_IGNORENONSPACE effectively means that diacritical marks such as the umlaut are ignored when creating a sort key for the filter string.

    The first argument to LCMapStringEx is a locale name and, by default, the “en-US” locale is used. The locale can be controlled by adding an LDAP_SERVER_SORT_OID extended control to the search request (perhaps specifying de-CH as the locale), but this has no effect on the treatment of umlauts when NORM_IGNORENONSPACE is used.

    In summary, when searching Active Directory, the results returned by an LDAP search need to be filtered again by client side application code to remove unwanted matches resulting from the NORM_IGNORENONSPACE (e.g. ignore diacritical marks) behaviour.

    Gary

    0 comments No comments