Azure OpenAI Service: Characters are converted to Unicode when indexing with Japanese files in Studio.

Question

Azure OpenAI Service: Characters are converted to Unicode when indexing with Japanese files in Studio.

大宮僚馬 70

Previously, when indexing Japanese files, they were still in Japanese, but when I tried recently, the characters were converted to Unicode.

Upon investigation, we found that the API used in the Azure Cognitive Search skill set has changed, and we believe this may be the cause.

Before the change

https://XXX.openai.azure.com/openai/chunks?api-version=2023-03-31-preview

After change

https://XXX.openai.azure.com/openai/preprocessing-jobs?api-version=2023-03-31-preview

How can I create the index in Japanese as it was before the change?

AshokPeddakotla-MSFT 35,971 Reputation points Moderator

2023-12-08T09:55:31.5366667+00:00

大宮僚馬 Greetings!

Previously, when indexing Japanese files, they were still in Japanese, but when I tried recently, the characters were converted to Unicode.

Could you confirm what changes have you made to the index files?

You can find more information on how to create an index for multiple languages in Azure Cognitive Search in the documentation Create an index for multiple languages in Azure AI Search

Can you share the complete API which you are using along with input parameters?

大宮僚馬 70

It is indexed by Azure OpenAI Studio and I did not make any changes.

The following is the skillset that is actually executed.

When this is executed, the file is split into JSON format, at which point the Japanese is converted to Unicode.

{
  "@odata.context": "https://XXX.search.windows.net/$metadata#skillsets/$entity",
  "@odata.etag": "\"0x8DBF6FC3E7A5DE7\"",
  "name": "openai-chat-skillset",
  "description": null,
  "skills": [
    {
      "@odata.type": "#Microsoft.Skills.Custom.WebApiSkill",
      "name": "openai-chat-skillset",
      "description": null,
      "context": "/document/content",
      "uri": "https://XXX.openai.azure.com/openai/preprocessing-jobs?api-version=2023-03-31-preview",
      "httpMethod": "POST",
      "timeout": "PT1M",
      "batchSize": 10,
      "degreeOfParallelism": 10,
      "inputs": [
        {
          "name": "document_id",
          "source": "/document/document_id"
        },
        {
          "name": "filename",
          "source": "/document/filename"
        },
        {
          "name": "fieldname",
          "source": "='content'"
        },
        {
          "name": "text",
          "source": "/document/content"
        },
        {
          "name": "url",
          "source": "/document/url"
        }
      ],
      "outputs": [
        {
          "name": "recordId",
          "targetName": "recordId"
        }
      ],
      "httpHeaders": {
        "original_request_id": "openai-chat",
        "original_internal_id": "original_internal_id",
        "num_tokens": "1024",
        "api-key": "API_KEY",
        "connection_string": "BlobEndpoint=https://XXX.blob.core.windows.net/;SharedAccessSignature=?...",
        "container_name": "openai-chat-chunks"
      }
    }
  ],
  "cognitiveServices": null,
  "knowledgeStore": null,
  "indexProjections": null,
  "encryptionKey": null
}

大宮僚馬 70 Reputation points

2023-12-13T06:52:19.8866667+00:00
This event still occurs.

As a further addition, the steps for this event to occur are

In Azure OpenAI Studio, select "Add a data source" from the "Add your data" tab.

Select "Upload files" as the data source.

Enter the required information and upload the file created in Japanese.

Click "Save and Close" to create the index.

Vector search is off and search type is semantic.

The last time I checked was on 11/22/2023, and at that time the Japanese index was created using the above procedure.

Recently, when the same procedure was performed with the same file, the index was created with Japanese as Unicode.

The picture shows the index data in AI Search's Search Explorer.

The "content" and "title" are in Unicode.

(Unicode characters are still used after the blacked-out area.)
大宮僚馬 70 Reputation points

2023-12-13T07:05:30.91+00:00

Deleted due to duplicate comments.
大宮僚馬 70 Reputation points

2023-12-15T00:04:53.7933333+00:00

I ran it last night and it fixed the problem.

Accepted answer

2 additional answers

Your answer

AshokPeddakotla-MSFT 35,971 Reputation points Moderator

2023-12-08T09:55:31.5366667+00:00

大宮僚馬 Greetings!

Previously, when indexing Japanese files, they were still in Japanese, but when I tried recently, the characters were converted to Unicode.

Could you confirm what changes have you made to the index files?

You can find more information on how to create an index for multiple languages in Azure Cognitive Search in the documentation Create an index for multiple languages in Azure AI Search

Can you share the complete API which you are using along with input parameters?
大宮僚馬 70 Reputation points

2023-12-11T00:27:38.8+00:00

It is indexed by Azure OpenAI Studio and I did not make any changes.

The following is the skillset that is actually executed.

When this is executed, the file is split into JSON format, at which point the Japanese is converted to Unicode.

{ "@odata.context": "https://XXX.search.windows.net/$metadata#skillsets/$entity", "@odata.etag": "\"0x8DBF6FC3E7A5DE7\"", "name": "openai-chat-skillset", "description": null, "skills": [ { "@odata.type": "#Microsoft.Skills.Custom.WebApiSkill", "name": "openai-chat-skillset", "description": null, "context": "/document/content", "uri": "https://XXX.openai.azure.com/openai/preprocessing-jobs?api-version=2023-03-31-preview", "httpMethod": "POST", "timeout": "PT1M", "batchSize": 10, "degreeOfParallelism": 10, "inputs": [ { "name": "document_id", "source": "/document/document_id" }, { "name": "filename", "source": "/document/filename" }, { "name": "fieldname", "source": "='content'" }, { "name": "text", "source": "/document/content" }, { "name": "url", "source": "/document/url" } ], "outputs": [ { "name": "recordId", "targetName": "recordId" } ], "httpHeaders": { "original_request_id": "openai-chat", "original_internal_id": "original_internal_id", "num_tokens": "1024", "api-key": "API_KEY", "connection_string": "BlobEndpoint=https://XXX.blob.core.windows.net/;SharedAccessSignature=?...", "container_name": "openai-chat-chunks" } } ], "cognitiveServices": null, "knowledgeStore": null, "indexProjections": null, "encryptionKey": null }
大宮僚馬 70 Reputation points

2023-12-13T06:52:19.8866667+00:00

This event still occurs.

As a further addition, the steps for this event to occur are

In Azure OpenAI Studio, select "Add a data source" from the "Add your data" tab.

Select "Upload files" as the data source.

Enter the required information and upload the file created in Japanese.

Click "Save and Close" to create the index.

Vector search is off and search type is semantic.

The last time I checked was on 11/22/2023, and at that time the Japanese index was created using the above procedure.

Recently, when the same procedure was performed with the same file, the index was created with Japanese as Unicode.

The picture shows the index data in AI Search's Search Explorer.

The "content" and "title" are in Unicode.

(Unicode characters are still used after the blacked-out area.)
大宮僚馬 70 Reputation points

2023-12-13T07:05:30.91+00:00

Deleted due to duplicate comments.
大宮僚馬 70 Reputation points

2023-12-15T00:04:53.7933333+00:00

I ran it last night and it fixed the problem.

Answer 1

大宮僚馬 I'm glad that your issue is resolved and thank you for posting your solution so that others experiencing the same thing can easily reference this!

Since the Microsoft Q&A community has a policy that the question author cannot accept their own answer, they can only accept answers by others, I'll repost your solution in case you'd like to Accept the answer.

Error Message:

Previously, when indexing Japanese files, they were still in Japanese, but when I tried recently, the characters were converted to Unicode.

Upon investigation, we found that the API used in the Azure Cognitive Search skill set has changed, and we believe this may be the cause.

Before the change

https://XXX.openai.azure.com/openai/chunks?api-version=2023-03-31-preview

After change

https://XXX.openai.azure.com/openai/preprocessing-jobs?api-version=2023-03-31-preview

How can I create the index in Japanese as it was before the change?

**
Solution :

I ran it last night and it fixed the problem.

If you have any other questions, please let me know. Thank you again for your time and patience throughout this issue.

Answer 2

大宮僚馬 70

I ran it last night and it fixed the problem.

Answer 3

Mads Olsgaard 0

I have also noted this API being called by Azure Cognitive Search skill set created via the OpenAI Studio or Azure AI studio. As far as I can tell, this API is completely undocumented. https://XXX.openai.azure.com/openai/preprocessing-jobs?api-version=2023-03-31-preview Where is documentation for this published? One would assume it should be https://learn.microsoft.com/en-us/azure/ai-services/openai/reference, but it is not.

Share via

Azure OpenAI Service: Characters are converted to Unicode when indexing with Japanese files in Studio.

2 additional answers

Your answer