create embeddings and search in Azure openai

sai 5 Reputation points
2023-03-21T18:48:27.0933333+00:00

@YutongTie-MSFT

how to create embeddings and perform document search on a PDF document on Azure openai?

I have a PDF input(just one document saved in local as I'm only testing for now).

How can I create embeddings for that and perform search so I can do Q&A on that data file?

I'm using Azure API Key and openai end points

https://github.com/openai/openai-cookbook/blob/main/examples/How_to_format_inputs_to_ChatGPT_models.ipynb

https://github.com/openai/openai-cookbook/blob/main/examples/vector_databases/pinecone/Gen_QA.ipynb

have seen above 2 examples- but not sure how I can apply the same for PDF. please help, still new and learning

Thanks

Azure OpenAI Service
Azure OpenAI Service
An Azure service that provides access to OpenAI’s GPT-3 models with enterprise capabilities.
4,080 questions
{count} vote

1 answer

Sort by: Most helpful
  1. YutongTie-MSFT 53,966 Reputation points Moderator
    2023-03-21T23:16:55.7033333+00:00

    Hello @sai

    Thanks for reaching out to us. For your question how to create create embeddings and search for PDF document, unfortunately, there is no straightforward solution for that. Azure OpenAI does not provide built-in support for reading PDF documents.

    As a workaround, you can use an OCR (Optical Character Recognition) tool to extract the text from the PDF, and then feed it to the Azure OpenAI API to generate embeddings.

    Here's a high-level overview of the steps you can follow:

    1. Use an OCR tool to extract the text from the PDF document. There are various OCR tools available, such as Azure Cognitive Services- Computer Vision Read API, Azure Form Recognizer if your PDF contains form format data.
    2. Once you have the text, you can use the OpenAI API to generate embeddings for each sentence or paragraph in the document, something like the code sample you shared.
    3. Store the embeddings in a vector database like Pinecone, where you can search for similar documents based on their embeddings.
    4. To perform a search on the document, you can use a question-answering (Q&A) model like OpenAI's GPT-3/ GPT-3.5. You can pass the question and the embeddings of the document to the Q&A model to generate an answer.

    I hope this helps, let me know if you have any question regarding to above. We are looking forwarding to the OCR feature happens in Azure OpenAI too, but it need some time.

    Regards,

    Yutong

    -Please kindly accept the answer and Vote 'Yes' if you feel helpful to support the community, thanks a lot.

    2 people found this answer helpful.

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.