Dataset format when finetuning llama2

Chinmay Shete 20 Reputation points
2024-02-27T22:38:16.09+00:00

Hi,

I am exploring whether it is feasible to fine tune a Llama 2 on Azure using the Azure AI Studio. I want to fine tune it for a specific question answering use case. So when a question is asked about a certain communications protocol (e.g. TCP) instead of the LLM using its own pre-trained/built-in knowledge about the protocol to answer the question, i want it to use my company's knowledge to answer the question (for e.g. if my company has a wrapper API to do TCP related functions, i want the LLM to use that to answer the questions).

  1. So first of all, can such a use case be met via fine tuning ?
  2. If yes, what will be the dataset format needed to do this in Azure AI Studio for llama 2 -7B ?
  3. Will 20 examples (i.e. entries in the dataset) be enough ? If not, how many ?

Thank You,

Regards,

CS

Azure AI services
Azure AI services
A group of Azure services, SDKs, and APIs designed to make apps more intelligent, engaging, and discoverable.
3,604 questions
0 comments No comments
{count} votes

Accepted answer
  1. Amira Bedhiafi 33,071 Reputation points Volunteer Moderator
    2024-02-28T13:39:10.29+00:00

    Fine-tuning Llama 2 (or any large language model, LLM) for a specific use case, like the one you described, is indeed possible and is a common approach to tailor the model's responses to specific knowledge domains or company-specific information. Q1 : Yes, your use case can be met via fine-tuning. Fine-tuning allows the model to learn from a dataset that contains the specific knowledge or information you want it to utilize when answering questions. This is particularly useful for scenarios where the model's pre-trained knowledge is either not up-to-date, not detailed enough, or not aligned with company-specific practices or information.

    Q2 : When fine-tuning Llama 2 - 7B or any similar LLM in Azure AI Studio, the dataset format typically follows a pattern of input-output pairs that represent the questions and the corresponding answers. Here's a simplified example of what the format might look like:

    [
      {
        "question": "How do I implement a TCP connection in our system?",
        "answer": "To implement a TCP connection in our system, you should use our custom wrapper API, which simplifies the process. Here's a code snippet to get you started: [Code Snippet]"
      },
      {
        "question": "What are the best practices for error handling in our TCP API?",
        "answer": "Our TCP API's best practices for error handling include [List of Practices]. For more detailed examples, refer to the internal documentation at [Link/Reference]."
      }
    //more examoles..
    ]
    

    Q3 : Regarding the number of examples, while 20 examples might be enough to start seeing some degree of customization in the model's responses, the effectiveness significantly improves with more data. For fine-tuning large language models, datasets typically range from hundreds to thousands of examples, depending on the complexity of the use case and the diversity of questions and answers you want to cover. However, starting with a smaller dataset and iteratively expanding it based on initial results is a practical approach. This way, you can assess the fine-tuning effectiveness early on and decide on further investments in dataset expansion.

    1 person found this answer helpful.

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.