Dataset format when finetuning llama2

Question

Dataset format when finetuning llama2

Chinmay Shete 20

Hi,

I am exploring whether it is feasible to fine tune a Llama 2 on Azure using the Azure AI Studio. I want to fine tune it for a specific question answering use case. So when a question is asked about a certain communications protocol (e.g. TCP) instead of the LLM using its own pre-trained/built-in knowledge about the protocol to answer the question, i want it to use my company's knowledge to answer the question (for e.g. if my company has a wrapper API to do TCP related functions, i want the LLM to use that to answer the questions).

So first of all, can such a use case be met via fine tuning ?
If yes, what will be the dataset format needed to do this in Azure AI Studio for llama 2 -7B ?
Will 20 examples (i.e. entries in the dataset) be enough ? If not, how many ?

Thank You,

Regards,

CS

Accepted answer

0 additional answers

Your answer

Answer 1

Fine-tuning Llama 2 (or any large language model, LLM) for a specific use case, like the one you described, is indeed possible and is a common approach to tailor the model's responses to specific knowledge domains or company-specific information. Q1 : Yes, your use case can be met via fine-tuning. Fine-tuning allows the model to learn from a dataset that contains the specific knowledge or information you want it to utilize when answering questions. This is particularly useful for scenarios where the model's pre-trained knowledge is either not up-to-date, not detailed enough, or not aligned with company-specific practices or information.

Q2 : When fine-tuning Llama 2 - 7B or any similar LLM in Azure AI Studio, the dataset format typically follows a pattern of input-output pairs that represent the questions and the corresponding answers. Here's a simplified example of what the format might look like:

[
  {
    "question": "How do I implement a TCP connection in our system?",
    "answer": "To implement a TCP connection in our system, you should use our custom wrapper API, which simplifies the process. Here's a code snippet to get you started: [Code Snippet]"
  },
  {
    "question": "What are the best practices for error handling in our TCP API?",
    "answer": "Our TCP API's best practices for error handling include [List of Practices]. For more detailed examples, refer to the internal documentation at [Link/Reference]."
  }
//more examoles..
]

Q3 : Regarding the number of examples, while 20 examples might be enough to start seeing some degree of customization in the model's responses, the effectiveness significantly improves with more data. For fine-tuning large language models, datasets typically range from hundreds to thousands of examples, depending on the complexity of the use case and the diversity of questions and answers you want to cover. However, starting with a smaller dataset and iteratively expanding it based on initial results is a practical approach. This way, you can assess the fine-tuning effectiveness early on and decide on further investments in dataset expansion.

Chinmay Shete 20 Reputation points

2024-02-28T17:22:18.1066667+00:00

Thanks for the great explanation! It was very helpful. Will follow your directions for trying out the fine tuning. I do have a couple of follow on questions:

I want to confirm once more that the dataset as pointed out is indeed a JSON file and not a JSONLines file (i.e. the file contents are just 1 JSON Array containing multiple JSON objects)

I have always wondered whether the dataset format for fine tuning an LLM is dependent on the LLM or on the cloud/service provider who offers this functionality. In other words would the dataset format described above change if I were to fine tune Llama 2 on AWS Sagemaker as opposed to Azure AI Studio ?

The keys used in the dataset above are "question" and "answer". Are these keys fixed ? In other words, if I were to use "q" and "a" would the fine tuning job fail ? This is just for my own edification. I have no issues using "question" and "answer".

Is there clear documentation for all the dataset formats to be used in various fine tuning use cases and for various models (on Azure) ?

Share via

Dataset format when finetuning llama2

0 additional answers

Your answer