Fine-tuning Llama 2 (or any large language model, LLM) for a specific use case, like the one you described, is indeed possible and is a common approach to tailor the model's responses to specific knowledge domains or company-specific information. Q1 : Yes, your use case can be met via fine-tuning. Fine-tuning allows the model to learn from a dataset that contains the specific knowledge or information you want it to utilize when answering questions. This is particularly useful for scenarios where the model's pre-trained knowledge is either not up-to-date, not detailed enough, or not aligned with company-specific practices or information.
Q2 : When fine-tuning Llama 2 - 7B or any similar LLM in Azure AI Studio, the dataset format typically follows a pattern of input-output pairs that represent the questions and the corresponding answers. Here's a simplified example of what the format might look like:
[
{
"question": "How do I implement a TCP connection in our system?",
"answer": "To implement a TCP connection in our system, you should use our custom wrapper API, which simplifies the process. Here's a code snippet to get you started: [Code Snippet]"
},
{
"question": "What are the best practices for error handling in our TCP API?",
"answer": "Our TCP API's best practices for error handling include [List of Practices]. For more detailed examples, refer to the internal documentation at [Link/Reference]."
}
//more examoles..
]
Q3 : Regarding the number of examples, while 20 examples might be enough to start seeing some degree of customization in the model's responses, the effectiveness significantly improves with more data. For fine-tuning large language models, datasets typically range from hundreds to thousands of examples, depending on the complexity of the use case and the diversity of questions and answers you want to cover. However, starting with a smaller dataset and iteratively expanding it based on initial results is a practical approach. This way, you can assess the fine-tuning effectiveness early on and decide on further investments in dataset expansion.