Recommendations for LLM fine-tuning

In some cases, LLMs may not perform well on specific domains, tasks, or datasets, or may produce inaccurate or misleading outputs. In such cases, fine-tuning the model can be a useful technique to adapt it to the desired goal and improve its quality and reliability.

When to consider fine-tuning

Below are some scenarios where fine-tuning can be considered.

  • Hallucinations: Hallucinations are untrue statements output by the model. They can harm the credibility and trustworthiness of your application. One possible mitigation is fine-tuning the model with data that contains accurate and consistent information.
  • Accuracy and quality problems: Pre-trained models may not achieve the desired level of accuracy or quality for a specific task or domain. This shortfall can be due a mismatch between the pre-training data and the target data, the diversity and complexity of the target data, and/or incorrect evaluation metrics and criteria.

How fine-tuning can help

Fine-tuning the model with data that is representative and relevant to the target task or domain may help improve the model's performance. Examples include:

  • Adding domain-specific knowledge: Teaching the model new (uncommon) tasks or constraining it to a smaller space, especially complex specialized tasks may require the model to learn new skills, concepts, or vocabulary that are not well represented in the model's original training data. Some examples are legal, medical, and technical texts. These tasks or domains may also have specific constraints or requirements, such as length, format, or style, that limit the model's generative space. Fine-tuning the model with domain-specific data may help the model acquire the necessary knowledge and skills and generate more appropriate and coherent texts.
  • Add data that doesn't fit in a prompt: The LLM prompt is the input text that is given to the model to generate an output. It usually contains some keywords, instructions, or examples that guide the model's behavior. However, the prompt has a limited size, and the data needed to complete a task may exceed the prompt's capacity. This happens in applications that require the LLM to process long documents, tables, etc. In such cases, fine-tuning can help the model handle more data and use smaller prompts at inference time to generate more relevant and complete outputs.
  • Simplifying prompts: Long or complex prompts can affect the model's efficiency and scalability. Fine-tuning the model with data that is tailored to the target task or domain can help the model provide quality responses from simpler prompts, and potentially use fewer tokens and improve latency.

Best practices for fine-tuning

Here are some best practices that can help improve the efficiency and effectiveness of fine-tuning LLMs for various applications:

  • Try different data formats: Depending on the task, different data formats can have different impacts on the model’s performance. For example, for a classification task, you can use a format that separates the prompt and the completion with a special token, such as {"prompt": "Paris##\n", "completion": " city\n###\n"}. Be sure to use formats suitable for your application.
  • Collect a large, high-quality dataset: LLMs are data-hungry and can benefit from having more diverse and representative data to fine-tune on. However, collecting and annotating large datasets can be costly and time-consuming. Therefore, you can also use synthetic data generation techniques to increase the size and variety of your dataset. However, you should also ensure that the synthetic data is relevant and consistent with your task and domain. Also ensure that it does not introduce noise or bias to the model.
  • Try fine-tuning subsets first: To assess the value of getting more data, you can fine-tune models on subsets of your current dataset to see how performance scales with dataset size. This fine-tuning can help you estimate the learning curve of your model and decide whether adding more data is worth the effort and cost. You can also compare the performance of your model with the pre-trained model or a baseline. This comparison shows how much improvement you can achieve with fine-tuning.
  • Experiment with hyperparameters: Iteratively adjust hyperparameters to optimize the model performance. Hyperparameters, such as the learning rate, the batch size and the number of epochs, can have significant effect on the model’s performance. Therefore, you should experiment with different values and combinations of hyperparameters to find the best ones for your task and dataset.
  • Start with a smaller model: A common mistake is assuming that your application needs the newest, biggest, most expensive model. Especially for simpler tasks, start with smaller models and only try larger models if needed.