Recommendations for Experimentation

Experimenting with large language models (LLMs) can be challenging, and may require careful design choices. Here are some recommendations for experimenting with LLMs effectively and efficiently.

Collect as much good quality data as possible then augment with synthetic data as needed

Large language models are trained on massive amounts of text from diverse sources. This massive training may not have enough data for your specific domain or task. Therefore, it is important to collect as much relevant and high-quality data as possible to provide the model with sufficient context and examples. Data collection can be costly and time-consuming. You may also consider augmenting your data with synthetic data generated by one of the following methods:

  • LLM
  • Other generative or deterministic methods (e.g., grammar-based)

Synthetic data can help increase the diversity and robustness of your data, and fill in the gaps or imbalances in your data distribution.

Define/use different evaluation metrics that fit your application

When using LLMs, you need to define or use different evaluation metrics that can capture the quality and performance of your model outputs. Depending on your application, you may use one of the following automatic metrics:

  • Levenshtein distance
  • BLEU
  • ROUGE
  • Human evaluation, such as ratings, rankings, or feedback

You may also use multiple metrics to get a comprehensive and holistic assessment of your model. For more information, see evaluation metrics.

Start with in-context learning

  • Start simple to establish a baseline – start with simple prompt designs and use that as a baseline. A baseline can be a quick and easy way to gauge the model's capabilities and limitations.
  • Gradually increase complexity. Once you have a baseline, you can experiment with increasing the complexity of your task or domain. You can add complexity by providing more context or examples, or introducing constraints.
  • Use different prompt designs to optimize the performance – different prompt designs can elicit different responses from the model, and some may be more suitable or effective for your task or domain than others. Therefore, try different prompt designs and compare their results.
  • Do benchmarking using different configurations and evaluate different models. You can use different prompt designs, model parameters, datasets, metrics, etc. to benchmark the model. See how it performs on different aspects of your task or domain. You can also evaluate and compare different versions or variants of GPT-3, or other large language models.

Perform fine-tuning if needed

While there are use cases where fine-tuning can help improve the model's performance and adaptability, it has limitations due to the costs, the need for more data, computational resources, and hyperparameter tuning. Fine-tuning may also cause over-fitting or catastrophic forgetting. Therefore, we advise doing fine-tuning only if needed, and only after you have exhausted the in-context learning methods. Below are a few recommendations for fine-tuning. For more information, see fine-turning recommendations.

  • Start with smaller models especially for simple tasks. Smaller models can be faster, cheaper, and easier to use and fine-tune, and they can also be more interpretable and controllable.
  • Try fine-tuning using different data formats. Different data formats can affect the model's input and output representations, and some may be more suitable or effective for your task or domain than others. For example, you can use plain text, structured text, or semi-structured text as your data format. You can also use different separators, delimiters, or tokens to indicate the boundaries or labels of your input and output.
  • Optimize the hyper-parameters of your model and your fine-tuning process, such as the learning rate, the batch size, the number of epochs, the weight decay, or the dropout rate.

Stages for Experimenting with LLMs

Large language model (LMM) experimentation is a multi-stage process. While the number of stages may vary from one application to another, we can define at least four main stages.

Preliminary ideation

In this stage, the goal is to explore different prompt ideas and qualitatively assess the output of the LLM. More specifically, a small toy dataset can be used to test different prompts and observe the output diversity, coherence, and relevance. This test dataset can help in defining data requirements and planning for experimentation. Simple Jupyter notebooks or the Azure OpenAI playground can be used to interact with the LLM.

Establishing a baseline

In this stage, the goal is to establish baseline performance using a simple solution (e.g., static prompt with zero-shot learning). To measure the performance, an evaluation set, and evaluation metrics are needed. The evaluation set should contain representative examples of the task, with ground-truth or references completions (outputs). The evaluation metrics should capture the quality aspects of the output, such as accuracy or informativeness. Tools or environments that can efficiently call the LLM API are needed to generate and score the outputs.

Hypothesis-driven experimentation

In this stage, multiple experiments can be implemented, executed, and evaluated to improve the performance of the solution. This stage is iterative and data-driven. For each iteration, different experiments are evaluated and compared. It may also involve defining different experimental configurations and performing hyperparameter sweeping. After that, exploratory results analysis can be performed for selected experiments to better understand performance issues. Performance issues such as revealing patterns of errors, biases, or gaps in the results may be found. Finally, insights can be used to define new hypotheses for improvement and/or the need for more data. An experimentation framework is needed at this stage to enable large-scale experimentation.

Real-world testing

In this stage, the goal is to test and evaluate the solution that has been deployed in production. Observability tools can be used to track and monitor the solution’s behavior and performance (e.g., detect drifts). Further, user data and feedback can be exported for exploratory data analysis (EDA) to quantitatively assess the performance of the solution. The EDA may also help us identify new data to be used in experimentation (e.g., added to the evaluation sets), new evaluation criteria/metrics, or improvement opportunities.