Understanding LLMs

Large Language Models (LLMs), such as GPT-3, are powerful tools that can generate natural language across various domains and tasks. However, they are not perfect and have limitations and risks that need to be considered before deciding to use them for real-world use cases. This articles provides some recommendations for the use cases of large language models.

These models can be best used for generative applications

LLMs are trained on massive amounts of text. The objective is to learn the statistical patterns of language, and predict the most likely word given the previous words. Therefore, they are most suited for the following scenarios that require generating coherent and fluent text:

  • Writing stories
  • Writing essays
  • Writing captions
  • Writing headlines
  • Generating natural language from structured data
  • Writing code from natural language specifications
  • Summarizing long documents

However, they may not perform well on tasks that require more logical reasoning, factual knowledge, or domain-specific expertise. For the latter, sufficient relevant information needs to be augmented to the prompt to ground the model.

Bad answers, factual errors, and other problematic output will happen

Large language models are not infallible, and they may produce output that is incorrect, misleading, biased, offensive, or harmful. This failure can happen for one of the following reasons:

  • Data quality issues
  • Model limitations
  • Adversarial inputs
  • Unintended consequences

Therefore, the use case should be designed in a way that minimizes the effect and frequency of such failures. It should also provide mechanisms for detecting, correcting, and reporting them. For example, the use case could include quality checks, feedback loops, human oversight, or ethical guidelines.

Smaller models might work better than LLMs

LLMs are general-purpose models that can handle a wide range of tasks. They may not be optimal for specific tasks that require more specialized knowledge or skills. In many cases, a smaller, purpose-built NLP model may outperform GPT-3 for a narrow, non-generation task.

For example, take a task that involves classifying text into predefined categories, such as sentiment analysis, spam detection, or topic modeling. That task may benefit from a model that is trained and fine-tuned on a relevant dataset and objective, rather than a generic model that tries to fit all possible scenarios. A purpose-built NLP model may also be more efficient, interpretable, and explainable than a large language model.

Use caution when sharing LLM output

LLMs are not recommended for use cases where outputs are directly consumed by end users. LLMs can generate plausible and convincing text, but they cannot guarantee its accuracy, reliability, or suitability for a given purpose. Therefore, we do not recommend use cases where the model outputs are directly presented to an end user, especially in high-risk or high-stakes contexts.

Particular care should be taken when the end user lacks the knowledge or expertise necessary to verify the validity of an LLM response. Consider these examples:

  • Medical advice
  • Legal guidance
  • Financial information
  • Educational content

In these cases, a human expert should be involved in the process. They should either review, edit, or approve the model outputs, or provide more context, clarification, or disclaimer.

Conclusion

In conclusion, LLMs are powerful and versatile tools that can enable many novel and useful applications. They also have limitations and risks that need to be carefully considered and addressed. We hope that these recommendations can help developers and users of large language models to make informed and responsible decisions about their use cases.