Use cases for Azure OpenAI Service
What is a Transparency Note?
An AI system includes not only the technology, but also the people who will use it, the people who will be affected by it, and the environment in which it is deployed. Creating a system that is fit for its intended purpose requires an understanding of how the technology works, what its capabilities and limitations are, and how to achieve the best performance. Microsoft’s Transparency Notes are intended to help you understand how our AI technology works, the choices system owners can make that influence system performance and behavior, and the importance of thinking about the whole system, including the technology, the people, and the environment. You can use Transparency Notes when developing or deploying your own system, or share them with the people who will use or be affected by your system.
Microsoft’s Transparency Notes are part of a broader effort at Microsoft to put our AI Principles into practice. To find out more, see the Microsoft's AI principles.
The basics of Azure OpenAI
Introduction
Azure OpenAI provides customers with a fully managed AI service that lets developers and data scientists apply OpenAI's powerful language models including their GPT-3 series, which includes the new ChatGPT model, and Codex series. GPT-3 models analyze and generate natural language, while Codex models analyze and generate code and plain text code commentary. These models use an autoregressive architecture, meaning they use data from prior observations to predict the most probable next word. This process is then repeated by appending the newly generated content to the original text to produce the complete generated response. Because the response is conditioned on the input text, these models can be applied to a variety of tasks simply by changing the input text.
The GPT-3 series of models are pretrained on a wide body of publicly available free text data. This data is sourced from a combination of web crawling (specifically, a filtered version of Common Crawl, which includes a broad range of text from the internet and comprises sixty percent of the weighted pre-training dataset) and higher-quality datasets, including an expanded version of the WebText dataset, two internet-based books corpora and English-language Wikipedia. The GPT-4 base model was trained using publicly available data (such as internet data) as well as data that was licensed by OpenAI. The model was fine-tuned using reinforcement learning with human feedback (RLHF).
Learn more about the training and modeling techniques in OpenAI's GPT-3, GPT-4, and Codex research papers. The guidance below is also drawn from OpenAI's safety best practices.
Key terms
Term | Definition |
---|---|
Prompt | The text you send to the service in the API call. This text is then input into the model. For example, one might input the following prompt: Convert the questions to a command: Q: Ask Constance if we need some bread A: send-msg `find constance` Do we need some bread? Q: Send a message to Greg to figure out if things are ready for Wednesday. A: |
Completion or Generation | The text Azure OpenAI outputs in response. For example, the service may respond with the following answer to the above prompt: send-msg `find greg` figure out if things are ready for Wednesday. |
Token | Azure OpenAI processes text by breaking it down into tokens. Tokens can be words or just chunks of characters. For example, the word “hamburger” gets broken up into the tokens “ham”, “bur” and “ger”, while a short and common word like “pear” is a single token. Many tokens start with a whitespace, for example “ hello” and “ bye”. |
Capabilities
System behavior
The Azure OpenAI Service models use natural language instructions and examples in the prompt to identify the task. The model then completes the task by predicting the most probable next text. This technique is known as "in-context" learning. These models are not re-trained during this step but instead give predictions based on the context you include in the prompt.
There are three main approaches for in-context learning. These approaches vary based on the amount of task-specific data that is given to the model:
Few-shot: In this case, a user includes several examples in the prompt that demonstrate the expected answer format and content. The following example shows a few-shot prompt providing multiple examples:
Convert the questions to a command:
Q: Ask Constance if we need some bread
A: send-msg `find constance` Do we need some bread?
Q: Send a message to Greg to figure out if things are ready for Wednesday.
A: send-msg `find greg` Is everything ready for Wednesday?
Q: Ask Ilya if we're still having our meeting this evening
A: send-msg `find ilya` Are we still having a meeting this evening?
Q: Contact the ski store and figure out if I can get my skis fixed before I leave on Thursday
A: send-msg `find ski store` Would it be possible to get my skis fixed before I leave on Thursday?
Q: Thank Nicolas for lunch
A: send-msg `find nicolas` Thank you for lunch!
Q: Tell Constance that I won't be home before 19:30 tonight — unmovable meeting.
A: send-msg `find constance` I won't be home before 19:30 tonight. I have a meeting I can't move.
Q: Tell John that I need to book an appointment at 10:30
A:
The number of examples typically ranges from 0 to 100 depending on how many can fit in the maximum input length for a single prompt. Few-shot learning enables a major reduction in the amount of task-specific data required for accurate predictions.
One-shot: This case is the same as the few-shot approach except only one example is provided. The following example shows a one-shot prompt:
Convert the questions to a command:
Q: Ask Constance if we need some bread
A: send-msg `find constance` Do we need some bread?
Q: Send a message to Greg to figure out if things are ready for Wednesday.
A:
Zero-shot: In this case, no examples are provided to the model and only the task request is provided. The following example shows a zero-shot prompt:
Convert the question to a command:
Q: Ask Constance if we need some bread
A:
Use cases
Intended uses
Azure OpenAI can be used in multiple scenarios. The system’s intended uses include:
- Chat and conversation interaction: Users can interact with a conversational agent that responds with responses drawn from trusted documents such as internal company documentation or tech support documentation. Conversations must be limited to answering scoped questions.
- Chat and conversation creation: Users can create a conversational agent that responds with responses drawn from trusted documents such as internal company documentation or tech support documentation. Conversations must be limited to answering scoped questions.
- Code generation or transformation scenarios: For example, converting one programming language to another, generating docstrings for functions, converting natural language to SQL.
- Journalistic content: For use to create new journalistic content or to rewrite journalistic content submitted by the user as a writing aid for pre-defined topics. Users cannot use the application as a general content creation tool for all topics. May not be used to generate content for political campaigns.
- Question-answering: Users can ask questions and receive answers from trusted source documents such as internal company documentation. The application does not generate answers ungrounded in trusted source documentation.
- Reason over structured and unstructured data: Users can analyze inputs using classification, sentiment analysis of text, or entity extraction. Examples include analyzing product feedback sentiment, analyzing support calls and transcripts, and refining text-based search with embeddings.
- Search: Users can search trusted source documents such as internal company documentation. The application does not generate results ungrounded in trusted source documentation.
- Summarization: Users can submit content to be summarized for pre-defined topics built into the application and cannot use the application as an open-ended summarizer. Examples include summarization of internal company documentation, call center transcripts, technical reports, and product reviews.
- Writing assistance on specific topics: Users can create new content or rewrite content submitted by the user as a writing aid for business content or pre-defined topics. Users can only rewrite or create content for specific business purposes or pre-defined topics and cannot use the application as a general content creation tool for all topics. Examples of business content include proposals and reports. For journalistic use, see above Journalistic content use case.
Considerations when choosing a use case
We encourage customers to leverage Azure OpenAI in their innovative solutions or applications. However, here are some considerations when choosing a use case:
- Not suitable for open-ended, unconstrained content generation. Scenarios where users can generate content on any topic are more likely to produce offensive or harmful text. The same is true of longer generations.
- Not suitable for scenarios where up-to-date, factually accurate information is crucial unless you have human reviewers or are using the models to search your own documents and have verified suitability for your scenario. The service does not have information about events that occur after its training date, likely has missing knowledge about some topics, and may not always produce factually accurate information.
- Avoid scenarios where use or misuse of the system could result in significant physical or psychological injury to an individual. For example, scenarios that diagnose patients or prescribe medications have the potential to cause significant harm.
- Avoid scenarios where use or misuse of the system could have a consequential impact on life opportunities or legal status. Examples include scenarios where the AI system could affect an individual's legal status, legal rights, or their access to credit, education, employment, healthcare, housing, insurance, social welfare benefits, services, opportunities, or the terms on which they are provided.
- Avoid high stakes scenarios that could lead to harm. The models hosted by Azure OpenAI service reflect certain societal views, biases, and other undesirable content present in the training data or the examples provided in the prompt. As a result, we caution against using the models in high-stakes scenarios where unfair, unreliable, or offensive behavior might be extremely costly or lead to harm.
- Carefully consider use cases in high stakes domains or industry: Examples include but are not limited to healthcare, medicine, finance, or legal.
- Carefully consider well-scoped chatbot scenarios. Limiting the use of the service in chatbots to a narrow domain reduces the risk of generating unintended or undesirable responses.
- Carefully consider all generative use cases. Content generation scenarios may be more likely to produce unintended outputs and these scenarios require careful consideration and mitigations.
Limitations
When it comes to large-scale natural language models, there are particular fairness and responsible AI issues to consider. People use language to describe the world and to express their beliefs, assumptions, attitudes, and values. As a result, publicly available text data typically used to train large-scale natural language processing models contains societal biases relating to race, gender, religion, age, and other groups of people, as well as other undesirable content. These societal biases are reflected in the distributions of words, phrases, and syntactic structures.
Technical limitations, operational factors and ranges
Large-scale natural language models trained with such data can potentially behave in ways that are unfair, unreliable, or offensive, in turn causing harms. There are several different ways in which a large-scale natural language processing model can cause harms. Some of the ways are listed here. We emphasize that these types of harms aren't mutually exclusive. A single model can exhibit more than one type of harm, potentially relating to multiple different groups of people. For example:
- Allocation: Language models can be used in ways that lead to unfair allocation of resources or opportunities. For example, automated resume screening systems can withhold employment opportunities from one gender if they're trained on resume data that reflects the existing gender imbalance in a particular industry.
- Quality of service: Language models can fail to provide the same quality of service to some people as they do to others. For example, sentence completion systems may not work as well for some dialects or language varieties because of their lack of representation in the training data. The Azure OpenAI models are trained primarily on English text. Languages other than English will experience worse performance. English language varieties with less representation in the training data may experience worse performance.
- Stereotyping: Language models can reinforce stereotypes. For example, when translating "He is a nurse" and "She is a doctor" into a genderless language such as Turkish and then back into English, many machine translation systems yield the stereotypical (and incorrect) results of "She is a nurse" and "He is a doctor."
- Demeaning: Language models can demean people. For example, an open-ended content generation system with inappropriate or insufficient mitigations might produce offensive text targeted at a particular group of people.
- Over- and underrepresentation: Language models can over- or underrepresent groups of people, or even erase them entirely. For example, toxicity detection systems that rate text containing the word "gay" as toxic might lead to the underrepresentation or even erasure of legitimate text written by or about the LGBTQIA+ community.
- Inappropriate or offensive content: Language models can produce other types of inappropriate or offensive content. Examples include hate speech; text that contains profane words or phrases; text that relates to illicit activities; text that relates to contested, controversial, or ideologically polarizing topics; misinformation; text that's manipulative; and text that relates to sensitive or emotionally charged topics. For example, suggested-reply systems that are restricted to positive replies can suggest inappropriate or insensitive replies for messages about negative events.
- Information reliability: Language model responses can fabricate content that may sound reasonable but is nonsensical or inaccurate with respect to external validation sources. Even when drawing responses from trusted source information, responses may misrepresent that content.
System performance
In many AI systems, performance is often defined in relation to accuracy—that is, how often the AI system offers a correct prediction or output. With large-scale natural language models, two different users might look at the same output and have different opinions of how useful or relevant it is, which means that performance for these systems must be defined more flexibly. Here, we broadly consider performance to mean that the application performs as you and your users expect, including not generating harmful outputs.
Azure OpenAI service can support a wide range of applications like search, classification, and code generation, each with different performance metrics and mitigation strategies. There are several steps you can take to mitigate some of the concerns listed under “Limitations” and to improve performance. Additional important mitigation techniques are outlined in the section Evaluating and integrating Azure OpenAI for your use below.
- Show and tell when designing prompts. Make it clear to the model what kind of outputs you expect through instructions, examples, or a combination of the two. If you want the model to rank a list of items in alphabetical order or to classify a paragraph by sentiment, show it that's what you want.
- Keep your application on-topic. Carefully structure prompts to reduce the chance of producing undesired content, even if a user tries to use it for this purpose. For instance, you might indicate in your prompt that a chatbot only engages in conversations about mathematics and otherwise responds “I’m sorry. I’m afraid I can’t answer that.” Adding adjectives like "polite" and examples in your desired tone to your prompt can also help steer outputs. Consider nudging users toward acceptable queries, either by listing such examples upfront or by offering them as suggestions upon receiving an off-topic request. Consider training a classifier to determine whether an input is on- or off-topic.
- Provide quality data. If you're trying to build a classifier or get the model to follow a pattern, make sure that there are enough examples. Be sure to proofread your examples — the model is usually smart enough to see through basic spelling mistakes and give you a response, but it also might assume this is intentional and it could affect the response. Providing quality data also includes giving your model reliable data to draw responses from in chat and question answering systems.
- Measure model quality. As part of general model quality, consider measuring and improving fairness-related metrics and other metrics related to responsible AI in addition to traditional accuracy measures for your scenario. Consider resources like this checklist when you measure the fairness of the system. These measurements come with limitations, which you should acknowledge and communicate to stakeholders along with evaluation results.
- Limit the length, structure, rate, and source of inputs and outputs. Restricting the length or structure of inputs and outputs can increase the likelihood that the application will stay on task and mitigate, at least in part, potential unfair, unreliable, or offensive behavior. Restricting the source of inputs (for example, limiting inputs to a fixed list of items, a particular domain, or to authenticated users rather than anyone on the internet) or restricting the source of outputs (for example, only surfacing answers from approved, vetted documents rather than the open web) can further mitigate the risk of the harms. Putting usage rate limits in place can also reduce misuse.
- Implement additional scenario-specific mitigations. Refer to the mitigations outlined in Evaluating and integrating Azure OpenAI for your use below, including content moderation strategies. They do not represent every mitigation that may be required for your application but point to the general minimum baseline we check for when approving use cases for the Azure OpenAI service.
Evaluation of Azure OpenAI
GPT-3 models were trained on a total of 300 billion tokens and evaluated in zero-shot, one-shot and few-shot settings on a range of different NLP benchmarks and tasks, such as language modeling and completion tasks, and on datasets that involve commonsense reasoning, reading comprehension and question answering. GPT-3 shows strong performance on many NLP benchmarks and tasks, it nearly matches the performance of state-of-the-art fine-tuned systems in some cases, and demonstrates strong qualitative performance at tasks defined on-the-fly. More information and benchmarking statistics can be found in the OpenAI GPT-3 research paper. On a suite of traditional NLP benchmarks, GPT-4 outperforms both previous large language models and most state-of-the-art systems. For detailed evaluation results, see GPT-4 Technical Report.
Evaluating and integrating Azure OpenAI for your use
Practices for responsible use
The practices below are baseline requirements for production applications.
Ensure human oversight. Ensure appropriate human oversight for your system, including ensuring the people responsible for oversight understand the system’s intended uses; how to effectively interact with the system; how to effectively interpret system behavior; when and how to override, intervene, or interrupt the system; and ways to remain aware of the possible tendency of over-relying on outputs produced by the system (“automation bias”). Especially in high-stakes scenarios, make sure people have a role in decision-making and can intervene in real time to prevent harm. For people to make good decisions about system outputs, they need to (a) understand how the system works, (b) have awareness of the system status and how well it’s working in their scenario, and (c) have the opportunity and the tools to bring the system into alignment with their expectations and goals. When using the Azure OpenAI service, human oversight might include some or all of the following:
1a Let people edit generated outputs. (For detailed guidance and examples, see HAX G9-B: Rich and detailed edits.) In this example, users are given the option to edit each generated output before using the text. Our research showed that users expected to be able to edit content and even combine parts of different generations.
1b. Provide system performance information. (For detailed guidance and examples, see HAX G2-C: Report system performance information.) Provide grounded information about how well the system can do what it can do, which may cover overall system performance as well as system performance under certain conditions or for certain user groups. An example of highlighting potential inaccuracies in generated output is to underline key bits of information provided by the system (numbers, days, dates, names, URLs, titles, quotations, addresses, phone numbers), and provided a tooltip to flag to the user that they should fact-check the content.
1c. Remind users that they are accountable for final decisions and/or final content. In the example below, before a user is allowed to insert generated content into their document, they must go through a pop-up dialog and edit the text or acknowledge that they’ve reviewed it. Our research suggests that users appreciate having these reminders and believe them to be especially beneficial for beginners.
1d. Limit how people can automate your product or service. For example, don’t allow automated posting of generated content to external sites (including social media), or automated execution of generated code.
1e. Disclose AI’s role in generated content. In some cases, letting content consumers know when published content is partly or fully generated by Azure OpenAI can help them use their own judgment about how to read it. If generated content does not include meaningful human oversight before being shared or published—including opportunities for an expert to understand, review, and edit the content—disclosure may be critical to preventing misinformation.
Implement technical limits on inputs and outputs
2a. Limit the length of inputs and outputs. Restricting input and output length can reduce the likelihood of producing undesirable content, misuse of the system beyond its intended use cases, or other harmful or unintended scenarios.
2b. Structure inputs to limit open-ended responses and to give users more-refined control. The better the instructions that users give each time they interact with the system, the better the results they’ll get. Restrict users from creating custom prompts that let them operate as if interacting directly with the API. You can also limit outputs to be structured in certain formats or patterns.
In the example above, the prompt has been structured to require that users provide limiting details such as audience and tone. When setting prompt fields, consider what information will be easy for users to provide, and run experiments to learn what information changes output quality. Smart defaults can help people get started quickly and can also be used to demonstrate best practices for prompt format, length, and style.
2c. Return outputs from validated, reliable source materials, such as existing support articles, rather than returning non-vetted content from the internet. This can help your application stay on task and mitigate unfair, unreliable, or offensive behavior.
2d. Implement blocklists and content moderation. Keep your application on topic by checking both inputs and outputs for undesired content. The definition of undesired content depends on your scenario and changes over time. It might include hate speech, text that contains profane words or phrases, misinformation, and text that relates to sensitive or emotionally charged topics. Checking inputs can help keep your application on topic, even if a malicious user tries to produce undesired content. Checking API outputs can allow you to detect undesired content produced by the system and take action. You can replace it, report it, ask the user to enter different input, or provide input examples.
2e. Put rate limits in place (in other words, frequency and quantity of API calls) to further reduce misuse.
Authenticate users. To make misuse more difficult, consider requiring that customers sign in and, if appropriate, link a valid payment method. Consider working only with known, trusted customers in the early stages of deployment. Applications that do not authenticate users may require other, stricter mitigations to ensure the application cannot be used beyond its intended purpose.
Test your application thoroughly to ensure it responds in a way that is fit for the application's purpose. This includes conducting adversarial testing where trusted testers attempt to find system failures, poor performance, or undesirable behaviors. This information helps you to understand risks and consider appropriate mitigations. Communicate the capabilities and limitations to stakeholders.
Establish feedback channels for users and impacted groups. AI-powered products and features require ongoing monitoring and improvement. Establish channels to collect questions and concerns from users as well as people affected by the system. For example:
- 5a. Build feedback features into the user experience. Invite feedback on the usefulness and accuracy of outputs, and give users a separate and clear path to report outputs that are problematic, offensive, biased, or otherwise inappropriate. For detailed guidance and examples, see HAX Guideline 15: Encourage granular feedback.
- 5b. Publish an easy-to-remember email address for public feedback.
Prompt-based practices
Robust prompt engineering can keep the model on topic, reduce the effectiveness of adversarial interactions, and help the model provide reliable responses for your scenario. Practices include providing additional grounding data for the model to base its responses on, breaking the prompt into steps, requiring links to the data responses were drawn from, and carefully choosing few-shot examples. For more information, see Microsoft’s Prompt Engineering guide.
Scenario-specific practices
- If your application powers chatbots or other conversational AI systems, follow the Microsoft guidelines for responsible development of conversational AI systems.
- If you are developing an application in a high stakes domain or industry, such as healthcare, human resources, education, or the legal field, thoroughly assess how well the application works in your scenario, implement strong human oversight, thoroughly evaluate how well users understand the limitations of the application, and comply with all relevant laws. Consider additional mitigations based on your scenario.
Additional practices
- Use Microsoft's Inclusive Design Guidelines to build inclusive solutions.
- Conduct research to test the product and solicit feedback. Include a diverse group of stakeholders (for example, direct users, consumers of generated results, admins, and so on) in your research structure to seek their feedback at different stages of deployment. Depending on the research goal, you can use various methodologies, such as Community Jury, online experiments, beta testing, and testing with real users after deployment. Consider including stakeholders from different demographic groups to gather a wider range of feedback.
- Conduct a legal review. Obtain appropriate legal advice to review your solution, particularly if you plan to use it in sensitive or high-risk applications.
- Learn more about responsible AI here.
Learn more about responsible AI
- Microsoft AI principles
- Microsoft responsible AI resources
- Microsoft Azure Learning courses on responsible AI