Transparency Note for Azure OpenAI Service
What is a Transparency Note?
An AI system includes not only the technology, but also the people who use it, the people who are affected by it, and the environment in which it's deployed. Creating a system that is fit for its intended purpose requires an understanding of how the technology works, what its capabilities and limitations are, and how to achieve the best performance. Microsoft's Transparency Notes are intended to help you understand how our AI technology works, the choices system owners can make that influence system performance and behavior, and the importance of thinking about the whole system, including the technology, the people, and the environment. You can use Transparency Notes when developing or deploying your own system, or share them with the people who will use or be affected by your system.
Microsoft’s Transparency Notes are part of a broader effort at Microsoft to put our AI Principles into practice. To find out more, see the Microsoft's AI principles.
The basics of the Azure OpenAI Models
Azure OpenAI provides customers with a fully managed AI service that lets developers and data scientists apply OpenAI's powerful models including models that can generate natural language, code, and images. Within the Azure OpenAI Service, the OpenAI models are integrated with Microsoft-developed content filtering and abuse detection models. Learn more about content filtering here and abuse detection here.
Select the tabs to see content for the relevant model type.
As part of the fully managed Azure OpenAI Service, the GPT-3 models analyze and generate natural language, Codex models analyze and generate code and plain text code commentary, and the GPT-4 models can understand and generate natural language and code. These models use an autoregressive architecture, meaning they use data from prior observations to predict the most probable next word. This process is then repeated by appending the newly generated content to the original text to produce the complete generated response. Because the response is conditioned on the input text, these models can be applied to various tasks simply by changing the input text.
The GPT-3 series of models are pretrained on a wide body of publicly available free text data. This data is sourced from a combination of web crawling (specifically, a filtered version of Common Crawl, which includes a broad range of text from the internet and comprises 60 percent of the weighted pretraining dataset) and higher-quality datasets, including an expanded version of the WebText dataset, two internet-based books corpora and English-language Wikipedia. The GPT-4 base model was trained using publicly available data (such as internet data) and data that was licensed by OpenAI. The model was fine-tuned using reinforcement learning with human feedback (RLHF).
|Prompt||The text you send to the service in the API call. This text is then input into the model. For example, one might input the following prompt:
|Completion or Generation||The text Azure OpenAI outputs in response. For example, the service may respond with the following answer to the above prompt:
|Token||Azure OpenAI processes text by breaking it down into tokens. Tokens can be words or just chunks of characters. For example, the word
The GPT-4, GPT-3, and Codex Azure OpenAI Service models use natural language instructions and examples in the prompt to identify the task. The model then completes the task by predicting the most probable next text. This technique is known as "in-context" learning. These models are not retrained during this step but instead give predictions based on the context you include in the prompt.
There are three main approaches for in-context learning. These approaches vary based on the amount of task-specific data that is given to the model:
Few-shot : In this case, a user includes several examples in the prompt that demonstrate the expected answer format and content. The following example shows a few-shot prompt providing multiple examples:
Convert the questions to a command: Q: Ask Constance if we need some bread A: send-msg `find constance` Do we need some bread? Q: Send a message to Greg to figure out if things areready for Wednesday. A: send-msg `find greg` Is everything ready forWednesday? Q: Ask Ilya if we're still having our meeting thisevening A: send-msg `find ilya` Are we still having a meetingthis evening? Q: Contact the ski store and figure out if I can getmy skis fixed before I leave on Thursday A: send-msg `find ski store` Would it be possible toget my skis fixed before I leave on Thursday? Q: Thank Nicolas for lunch A: send-msg `find nicolas` Thank you for lunch! Q: Tell Constance that I won't be home before 19:30tonight — unmovable meeting. A: send-msg `find constance` I won't be home before19:30 tonight. I have a meeting I can't move. Q: Tell John that I need to book an appointment at10:30 A:
The number of examples typically ranges from 0 to 100 depending on how many can fit in the maximum input length for a single prompt. Few-shot learning enables a major reduction in the amount of task-specific data required for accurate predictions.
One-shot : This case is the same as the few-shot approach except only one example is provided. The following example shows a one-shot prompt:
Convert the questions to a command: Q: Ask Constance if we need some bread A: send-msg `find constance` Do we need some bread? Q: Send a message to Greg to figure out if things are ready for Wednesday. A:
Zero-shot: In this case, no examples are provided to the model and only the task request is provided. The following example shows a zero-shot prompt:
Convert the question to a command: Q: Ask Constance if we need some bread A:
The GPT-4, GPT-3, and Codex models in the Azure OpenAI service can be used in multiple scenarios. The following list isn't comprehensive, but it illustrates the diversity of tasks that can be supported with appropriate mitigations:
- Chat and conversation interaction : Users can interact with a conversational agent that responds with responses drawn from trusted documents such as internal company documentation or tech support documentation. Conversations must be limited to answering scoped questions.
- Chat and conversation creation : Users can create a conversational agent that responds with responses drawn from trusted documents such as internal company documentation or tech support documentation. Conversations must be limited to answering scoped questions.
- Code generation or transformation scenarios : For example, converting one programming language to another, generating docstrings for functions, converting natural language to SQL.
- Journalistic content : For use to create new journalistic content or to rewrite journalistic content submitted by the user as a writing aid for predefined topics. Users cannot use the application as a general content creation tool for all topics. May not be used to generate content for political campaigns.
- Question-answering : Users can ask questions and receive answers from trusted source documents such as internal company documentation. The application does not generate answers ungrounded in trusted source documentation.
- Reason over structured and unstructured data : Users can analyze inputs using classification, sentiment analysis of text, or entity extraction. Examples include analyzing product feedback sentiment, analyzing support calls and transcripts, and refining text-based search with embeddings.
- Search : Users can search trusted source documents such as internal company documentation. The application does not generate results ungrounded in trusted source documentation.
- Summarization : Users can submit content to be summarized for predefined topics built into the application and cannot use the application as an open-ended summarizer. Examples include summarization of internal company documentation, call center transcripts, technical reports, and product reviews.
- Writing assistance on specific topics : Users can create new content or rewrite content submitted by the user as a writing aid for business content or pre-defined topics. Users can only rewrite or create content for specific business purposes or predefined topics and cannot use the application as a general content creation tool for all topics. Examples of business content include proposals and reports. For journalistic use, see above Journalistic content use case.
- Data generation for fine-tuning: Users can use a model in Azure OpenAI to generate data which is used solely to fine-tune (i) another Azure OpenAI model, using the fine-tuning capabilities of Azure OpenAI, and/or (ii) another Azure AI custom model, using the fine-tuning capabilities of the Azure AI service. Generating data and fine-tuning models is limited to internal users only; the fine-tuned model may only be used for inferencing in the applicable Azure AI service and, for Azure OpenAI service, only for customer’s permitted use case(s) under this form.
Considerations when choosing a use case
We encourage customers to use the Azure OpenAI GPT-4, GPT-3, and Codex models in their innovative solutions or applications as approved in their Limited Access registration form. However, here are some considerations when choosing a use case:
- Not suitable for open-ended, unconstrained content generation. Scenarios where users can generate content on any topic are more likely to produce offensive or harmful text. The same is true of longer generations.
- Not suitable for scenarios where up-to-date, factually accurate information is crucial unless you have human reviewers or are using the models to search your own documents and have verified suitability for your scenario. The service does not have information about events that occur after its training date, likely has missing knowledge about some topics, and may not always produce factually accurate information.
- Avoid scenarios where use or misuse of the system could result in significant physical or psychological injury to an individual. For example, scenarios that diagnose patients or prescribe medications have the potential to cause significant harm.
- Avoid scenarios where use or misuse of the system could have a consequential impact on life opportunities or legal status. Examples include scenarios where the AI system could affect an individual's legal status, legal rights, or their access to credit, education, employment, healthcare, housing, insurance, social welfare benefits, services, opportunities, or the terms on which they're provided.
- Avoid high stakes scenarios that could lead to harm. The models hosted by Azure OpenAI service reflect certain societal views, biases, and other undesirable content present in the training data or the examples provided in the prompt. As a result, we caution against using the models in high-stakes scenarios where unfair, unreliable, or offensive behavior might be extremely costly or lead to harm.
- Carefully consider use cases in high stakes domains or industry: Examples include but are not limited to healthcare, medicine, finance, or legal.
- Carefully consider well-scoped chatbot scenarios. Limiting the use of the service in chatbots to a narrow domain reduces the risk of generating unintended or undesirable responses.
- Carefully consider all generative use cases. Content generation scenarios may be more likely to produce unintended outputs and these scenarios require careful consideration and mitigations.
When it comes to large-scale natural language models, vision models, and speech models, there are fairness and responsible AI issues to consider. People use language and images to describe the world and to express their beliefs, assumptions, attitudes, and values. As a result, publicly available text and image data typically used to train large-scale natural language processing and image generation models contains societal biases relating to race, gender, religion, age, and other groups of people, as well as other undesirable content. Similarly, speech models can exhibit different levels of accuracy across different demographic groups and languages. These societal biases are reflected in the distributions of words, phrases, and syntactic structures.
Technical limitations, operational factors, and ranges
Be advised that this section contains illustrative examples which include terms and language that some individuals might find offensive.
Large-scale natural language, image, and speech models trained with such data can potentially behave in ways that are unfair, unreliable, or offensive, in turn causing harms. Some of the ways are listed here. We emphasize that these types of harms are not mutually exclusive. A single model can exhibit more than one type of harm, potentially relating to multiple different groups of people. For example:
- Allocation: These models can be used in ways that lead to unfair allocation of resources or opportunities. For example, automated résumé screening systems can withhold employment opportunities from one gender if they are trained on résumé data that reflects the existing gender imbalance in a particular industry. Or the DALL·E models could be used to create imagery in the style of a known artist, which could affect the value of the artist's work or the artist's life opportunities.
- Quality of service: The Azure OpenAI models are trained primarily on English text and images with English text descriptions. Languages other than English will experience worse performance. English language varieties with less representation in the training data might experience worse performance than standard American English. The publicly available images used to train the DALL·E models might reinforce public bias and other undesirable content. The DALL-E models are also unable to consistently generate comprehensible text at this time. Speech models might introduce other limitations, for example, translations using the Whisper model in Azure OpenAI are limited to English output only. Broadly speaking, with Speech-to-Text models, be sure to properly specify a language (or locale) for each audio input to improve accuracy in transcription. Additionally, acoustic quality of the audio input, non-speech noise, overlapped speech, vocabulary, accents, and insertion errors might also affect the quality of your transcription or translation.
- Stereotyping: These models can reinforce stereotypes. For example, when translating "He is a nurse" and "She is a doctor" into a genderless language such as Turkish and then back into English, many machine translation systems yield the stereotypical (and incorrect) results of "She is a nurse" and "He is a doctor." With DALL·E, when generating an image based on the prompt "Fatherless children," the model could generate images of Black children only, reinforcing harmful stereotypes that might exist in publicly available images.
- Demeaning: The natural language and vision models in the Azure OpenAI service can demean people. For example, an open-ended content generation system with inappropriate or insufficient mitigations might produce content that is offensive or demeaning to a particular group of people.
- Overrepresentation and underrepresentation: The natural language and vision models in the Azure OpenAI service can over- or under-represent groups of people, or even erase their representation entirely. For example, if text prompts that contain the word "gay" are detected as potentially harmful or offensive, this identification could lead to the underrepresentation or even erasure of legitimate image generations by or about the LGBTQIA+ community.
- Inappropriate or offensive content: The natural language and vision models in the Azure OpenAI service can produce other types of inappropriate or offensive content. Examples include the ability to generate text that is inappropriate in the context of the text or image prompt;, the ability to create images that potentially contain harmful artifacts such as hate symbols; images that illicit harmful connotations; images that relate to contested, controversial, or ideologically polarizing topics; images that are manipulative; images that contain sexually charged content that is not caught by sexual-related content filters; and images that relate to sensitive or emotionally charged topics. For example, a well-intentioned text prompt aimed to create an image of the New York skyline with clouds and airplanes flying over it might unintentionally generate images that illicit sentiments related to the events surrounding 9/11.
- Disinformation and misinformation about sensitive topics: Because DALL·E 2 and DALL·E 3 are powerful image generation models, they can be used to produce disinformation and misinformation that can be harmful. For example, a user could prompt the model to generate an image of a political leader engaging in activity of a violent or sexual (or simply inaccurate) nature that might lead to consequential harms, including but not limited to public protests, political change, or fake news.
- Information reliability: Language and vision model responses can generate nonsensical content or fabricate content that might sound reasonable but is inaccurate with respect to external validation sources. Even when drawing responses from trusted source information, responses might misrepresent that content. Transcriptions or translations might result in inaccurate text.
- False information: Azure OpenAI does not fact-check or verify content that is provided by customers or users. Depending on how you have developed your application, it might produce false information unless you have built in mitigations (see Best practices for improving system performance).
In many AI systems, performance is often defined in relation to accuracy—that is, how often the AI system offers a correct prediction or output. With large-scale natural language models and vision models, two different users might look at the same output and have different opinions of how useful or relevant it is, which means that performance for these systems must be defined more flexibly. Here, we broadly consider performance to mean that the application performs as you and your users expect, including not generating harmful outputs.
Azure OpenAI service can support a wide range of applications like search, classification, code generation, image generation, and image understanding, each with different performance metrics and mitigation strategies. There are several steps you can take to mitigate some of the concerns listed under "Limitations" and to improve performance. Other important mitigation techniques are outlined in the section Evaluating and integrating Azure OpenAI for your use.
Best practices for improving system performance
- Show and tell when designing prompts. With natural language models and speech models, make it clear to the model what kind of outputs you expect through instructions, examples, or a combination of the two. If you want the model to rank a list of items in alphabetical order or to classify a paragraph by sentiment, show the model that is what you want.
- Prompts for the Whisper model in Azure OpenAI service can help improve model outputs. The following best practices will help you create prompts that best fit your scenario and needs.
- Consider including a prompt to instruct the model to correct specific words or acronyms that the model often misrecognizes in the audio.
- To preserve the context of a file that was split into segments, you might prompt the model with the transcript of the preceding segment. This prompt will make the transcript more accurate, because the model will use the relevant information from the previous audio. The model will only consider the final 224 tokens of the prompt and ignore anything earlier.
- The model might skip punctuation in the transcript. Consider using a simple prompt that instructs the model to include punctuation.
- The model might also leave out common filler words, for example, hmmm, umm, etc. in the audio. If you want to keep the filler words in your transcript, you might include a prompt that contains them.
- Some languages can be written in different ways, such as simplified or traditional Chinese. The model might not always use the writing style that a user wants for their transcript by default. Consider using a prompt to describe your preferred writing style.
- Prompts for the Whisper model in Azure OpenAI service can help improve model outputs. The following best practices will help you create prompts that best fit your scenario and needs.
- Keep your application on topic. Carefully structure prompts and image inputs to reduce the chance of producing undesired content, even if a user tries to use it for this purpose. For instance, you might indicate in your prompt that a chatbot only engages in conversations about mathematics and otherwise responds "I'm sorry. I'm afraid I can't answer that." Adding adjectives like "polite" and examples in your desired tone to your prompt can also help steer outputs. With DALL·E models, you might indicate in your prompt or image input that your application generates only conceptual images. It might otherwise generate a pop-up notification that explains that the application is not for photorealistic use or to portray reality. Consider nudging users toward acceptable queries and image inputs, either by listing such examples up front or by offering them as suggestions upon receiving an off-topic request. Consider training a classifier to determine whether an input (prompt or image) is on topic or off topic.
- Provide quality data. With text and code models, if you are trying to build a classifier or get the model to follow a pattern, make sure that there are enough examples. Be sure to proofread your examples—the model is usually capable of processing basic spelling mistakes and giving you a response, but it also might assume errors are intentional which could affect the response. Providing quality data also includes giving your model reliable data to draw responses from in chat and question answering systems.
- Measure model quality. As part of general model quality, consider measuring and improving fairness-related metrics and other metrics related to responsible AI in addition to traditional accuracy measures for your scenario. Consider resources like this checklist when you measure the fairness of the system. These measurements come with limitations, which you should acknowledge and communicate to stakeholders along with evaluation results.
- Limit the length, structure, and rate of inputs and outputs. Restricting the length or structure of inputs and outputs can increase the likelihood that the application will stay on task and mitigate, at least in part, any potentially unfair, unreliable, or offensive behaviour. Other options to reduce the risk of misuse include (i) restricting the source of inputs (for example, limiting inputs to a particular domain or to authenticated users rather than being open to anyone on the internet) and (ii) implementing usage rate limits.
- Encourage human review of outputs prior to publication or dissemination. With generative AI, there is potential for generating content that might be offensive or not related to the task at hand, even with mitigations in place. To ensure that the generated output meets the task of the user, consider building ways to remind users to review their outputs for quality prior to sharing widely. This practice can reduce many different harms, including offensive material, disinformation, and more.
- Implement additional scenario-specific mitigations. Refer to the mitigations outlined in Evaluating and integrating Azure OpenAI for your use including content moderation strategies. These recommendations do not represent every mitigation that might be required for your application, but they point to the general minimum baseline we check for when approving use cases for Azure OpenAI Service.
Evaluating and integrating Azure OpenAI natural language and vision models for your use
For additional information on how to evaluate and integrate these models responsibly, please see the RAI Overview document.
Learn more about responsible AI
- Microsoft AI principles
- Microsoft responsible AI resources
- Microsoft Azure Learning courses on responsible AI