FAQ for analytics

This article answers frequently asked questions about the AI capabilities used in analytics features in Copilot Studio.

How is generative AI used for analytics?

Copilot Studio uses AI to evaluate the quality of generative answers and to identify patterns in user queries through clustering. These clusters provide insights into agent performance.

Generative answers use knowledge sources you choose to generate a response. The feature also collects any feedback you provide. Analytics use large language models (LLMs) to classify the chat messages between users and agents into levels that indicate the quality of generative answer responses. These classifications are aggregated to provide a summary of agent performance.

Clustering uses LLMs to sort users' messages into groups based on shared subjects and provide each group with a descriptive name. Copilot Studio uses the names of these clusters to provide different types of insights you can use to improve your agent.

Quality of responses for generative answers

What is the intended use of quality of response?

Use quality of response analytics to understand agent performance and identify improvements. Currently, you can use analytics to understand if the quality of an agent's generative answers meets your expectations.

In addition to overall quality, quality of response analytics identifies areas where an agent performs poorly or fails to perform your intended goals. Identify where generative answers perform poorly and take steps to improve their quality.

When identifying poor performance, follow best practices that can help improve quality. For example, after identifying knowledge sources with poor performance, you can edit the knowledge source or split the knowledge source into multiple, more focused sources for increased quality.

What data is used to create analytics for quality of response?

Quality of response analytics are based on a sample of generative answer interactions. It requires the user query, the agent response, and the relevant knowledge sources that the generative model uses for the generative answer. Quality of response analytics uses that information to evaluate if the generative answer quality is good, and if not, why the quality is poor. For example, quality of response can identify incomplete, irrelevant, or not fully grounded responses.

What are the limitations of quality of response analytics, and how can users minimize the effects of these limitations?

  • Quality of response analytics don't use all generative responses. Instead, analytics measures a sample of user-agent sessions. Agents with fewer than the minimum number of successful generative answers can't receive a quality of response analytical summary.

  • There are cases when analytics don't evaluate an individual response accurately. On an aggregated level, it should be accurate for most cases.

  • Quality of response analytics don't provide a breakdown of the specific queries that led to low quality performance. They also don't provide a breakdown of common knowledge sources or topics that were used when low quality responses occur.

  • Analytics aren't calculated for answers that use generative knowledge.

  • Answer completeness is one of the metrics used to assess response quality. This metric measures how fully the response addresses the content in the retrieved document.

    If the system doesn't retrieve a relevant document with additional information for the question, it doesn't evaluate the completeness metric for that document.

What protections are in place for quality of response analytics within Copilot Studio for responsible AI?

Users of agents don't see analytics results. Results are available to agent makers and admins only.

Makers and admins can only use quality of response analytics to see the percentage of good quality responses and any predefined reasons for poor performance. Results are aggregated and presented as percentages and predefined categories.

We tested analytics for quality of responses thoroughly during development to ensure good performance. However, on rare occurrences, quality of response assessments might be inaccurate.

Sentiment analysis for conversational sessions

What is the intended use of sentiment analysis?

Use sentiment analysis to understand the level of user satisfaction in conversation sessions based on an AI analysis of user messages to the agent. You can understand the overall sentiment of the session (positive, negative, or neutral), investigate the reasons, and take measures to address it.

What data is used for sentiment analysis?

Sentiment analysis uses user messages to the agent for a sample set of conversational sessions.

Sentiment analytics uses that information to evaluate if the user satisfaction during the session is positive, negative, or neutral. For example, a user can use words and a tone of voice that indicate frustration or dissatisfaction based on the interaction with the agent. In this case, the session is classified as negative sentiment.

What are the limitations of sentiment analysis, and how can users mitigate for these limitations?

Sentiment analytics aren't calculated using all conversational sessions. Instead, analytics measures a sample of user-agent sessions. Agents below a minimum number of daily successful generative answers can't receive a sentiment score.

Sentiment analysis currently has a dependency on generative answers and requires a minimum number of daily successful answers to calculate sentiment score for the agent.

To calculate sentiment for a session, there must be at least two user messages. Additionally, due to current technical constraints, sentiment analysis isn't performed on sessions that exceed a total of 26 messages (including both user and agent messages)

Sentiment analysis doesn't provide a breakdown of the specific user messages that led to the sentiment score.

What protections are in place for sentiment analysis within Copilot Studio for responsible AI?

Users of agents don't see analytics results. Results are available to agent makers and admins only.

You can only use sentiment analysis to see the breakdown of sentiment across all sessions.

We tested sentiment analysis thoroughly during development to ensure good performance. However, on rare occurrences, sentiment assessments might be inaccurate.

Themes of user questions

What is the intended use of themes?

Clustering by themes and theme-level analysis help you quickly understand what users are asking about at scale. This feature analyzes large volumes of user queries and surfaces high-level topics ("themes") that represent the main subjects users care about. This analysis helps you move from inspecting individual conversations to identifying broader patterns, emerging needs, and areas of interest.

By providing a structured, data-driven overview of user activity, theme-level analysis helps you:

  • Identify the most common topics users engage with.

  • Detect gaps in coverage or unclear experiences.

  • Monitor how user interests evolve over time.

  • Prioritize improvements based on real user demand.

How does theme analysis work at a high level?

This feature operates as a multistage process that continuously organizes user queries into meaningful groups. At a high level, this process includes two key phases:

Theme candidate generation

The system analyzes a recent set of user queries and identifies candidate themes that represent distinct high-level topics. The system detects patterns, similarities, and recurring subjects across queries to derive these candidates.

Query attribution to themes

After the system generates candidate themes, it associates individual queries to the most relevant theme. Each theme represents a collection of related user questions and evolves as the system processes new queries. The system refines these themes over time by using signals such as semantic similarity and user feedback. This refinement process allows the representation to adapt as user behavior changes.

What data is used to create themes?

Users generate themes from queries that result in generative answers. The process focuses on a recent window of activity to ensure that themes reflect current user interests and evolving trends. As new data becomes available, the system refreshes themes to keep them relevant.

Because Themes relies on patterns in user queries, the feature depends on having a meaningful amount of activity to analyze. In situations where there's limited data or highly fragmented queries, the system might not generate themes or might provide limited insight.

What are the limitations of theme analysis, and how can I mitigate them?

Theme analysis is a data-driven clustering system, and its effectiveness depends on the nature and volume of user queries. Some potential limitations include:

  • Insufficient or highly diverse data might lead to themes that are too broad or narrow.

  • Closely related topics might sometimes be split into separate themes.

  • Unrelated queries might occasionally be grouped together.

  • Changes in user language over time might affect consistency of themes.

To get the most value from themes:

  • Regularly review generated themes.

  • Provide feedback (for example: thumbs up or down) to improve quality.

  • Interpret themes as directional insights rather than exact categorizations.

What responsible AI protections are in place?

Theme clustering and analysis is designed with responsible AI principles in mind.

  • Authorized makers and admins are the only ones who can see themes.

  • Only those authorized to see the user queries can see their breakdown into themes.

  • The themes reflect the content of the user queries, so they provide an honest summary for the makers and admins to see.

These safeguards help ensure that Themes provides useful insights while maintaining a safe and controlled experience.

Custom metrics analytics

What is the intended use of custom metrics?

Use custom metrics analytics to understand how much your conversational agents affect business outcomes. These metrics complement savings analytics. Examples of custom metrics include resolution rate, customer intent classification, and other domain‑specific outcomes.

Custom metrics can show where agents miss intended goals. Define what to measure, test metrics against real session data, and refine definitions based on the results.

What data is used to calculate custom metrics?

Calculate custom metrics using a sample of past agent sessions. The calculation uses the conversational messages exchanged during a session.

The AI model classifies session data based on your metric definition. The agent aggregates results across the sample to show overall metric performance for the selected time period.

What are the limitations of custom metrics and how can users minimize the effects of limitations?

Custom metrics don't use all agent sessions. Instead, they measure a sample of sessions from the selected time period. Because results are based on a sample, treat them as directional indicators rather than exact figures.

Consider that the metric calculation is based on the messages transcript when interpreting metrics. Avoid drawing conclusions about behaviors that occur primarily outside messages, such as topics and tools.

The AI model might misclassify sessions. Aggregate results are generally accurate. Sessions that don't match a defined category are placed in the fallback (Other) category. If test results don't match expected outcomes, you can update the metric description and category definitions.

If you significantly change an agent's instructions or configuration after defining a metric, the metric might no longer accurately reflect the agent's updated behavior. Review their custom metrics after making substantive changes to the agent.

What protections are in place for custom metrics within Copilot Studio for responsible AI?

Agent makers and admins are the only ones who can access custom metrics results. Users of the agent don't have access to analytics results.

You review and approve all custom metrics before saving. During metric definition, you test metrics against sample session data and review individual results and model reasoning. If results don't meet expectations, you can update or discard the metric. Metrics aren't applied without your explicit confirmation.

The AI-generated prompt used to classify sessions is visible to you in the UI, so you can understand how the model interprets your metric definition. You can edit or remove custom metrics at any time.

On rare occasions, individual session classifications might be inaccurate. Results should be interpreted in aggregate rather than at the individual session level.