The error "Something went wrong" during Evaluation typically indicates a generic failure in the underlying agent skills or evaluation pipeline rather than a problem with the test cases themselves. In the current tooling, these failures often surface as high-level messages instead of detailed diagnostics.
For Copilot Studio–based agents and evaluations, there are two main categories of issues relevant to this scenario:
- Agent-side or skill-side failures
Security and Copilot agents can fail when their underlying skills or identities encounter issues. In those cases, the platform surfaces generic errors like:
-
ProcessSkillEvaluationFailed – the agent’s underlying skill had an issue during evaluation.
-
SecurityCopilotEvaluationError – an error occurred in a child evaluation.
- Identity-related errors such as
ForbiddenIdentityConfiguration, UnauthorizedIdentityConfiguration, InvalidIdentityConfiguration, or MissingIdentityConfiguration when the identity used by the agent cannot obtain or use tokens correctly.
These are summarized to the user as a generic failure (for example, “Something went wrong”) instead of exposing detailed internal diagnostics. The documented guidance is:
- For skill/evaluation errors (
ProcessSkillEvaluationFailed, SecurityCopilotEvaluationError): contact the agent owner.
- For identity configuration issues: verify the identity configuration in the agent, ensure the identity has all permissions listed in the manifest, and coordinate with the sign-in logs administrator to see if access is being blocked.
- Evaluation setup or environment issues
Even when the agent is created from a Copilot prompt, evaluation can fail if:
- The evaluation setup is misconfigured (for example, grader or expected values are incorrect, or the evaluation method is not appropriate).
- The agent itself is not fully functional (for example, unpublished or misconfigured), so every evaluation test case fails.
The recommended triage path for agent evaluation failures is:
- Verify the evaluation setup first:
- Manually compare the agent’s actual response with the expected value and the grading method.
- Confirm that the expected answers are current and accurate, the test cases are realistic, and the evaluation method (for example, compare meaning vs. text similarity) is appropriate.
- If the agent’s responses are acceptable but still fail, the grader or expected values likely need adjustment.
- If the evaluation setup is valid, diagnose the agent:
- Ensure the agent is published and functioning. Unpublished agents can behave differently (for example, limited data processing or incomplete behavior) and cause evaluation failures.
- Test the agent directly in Copilot Studio’s test pane to confirm it can respond correctly outside of evaluation.
Regarding why the message is so generic:
The platform intentionally abstracts many internal error details into a small set of user-facing messages (for example, “Something went wrong”, “Try again later”) and, where applicable, structured error codes (like the identity and skill errors above). The expectation is that administrators or agent makers use:
- Error codes surfaced in the test pane or logs.
- Sign-in logs and service logs for identity-related issues.
- Evaluation triage guidance (verify evaluation setup, then agent behavior) to narrow down the root cause.
To move forward in this scenario:
- Confirm the agent is published and functioning by testing it directly in Copilot Studio.
- If evaluation still shows generic failures, inspect any error codes or identity-related messages associated with the agent (for example, Forbidden/Unauthorized/Invalid/Missing identity configuration) and correct the identity configuration as documented.
- If the agent’s responses look correct but evaluations still fail, adjust the evaluation setup following the evaluation triage guidance.
References: