Learn about AI analysis in Data Security Investigations (preview)

2025-04-26

Important

Data Security Investigations uses generative artificial intelligence (AI), large language models, and orchestration in the analysis of data in your organization. Results generated by AI might not always be accurate or complete. While we strive to provide reliable and helpful information, AI systems can produce incorrect or false results. It is important to verify the information and use it with caution. Microsoft makes no warranties, express, implied, or statutory, as to the information provided by AI systems.

Data Security Investigations (preview) uses AI services and tools to help you quickly review and take action on items associated with security incidents. AI-related services include the following tools:

Vector search
Categorization
Examination

Vector search

Vector search in Data Security Investigations (preview) gives you a way to contextually search through data that you add to the investigation scope using advanced orchestration and embeddings. Vector search is a search engine technology that focuses on understanding the meaning and context behind words and phrases in a query, rather than just matching keywords.

Some key aspects of vector search are:

Contextual understanding: Vector search interprets the context of your search terms, considering factors like your organization, search history, and the overall meaning of the query.
Intent recognition: Vector search works to understand your intent, whether you're looking for information, trying to take an action, or seeking a specific type of content associated with a search.
Relevance and accuracy: By focusing on the semantics (the meaning and intent of words in your query), vector search provides more accurate and relevant results and improves the overall search experience.

When investigators in your organization investigate compromised data sets, vector search in Data Security Investigations (preview) can significantly enhance your investigation by addressing several key challenges:

Identifying relevant information: Vector search understands the context and intent behind your queries. This focus helps you quickly locate relevant documents, emails, or records, even if they don’t contain the exact keywords you used.
Handling ambiguity: Vector search disambiguates terms that have multiple meanings, ensuring that you get results that are contextually appropriate to your investigation.
Reducing noise: Vector search filters out irrelevant information, allowing you to focus on the most pertinent data and reducing the time spent sifting through unrelated results.
Improving efficiency: Vector search streamlines the search process, making your investigation more efficient and effective by quickly surfacing the most relevant information.

How it works

Once you create an investigation, defined the scope, and prepared data for AI, you can run vector searches over the data set. While previous steps of the process allow for simple keyword, meta data, and date range search. Vectors search uses AI embeddings to contextually search through data. This process allows investigators to find items without knowing their exact content.

Vector search works by first running all scoped data in an investigation through an AI embeddings model. This model extracts semantic meaning from every item in your data set and breaks them into smaller parts. This is called embedding and allows Data Security Investigations (preview) to use dimension values to understand your data contextually. A semantic search index is built from these values that can be queried.

When you create a vector search query in an investigation, AI automatically expands and broadens your query and runs the query through the semantic search index. Data Security Investigations (preview) then matches the semantic meaning of your query with semantic meaning of your content and returns all contextually relevant items.

For example, if you search for "Confidential data included in the Contoso Security project", the vector search engine understands that you're looking for confidential data in this specific project rather than simply matching keywords (confidential, data, Contoso, etc.) contained in the search query. Using vector search, you can query impacted to find all data items related to a particular subject, even if keywords are missing.

For more information on vector search concepts, see the concepts section in the Vectors in Azure AI Search article.

Categorization

When your organization is breached and the impacted data is identified, investigators need to start prioritizing data to start identifying security risks. Categories in Data Security Investigations (preview) remove the need to manually assign categories to items in large and complex investigation scopes.

You can use AI-powered categorization in Data Security Investigations (preview) to more quickly reason through and prioritize potentially impacted data. To categorize data, you can select all or some default category options, use AI-suggested categories based on their investigation, or create your own custom categories.

The AI generated categories are enriched with additional information for subject-level content in scope:

Name: The name of the category/area based on the content
Summary: A short description of the underlying content

Within each category, you can use vector search and examination tools on any content.

Default categories

Data Security Investigations (preview) includes default categories to categorize items in your investigation scope. When running categorization, you can select all default categories or only the default categories applicable to the scope of your review. Unselected default categories are ignored in the analysis and results for these categories aren't available when reviewing items.

The initial default categories determined by AI processing for content items are:

Business information: General business information. This category typically contains a large number of items. Some example areas in this category might include digital engagement and analysis, user and human resources, routine administrative communication, customer engagement/experience, and more.
Communication records: General communication information. This category also typically contains a large number of items. Users can use this category to see their investigations based on areas of communications. Some example areas in this category might include client complaints, holiday greetings, internal memos, project updates, and more.
Credentials and access information Focuses on information related to access assets in investigations. This information helps identify potentially risky data and communications in your organization. Some example areas in this category might include user credentials, unauthorized database access, data exposure, and more.
Customer information: Focuses on information shared with customers. This category can be used to understand what customer data might be at risk. Some example areas in this category might include payment confirmations, customer experience improvement, delivery information, and more.
User information: Focuses on information related to users in your organization. This category also typically contains a large number of items. Some example areas in this category might include user employment information, user retention strategies, specialized group memberships, and more.
Financial information: Focuses on financial information in an investigation. Some example areas in this category might include financial planning, grant opportunities, budgets, financial statements, and more.
Health information: Focuses on health and medical-related items in an investigation. Some example areas in this category might include wellness and health records, COVID-19 safety protocol updates, health claims and incident reports, and more.
Incident and investigation information: Focuses on items about incidents and investigations in an investigation. This category includes security incidents and investigations within your organization. Some example areas in this category might include data breach, health records incidents, high-risk client account monitoring, and more.
Intellectual property: Focused on intellectual property (IP) data in an investigation. Some example areas in this category might include future patent applications, research and development work, experiment result metrics, and more.
Marketing information Focuses on marketing data in an investigation. Some example areas in this category might include press releases, advertising campaigns, marketing and sales plans, or strategies and more.
Operational information: Focuses on your organization’s operational data. Some example areas in this category might include logistics, shipping, inventory, compliance, tax records, and more.
Personally identifiable information: Focuses on group personal data and related items in an investigation. Some example areas in this category might include event guest lists, staff and training sessions, employee personal information, and more.
Regulated data: Focuses on regulated data in an investigation. Some example areas in this category might include regulatory, data protection, regulatory records, and more.

Suggested categories

Data Security Investigations (preview) also provides AI-generated suggested categories based on the content analyzed in your investigation scope. These suggested categories are automatically created to help investigations review items grouped in unexpected or unknown areas. Depending on the type of content included, the suggested categories vary.

If the content analyzed is primarily focused on a specific subject area outside of the default category areas, the suggested categories are customized to that specific content area. For example, if the analyzed content is focused on a highly confidential subject with terms and concepts specific to solely to your organization, the suggested categories are automatically created for these areas. These categories are unique to your organization and the content analyzed.

Custom categories

Data Security Investigations (preview) allows you to manually create custom categories for the generative AI process to use when analyzing the content. By defining categories most applicable to your investigation needs, you can save time and let the AI process automatically categorize items based on these custom categories.

Custom categories can be specific words or phrases that capture the specific nature of content of interest in the investigation. For example, custom categories might include Security vulnerability, Bug fix, specific project code names, or custom intellectual property like R&D related to a specific medicine or drug candidates.

Examination

As you identify items that require a deeper analysis, Data Security Investigations (preview) provides AI-based examination capabilities to help you focus on key security and sensitive data risks.

Credentials: Use this examination focus area to scan and extract credentials from all selected items in an investigation scope. This information provides investigators with a quick way to understand which accounts and credentials are associated with a security incident and that might be potentially exfiltrated.
Risk: Use this examination focus area to score all the risk areas in selected files to help investigators in focus and prioritize investigations. This tool provides the overall risk for each item, if the item is privileged content, and other specific risks for the item.

Types of risk areas include:
- Asset identifiers
- Credentials and secrets
- Evidence of threat actor discussions breach discussions
- Urgent security incidents
- Vulnerability and security hygiene
- Personal and sensitive content
- Network and access information
- Policy compliance and data protection
- Infrastructure information
- Customer information
- Government information
- Privileged information
- Trade secrets
Mitigate: Use this examination focus area to score the risk for selected files and enable Data Security Investigations (preview) to provide you with mitigation instructions for what to do next. Selected files get a risk score, risk summary, and detailed mitigation recommendations to prevent more harm from a content breach.

AI analysis recommendations

The following table outlines recommendations, example scenarios, and best practices when using the AI analysis tools in Data Security Investigations (preview).

Recommendations	Vector search	Categorization	Examination
When to use	Look for examples of specific items within a vectorized data set (invoices, bug fixes, etc.) to confirm and further investigation hypotheses. Use vector search for quick interactive analysis, results are quickly populated.	Quickly sort large amounts of data into default, custom, or AI-generated categories to prioritize investigation focus by sensitivity and severity. Depending on the size of the data set, categorization might take some time to complete.	Targeted analysis at the item level for a scoped data set, helps extract insights from a confirmed data asset for next steps. Use examination to identify items for mitigation.
Example scenario	Evaluation of potentially fraudulent activity.	Prioritization of items for analysis after large breach.	Extraction of credentials from a validated data set and recommended mitigation steps.
Best practices	Search across all vectorized content for items of interest to generate more meaningful AI suggested categories.	Select one or multiple categories and use vector search to search within the category. Review AI-generated areas within each category to understand specific content within the data set.	Use examination to drill into specific items with high sensitivity to get individual scores and results.