Trainable classifier does not identify source code files.

Luiz Otavio | HSBS 20 Reputation points
2023-06-19T17:25:48.81+00:00

Microsoft Purview has a data classifier called Source Code that searches for source code in files.

LABORATORY
I created an automatic labeling rule (AIP) so that if the content reaches data related to the Source Code classifier, it labels the file as confidential. Initially I left it with simulation mode to evaluate the rule. I created several files from the most diverse programming languages ​​and stored them in OneDrive. Files have code file extensions (Ex: .py, .java, .cs, .c, .cpp)

PROBLEM
No files were identified by the data classifier and all files are code files. I performed another test of copying the code to an MS Word file to check if it would be identified and even so it was unsuccessful. Finally, I went to the classifier to upload a file to test and it still didn't detect it. The path used was compliance.microsoft.com > dataclassificationclassifiers > classifiers > Source Code.

DOUBT
Is there a minimum amount of files for the classifier to detect?
Is there a minimum amount of line of code in the file for it to identify?
Does the classifier only identify in MS Office type files or can it also identify in other extensions?
How can I make esser classified work?

Azure Information Protection
Azure Information Protection
An Azure service that is used to control and help secure email, documents, and sensitive data that are shared outside the company.
529 questions
0 comments No comments
{count} votes

Accepted answer
  1. Shweta Mathur 29,341 Reputation points Microsoft Employee
    2023-06-21T13:27:16.4966667+00:00

    Hi @Luiz Otavio | HSBS ,

    Thanks for reaching out

    I understand you are trying to apply ready to use Trainable classifier "Source code" to your files while contain source code but Microsoft purview is not able to identify those files and put label on it.

    I repro the scenario and was able to test the scenario in my lab.

    • I have a label which already exists. It took 24 hours for any new label to reflect.
    • Create a custom policy to add trainable classifier ready to use "Source code"

    User's image

    I tested the source code file in matched items and got result that contains "source code".

    User's image

    Uploaded the source code in one drive (code in the word file) and able to detect the label automatically.

    User's image

    Code.docx

    User's image

    Make sure "Source code is trained to detect when the bulk of the text is source code. It does not detect source code text that is interspersed with plain text."

    There might be a chance your auto labelling is not configured properly or the code snippet you have is not detected as source code.

    Source code is identified as below:

    User's image

    Hope this will help.

    Thanks,

    Shweta


    Please remember to "Accept Answer" if answer helped you.

    0 comments No comments

0 additional answers

Sort by: Most helpful