Detect file

Peter Volz 1,295 Reputation points
2023-06-10T16:52:44.6066667+00:00

Hello,

I just removed my FindMimeFromData imported from urlmon.dll code and replaced it by a simple code to read the first 32 bytes to match it against known magic numbers.

FindMimeFromData had a very limited list and also ancient IE based, also had to cease using System.Runtime.InteropServices

Anyway, FindMimeFromData was good when detecting .txt .html and .css files.

These 3 types don't have magic numbers, am I right? If so, how to detect them? :(

Thanks for helping out, appreciated...

Developer technologies VB
Developer technologies C#
{count} votes

2 answers

Sort by: Most helpful
  1. P a u l 10,761 Reputation points
    2023-06-11T14:06:21.0766667+00:00

    Correct - txt, html & css files are all plain-text file extensions, not file formats. These extensions exist to give a hint to the program interacting with them about how to handle them. For example, if you open a css file in VS Code it will help establish the language to use for syntax highlighting based on that extension. I'd imagine you'll need to do the same thing.

    File formats like binary image formats (PNG/JPEG) typically have a header that begins with a few signature bytes. The program interacting with them likely doesn't even consider the extension when processing the actual file data. For example, you could binary concatenate two files together (text file at the top & PNG at the bottom) and depending on the image viewer you're using it's likely that the viewer will ignore the text file and just begin reading when it identifies an image signature it understands, or if by chance the text file at the top contains those signature bytes you'll end up with a corrupted image.

    Text files can have character encoding, however. For example if you save a txt file with in Notepad as UTF-8 With BOM (Byte-Order Mark) or UTF-16 then this will cause the resulting file to include some signature bytes at the start of the header (link here for a table of what these bytes are for Unicode character sets: https://en.wikipedia.org/wiki/Byte_order_mark#Byte_order_marks_by_encoding).

    TLDR: Unless you're prepared to analyse the syntax in the files (using txt as a fallback) then you only really have the file extension to go off.

    0 comments No comments

  2. Hui Liu-MSFT 48,676 Reputation points Microsoft External Staff
    2023-06-12T02:52:20.2366667+00:00

    Hi,@Peter Volz. Welcome Microsoft Q&A.

    You are correct that plain text files (.txt), HTML files (.html), and CSS files (.css) do not have specific magic numbers or headers that can be used to identify them. These file types primarily consist of plain text content without any specific binary structure.

    To detect these file types, you could use alternative methods based on their file extensions or content analysis.

    File Extension: Check the file extension of the file to determine its type. For example, if the file has a ".txt" extension, you could assume it is a plain text file. Similarly, if it has a ".html" extension, you can assume it is an HTML file, and if it has a ".css" extension, you can assume it is a CSS file. This method is simple but relies on the file extensions being accurate.

    Content Analysis: Read the content of the file and analyze it to make an educated guess about its type. For plain text files, you could check if the content contains any special characters or HTML/CSS tags. For HTML files, you can look for specific HTML tags like "<html>", "<head>", or "<body>". For CSS files, you can check for CSS-specific syntax or common CSS properties. Content analysis can help in cases where the file extension is missing or incorrect, but it is not foolproof.

    Third-Party Libraries: Another option is to use third-party libraries or frameworks that provide more advanced file type detection capabilities. These libraries often have extensive databases or algorithms to identify file types based on content analysis.

    Third-Party Libraries: Another option is to use third-party libraries or frameworks that provide more advanced file type detection capabilities. These libraries often have extensive databases or algorithms to identify file types based on content analysis.

    public static string GetFileType(string filePath)
    {
        // Check file extension
        string extension = Path.GetExtension(filePath)?.ToLower();
        if (!string.IsNullOrEmpty(extension))
        {
            switch (extension)
            {
                case ".txt":
                    return "text/plain";
                case ".html":
                case ".htm":
                    return "text/html";
                case ".css":
                    return "text/css";
                
              
            }
        }
    
        // Check file content
        byte[] buffer = new byte[32];
        using (FileStream fileStream = new FileStream(filePath, FileMode.Open, FileAccess.Read))
        {
            int bytesRead = fileStream.Read(buffer, 0, buffer.Length);
            if (bytesRead > 0)
            {
                string content = Encoding.ASCII.GetString(buffer, 0, bytesRead);
    
                if (IsPlainText(content))
                {
                    return "text/plain";
                }
                else if (IsHtml(content))
                {
                    return "text/html";
                }
                else if (IsCss(content))
                {
                    return "text/css";
                }
            }
        }
    
        // Unable to determine the file type
        return "application/octet-stream";
    }
    
    // Example content analysis checks
    private static bool IsPlainText(string content)
    {
        // You can define your own logic to check for plain text file content
        // For example, check if the content contains non-printable characters or specific patterns
        // Return true if the content is determined to be plain text, otherwise false
        return true;
    }
    
    private static bool IsHtml(string content)
    {
        // You can define your own logic to check for HTML file content
        // For example, check if the content contains HTML tags or specific HTML elements
        // Return true if the content is determined to be HTML, otherwise false
        return false;
    }
    
    private static bool IsCss(string content)
    {
        // You can define your own logic to check for CSS file content
        // For example, check if the content contains CSS selectors or specific CSS properties
        // Return true if the content is determined to be CSS, otherwise false
        return false;
    }
    
    

    If the response is helpful, please click "Accept Answer" and upvote it.

    Note: Please follow the steps in our documentation to enable e-mail notifications if you want to receive the related email notification for this thread.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.