What is the first step in the statistical analysis of terms in a text in the context of natural language processing (NLP)?

ab321 0 Reputation points
2023-12-04T19:16:06.8666667+00:00

In the practice assessment for the AI-900 exam the correct answer to the following question is "removing stop words".

What is the first step in the statistical analysis of terms in a text in the context of natural language processing (NLP)?

But in the Microsoft Learn Module: Understand Text Analytics, it states, "The first step in analyzing a corpus is to break it down into tokens."

No where in the Self-Paced Learning Module for the AI-900 exam does it state that the first step to statistical analysis of terms in a text in the context of NLP is "removing stop words".

What is the correct answer? And where can I find supporting documentation?

Azure Azure Training
{count} votes

1 answer

Sort by: Most helpful
  1. 86576635 15 Reputation points
    2023-12-04T20:12:19.0833333+00:00

    AI-900 Question

    The confusion seems to arise from a discrepancy between the AI-900 exam practice assessment and the Microsoft Learn Module: Understand Text Analytics.

    In the practice assessment, the correct answer is given as “removing stop words”. However, the Microsoft Learn Module states that the first step in analyzing a corpus is tokenization, which involves breaking down the text into individual words or “tokens”.

    It appears that both tokenization and stop word removal are crucial steps in NLP. However, their order can sometimes vary depending on the specific task or the particular focus of the analysis. In many NLP tasks, tokenization is often the very first step, followed by other preprocessing steps such as removing stop words, stemming, and so on. But in the context of the question from the AI-900 exam practice assessment, which specifically focuses on “statistical analysis of terms,” removing stop words is considered the first step. This is because stop words (common words like “is”, “the”, “and”, etc.) can often skew the statistics due to their high frequency. By removing these words, the focus will be on the more meaningful words in the text for the analysis.

    I believe it would be beneficial for a moderator to provide a final decision on this matter to clear up any remaining confusion.

    Microsoft Learn Module: Understand Text Analytics - https://learn.microsoft.com/en-us/azure/synapse-analytics/machine-learning/tutorial-text-analytics-use-mmlspark

    3 people found this answer helpful.
    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.