The confusion seems to arise from a discrepancy between the AI-900 exam practice assessment and the Microsoft Learn Module: Understand Text Analytics.
In the practice assessment, the correct answer is given as “removing stop words”. However, the Microsoft Learn Module states that the first step in analyzing a corpus is tokenization, which involves breaking down the text into individual words or “tokens”.
It appears that both tokenization and stop word removal are crucial steps in NLP. However, their order can sometimes vary depending on the specific task or the particular focus of the analysis. In many NLP tasks, tokenization is often the very first step, followed by other preprocessing steps such as removing stop words, stemming, and so on. But in the context of the question from the AI-900 exam practice assessment, which specifically focuses on “statistical analysis of terms,” removing stop words is considered the first step. This is because stop words (common words like “is”, “the”, “and”, etc.) can often skew the statistics due to their high frequency. By removing these words, the focus will be on the more meaningful words in the text for the analysis.
I believe it would be beneficial for a moderator to provide a final decision on this matter to clear up any remaining confusion.
Microsoft Learn Module: Understand Text Analytics - https://learn.microsoft.com/en-us/azure/synapse-analytics/machine-learning/tutorial-text-analytics-use-mmlspark