Speed up keyphrase extraction on huge dataset

DivyaK-4075 11 Reputation points


I am using cognitive service from mmlspark package to extract keyphrases, I have dataset with ~500k (5 lakh records), its taking too long (job runs for more than 24 hrs) to extract keyphrases, is there any faster or efficient way to extract key phrases for huge dataset.

keyphrase = (KeyPhraseExtractor()

results = keyphrase.transform(df_cleaned)

I am running the job on Synapse notebook on spark cluster.


Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
4,673 questions
{count} votes

1 answer

Sort by: Most helpful
  1. HimanshuSinha-msft 19,386 Reputation points Microsoft Employee

    Hello @DivyaK-4075 ,
    Thanks for the question and using MS Q&A platform.
    As we understand the ask here is how to speed up the execution process , please do let us know if its not accurate.
    I suggest you to please start with increasing the node size and see if that helps .


    Please do let me if you have any queries.

    • Please don't forget to click on 130616-image.png or upvote 130671-image.png button whenever the information provided helps you. Original posters help the community find answers faster by identifying the correct answer. Here is how
    • Want a reminder to come back and check responses? Here is how to subscribe to a notification
      • If you are interested in joining the VM program and help shape the future of Q&A: Here is how you can be part of Q&A Volunteer Moderators
    0 comments No comments