Unable to read csv file from cosmos gen1 through spark. Pandas read works.

Ajay Prasadh Viswanathan 1 Reputation point Microsoft Employee
2020-10-08T15:35:40.867+00:00

This works,

df = pd.read_csv('/dbfs/mnt/ajviswan/forest_efficiency/2020-04-26_2020-05-26.csv')
sdf = spark.createDataFrame(df)
sdf.head()

But spark read does not work.

df = spark.read.csv('/dbfs/mnt/ajviswan/forest_efficiency/2020-04-26_2020-05-26.csv')
df

Returns an error,

Py4JJavaError: An error occurred while calling o3781.csv.
: java.lang.NoClassDefFoundError: org/apache/spark/sql/sources/v2/ReadSupport
Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,163 questions
Azure Cosmos DB
Azure Cosmos DB
An Azure NoSQL database service for app development.
1,615 questions
0 comments No comments
{count} votes

4 answers

Sort by: Most helpful
  1. PRADEEPCHEEKATLA-MSFT 88,791 Reputation points Microsoft Employee
    2020-10-09T04:56:26.49+00:00

    Hello @Ajay Prasadh Viswanathan ,

    Welcome to Microsoft Q&A platform.

    To read via spark methods from mount points you shouldn't include the /dbfs/mnt as /dbfs/mnt/ajviswan/forest_efficiency/2020-04-26_2020-05-26.csv prefix, you can use dbfs:/mnt as dbfs:/mnt/ajviswan/forest_efficiency/2020-04-26_2020-05-26.csv instead.

    31163-image.png

    Hope this helps. Do let us know if you any further queries.

    ----------------------------------------------------------------------------------------

    Do click on "Accept Answer" and Upvote on the post that helps you, this can be beneficial to other community members.


  2. Ajay Prasadh Viswanathan 1 Reputation point Microsoft Employee
    2020-10-12T13:16:59.393+00:00

    Hello @PRADEEPCHEEKATLA-MSFT ,
    Yes I have mounted the dataset,

       configs = {"fs.adl.oauth2.access.token.provider.type": "ClientCredential",  
                  "fs.adl.oauth2.client.id": "xxxx",  
                  "fs.adl.oauth2.credential": "xxxxx",  
                  "fs.adl.oauth2.refresh.url": "https://login.microsoftonline.com/xxxx/oauth2/token"}  
       # Optionally, you can add <directory-name> to the source URI of your mount point.  
       dbutils.fs.mount(  
         source = "adl://office-adhoc-c14.azuredatalakestore.net/local/users/ajviswan",  
         mount_point = "/mnt/ajviswan",  
         extra_configs = configs)  
         
         
       print('mounted')  
    

    After mounting, I can read the data via python and the bash terminal.

       pd.read_csv("/dbfs/mnt/ajviswan/CPUPrediction/CPUPrediction_2020_09_17.csv")  
    

    works,

       %sh  
       head /dbfs/mnt/ajviswan/CPUPrediction/CPUPrediction_2020_09_17.csv  
    

    also works.

    But the spark read does not work,

       df = spark.read.csv("dbfs:/mnt/ajviswan/CPUPrediction/CPUPrediction_2020_09_17.csv")  
    

    I get, Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.sources.v2.ReadSupport

    Cluster configuration is,

       7.3 LTS (includes Apache Spark 3.0.1, Scala 2.12)  
    
    0 comments No comments

  3. PRADEEPCHEEKATLA-MSFT 88,791 Reputation points Microsoft Employee
    2020-10-13T11:04:57.06+00:00

    Hello @Ajay Prasadh Viswanathan ,

    I have tested on Azure Databricks Runtime: 7.3 LTS (includes Apache Spark 3.0.1, Scala 2.12).

    To mount an Azure Data Lake Storage Gen1 resource or a folder inside it, use the following command:

    32014-image.png

    And able to read via spark methods from mount points.

    32005-image.png

    Reference: Azure Databricks - Azure Data Lake Storage Gen1

    Hope this helps. Do let us know if you any further queries.

    ----------------------------------------------------------------------------------------

    Do click on "Accept Answer" and Upvote on the post that helps you, this can be beneficial to other community members.

    0 comments No comments

  4. Ajay Prasadh Viswanathan 1 Reputation point Microsoft Employee
    2020-10-13T11:42:49.803+00:00

    Hi @PRADEEPCHEEKATLA-MSFT , I have followed the instructions exactly.
    I mounted correctly.

    32041-1.png

    I am able to read the data through python.

    32025-2.png

    I am unable to read it through spark.

    32026-3.png

    My data is in AAD wall in cosmos adls1 accessible from an aad account(ajviswan_debug@prdtrs01.prod.outlook.com), my databricks is in under a corpaccount(ajviswan). My intuition is that if it was a mounting related issue, it should have complained when I tried mounting. I should not have been able to read through python if it was a data mounting issue. But I can read through python and not through spark, it seems like I am missing something big. It would be really great if you can help me and my team with this.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.