Unable to read csv file from cosmos gen1 through spark. Pandas read works.

Ajay Prasadh Viswanathan 1 Reputation point
2020-10-08T15:35:40.867+00:00

This works,

df = pd.read_csv('/dbfs/mnt/ajviswan/forest_efficiency/2020-04-26_2020-05-26.csv')
sdf = spark.createDataFrame(df)
sdf.head()

But spark read does not work.

df = spark.read.csv('/dbfs/mnt/ajviswan/forest_efficiency/2020-04-26_2020-05-26.csv')
df

Returns an error,

Py4JJavaError: An error occurred while calling o3781.csv.
: java.lang.NoClassDefFoundError: org/apache/spark/sql/sources/v2/ReadSupport
Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
1,947 questions
Azure Cosmos DB
Azure Cosmos DB
An Azure NoSQL database service for app development.
1,454 questions
0 comments No comments
{count} votes

4 answers

Sort by: Most helpful
  1. PRADEEPCHEEKATLA-MSFT 78,331 Reputation points Microsoft Employee
    2020-10-09T04:56:26.49+00:00

    Hello @Ajay Prasadh Viswanathan ,

    Welcome to Microsoft Q&A platform.

    To read via spark methods from mount points you shouldn't include the /dbfs/mnt as /dbfs/mnt/ajviswan/forest_efficiency/2020-04-26_2020-05-26.csv prefix, you can use dbfs:/mnt as dbfs:/mnt/ajviswan/forest_efficiency/2020-04-26_2020-05-26.csv instead.

    31163-image.png

    Hope this helps. Do let us know if you any further queries.

    ----------------------------------------------------------------------------------------

    Do click on "Accept Answer" and Upvote on the post that helps you, this can be beneficial to other community members.


  2. Ajay Prasadh Viswanathan 1 Reputation point
    2020-10-12T13:16:59.393+00:00

    Hello @PRADEEPCHEEKATLA-MSFT ,
    Yes I have mounted the dataset,

       configs = {"fs.adl.oauth2.access.token.provider.type": "ClientCredential",  
                  "fs.adl.oauth2.client.id": "xxxx",  
                  "fs.adl.oauth2.credential": "xxxxx",  
                  "fs.adl.oauth2.refresh.url": "https://login.microsoftonline.com/xxxx/oauth2/token"}  
       # Optionally, you can add <directory-name> to the source URI of your mount point.  
       dbutils.fs.mount(  
         source = "adl://office-adhoc-c14.azuredatalakestore.net/local/users/ajviswan",  
         mount_point = "/mnt/ajviswan",  
         extra_configs = configs)  
         
         
       print('mounted')  
    

    After mounting, I can read the data via python and the bash terminal.

       pd.read_csv("/dbfs/mnt/ajviswan/CPUPrediction/CPUPrediction_2020_09_17.csv")  
    

    works,

       %sh  
       head /dbfs/mnt/ajviswan/CPUPrediction/CPUPrediction_2020_09_17.csv  
    

    also works.

    But the spark read does not work,

       df = spark.read.csv("dbfs:/mnt/ajviswan/CPUPrediction/CPUPrediction_2020_09_17.csv")  
    

    I get, Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.sources.v2.ReadSupport

    Cluster configuration is,

       7.3 LTS (includes Apache Spark 3.0.1, Scala 2.12)  
    
    0 comments No comments

  3. PRADEEPCHEEKATLA-MSFT 78,331 Reputation points Microsoft Employee
    2020-10-13T11:04:57.06+00:00

    Hello @Ajay Prasadh Viswanathan ,

    I have tested on Azure Databricks Runtime: 7.3 LTS (includes Apache Spark 3.0.1, Scala 2.12).

    To mount an Azure Data Lake Storage Gen1 resource or a folder inside it, use the following command:

    32014-image.png

    And able to read via spark methods from mount points.

    32005-image.png

    Reference: Azure Databricks - Azure Data Lake Storage Gen1

    Hope this helps. Do let us know if you any further queries.

    ----------------------------------------------------------------------------------------

    Do click on "Accept Answer" and Upvote on the post that helps you, this can be beneficial to other community members.

    0 comments No comments

  4. Ajay Prasadh Viswanathan 1 Reputation point
    2020-10-13T11:42:49.803+00:00

    Hi @PRADEEPCHEEKATLA-MSFT , I have followed the instructions exactly.
    I mounted correctly.

    32041-1.png

    I am able to read the data through python.

    32025-2.png

    I am unable to read it through spark.

    32026-3.png

    My data is in AAD wall in cosmos adls1 accessible from an aad account(ajviswan_debug@prdtrs01.prod.outlook.com), my databricks is in under a corpaccount(ajviswan). My intuition is that if it was a mounting related issue, it should have complained when I tried mounting. I should not have been able to read through python if it was a data mounting issue. But I can read through python and not through spark, it seems like I am missing something big. It would be really great if you can help me and my team with this.