push data from dataframe to json doc

Raj D 616 Reputation points
2020-08-18T19:15:00.7+00:00

Hi, I am using below code in python to read data from a SQL table and copy results in a dataframe then push the results into a json document and save it in Azure Data Lake Storage Gen2.

https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-overview

    jdbcHostname = "hostname"  
    jdbcDatabase = "databasename"  
    jdbcPort = 1413  
    jdbcUsername = "username"  
    jdbcPassword = "password"  
      
    jdbcUrl = "jdbc:sqlserver://{0}:{1};database={2}".format(jdbcHostname, jdbcPort, jdbcDatabase)  
    connectionProperties = {  
      "user" : jdbcUsername,  
      "password" : jdbcPassword,  
      "driver" : "com.microsoft.sqlserver.jdbc.SQLServerDriver"  
    }  
      
    pushdown_query = "(select * from hr.employee) emp"  
    df = spark.read.jdbc(url=jdbcUrl, table=pushdown_query, properties=connectionProperties)  
    display(df)  
    df.write.mode("overwrite").json("wasbs://<file_system>@<storage-account-name>.blob.core.windows.net/hr/emp")  

The above code displays data in dataframe but does not create the folder and the json document. Could you please where I am doing it wrong.

Thank you

Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
1,559 questions
Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,514 questions
0 comments No comments
{count} votes

3 answers

Sort by: Most helpful
  1. Winterstein Marine 26 Reputation points
    2021-08-27T12:43:49.853+00:00

    Hi PRADEEPCHEEKATLA-MSFT, this is almost the same solution I encountered in Data Science from Scratch

         org.apache.spark.SparkException: Job aborted.                
         ---------------------------------------------------------------------------
         Py4JJavaError                             Traceback (most recent call last)
         <command-2802701877950826> in <module>
               9 df=spark.createDataFrame([Row(**i) for i in data])
              10 df.show()
         ---> 11 df.write.mode("overwrite").json("wasbs://<file_system>@<storage-account-name>.blob.core.windows.net/hr/emp")
    
         /databricks/spark/python/pyspark/sql/readwriter.py in json(self, path, mode, compression, dateFormat, timestampFormat, lineSep, encoding)
             815             compression=compression, dateFormat=dateFormat, timestampFormat=timestampFormat,
             816             lineSep=lineSep, encoding=encoding)
         --> 817         self._jwrite.json(path)
             818 
             819     @since(1.4)
    
         /databricks/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in __call__(self, *args)
            1255         answer = self.gateway_client.send_command(command)
            1256         return_value = get_return_value(
         -> 1257             answer, self.gateway_client, self.target_id, self.name)
            1258 
            1259         for temp_arg in temp_args:
    
    2 people found this answer helpful.
    0 comments No comments

  2. PRADEEPCHEEKATLA 90,641 Reputation points Moderator
    2020-08-19T06:05:05.017+00:00

    Hello @Raj D ,

    I'm able to push data from dataframe to json with the same code which you are tried.

    18597-image.png

    Here is the JSON document which is written to Storage account:

    18639-image.png

    If the above method is not working for you, you can try the below method:

    Step1: Configure the storage account path

    spark.conf.set("fs.azure.account.key.chepra.blob.core.windows.net", "gvXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXdlOiA==")  
    output_container_path = "wasbs://******@chepra.blob.core.windows.net/Json_data_folder"  
    

    Step2: Use the below code snippet for writing sql table result to JSON document

    df.coalesce(1).write.format("org.apache.spark.sql.json").mode("overwrite").save(output_container_path)  
    

    18566-image.png

    Here is the JSON document which is written to Storage account:

    18567-image.png

    Hope this helps. Do let us know if you any further queries.

    ----------------------------------------------------------------------------------------

    Do click on "Accept Answer" and Upvote on the post that helps you, this can be beneficial to other community members.

    1 person found this answer helpful.

  3. Raj D 616 Reputation points
    2020-08-20T00:49:48.913+00:00

    Hi PRADEEPCHEEKATLA-MSFT, For some reason run into error every once in a while. org.apache.spark.SparkException: Job aborted.

        org.apache.spark.SparkException: Job aborted.               
        ---------------------------------------------------------------------------
        Py4JJavaError                             Traceback (most recent call last)
        <command-2802701877950826> in <module>
              9 df=spark.createDataFrame([Row(**i) for i in data])
             10 df.show()
        ---> 11 df.write.mode("overwrite").json("wasbs://<file_system>@<storage-account-name>.blob.core.windows.net/hr/emp")
        
        /databricks/spark/python/pyspark/sql/readwriter.py in json(self, path, mode, compression, dateFormat, timestampFormat, lineSep, encoding)
            815             compression=compression, dateFormat=dateFormat, timestampFormat=timestampFormat,
            816             lineSep=lineSep, encoding=encoding)
        --> 817         self._jwrite.json(path)
            818 
            819     @since(1.4)
        
        /databricks/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in __call__(self, *args)
           1255         answer = self.gateway_client.send_command(command)
           1256         return_value = get_return_value(
        -> 1257             answer, self.gateway_client, self.target_id, self.name)
           1258 
           1259         for temp_arg in temp_args:
    
    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.