Read multiline json string using Spark dataframe in azure databricks

Raj D 616 Reputation points
2021-03-10T04:45:29.32+00:00

I am reading the contents of an api into a dataframe using the pyspark code below in a databricks notebook. I validated the json payload and the string is in valid json format. I guess the error is due to multiline json string. The below code worked fine with other json api payloads.

Spark version < 2.2

import requests
user = "usr"
password = "aBc!23"
response = requests.get('https://myapi.com/allcolor', auth=(user, password))
jsondata = response.json()
from pyspark.sql import *
df = spark.read.option("multiline", "true").json(sc.parallelize([data]))
df.show()

JSON payload:

{
  "colors": [
    {
      "color": "black",
      "category": "hue",
      "type": "primary",
      "code": {
        "rgba": [
          255,
          255,
          255,
          1
        ],
        "hex": "#000"
      }
    },
    {
      "color": "white",
      "category": "value",
      "code": {
        "rgba": [
          0,
          0,
          0,
          1
        ],
        "hex": "#FFF"
      }
    },
    {
      "color": "red",
      "category": "hue",
      "type": "primary",
      "code": {
        "rgba": [
          255,
          0,
          0,
          1
        ],
        "hex": "#FF0"
      }
    },
    {
      "color": "blue",
      "category": "hue",
      "type": "primary",
      "code": {
        "rgba": [
          0,
          0,
          255,
          1
        ],
        "hex": "#00F"
      }
    },
    {
      "color": "yellow",
      "category": "hue",
      "type": "primary",
      "code": {
        "rgba": [
          255,
          255,
          0,
          1
        ],
        "hex": "#FF0"
      }
    },
    {
      "color": "green",
      "category": "hue",
      "type": "secondary",
      "code": {
        "rgba": [
          0,
          255,
          0,
          1
        ],
        "hex": "#0F0"
      }
    }
  ]
}

Error:

pyspark.sql.dataframe.DataFrame = [_corrupt_record: string]
Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,514 questions
0 comments No comments
{count} votes

2 answers

Sort by: Most helpful
  1. PRADEEPCHEEKATLA 90,641 Reputation points Moderator
    2021-03-10T11:30:02.873+00:00

    Hello @Raj D ,

    Thanks for the ask and using the Microsoft Q&A platform.

    Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail.

    You may checkout the below threads which addressing similar issue:

    https://stackoverflow.com/questions/38895057/reading-json-with-apache-spark-corrupt-record

    https://www.mail-archive.com/user@天地微光 .apache.org/msg59206.html

    Hope this helps. Do let us know if you any further queries.


    Please don’t forget to Accept Answer and Up-Vote wherever the information provided helps you, this can be beneficial to other community members.


  2. Raj D 616 Reputation points
    2021-03-11T17:54:41.637+00:00

    Hello @ PRADEEPCHEEKATLA-MSFT

    Modified code:

    spark.sql("set spart.databricks.delta.preview.enabled=true")
    spark.sql("set spart.databricks.delta.retentionDutationCheck.preview.enabled=false")
    import json
    import requests
    from requests.auth import HTTPDigestAuth
    import pandas as pd
    user = "username"
    password = "password"
    myResponse = requests.get('https://myapi.com/allcolor', auth=(user, password))
    if(myResponse.ok):
      jData = json.loads(myResponse.content)
      s1 = json.dumps(jData)
      #load data from api
      x = json.loads(s1)
      data = pd.read_json(json.dumps(x))
      #create dataframe
      spark_df = spark.createDataFrame(data)
      spark_df.show()          
      spark.conf.set("fs.azure.account.key.<your-storage-account-name>.blob.core.windows.net","<your-storage-account-access-key>")
      spark_df.write.mode("overwrite").json("wasbs://<container>@<storage-account-name>.blob.core.windows.net/<directory>/")
    else:
      myResponse.raise_for_status()
    

    Output not in right format as the source.

    {
      "colors": 
        {
          "color": "black",
          "category": "hue",
          "type": "primary",
          "code": {
            "rgba": [
              255,
              255,
              255,
              1
            ],
            "hex": "#000"
          }
        }
        }
    {
      "colors":     
        {
          "color": "white",
          "category": "value",
          "code": {
            "rgba": [
              0,
              0,
              0,
              1
            ],
            "hex": "#FFF"
          }
        }
        }
    

    Could you please point to me where I am going wrong as the output file I am storing in ADLS Gen2 does not match the source api json payload.

    Thank you


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.