Read multiline json string using Spark dataframe in azure databricks

Raj D 586 Reputation points
2021-03-10T04:45:29.32+00:00

I am reading the contents of an api into a dataframe using the pyspark code below in a databricks notebook. I validated the json payload and the string is in valid json format. I guess the error is due to multiline json string. The below code worked fine with other json api payloads.

Spark version < 2.2

import requests
user = "usr"
password = "aBc!23"
response = requests.get('https://myapi.com/allcolor', auth=(user, password))
jsondata = response.json()
from pyspark.sql import *
df = spark.read.option("multiline", "true").json(sc.parallelize([data]))
df.show()

JSON payload:

{
  "colors": [
    {
      "color": "black",
      "category": "hue",
      "type": "primary",
      "code": {
        "rgba": [
          255,
          255,
          255,
          1
        ],
        "hex": "#000"
      }
    },
    {
      "color": "white",
      "category": "value",
      "code": {
        "rgba": [
          0,
          0,
          0,
          1
        ],
        "hex": "#FFF"
      }
    },
    {
      "color": "red",
      "category": "hue",
      "type": "primary",
      "code": {
        "rgba": [
          255,
          0,
          0,
          1
        ],
        "hex": "#FF0"
      }
    },
    {
      "color": "blue",
      "category": "hue",
      "type": "primary",
      "code": {
        "rgba": [
          0,
          0,
          255,
          1
        ],
        "hex": "#00F"
      }
    },
    {
      "color": "yellow",
      "category": "hue",
      "type": "primary",
      "code": {
        "rgba": [
          255,
          255,
          0,
          1
        ],
        "hex": "#FF0"
      }
    },
    {
      "color": "green",
      "category": "hue",
      "type": "secondary",
      "code": {
        "rgba": [
          0,
          255,
          0,
          1
        ],
        "hex": "#0F0"
      }
    }
  ]
}

Error:

pyspark.sql.dataframe.DataFrame = [_corrupt_record: string]
Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,091 questions
0 comments No comments
{count} vote

2 answers

Sort by: Most helpful
  1. PRADEEPCHEEKATLA-MSFT 86,131 Reputation points Microsoft Employee
    2021-03-10T11:30:02.873+00:00

    Hello @Raj D ,

    Thanks for the ask and using the Microsoft Q&A platform.

    Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail.

    You may checkout the below threads which addressing similar issue:

    https://stackoverflow.com/questions/38895057/reading-json-with-apache-spark-corrupt-record

    https://www.mail-archive.com/user@天地微光 .apache.org/msg59206.html

    Hope this helps. Do let us know if you any further queries.


    Please don’t forget to Accept Answer and Up-Vote wherever the information provided helps you, this can be beneficial to other community members.


  2. Raj D 586 Reputation points
    2021-03-11T17:54:41.637+00:00

    Hello @ PRADEEPCHEEKATLA-MSFT

    Modified code:

    spark.sql("set spart.databricks.delta.preview.enabled=true")
    spark.sql("set spart.databricks.delta.retentionDutationCheck.preview.enabled=false")
    import json
    import requests
    from requests.auth import HTTPDigestAuth
    import pandas as pd
    user = "username"
    password = "password"
    myResponse = requests.get('https://myapi.com/allcolor', auth=(user, password))
    if(myResponse.ok):
      jData = json.loads(myResponse.content)
      s1 = json.dumps(jData)
      #load data from api
      x = json.loads(s1)
      data = pd.read_json(json.dumps(x))
      #create dataframe
      spark_df = spark.createDataFrame(data)
      spark_df.show()          
      spark.conf.set("fs.azure.account.key.<your-storage-account-name>.blob.core.windows.net","<your-storage-account-access-key>")
      spark_df.write.mode("overwrite").json("wasbs://<container>@<storage-account-name>.blob.core.windows.net/<directory>/")
    else:
      myResponse.raise_for_status()
    

    Output not in right format as the source.

    {
      "colors": 
        {
          "color": "black",
          "category": "hue",
          "type": "primary",
          "code": {
            "rgba": [
              255,
              255,
              255,
              1
            ],
            "hex": "#000"
          }
        }
        }
    {
      "colors":     
        {
          "color": "white",
          "category": "value",
          "code": {
            "rgba": [
              0,
              0,
              0,
              1
            ],
            "hex": "#FFF"
          }
        }
        }
    

    Could you please point to me where I am going wrong as the output file I am storing in ADLS Gen2 does not match the source api json payload.

    Thank you