differences in row counting using spark and panas readers

Auricchio Valerio 21 Reputation points
2021-02-13T00:19:00.837+00:00

I'm reading the same CSV once in Scala with Spark and once in Python with Pandas, this is the code that I'm using:

val tabella = spark.read.option("header",true).option("mode", "DROPMALFORMED").csv("/FileStore/tables/IMMOBILI_MDRE_FACT_FENICE_INNER_DWH_CREDITI_2.csv")

tabella = pd.read_csv("/dbfs/FileStore/tables/IMMOBILI_MDRE_FACT_FENICE_INNER_DWH_CREDITI_2.csv")

In both case when i count i find different rows

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,091 questions
0 comments No comments
{count} votes

Accepted answer
  1. PRADEEPCHEEKATLA-MSFT 86,131 Reputation points Microsoft Employee
    2021-02-15T05:50:33.52+00:00

    Hello @Auricchio Valerio ,

    Welcome to the Microsoft Q&A platform.

    I had tested the same from our end, it results the same row count using Scala with Spark and once in Python with pandas.

    Checkout the results:

    Using dataframe.count:

    67926-image.png

    Using display(dataframe):

    68042-image.png

    Hope this helps. Do let us know if you any further queries.

    ------------

    • Please accept an answer if correct. Original posters help the community find answers faster by identifying the correct answer. Here is how.
    • Want a reminder to come back and check responses? Here is how to subscribe to a notification.

0 additional answers

Sort by: Most helpful