Pyspark codes does shows different values when displaying the dataframe with limited columns

Question

Pyspark codes does shows different values when displaying the dataframe with limited columns

Manash 51

I have a dataframe named customer_calculated_data as a result of several computation using customer transactions. Then I created a subset of my dataframe with only 3 columns namely Customer_number, date and one calculated column. When I tried to display the subset dataframe the values in the calculated column is as below

User's image

But when I tried to display all the columns from customer_calculated_data for the same 3 customers then the calculated column values present above are all zeros. I expect the first 2 customer values to be as 1 but its now. When I again filter the dataset and display only the 3 columns then I get the above values but the displaying the entire dataframe is only giving me as zeros for those calculated columns. I cross verified the dataframe name to see if I am using two different versions but there is only one dataframe named customer_calculated_data in my code. When I write the values into a parquet file and read the file in python I see the values as 0.

I don't understand why is this so!

AnnuKumari-MSFT 34,556 Reputation points Microsoft Employee Moderator

2024-01-02T10:48:35.05+00:00

Hi Manash ,

Just checking to see if you got a chance to check the below answer . Hope it helped. Kindly accept the answer by clicking on Accept answer button. Thankyou
AnnuKumari-MSFT 34,556 Reputation points Microsoft Employee Moderator

2024-01-04T07:22:11.0133333+00:00

Hi Manash ,

Just following up to check if the below response helped? If yes, kindly accept it by clicking on Accept answer button, or else, kindly share the code snippet you have tried so far in order for us to provide better help. Thankyou

Accepted answer

2 additional answers

Your answer

AnnuKumari-MSFT 34,556 Reputation points Microsoft Employee Moderator

2024-01-02T10:48:35.05+00:00

Hi Manash ,

Just checking to see if you got a chance to check the below answer . Hope it helped. Kindly accept the answer by clicking on Accept answer button. Thankyou
AnnuKumari-MSFT 34,556 Reputation points Microsoft Employee Moderator

2024-01-04T07:22:11.0133333+00:00

Hi Manash ,

Just following up to check if the below response helped? If yes, kindly accept it by clicking on Accept answer button, or else, kindly share the code snippet you have tried so far in order for us to provide better help. Thankyou

Answer 1

Hello @Manash !

I understand that you are having some issues with your data calculations and visualization

i would kindly ask to post your code if possible so we can provide better help

if you are using pd.DataFrame.replace, pd.DataFrame.drop, etc., ensure you're either saving the result back to the dataframe or using the inplace=True argument if that is your intention.
use print statements or using head() to look at the results after each operation that leads to the creation of the customer_calculated_data dataframe to pinpoint where the values might be getting changed. It might offer you insight on the flow
Double-check that you're not accidentally overwriting the calculated column anywhere in your code
There is a chance your issue is caused with the order of operations where the dataframe is written to a file before the calculations are applied.
Also if there is a Jupyter notebook you are using , try restarting the kernel and running the whole notebook or script again, there is a chance that the state of the session can affect the output an variables may not be cleared as expected when re-running cells. Unless these suggestions provided your solution please post some details about the code and how are you running it !

I hope this helps!

The answer or portions of it may have been assisted by AI Source: ChatGPT Subscription

Kindly mark the answer as Accepted and Upvote in case it helped!

Regards

Manash 51 Reputation points

2024-01-04T07:53:12.1766667+00:00

Hello @Konstantinos Passadis ,

Here is the code and the output.
Manash 51 Reputation points

2024-01-04T08:48:10.4433333+00:00

Also, I tried to cache the customer_data dataframe and then perform the same display statement and the results are as below
Manash 51 Reputation points

2024-01-05T14:19:49.0033333+00:00

Rewrote my code by removing some join operations after creation of calculated_column_9. After this the issue does not exists anymore.

Answer 2

Hello @Manash !

Thank you for the the info!

Try resetting the index of the DataFrame using the .reset_index() method before creating the subset. This can help ensure that the indexing and referencing of the DataFrame is consistent across all columns.
One possibility could be a discrepancy in the data types or formats when filtering and displaying different columns, leading to inconsistent values being shown. Another potential issue could be related to the indexing or referencing of the DataFrame, causing unexpected behaviors when displaying subsets of data versus the entire DataFrame.

To diagnose this issue, you can try using the .loc method to create the subset of the DataFrame and ensure that the data types and formats are consistent with the original DataFrame

I hope this helps!

The answer or portions of it may have been assisted by AI Source: ChatGPT Subscription

Kindly mark the answer as Accepted and Upvote in case it helped!

Regards

Manash 51 Reputation points

2024-01-04T14:49:13.3166667+00:00

Hello @Konstantinos Passadis ,

I am working on Pyspark and not python and so index does not exist, or I should manually create one. But then reset index is not available in Pyspark.

Regarding the datatype I made sure that column type is Double Type (as seen in my code) in both main and subset dataset.

Answer 3

Hello @Manash !

I did an extensive research on this issue

From what i understand it could be a number of factors

Once you applied caching and we got the result shown it indicates that there is a discrepancy in the data being displayed, which could be due to several reasons.

Apache Spark's caching is lazy, meaning it does not compute the cache immediately upon calling .cache(). The data is actually computed and stored only when an action (like .show(), .write(), etc.) is triggered. Trigger an action immediately after caching to ensure the data is computed and stored as expected. Use customer_data.unpersist() in PySpark to remove the DataFrame from memory/cache. Ensure that the data types are consistently defined in PySpark. Type mismatches can occur with operations like cast().

Consider Clearing Cache: If there was a change in the computation logic after the data was cached, the cache might still be holding onto the old values. You can use customer_data.unpersist() to clear the cache and then rerun the computations.
Ensure No Overwrites: Make sure that customer_data is not being overwritten after the cache operation, which could lead to displaying stale or incorrect data.
Inspect Spark Jobs: Each time you perform an action on a DataFrame, Spark executes a job. Inspecting the Spark job logs can give you insights into what transformations are being applied and in what order.
Test with Checkpointing: Instead of caching, try using checkpointing, which is a more robust way to truncate the logical plan of the DataFrame and save the intermediate state to disk.
Examine Data Types: Ensure that the data type of Calculated_column_9 is correct throughout all operations. A change in data type could potentially lead to unexpected results.
Use Explain Plan: Use the .explain(true) method on your DataFrame to see the logical and physical plans of your computations. This can help identify any issues with the execution plan that might be causing the discrepancy.
Write Intermediate Results: As a diagnostic step, write the results of customer_data to a file after caching and again after the final transformations. Then read these files back into DataFrames to verify the values of Calculated_column_9 at each stage.

Also , Use collect(): Instead of displaying with show(), use collect() to bring the data into the driver node and print it, which can sometimes give a more accurate picture of what's in the DataFrame.

Please verify in a "cleared mind" the above. Sometimes we miss somthing that was there all the time!

Kindly let us know the results and we will do our best to assist you. The more data you can provide the better help you will get !

I hope this helps!

The answer or portions of it may have been assisted by AI Source: ChatGPT Subscription

Kindly mark the answer as Accepted and Upvote in case it helped!

Regards

Manash 51 Reputation points

2024-01-05T09:47:49.94+00:00

Rewrote the code by skipping some join statements after creation of calculated_column_9. Now the issue does not exists anymore.
Konstantinos Passadis 19,586 Reputation points MVP

2024-01-05T13:02:24.7766667+00:00

Hello @Manash !

Thats great !

Thank you for letting us know!

Kindly set an answer that helped you as Accepted and Upvote so others may benefit as well!

Regards

Share via

Pyspark codes does shows different values when displaying the dataframe with limited columns

2 additional answers

Your answer