How to Handle Dynamic Date formats which are of string type inside a single column in pyspark

A K, Srinivaasan 1 Reputation point
2022-10-12T07:43:00.823+00:00

Hi experts,
I have a date column which is of string type comprising of two date formats

  1. Dec 15 2019 12:00AM
  2. 04/27/2020
    I need to parse them into a unified format such as '2019-12-06'
    Can you please help me with inputs on getting them parsed?
    @AnnuKumari-MSFT
    @SSingh-MSFT

Thank you

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,514 questions
0 comments No comments
{count} votes

2 answers

Sort by: Most helpful
  1. MartinJaffer-MSFT 26,236 Reputation points
    2022-10-12T18:15:17.717+00:00

    Hello @A K, Srinivaasan ,
    Thanks for the question and using MS Q&A platform.

    As I understand, the goal is to wrangle data containing multiple datetime formats into a single cohesive format. It is good you only have 2 formats without ambiguousness, such as mm/dd/yy vs dd/mm/yy.

    I recommend you take a look at the dateutil python library. It is an extension to datetime, and contains a parser which work on many formats. You will need to install this library (pip install python-dateutil), it isn't built into the standard python library last I checked. The parser outputs python datetime datatypes. Then we use the strftime with a desired format code.

    Then you can do like:

    from dateutil import parser  
    
    form1 = "Dec 15 2019 12:00AM"  
    form2 = "04/27/2020"  
    
    def reformat ( invalue ):  
        return parser.parse(invalue).strftime("%Y-%m-%d")  
    
    >>> reformat(form1)  
    
    '2019-12-15'  
    >>> reformat(form2)  
    
    '2020-04-27'  
    

    Please do let me if you have any queries.

    Thanks
    Martin


    • Please don't forget to click on 130616-image.png or upvote 130671-image.png button whenever the information provided helps you. Original posters help the community find answers faster by identifying the correct answer. Here is how
    • Want a reminder to come back and check responses? Here is how to subscribe to a notification
      • If you are interested in joining the VM program and help shape the future of Q&A: Here is how you can be part of Q&A Volunteer Moderators
    0 comments No comments

  2. A K, Srinivaasan 1 Reputation point
    2022-10-14T06:31:01.763+00:00

    Hi @MartinJaffer-MSFT ,

    Your code works ! , But when I apply it as a UDF function to a column and cast it as a date the casting is done as date.
    But when I do a .display() and call the dataframe

    I get this error :
    PythonException: 'TypeError: raise TypeError('Parser must be a string or character stream, not '
    TypeError: Parser must be a string or character stream, not date

    Can you please help me on this?

    Thanks


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.