Azure Data Factory Copy Data is not parsing CSV correctly. It is splitting at ',' within double quotes.

Paul Simon 5 Reputation points
2023-02-13T16:57:22.01+00:00

The CSV parser is not working properly. In the below it is splitting "PRI/LTD BY GUAR/NSC (Private, limited by guarantee, no share capital)" into three columns even though double quote text qualifier is selected.

"""BE"" NEWHALL MANAGEMENT COMPANY LIMITED","07326397","","","VANTAGE POINT","23 MARK ROAD","HEMEL HEMPSTEAD","HERTFORDSHIRE","","HP2 7DN","PRI/LTD BY GUAR/NSC (Private, limited by guarantee, no share capital)","Active","United Kingdom","","27/07/2010","31","7","30/04/2024","31/07/2022","DORMANT","24/08/2016","27/07/2015","0","0","0","0","98000 - Residents property management","","","","0","0","http://business.data.gov.uk/id/company/07326397","","","","","","","","","","","","","","","","","","","","","10/08/2023","27/07/2022"

This is data from Companies House.

https://download.companieshouse.gov.uk/BasicCompanyDataAsOneFile-2023-02-01.zip

Does anyone know a way around this?

On the Dataset I have selected a Compression Type ZipDeflate

Comma Delimiter: Comma

Row delimiter: Default \r,\n or \r\n

Encoding UTF-8

Escape Character: Backslash \

Quote Character: Double quote (")

First row as header ticked

Null value: is left blank

I have opened the Part 1 of 7 version of the file in Notepad++ and the CSV Query PlugIn has parsed it correctly, and I didn't need to tell it anything.

Why is a simple thing like this not working in ADF?

Regards

Paul Simon

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
11,624 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Paul Simon 6 Reputation points
    2023-02-16T14:16:26.4766667+00:00

    Hi Kranthi

    Firstly, I would regard this as a bug rather than a feature request. If a free product like Notepad++ can open and correctly parse this CSV file then surely Microsoft's rolls-royce ETL product, ADF should be able to do this?

    This is a CSV file that has been produced by Companies House which holds records on every UK company. It could probably be better formatted and I have raised that with them. However, I am not holding out much hope for a speedy solution from either Companies House or Microsoft. We need to have Company Registration Numbers to comply with a government return in two weeks time.

    The key issue seems to be that the file has used "" when a company name contains ". The overall number of double quotes in a field is still balanced, ie there will always be an even number of double quotes.

    I have tried accessing the link to post a feature request.

    https://feedback.azure.com/d365community/forum/1219ec2d-6c26-ec11-b6e6-000d3a4f032c

    However, it will not allow me to login. If I click Sign In it immediately brings up a dialog box telling me that I need to register but my email address is already registered and therefore I cannot register with that email address, but if it is already registered then I should be able to sign in with that email address.

    If you can resolve that issue then I will be happy to post a feature request

    Regards

    Paul Simon


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.