question

PalashAich-7056 avatar image
0 Votes"
PalashAich-7056 asked MartinJaffer-MSFT commented

Can we configure PolyBase to get data in SQL Server On Prem from Azure Data Lake Gen 2 Common Data Model (CDM)

Hello there, I need to get the data from Azure Data Lake Gen 2 CDM to SQL Server On Prem. It seems we can configure PolyBase and avoid any ETL. Can we configure PolyBase to get data in SQL Server On Prem from Common Data Model in Azure Data Lake Storage Gen 2. If yes, what are the steps. Thanks, Palash

azure-data-factoryazure-data-lake-storage
· 2
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

@PalashAich-7056 did my reply solve your issue? If so please mark as accepted answer, otherwise let me know how I may better assist.

0 Votes 0 ·

If you found your own solution, could you please share here with the community?

0 Votes 0 ·
MartinJaffer-MSFT avatar image
0 Votes"
MartinJaffer-MSFT answered PalashAich-7056 commented

Hello @PalashAich-7056 and welcome to Microsoft Q&A.

The main difficulty in your ask, is that CDM runs in Dataflow, but Dataflow does not work on-premise. Copy Activity does work with on-prem.

This leads me to recommend you do this in two steps. First using Dataflow, copy from the CDM to a staging location (such as delimited text in Data Lake). Then use Copy Activity to take from staging, and use Polybase to load into your on-prem SQL.

If you know exactly where the files are in the Data Lake, it may be possible to use a Copy activity directly without using CDM. I'm not very good at CDM, so this is just a 'maybe'.

· 1
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Hi @MartinJaffer-MSFT ,

Thanks for your response!

We are looking for a solution to avoid ETL. Basically, the data from MS Dynamics will be replicated and available in ADLS Gen 2 CDM frequently. If we can point this CDM location directly from SQL server on prem, we can reduce the step to pull data in a staging area location. Our data size is huge. We need to create data warehouse in On Prem and the data will be populated in DW using SSIS incrementally.

Any solution that can directly point to CDM will be helpful to get most recent data and to reduce development efforts.

Thanks,
Palash

0 Votes 0 ·
MartinJaffer-MSFT avatar image
0 Votes"
MartinJaffer-MSFT answered MartinJaffer-MSFT commented

Okay, so for going direct from on-prem SQL to azure storage, I found a relevant document, PalashAich-7056 .

This article talks about blob storage. Data Lake Gen 2 is built on top of blob storage, but is not the same thing. I think there is enough compatibility to use the blob protocol against ADLS Gen2 for reading. Writing, I have less confidence on.

Upon further searching, I found documentation stating ADLS Gen2 is not supported.


· 3
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Hi @MartinJaffer-MSFT ,
Thanks for your response.

I am able to connect to ADLS Gen 2 and query the data from SQL Server on-prem by configuring PolyBase. I created a ADF pipeline to pull the data from ADLS Gen 2 CDM and populated a single file as CSV and placed it under a different folder in ADLS Gen 2.

However, I am trying to eliminate the process of pulling the data from CDM to a CSV file, post that point SQL Server external table to the CSV file to query. Do you mean that we cannot directly point to CDM csv files from SQL Server On Prem?

Below is the structure of CDM. .JSON file has the metadata details and in side the folder we have different partitions. The partitions does not have column header.

84724-image.png


0 Votes 0 ·
image.png (73.9 KiB)

Hi @MartinJaffer-MSFT

I am able to connect to CDM from On-Prem SQL server now.

The only issue I am facing is column mismatch. There are a few values in a column has comma (,). Also, comma is field delimiter. Due to the same, the sql server throwing below error. Is there any was can can tweak this in the external table please?

Error:
HdfsBridge::recordReaderFillBuffer - Unexpected error encountered filling record reader buffer: HadoopExecutionException: Not enough columns in this line.
OLE DB provider "MSOLEDBSQL" for linked server "(null)" returned message "Unspecified error".


-- File Format
CREATE EXTERNAL FILE FORMAT TextFile
WITH (
FORMAT_TYPE = DelimitedText,
FORMAT_OPTIONS (FIELD_TERMINATOR = ',')
);

Create table
CREATE EXTERNAL TABLE dbo.Account_CDM (
<Column list and data type>
)
WITH (
LOCATION='/account',
DATA_SOURCE=AzureStorage,
FILE_FORMAT=TextFile
);

0 Votes 0 ·

I think I encountered some similar question before, @PalashAich-7056 , where the data contained comma ,. The complex cases did not go well.

There is one thing to try, which you are missing. Specify String delimiter.
The hope is that anything between string delimiters is treated as data, not delimiters.

 CREATE EXTERNAL FILE FORMAT TextFile
 WITH (
 FORMAT_TYPE = DelimitedText,
 FORMAT_OPTIONS (FIELD_TERMINATOR = ',',STRING_DELIMITER = '"')
 );


0 Votes 0 ·