Azure Data Factory - Self Hosted IR - Prevent Binary Import

Question

Azure Data Factory - Self Hosted IR - Prevent Binary Import

Wolff Michael 1

Requirement:
Internal Policy Compliance Regulation requires us to prevent copying binary data from the cloud to OnPremise servers through Azure Data Factory.

Problem:
I figured out how to successfully set up Azure Policy to prevent Binary Datasets.
But besides Binary Datasets there are still other ways to move binary data through Azure Data Factory.
(e.g. through binary column mapping into a SQL Server table's binary column)

One idea would be to let the Self-Hosted-Integration-Runtime detect transportion of binary data into OnPrem and prevent it by some sort of rule.
Another idea would be to let Azure Data Factory generally detect the transportion of binary data and prevent it.

Question:
Is there any way to accomplish this with or without Azure Policy ?

KranthiPakala-MSFT 46,642 Reputation points Microsoft Employee Moderator

2020-06-15T21:01:52.087+00:00
Hi @Wolff Michael ,

Welcome to Microsoft Q&A and thanks for your query.

To better assist on your query, could you please clarify on below.

Binary Dataset - do you mean by Binary format in Azure Data Factory?

And also, curious to know how you were able to successfully set up Azure Policy to prevent Binary Datasets. Could you please provide few additional details on how you achieved this?

----------

Thank you
Wolff Michael 1 Reputation point

2020-06-16T18:02:49.59+00:00

Hi KranthiPakala-MSFT,

yes correct with Binary Datasets I meant a Dataset in Binary Format as defined here: Binary format in Azure Data Factory

I was able to prevent those datasets with a Custom Azure Policy. The definition for the policyRule part of the Custom Policy see attached sample code.10186-msdnblogentry.txt
But I am still curious if it would be possible to prevent transportation of binary content through the Self-Hosted-Integration-Runtime into the local network.

Attached you will find the Custom policyRule part(most important part of the whole policy) to prevent datasets in binary format
KranthiPakala-MSFT 46,642 Reputation points Microsoft Employee Moderator

2020-06-17T19:03:38.153+00:00

Hi @Wolff Michael ,

Thanks for your response and sharing additional details. I have reached to internal sources to get additional help on this requirement. Will keep you posted as soon as I have a response from the team.

2 answers

Your answer

KranthiPakala-MSFT 46,642 Reputation points Microsoft Employee Moderator

2020-06-15T21:01:52.087+00:00

Hi @Wolff Michael ,

Welcome to Microsoft Q&A and thanks for your query.

To better assist on your query, could you please clarify on below.

Binary Dataset - do you mean by Binary format in Azure Data Factory?

And also, curious to know how you were able to successfully set up Azure Policy to prevent Binary Datasets. Could you please provide few additional details on how you achieved this?

----------

Thank you
Wolff Michael 1 Reputation point

2020-06-16T18:02:49.59+00:00

Hi KranthiPakala-MSFT,

yes correct with Binary Datasets I meant a Dataset in Binary Format as defined here: Binary format in Azure Data Factory

I was able to prevent those datasets with a Custom Azure Policy. The definition for the policyRule part of the Custom Policy see attached sample code.10186-msdnblogentry.txt
But I am still curious if it would be possible to prevent transportation of binary content through the Self-Hosted-Integration-Runtime into the local network.

Attached you will find the Custom policyRule part(most important part of the whole policy) to prevent datasets in binary format
KranthiPakala-MSFT 46,642 Reputation points Microsoft Employee Moderator

2020-06-17T19:03:38.153+00:00

Hi @Wolff Michael ,

Thanks for your response and sharing additional details. I have reached to internal sources to get additional help on this requirement. Will keep you posted as soon as I have a response from the team.

Answer 1

MartinJaffer-MSFT 26,236

The difference between binary dataset and the other datasets ( delimited text, parquet, sql, rest, json ) , is that binary datasets do not attempt to parse the data. They just copy as-is, no mapping, no schema, no datatype. All other datasets try to parse the data so you can then map it to the sink dataset columns.

If you tried to push a compiled executable (binary) through the other dataset types, Data Factory would throw an error, because it can't parse the data into records.

Binary dataset is used to transport anything which cannot be parsed into records. (It also works on those which can).

I expect what you are really trying to stop malware from entering through Data Factory. Please correct me if I am mistaken.

The only other 'binary' I can think of besides the dataset, is the data type, such as used in SQL. This type of 'binary' is safe unless your database has the ability to execute data as code, or write it to disk for execution.
@WolffMichael-0000,

Wolff Michael 1 Reputation point

2020-06-25T06:25:44.983+00:00

Thank you for your investigation.

Yes, you are absolutely right, what we are really trying to do is to stop malware from entering through Data Factory.

At the same time we are trying to figure out what different paths malware could take through Data Factory to enter our internal Network through the Self-Hosted IR. As you described we already identified binary datasets which we are able to prevent through Azure Policy.

What other ways besides binary columns in SQL Server are imaginable ? Base64 encoded Code through text fields ? What else ?

And is there any practiable guideline to prevent malware entering our Network through Data Factory ?
MartinJaffer-MSFT 26,236 Reputation points

2020-06-29T18:39:14.92+00:00

Hypothetically, source code could be brought in as text, and then an operative in your company could either compile it or use an interpreter. There is no technical prevention for that, as it is done by a person.

The SQL datasets when used with either Lookup or Copy activity, have an option to type in a query. Anything can be entered there. If you parameterize such that the query is brought in from somewhere else, like a website or blob, you will need to secure those sources.

The key for you is to identify and control what does or does not have the power to execute code, or data as code.

Depending upon what on-prem resources you are using, it may be possible to put it on a sandbox machine and scan the files brought in. Would work for saving on local file system, but not for connected resources like an on prem sql

Take a look at security-baseline : malware defense
MartinJaffer-MSFT 26,236 Reputation points

2020-07-08T18:15:36.377+00:00

@Wolff Michael , I have not heard back from you. Have I answered your question? Please let us know if you found other solutions. This would be helpful for the community.

Answer 2

DvorakMichal 1

Hello @MartinJaffer-MSFT ,

I'm a colleague of @Wolff Michael and we've open the topic in our company again.
You've written: "The difference between binary dataset and the other datasets ( delimited text, parquet, sql, rest, json ) , is that binary datasets do not attempt to parse the data. They just copy as-is, no mapping, no schema, no datatype. All other datasets try to parse the data so you can then map it to the sink dataset columns."

Unfortunately I've found out that you can use for example two csv datasets and copy binary data from one datalake to another datalake. There is no parsing of the data. Therefore I think it will work the same way also with the SHIR - you can then download binary files using csv dataset into an internal file storage.

Do you have any idea how to solve this challenge?

Best regards,
Michal Dvorak

MartinJaffer-MSFT 26,236 Reputation points

2021-09-22T20:39:36.387+00:00

I don't really have any particularly good ideas that can be applied inside Data Factory. This is because of the nature of electronic data. @DvorakMichal

All data, all code, everything on a digital computer in its most basic state, is binary. The difference is how you interpret the sequence, and what you try to do with it.

For example 100110 . This can be read as integer value 38. It can also be read as the ampersand & using ASCII. This & could be used as text like Simon & Sons. It could also be used in a script or source code to be compiled later like if(X&Y) . The sequence could also be part of compiled code, or microcode.
This is why it is possible to open in a text editor, a file not intended as text. The result most likely will be nonsensical, but it can be done. Likewise, an application intended to consume files may also be able to 'run' a file intended as text. If we take powershell scripts as an example, the lines get blurry.

If you are concerned about things that can be executed without a 'consumer' application, you can protect yourself at the filesystem level. In Unix-like file systems, there are 3 permissions for every entry, file and folder alike:
Read, Write, Execute.
Read is obvious, the ability to look at the file or folder.
Write is whether you can overwrite a file or create new file / folders inside.
Execute is the permission which allows a binary to be run. To protect yourself, disable the execute permission on all data files. (Leave it on for folders).
DvorakMichal 1 Reputation point

2021-09-23T13:08:29.637+00:00

Thank you very much for your answer, @MartinJaffer-MSFT
MartinJaffer-MSFT 26,236 Reputation points

2021-09-28T18:19:56.347+00:00

Good to know this was helpful. I suspected there was a fundamental misunderstanding, but worried you might have been offended by the education. I had no way of knowing your level of technical knowledge.

@DvorakMichal

Share via

Azure Data Factory - Self Hosted IR - Prevent Binary Import

2 answers

Your answer