datafactory vs databricks

Question

datafactory vs databricks

Akthar Hussain 21

Is there an overlap between #azuredatafactory and #azuredatabricks? e.g. azure datafactory dataflows which uses azure data bricks under the hood (as I understand)

under such circumstances which technology is more efficient / cost effective to use? any blogs ? (newbie question)
As databricks is used under the hood for datafactory dataflows is it better to directly use databricks (instead of using dataflow) whilst orchestrating data in data factory? can you start up and run a data bricks cluster from data factory and then then the pipeline orchestrating processes continue ?

I can see the answer to one of my questions here:
https://social.msdn.microsoft.com/Forums/en-US/beff78b4-7700-46e1-bb1c-3e705e3847e3/running-databricks-notebook-from-azure-data-factory-via-interactive-cluster?forum=AzureDatabricks

3 answers

Your answer

Answer 1

Hello @Akthar Hussain ,

Welcome to Microsoft Q&A platform.

Both ADF’s Mapping Data Flows and Databricks utilize spark clusters to transform and process big data and analytics workloads in the cloud.

Mapping data flows are visually designed data transformations in Azure Data Factory. Data flows allow data engineers to develop data transformation logic without writing code. The resulting data flows are executed as activities within Azure Data Factory pipelines that use scaled-out Apache Spark clusters. Data flow activities can be operationalized using existing Azure Data Factory scheduling, control, flow, and monitoring capabilities.

Mapping data flows provide an entirely visual experience with no coding required. Your data flows run on ADF-managed execution clusters for scaled-out data processing. Azure Data Factory handles all the code translation, path optimization, and execution of your data flow jobs.

Azure Databricks is based on Apache Spark and provides in memory compute with language support for Scala, R, Python and SQL. Data transformation/engineering can be done in notebooks with statements in different languages. That makes this a flexible technology to include advanced analytics and machine learning as part of the data transformation process. You are also able to run each step of the process in a notebook, so step by step debugging is easy. You will also be able to see this process during job execution, so it is easy to see if your job stops.

Azure Databricks clusters can be configured in a variety of ways, both regarding the number and type of compute nodes. Managing to set the correct cluster is an art form, but you can get quite close as you can set up your cluster to automatically scale within your defined threshold given the workload. It can also be set to automatically terminate when it is inactive for a certain time. When used with ADF the cluster will start up when activities are started. parameters can be sent in and out from ADF. Azure Databricks is closely connected to other Azure services, both Active Directory, KeyVault and data storage options like blob, data lake storage and sql.

The biggest drawback of Databricks in my mind is that you must write code. Most BI developers are used to more graphical ETL tools like SSIS, Informatica or similar, and it is a learning curve to rather write code. Many will say that poorly written code will be very hard to maintain, but I’ve seen plenty of examples where graphical ETL isn’t easy to follow either.

Hope this helps. Do let us know if you any further queries.

----------------------------------------------------------------------------------------

Do click on "Accept Answer" and Upvote on the post that helps you, this can be beneficial to other community members.

Akthar Hussain 21 Reputation points

2020-09-28T08:30:29.417+00:00

Hi @PRADEEPCHEEKATLA ,
Thanks for reply. The key words in my question was about over lap and cost effectiveness between the two technologies, I am sorry was not entirely obvious. When the overlap exists as in the case of using mapping dataflows, is there a significant benefit in terms of cost and performance/efficiency in doing the ETL in azure databricks directly ? Or is this not an issue to even discuss?
PRADEEPCHEEKATLA 90,641 Reputation points Moderator

2020-09-30T11:28:37.297+00:00

Hi @Akthar Hussain ,

Azure Data Flows internally uses Azure Databricks. Dataflows helps build orchestration, activity and resource management and then Azure Databricks helps to build compute.

Data Flows are visually-designed components inside of Data Factory that enable data transformations at scale. You pay for the Data Flow cluster execution and debugging time per vCore-hour. The minimum cluster size to run a Data Flow is 8 vCores. Execution and debugging charges are prorated by the minute and rounded up.

You can utilize the Azure pricing calculator to get the cost actual cost and the performance is always based on the compute type which you have selected.

This article highlights various ways to tune and optimize your data flows so that they meet your performance benchmarks.

Hope this helps.
PRADEEPCHEEKATLA 90,641 Reputation points Moderator

2020-10-06T08:44:17.42+00:00

Hi @Akthar Hussain ,

Just checking in to see if the above answer helped. If this answers your query, do click “Accept Answer” and Up-Vote for the same. And, if you have any further query do let us know.
PRADEEPCHEEKATLA 90,641 Reputation points Moderator

2020-10-07T13:05:07.913+00:00

Hi @Akthar Hussain ,

Following up to see if the above suggestion was helpful. And, if you have any further query do let us know.
Bryan Beasley 16 Reputation points

2021-01-22T15:10:47.233+00:00

I think the point of cost comparison between ADF dataflows and Databricks was missed. I only see cost on ADF. Some of us who have developed previously and are not afraid of code are trying to determine if it is worth the effort of tackling code in Databricks. But the cost is very difficult to see comparatively. And doing both would be an expensive exercise for most companies. I have used tools in the past that were code based but the ones that helped you manage this were worth their weight in gold whereas those that did not and relied on developers to manage everything were a nightmare and led to failure over time. So if there is a case study that highlights the cost/time elements that would great so that people can make more informed decisions.

Answer 2

EJCorcoran 1 Microsoft Employee

Data flows run on Spark clusters that are spun up at run-time. The configuration for the cluster used is defined in the integration runtime (IR) of the activity.
https://learn.microsoft.com/en-us/azure/data-factory/concepts-data-flow-performance#ir
Data flows does not run on Databricks; data flows run on the compute configured as a part of the integration runtime.

Michal Pawlikowski 11 Reputation points

2021-07-06T13:15:15.907+00:00

Actually, so that there would be no more confusion, a few words of explanation.

Just look at the ADF MF documentation on the wayback machine :)

https://web.archive.org/web/20190407050742/https://learn.microsoft.com/en-us/azure/data-factory/concepts-data-flow-overview

It is obvious that once (before mid-2020) Mapping Data Flows worked on Databricks clusters, and then MS prepared an architecture based on Spark clusters (probably after Spark's lead to Synapse, earlier Spark existed either in HDInsight or SQL Server Big Data Clusters)

Answer 3

Karthik Muthukrishnan 1 Microsoft Employee

Is there a table that compares what can & cannot do with Azure Databricks (ADB) and Azure Data Factory (ADF) ? For instance, I know we can do continuous ingest (pipeline) in ADF ; can we do the same in ADB ? If so, and if the final destination is Analytics (vs. warehousing of data), I would go with ADB... If not, perhaps need to use both: ADF for ingest pipeline, transform and load into Analytics database (say ADLS) ; ADB for load from ADLS and learn/analyze/visualize

PRADEEPCHEEKATLA 90,641 Reputation points Moderator

2022-09-05T06:45:53.827+00:00

Hello @Karthik Muthukrishnan ,

Since this thread is too old, I would recommend creating a new thread on the same forum with as much details about your issue as possible. That would make sure that your issue has better visibility in the community.

Share via

datafactory vs databricks

3 answers

Your answer