datafactory vs databricks

Akthar Hussain 21 Reputation points
2020-09-27T09:25:20.267+00:00

Is there an overlap between #azuredatafactory and #azuredatabricks? e.g. azure datafactory dataflows which uses azure data bricks under the hood (as I understand)

under such circumstances which technology is more efficient / cost effective to use? any blogs ? (newbie question)
As databricks is used under the hood for datafactory dataflows is it better to directly use databricks (instead of using dataflow) whilst orchestrating data in data factory? can you start up and run a data bricks cluster from data factory and then then the pipeline orchestrating processes continue ?

I can see the answer to one of my questions here:
https://social.msdn.microsoft.com/Forums/en-US/beff78b4-7700-46e1-bb1c-3e705e3847e3/running-databricks-notebook-from-azure-data-factory-via-interactive-cluster?forum=AzureDatabricks

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,272 questions
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
11,024 questions
0 comments No comments
{count} votes

3 answers

Sort by: Most helpful
  1. PRADEEPCHEEKATLA 90,446 Reputation points
    2020-09-28T07:57:12.557+00:00

    Hello @Akthar Hussain ,

    Welcome to Microsoft Q&A platform.

    Both ADF’s Mapping Data Flows and Databricks utilize spark clusters to transform and process big data and analytics workloads in the cloud.

    Mapping data flows are visually designed data transformations in Azure Data Factory. Data flows allow data engineers to develop data transformation logic without writing code. The resulting data flows are executed as activities within Azure Data Factory pipelines that use scaled-out Apache Spark clusters. Data flow activities can be operationalized using existing Azure Data Factory scheduling, control, flow, and monitoring capabilities.

    Mapping data flows provide an entirely visual experience with no coding required. Your data flows run on ADF-managed execution clusters for scaled-out data processing. Azure Data Factory handles all the code translation, path optimization, and execution of your data flow jobs.

    Azure Databricks is based on Apache Spark and provides in memory compute with language support for Scala, R, Python and SQL. Data transformation/engineering can be done in notebooks with statements in different languages. That makes this a flexible technology to include advanced analytics and machine learning as part of the data transformation process. You are also able to run each step of the process in a notebook, so step by step debugging is easy. You will also be able to see this process during job execution, so it is easy to see if your job stops.

    Azure Databricks clusters can be configured in a variety of ways, both regarding the number and type of compute nodes. Managing to set the correct cluster is an art form, but you can get quite close as you can set up your cluster to automatically scale within your defined threshold given the workload. It can also be set to automatically terminate when it is inactive for a certain time. When used with ADF the cluster will start up when activities are started. parameters can be sent in and out from ADF. Azure Databricks is closely connected to other Azure services, both Active Directory, KeyVault and data storage options like blob, data lake storage and sql.

    The biggest drawback of Databricks in my mind is that you must write code. Most BI developers are used to more graphical ETL tools like SSIS, Informatica or similar, and it is a learning curve to rather write code. Many will say that poorly written code will be very hard to maintain, but I’ve seen plenty of examples where graphical ETL isn’t easy to follow either.

    Hope this helps. Do let us know if you any further queries.

    ----------------------------------------------------------------------------------------

    Do click on "Accept Answer" and Upvote on the post that helps you, this can be beneficial to other community members.

    3 people found this answer helpful.

  2. EJCorcoran 1 Reputation point Microsoft Employee
    2021-06-01T18:34:10.473+00:00

    Data flows run on Spark clusters that are spun up at run-time. The configuration for the cluster used is defined in the integration runtime (IR) of the activity.
    https://learn.microsoft.com/en-us/azure/data-factory/concepts-data-flow-performance#ir
    Data flows does not run on Databricks; data flows run on the compute configured as a part of the integration runtime.


  3. Karthik Muthukrishnan 1 Reputation point Microsoft Employee
    2022-09-03T17:39:55.5+00:00

    Is there a table that compares what can & cannot do with Azure Databricks (ADB) and Azure Data Factory (ADF) ? For instance, I know we can do continuous ingest (pipeline) in ADF ; can we do the same in ADB ? If so, and if the final destination is Analytics (vs. warehousing of data), I would go with ADB... If not, perhaps need to use both: ADF for ingest pipeline, transform and load into Analytics database (say ADLS) ; ADB for load from ADLS and learn/analyze/visualize


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.