Share via

Optimizing Cluster Usage in DataFlow Pipelines and Exploring Alternatives for Running Python Code in a Dedicated Environment

Glasier 440 Reputation points
2024-01-30T02:05:58.5933333+00:00

I'm currently working with several pipelines that employ DataFlow, and I've noticed that each DataFlow generates its own cluster upon pipeline execution. I'm wondering if it's feasible to set up a single, dedicated cluster that can be utilized by all my DataFlows, rather than having each DataFlow create its own individual cluster. In case this approach isn't viable, could you suggest an alternative solution that allows running multiple Python scripts on a dedicated cluster? I'm looking for options beyond DataBricks and Azure Batch, as we don't have the authorization to use these services in our project.

Azure Data Factory
Azure Data Factory

An Azure service for ingesting, preparing, and transforming data at scale.


Answer accepted by question author

Anonymous
2024-01-30T12:10:47.88+00:00

@Glasier

Welcome to Microsoft Q&A platform and thanks for posting your question.

You can use pathos, a Python package that can do multiprocessing not just across different cores within a single computer, but also with a cluster distributed across multiple machines . Pathos has the ability to establish connections to remote servers through a parallel map, and to do multiprocessing. Here’s an example of how you can use pathos to run a function in parallel across multiple machines:

https://stackoverflow.com/questions/26876898/python-multiprocessing-with-distributed-cluster

from pathos.core import connect
from pathos.pp import ParallelPythonPool as Pool
# Establish a ssh tunnel
tunnel = connect('remote.computer.com', port=1234)
# Define some function to run in parallel
def sleepy_squared(x):
    from time import sleep
    sleep(1.0)
    return x**2
# Build a pool of servers and execute the parallel map
p = Pool(8, servers=('localhost:55774',))
y = p.map(sleepy_squared, x)

Alternatively, you can use Ansible, a configuration management tool that allows you to start with a controller node, making an inventory (list of hosts/machines), and share the controller’s public key on all the nodes/machines . With Ansible, you can deploy your Python scripts to multiple machines and run them in parallel. https://stackoverflow.com/questions/25792357/how-to-distribute-a-python-script-across-multiple-nodes Hope this helps. Do let us know if you any further queries..

Was this answer helpful?

0 comments No comments

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.