In a short answer, it depends on your requirements.
In my opinion, you need to consider these points:
Characteristics of the workloads
- Data volume, larger datasets require more computing resources to process efficiently
- Complex analytical jobs or machine learning models may require more computing power, influencing the number of executors and nodes.
- Concurrent users or jobs
Resource Allocation
- You'll need to decide on the number of executors per node, the amount of memory (RAM) for each executor, and the number of cores per executor. (A common practice is to have one executor per node to maximize resource allocation)
- The size (CPU, memory) of your worker nodes will affect the number of executors you can run. Larger nodes can support more or larger executors but may also lead to higher costs.
Performance / Cost
You need to determine whether you'll use autoscaling (which adjusts the number of nodes based on the workload) or manual scaling. Autoscaling can optimize costs but may require fine-tuning to prevent over-provisioning.
Or you need to consider the cost of the nodes and choose the right balance between performance and expense. Sometimes, using more powerful nodes but fewer of them can be more cost-efficient than many smaller nodes.
https://learn.microsoft.com/en-us/azure/databricks/best-practices-index