Azure Table Storage writing extremely slow

jno 0 Reputation points
2023-11-08T16:31:07.17+00:00

I am trying to write to Azure Table Storage a relatively light pandas_df with 100k rows.

The snipped below times more than 1 hour without multiprocessing and 4 minutes with multiprocessing and optimal n_multiproc.

4 minutes is still too much for my application as I will need to scale that up and it's a recurrent process in databricks, costing expensive cluster computing time. As a benchmark, saving the same pandas_df as csv, parquet or writing to Postgre takes seconds, but I can only use Azure Table Storage in my infrastructure.

Is the way I am trying to store incorrect for my use case? Why is it so slow?


from azure.data.tables import TableClient, UpdateMode
import multiprocessing
import json

tablename='mytable'
n_multiproc = 8
api_data_df_dic = pandas_df.to_dict(orient='records')

table_client = TableClient.from_connection_string(conn_str=AZURE_STORAGE_CONN_STR,table_name=tablename)


def insert_batch(data):
    for row in data:
        table_client.upsert_entity(mode=UpdateMode.REPLACE, entity=row)

data = [
    {
    'PartitionKey' : str(row["PART_KEY"]),
    'RowKey': str(row["ROW_KEY"]),
    "info1" : row["info1"],
    "info2" : row["info2"],
    "info3" : row["info3"],
    }for row in api_data_df_dic]

batch_size = len(data) // n_multiproc
batch_chunks = [data[i:i + batch_size] for i in range(0, len(data), batch_size)]

pool = multiprocessing.pool.ThreadPool()
processed_dfs = pool.map(insert_batch, batch_chunks)
pool.close()

A couple of alternatives I have already tried without success:

  • Instead of upsert_entity for every row, accumulate 100 or so rows within insert_batch and store the batch with submit_transaction.
  • Assessing if api_data_df_dic to data conversion was a bottleneck, it takes seconds and all latency comes from insert_batch.
  • Assessing the impact of info1, info2, info3 object types, even storing just PartitionKey and RowKey takes about the same time.
  • Tried different n_multiproc splits.

If someone can point me out to how can I make it run faster it would be hugely helpful!

Azure Table Storage
Azure Table Storage
An Azure service that stores structured NoSQL data in the cloud.
171 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Amira Bedhiafi 26,656 Reputation points
    2023-11-12T17:36:20.99+00:00

    In some cases, Azure Table Storage is typically not designed for bulk operations maybe it is a good choice for fast access to small amounts of data. So it differs from case to another, from file-based storage (like CSV or Parquet) or relational databases (like PostgreSQL).

    You mentioned trying to accumulate 100 rows and then performing a batch operation. This is generally a good approach as batch operations in Azure Table Storage can be more efficient. However, the maximum number of entities that can be included in a single transaction is 100, and the total payload of a batch operation must not exceed 4 MB. So you may need to check what you have done so far.

    Another detail you mentioned is the different levels of parallelism (n_multiproc) you have. Don't forget Overhead can sometimes kill the parallel processing, especially if the number of processes exceeds the number of available CPU cores.

    This is my analysis , so you may need to rethink other alternatives for Azure Services.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.