How to get excel sheet names dyanamically in azure data factory.

Rohit Sawane 46 Reputation points
2022-08-30T07:11:57.887+00:00

Currently I have an Excel file that has multiple worksheets (over 11). This Excel file currently lives in on premise. I am trying to use Azure Data FactoryV2 to get the multiple tab names dynamically.
I am not assuming that I will know the sheet names or how many sheets there will be so we cannot create configuration table in sqlserver.
We have tried with azure function using python but in future we are eliminating python, is there any other solution besides this.Thank you.

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
9,601 questions
{count} vote

Accepted answer
  1. MartinJaffer-MSFT 26,031 Reputation points
    2022-08-30T23:14:07.243+00:00

    Hello and welcome to Microsoft Q&A @Rohit Sawane

    The ask as you have stated is how to dynamically get the sheet names from an xlsx in Data Factory. I suspect the real goal is just to get the data from all the sheets, when you don't know the quantity or names of sheets beforehand.
    I can help you with the latter, more than the former.

    As far as I am aware, there isn't an out-of-box feature for getting the sheet names at this time in Data Factory. I tried with Get Metadata, that did not help. But then I wondered, how does the multiple-choice selector in the Dataset get the sheet names? So I dug into the browser to investigate. There is a microservice endpoint called enumerateItems. However it is not intended for end-user use, and looks more complicated than the alternative solution I discuss below.

    If I can assume, what you really care about is the ability to iterate through all your sheets, and not the sheet name specifically, there is a simpler solution.
    Iterating by sheet ordinal is simpler than by sheet name. I noticed when I did Lookup or Get Metadata on sheet number 4 of a book with only 3 sheets, the error message told me how many sheets the book had. This we can use.
    236312-image.png
    By deliberately failing an activity we can find out how many sheets and work that.
    236352-image.png

    So, first step. I by the sound of things, you have already parameterized the Excel dataset for dynamic file name and dynamic page name. Let's swap that over to dynamic page number.
    236293-image.png

    Now lets set up that activity to fail (and doesn't that sound wrong!). I think Lookup activity ran faster, but Get Metadata has simpler error message, so I am using Get Metadata here. In the field list, choose "structure". We choose structure because that actually needs to look at the sheet, as opposed to Exists which only check for the presence of the file. Let us pass a suitably large sheet number, say 99. Also there is the file name.
    Get Metadata
    Now, we want to get the error, and stop if there is no error (if someone has 100 sheet excel book!). We connect the Get Metadata to Set Variable by red on-fail dependency. To make things easier, I split the task of error-to-list-of-sheet-numbers into parts. This first part extracts the portion of the error message we care about, the 0..2) from the rest. I do this by setting string type variable I am calling "message". The expression is:

    @split(activity('Get pages by failing').error.message,'(')[2]  
    

    After this we have another Set Variable activity, this one to an array type variable I am naming "sheets". In this, several things will happen. I will extract the second number from the message, and then use it in a range function to make the array [0,1,2]. There is a lot to unpack in this expression

    @range(0,add(1,  
    int(substring(variables('message'),  
    3,  
    sub(length(variables('message')),4)))))  
    

    range(0,2) yields [0,1] . This is because the second value is number of elements, so we actually want to add 1 to the number.
    The number we want starts on the third character 0..N, so that is where 3 comes from.
    The number of sheets could be a single digit , or double digit, or theoretically, more. So to determine how many characters I want to grab, depends upon the length of string. There are 4 characters we do not care about. 0,..,) So we subtract 4 from the length of the string.

    Finally, to the ForEach, we feed the array variable "sheets" we made in the last step.

    ... actually on further thought, the range always starts with 0, so this could have been easier if I split on .. rather than (. Less substring arithmetic that way.

    Hope this will help. Please let us know if any further queries.

    ------------------------------

    • Please don't forget to click on 130616-image.png or upvote 130671-image.png button whenever the information provided helps you. Original posters help the community find answers faster by identifying the correct answer. Here is how
    • Want a reminder to come back and check responses? Here is how to subscribe to a notification
    • If you are interested in joining the VM program and help shape the future of Q&A: Here is how you can be part of Q&A Volunteer Moderators
    2 people found this answer helpful.
    0 comments No comments

1 additional answer

Sort by: Most helpful
  1. Nandan Hegde 29,891 Reputation points MVP
    2022-08-31T16:00:43.16+00:00

    Hey,
    Thank you @MartinJaffer-MSFT for the explanation :)
    We do follow a similar approach in our current project wherein rather than leveraging getmetadata activity we leverage until activity and iterate over till the point of failure.
    This in turn helps to avoid an assumption w.r.t actual sheet number exceeding than our hardcoded value.

    The below blog explains the same in detail :
    https://datasharkx.wordpress.com/2022/08/31/iterating-across-all-excel-sheets-dynamically-and-converting-into-csv-files-via-azure-data-factory-synapse-pipeline/

    236606-image.png

    @Rohit Sawane hope either of the above 2 approaches help but as stated, we cannot get the sheet names but we can iterate through all the sheets.

    2 people found this answer helpful.