Hello @Nishanth S
The JSON you’re using is for the Pdftextextracti
onSkill
, which is a skill for extracting text from PDFs
However, you might want to use the DocumentExtractionSkill
instead, which is a more general-purpose skill that can extract content from a variety of file formats, including PDF.
Here’s an example of what the JSON for a DocumentExtractionSkill
might look like:
{
"name": "document-extraction-skillset",
"description": "Skillset for extracting text from documents",
"skills": [
{
"@odata.type": "#Microsoft.Skills.Util.DocumentExtractionSkill",
"context": "/document",
"inputs": [
{
"name": "file_data",
"source": "/document/file_data"
}
],
"outputs": [
{
"name": "content",
"targetName": "content"
}
]
}
]
}
In this JSON, the DocumentExtractionSkill
is is used to extract the content from the document. The file_data
input is set to the file_data
field of the document, which represents the original file data downloaded from your blob data source
Also, make sure to set the allowSkillsetToReadFileData
parameter on your indexer definition to true
. This creates a path /document/file_data
that is an object representing the original file data downloaded from your blob data source.
Hope that helps.
-Grace