Indexadillo - Index your documents using Durable Functions and AI Search for RAG applications.
Indexadillo helps you push data to Azure AI Search in a scalable, observable way. Instead of using a pull-based DSL approach (which can be tricky to debug and customize), this solution uses Azure Durable Functions to handle everything from parsing your documents to uploading embeddings—without restarting at every little hiccup.
What’s Inside?
- Orchestrators and Activities: Each document gets its own sub-orchestrator, so failures don't bring everything down.
- Blob Storage Input: Drop your PDFs into blob storage, and we’ll automatically pick them up.
- Document Intelligence: Extract text from documents before sending them on.
- “Chonkie” for Chunking: Break down big files into smaller pieces for easier processing.
- OpenAI Text-003-Large Embeddings: Transform your text into embeddings for full-text AI search.
- Azure AI Search Upload: All neatly sent to your search index.
- Scalability: Process documents in parallel without losing track, thanks to continuation tokens and Durable Functions’ built-in retries.
Getting Started
Quick Start & Prerequisites
Environment Setup
Deploy Your Infrastructure
- Create a new environment:
Bash
azd env new indexadillo-dev
- Authenticate with Azure:
Bash
azd auth login az login
- Provision your resources:
Bash
This command sets up the necessary infrastructure (storage, function app, AI Search, etc.). Follow the prompts to select a subscription and region (Sweden Central is recommended).azd up
- Create a new environment:
Detailed Setup (Optional)
- Roles & Permissions: Run the roles script to assign necessary permissions:
Bash
./scripts/roles.sh
- Document Upload: Place your PDFs into the
source
container within your storage account. - Monitoring:
- Check the processing in Application Insights:
- Or view the function app’s invocations by selecting the
index_event_grid
function and switching to the "Invocations" tab.
- Check the processing in Application Insights:
- Accessing Search: Use the AI Search portal or the provided Bruno collection in the
/http
folder. (Don’t forget to update thehost
variable in the collection settings with your function app name.) - Reindexing: Trigger a full reindex via the
/index
endpoint. This creates a new index (defaulted toother-index
, can be changed in the parameters) for the blobs. You can adjust the prefixes to index specific folders, and the endpoint returns an ID to track progress via/status/:id
.
- Roles & Permissions: Run the roles script to assign necessary permissions:
Local Debugging
Setup
local.settings.json
Place the followinglocal.settings.json
file in thesrc
folder:JSON{ "IsEncrypted": false, "Values": { "AzureWebJobsStorage": "UseDevelopmentStorage=true", "SOURCE_STORAGE_ACCOUNT_NAME": "<source_storage_account_name>", "DI_ENDPOINT": "<di_endpoint>", "FUNCTIONS_WORKER_RUNTIME": "python", "AzureWebJobsFeatureFlags": "EnableWorkerIndexing", "AZURE_OPENAI_ENDPOINT": "<azure_openai_endpoint>", "SEARCH_SERVICE_ENDPOINT": "<search_service_endpoint>" } }
Configure Endpoints
Fill outDI_ENDPOINT
,AZURE_OPENAI_ENDPOINT
,SOURCE_STORAGE_ACCOUNT_NAME
, andSEARCH_SERVICE_ENDPOINT
with the correct endpoints from the.env
file located at.azure/indexadillo-dev/.env
.Start Azurite Service
Launch the Azurite service by using the Azurite: Start command from the VS Code command palette.Set Up a Virtual Environment (if not using Dev Containers)
If you're not using a VS Code dev container, you need to manually set up a Python virtual environment:shpython3 -m venv src/.venv
Then, activate it:
- Linux/macOS:
sh
source src/.venv/bin/activate
- Windows (PowerShell):
PowerShell
src\.venv\Scripts\Activate
Also, make sure you have the Azure Functions Core Tools installed.
- Linux/macOS:
Run the Debugger In the VS Code debug section, select and run the Attach to Python Function configuration.
Resource Architecture
Below is a diagram of the Azure resources that get deployed when you run azd up
. Everything is organized under a single
resource group:
- Storage Account (with
source
container): Where you drop PDFs to be processed. - Event Grid: Sends notifications to the function app whenever a new file is uploaded.
- Function App: Hosts the Durable Functions that orchestrate parsing, embedding, indexing and serving the http endpoints.
- App Service Plan (Flex Consumption Plan): Underlying hosting model for the Function App.
- Document Intelligence: Extracts text from PDFs or other document types (if implemented).
- Azure OpenAI Embeddings: Converts text to vector embeddings for advanced AI search.
- AI Search: Stores and queries your indexed documents, with a default index out of the box.
- Application Insights: Collects telemetry, logs, and performance metrics.
Feel free to customize or rename these resources to suit your workflow. For more details on each service:
- Azure Cognitive Services Document Intelligence
- Azure OpenAI Service
- Azure AI Search
- Application Insights
Contributing
Feel free to open issues or submit pull requests if you’d like to help out. All improvements are welcome—whether that’s more chunking strategies, extra integration tests, or better docs.