Hi @Amaan Syed,
The issue where your Streamlit app retrieves SAP documents but extracts only partial content and ABAP code is likely due to limitations in the document parsing process, potentially in Azure AI Document Intelligence (if that’s the tool you’re using) and how LangChain handles text splitting. ABAP code blocks, especially those embedded in tables or with special formatting, may be misinterpreted or truncated during parsing. Additionally, LangChain’s default text splitters can unintentionally break code across chunks, resulting in incomplete or fragmented outputs. If the extracted content isn’t properly grouped before storing in Pinecone, retrieval accuracy can also suffer.
In LangChain, apply custom chunking logic that avoids splitting code blocks, and store full logical sections (text + code) in Pinecone with appropriate metadata. This helps ensure accurate extraction and retrieval of complete content and ABAP code as they appear in the original documents.
I hope you understand. And, if you have any further query do let us know.
If this answers your query, do click Accept Answer
and Yes
for was this answer helpful.