Within Azure AI Studio, in the Indexes section, im creating a new index. The payload is a bunch of webpages. Add Vector search to this search resource is enabled.
After a few minutes I recieve the error
It would seem that something has broken between 0.0.42 and the older 0.0.38 versions of llm_rag_crack_and_chunk_and_embed
As I have attempted to recreate a new index using the exact same payload with the new 0.0.42 version and its failing, i then clone the older 0.0.38 job and run it again and it works. These are the errors from the log
[2024-09-16 17:11:21] INFO azureml.rag.crack_and_chunk - Processing file: www.surreyilc.org.uk.html (crack_and_chunk.py:127)
[2024-09-16 17:11:22] ERROR azureml.rag.crack_and_chunk_and_embed.create_embeddings - ActivityCompleted: Activity=create_embeddings, HowEnded=Failure, Duration=252844.63 [ms], Exception=AttributeError (activity.py:127)
[2024-09-16 17:11:22] ERROR azureml.rag.crack_and_chunk_and_embed.crack_and_chunk_and_embed - ServiceError: intepreted error = Rag system error, original error = 'lxml.etree._ProcessingInstruction' object has no attribute 'is_phrasing' (exceptions.py:124)
[2024-09-16 17:11:27] ERROR azureml.rag.crack_and_chunk_and_embed.crack_and_chunk_and_embed - crack_and_chunk failed with exception: Traceback (most recent call last):
File "/azureml-envs/rag-embeddings/lib/python3.9/site-packages/azureml/rag/tasks/crack_and_chunk_and_embed.py", line 506, in main_wrapper
map_exceptions(main, activity_logger, args, logger, activity_logger)
File "/azureml-envs/rag-embeddings/lib/python3.9/site-packages/azureml/rag/utils/exceptions.py", line 126, in map_exceptions
raise e
File "/azureml-envs/rag-embeddings/lib/python3.9/site-packages/azureml/rag/utils/exceptions.py", line 118, in map_exceptions
return func(*func_args, **kwargs)
File "/azureml-envs/rag-embeddings/lib/python3.9/site-packages/azureml/rag/tasks/crack_and_chunk_and_embed.py", line 475, in main
embeddings_container = crack_and_chunk_and_embed(
File "/azureml-envs/rag-embeddings/lib/python3.9/site-packages/azureml/rag/tasks/crack_and_chunk_and_embed.py", line 344, in crack_and_chunk_and_embed
num_embedded = create_embeddings(
File "/azureml-envs/rag-embeddings/lib/python3.9/site-packages/azureml/rag/tasks/embed.py", line 312, in create_embeddings
for chunk in chunks:
File "/azureml-envs/rag-embeddings/lib/python3.9/site-packages/azureml/rag/tasks/crack_and_chunk_and_embed.py", line 218, in documents_to_embed
for chunked_doc in chunked_docs:
File "/azureml-envs/rag-embeddings/lib/python3.9/site-packages/azureml/rag/documents/chunking.py", line 169, in split_documents
for i, document in enumerate(documents):
File "/azureml-envs/rag-embeddings/lib/python3.9/site-packages/azureml/rag/documents/cracking.py", line 376, in crack_documents
raise e
File "/azureml-envs/rag-embeddings/lib/python3.9/site-packages/azureml/rag/documents/cracking.py", line 365, in crack_documents
yield loader.load_chunked_document()
File "/azureml-envs/rag-embeddings/lib/python3.9/site-packages/azureml/rag/documents/cracking.py", line 71, in load_chunked_document
pages = self.load()
File "/azureml-envs/rag-embeddings/lib/python3.9/site-packages/azureml/rag/documents/cracking.py", line 132, in load
docs = super().load()
File "/azureml-envs/rag-embeddings/lib/python3.9/site-packages/azureml/rag/langchain/vendor/document_loaders/unstructured.py", line 79, in load
elements = self._get_elements()
File "/azureml-envs/rag-embeddings/lib/python3.9/site-packages/azureml/rag/documents/cracking.py", line 148, in _get_elements
return partition_html(file=self.file, **self.unstructured_kwargs)
File "/azureml-envs/rag-embeddings/lib/python3.9/site-packages/unstructured/documents/elements.py", line 605, in wrapper
elements = func(*args, **kwargs)
File "/azureml-envs/rag-embeddings/lib/python3.9/site-packages/unstructured/file_utils/filetype.py", line 706, in wrapper
elements = func(*args, **kwargs)
File "/azureml-envs/rag-embeddings/lib/python3.9/site-packages/unstructured/file_utils/filetype.py", line 662, in wrapper
elements = func(*args, **kwargs)
File "/azureml-envs/rag-embeddings/lib/python3.9/site-packages/unstructured/chunking/dispatch.py", line 74, in wrapper
elements = func(*args, **kwargs)
File "/azureml-envs/rag-embeddings/lib/python3.9/site-packages/unstructured/partition/html/partition.py", line 103, in partition_html
elements = list(
File "/azureml-envs/rag-embeddings/lib/python3.9/site-packages/unstructured/partition/lang.py", line 475, in apply_lang_metadata
elements = list(elements)
File "/azureml-envs/rag-embeddings/lib/python3.9/site-packages/unstructured/partition/html/partition.py", line 222, in iter_elements
yield from cls(opts)._iter_elements()
File "/azureml-envs/rag-embeddings/lib/python3.9/site-packages/unstructured/partition/html/partition.py", line 229, in _iter_elements
for e in self._main.iter_elements():
File "/azureml-envs/rag-embeddings/lib/python3.9/site-packages/unstructured/partition/html/parser.py", line 361, in iter_elements
yield from self._element_from_text_or_tail(block_item.tail or "", q)
File "/azureml-envs/rag-embeddings/lib/python3.9/site-packages/unstructured/partition/html/parser.py", line 377, in _element_from_text_or_tail
for node in self._iter_text_segments(text, q):
File "/azureml-envs/rag-embeddings/lib/python3.9/site-packages/unstructured/partition/html/parser.py", line 421, in _iter_text_segments
while q and q[0].is_phrasing:
AttributeError: 'lxml.etree._ProcessingInstruction' object has no attribute 'is_phrasing'
(crack_and_chunk_and_embed.py:508)
[2024-09-16 17:11:27] ERROR azureml.rag.crack_and_chunk_and_embed.crack_and_chunk_and_embed - ActivityCompleted: Activity=crack_and_chunk_and_embed, HowEnded=Failure, Duration=259423.54 [ms], Exception=AttributeError (activity.py:127)
Traceback (most recent call last):
File "/azureml-envs/rag-embeddings/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/azureml-envs/rag-embeddings/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/azureml-envs/rag-embeddings/lib/python3.9/site-packages/azureml/rag/tasks/crack_and_chunk_and_embed.py", line 559, in