Replies: 5 comments
-
My project has a similar need to process the DoclingDocument. I solved it like this: |
Beta Was this translation helpful? Give feedback.
-
I am new to this great library and face a similar issue. I want to get rid of some text by label them as page-header in the DoclingDocument object. The labeling works, but the headers are still shown in export to markdown. I modified the document like this: for item in conv_result.document.iterate_items():
if isinstance(item[0], DocItem):
docitem: DocItem = item[0]
for prov_item in docitem.prov:
if prov_item.bbox.b > 783: # everything above 783 is header
docitem.label = DocItemLabel.PAGE_HEADER @deruli79 did you remove the text also from all parents? like |
Beta Was this translation helpful? Give feedback.
-
@Greenheart Thanks for the reference, I will look into your approach. @jherrmann If I delete the item in addition from conv_result.document.texts (which in my opinion should be done) and try to iterate I get a: In general I understand that there is no documented way (yet) to remove docitems from a Document. I am contemplating to build a DoclingDocument from scratch and use add_ methods based on the conv_result.document. |
Beta Was this translation helpful? Give feedback.
-
Alright, for the moment I settled with the following approach inspired by @jherrmann
def remove_items_from_document(items: list[DocItem], document: DoclingDocument):
"""
Remove Items from Document
"""
#Iterate over items which should be removed
for item_remove in items:
#Resolve parent for this item
parent_remove = item_remove.parent.resolve(document)
#Remove from Parent
if isinstance(parent_remove, NodeItem):
print(f'Removing Item from Document with cref: {item_remove.self_ref}')
parent_remove.children.remove(item_remove.get_ref()) Works well so far. |
Beta Was this translation helpful? Give feedback.
-
This approach seems to work well |
Beta Was this translation helpful? Give feedback.
-
Hello,
I want to add or remove Items from a DoclingDocument but do not know if this is supported or how to achieve this
Steps:
-> I tried to remove elements from document.texts but this breaks the document when I try to iterate_items() again
Is there a supported way to remove_text similar to add_text method?
Thanks
Beta Was this translation helpful? Give feedback.
All reactions