-
Notifications
You must be signed in to change notification settings - Fork 459
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Docx support #515
base: master
Are you sure you want to change the base?
[WIP] Docx support #515
Conversation
I need to add some tests :/ |
cacc48b
to
75a2209
Compare
I've added examples under |
Given the poor results with Apache POI, we would need to try docx4j as alternative. |
Is there a quick way -- or would it make sense to add the ALTO XML as an input format for the service, as well? In this manner, someone could run that pipeline separately (pdfalto included) and from there just submit the XML. This is of course not as nice as getting inline converters working, but it does sort of shift the problem. |
Adding web services taking as input ALTO XML instead of PDF is easy (we just shortcut the pdf conversion), but the problem is that there are a certain number of ALTO flavors and so far the only one well tested and supported is the pdfalto output. This is probably quite a lot of effort to test and support comprehensively ALTO variants. The best option regarding docx would be certainly a docx to ALTO XML conversion (same ALTO as pdfalto), and just take also ALTO as input, but it looks complicated, which is why I am adding, first, this docx support via Apache POI or docx4j. |
Should I add ALTO XML input as a feature request / pull request? The embedded PDF ALTO support makes containerization a bit more of a challenge and does have an impact in the way ingestion can be managed- as you could imagine, running the PDF ALTO process on a separate server gives the GROBID server more capacity for running WAIPI and such. |
noting another contributor has posted a pdfalto service PR (WIP): #552 |
docx document support using Apache POI (via the opensagres converter)