Word Index Entry Render as the HTML to WORD #10171
Replies: 7 comments 11 replies
-
I do not understand the question. |
Beta Was this translation helpful? Give feedback.
-
Hi John, I think I understand this question, since my current task is similar. I'm trying to convert as very large Word document to a Markdown format (Quarto or JupyterBook, etc) and would like to extract Word's index entries. For example, the text "Alan Kay" shows in the Word doc as Ideally, in the case of md or latex output, I would love to be able to generate a document which has the text Here's a snapshot of the XML, with a couple annotations: <w:r w:rsidRPr="00B7562C">
<w:rPr>
<w14:ligatures w14:val="standard"/>
</w:rPr>
<!-- regular text parsed fine, the index entry is immediately following this. -->
<w:t>e have been extremely lucky in our mentors. Jens cut his teeth in the company of the Smalltalk pioneers: Alan Kay</w:t>
</w:r>
<!-- first block seems to be the special { to denote the start of the entry. -->
<w:r w:rsidR="003C10BE" w:rsidRPr="00B7562C">
<w:rPr>
<w14:ligatures w14:val="standard"/>
</w:rPr>
<w:fldChar w:fldCharType="begin"/>
</w:r>
<w:r w:rsidR="003C10BE" w:rsidRPr="00B7562C">
<w:rPr>
<w14:ligatures w14:val="standard"/>
</w:rPr>
<!-- As best I can tell, the index definitions are all defined by `XE "(.*)"` to extract the content.
<w:instrText xml:space="preserve"> XE "Kay, Alan" </w:instrText>
</w:r>
<!-- final block seems to be the special } to denote the end of the entry. -->
<w:r w:rsidR="003C10BE" w:rsidRPr="00B7562C">
<w:rPr>
<w14:ligatures w14:val="standard"/>
</w:rPr>
<w:fldChar w:fldCharType="end"/>
</w:r> Personally, I'd even be fine with outputting the text Edit: I realize that the value of the index entry is also being able to generate the index with the appropriate links back to the source definition. I, personally, don't expect pandoc to handle things in a super generic way. Certainly in an HTML format, it would be relatively straightforward to generate named anchor tags and generate a list of anchors pointing to the sources. But I don't believe there is a universal index format. So my goal is primarily being up to extract and reformat the data for my own needs. I'd be happy to use a lua filter as long as I can figure out exactly what I need to hook into. |
Beta Was this translation helpful? Give feedback.
-
I was looking at this too and came up with the following using a python-docx fork (I also tried to carry over docx markdown
html
|
Beta Was this translation helpful? Give feedback.
-
OK, I have now added code to pandoc that parses these index entries into empty Span elements. |
Beta Was this translation helpful? Give feedback.
-
What if there's "See XXX" text, as in the example below? I believe that there can in theory be arbitrary such text (it doesn't need to start "See"; that's just the default) and it can include formatting (by default "See" is in italics). Also, I wonder whether the XE text (here "Kay:Alan") is being parsed ("Kay" is the main entry and "Alan" is the subentry)? (Sorry, I know I could answer this question by looking at the code.) |
Beta Was this translation helpful? Give feedback.
-
OK. I see. LGTM. Thanks!
(Might it be worth retaining the |
Beta Was this translation helpful? Give feedback.
-
This could be a very handy way to index a book. Will there be a method for transferring entries back to Docx? (In an ideal world of course it would be brilliant if the same markup could also be used to produce an index in LaTeX!) |
Beta Was this translation helpful? Give feedback.
-
render word default index entry using HTML to Docx conversion. refer to the attached.
commet.docx
Beta Was this translation helpful? Give feedback.
All reactions