Word Index Entry Render as the HTML to WORD #10171

PerlDeveloperSurya · 2024-09-11T06:21:53Z

PerlDeveloperSurya
Sep 11, 2024

render word default index entry using HTML to Docx conversion. refer to the attached.

commet.docx

jgm · 2024-09-11T15:57:38Z

jgm
Sep 11, 2024
Maintainer

I do not understand the question.

0 replies

cycomachead · 2024-11-29T23:33:29Z

cycomachead
Nov 29, 2024

Hi John,

I think I understand this question, since my current task is similar. I'm trying to convert as very large Word document to a Markdown format (Quarto or JupyterBook, etc) and would like to extract Word's index entries.

For example, the text "Alan Kay" shows in the Word doc as Alan Kay { XE "Kay, Alan" } when showing all paragraph markers. This creates an index entry that is linked to the location where "Alan Kay" appears in the document.

Ideally, in the case of md or latex output, I would love to be able to generate a document which has the text Alan Kay \index{Kay, Alan}. I've been meeting around with Lua filters, trying to inspect Pandoc's AST and so far it seems like I can't find any reliable way of getting at the "raw" w:instrText elements from Word's XML which contain the index entries.

Here's a snapshot of the XML, with a couple annotations:

      <w:r w:rsidRPr="00B7562C">
        <w:rPr>
          <w14:ligatures w14:val="standard"/>
        </w:rPr>
         <!-- regular text parsed fine, the index entry is immediately following this. -->
        <w:t>e have been extremely lucky in our mentors.  Jens cut his teeth in the company of the Smalltalk pioneers: Alan Kay</w:t>
      </w:r>
      <!-- first block seems to be the special { to denote the start of the entry. -->
      <w:r w:rsidR="003C10BE" w:rsidRPr="00B7562C">
        <w:rPr>
          <w14:ligatures w14:val="standard"/>
        </w:rPr>
        <w:fldChar w:fldCharType="begin"/>
      </w:r>
      <w:r w:rsidR="003C10BE" w:rsidRPr="00B7562C">
        <w:rPr>
          <w14:ligatures w14:val="standard"/>
        </w:rPr>
        <!-- As best I can tell, the index definitions are all defined by `XE "(.*)"` to extract the content. 
        <w:instrText xml:space="preserve"> XE "Kay, Alan" </w:instrText>
      </w:r>
      <!-- final block seems to be the special } to denote the end of the entry. -->
      <w:r w:rsidR="003C10BE" w:rsidRPr="00B7562C">
        <w:rPr>
          <w14:ligatures w14:val="standard"/>
        </w:rPr>
        <w:fldChar w:fldCharType="end"/>
      </w:r>

Personally, I'd even be fine with outputting the text Alan Kay XE "Kay, Alan" .... which would need some more parsing and cleanup, but would be relatively functional. Unless I am mistaken, my current option seems to be to re-parse the Word XML, then try to figure out out when in my current documents to place the index

Edit:

I realize that the value of the index entry is also being able to generate the index with the appropriate links back to the source definition. I, personally, don't expect pandoc to handle things in a super generic way. Certainly in an HTML format, it would be relatively straightforward to generate named anchor tags and generate a list of anchors pointing to the sources. But I don't believe there is a universal index format. So my goal is primarily being up to extract and reformat the data for my own needs. I'd be happy to use a lua filter as long as I can figure out exactly what I need to hook into.

5 replies

cycomachead Nov 29, 2024

This is absolutely not a minimal test case, but a bunch of my experimentation is here, in case anyone is interested: https://github.com/snap-cloud/snap-manual/tree/quarto/tests

jgm Dec 5, 2024
Maintainer

I'm going to see if I can modify the docx reader to return this information in Spans.

jgm Dec 5, 2024
Maintainer

Interestingly, when I try inserting an index entry it is marked up like this (not a plain string):

<w:instrText xml:space="preserve"> XE "</w:instrText></w:r><w:r w:rsidRPr="001B31C8"><w:instrText>MacFarlane, John</w:instrText></w:r><w:r><w:instrText xml:space="preserve">" </w:instrText>

There are three different ways:instrText elements, one containing just XE ", then next with the text, and the final one with just ". Microsoft!

wlupton Dec 5, 2024

Right. My script dumps it this way (FieldCode means the same as instrText; I've lost the space="preserve" attribute).

% ./docx2md.py test.docx -l 1
INFO:docx2md:document 
INFO:docx2md:  paragraph <Normal> John MacFarlane^[[Macfarlane:John]{.index-entry}]
INFO:docx2md:    run <---> John MacFarlane
INFO:docx2md:      text John MacFarlane
INFO:docx2md:    run <---> [<FieldChar begin>]
INFO:docx2md:      fieldChar begin
INFO:docx2md:    run <---> [<FieldCode ['XE', '"']>]
INFO:docx2md:      fieldCode ['XE', '"']
INFO:docx2md:    run <---> [<FieldCode ['Macfarlane:John']>]
INFO:docx2md:      fieldCode ['Macfarlane:John']
INFO:docx2md:    run <---> [<FieldCode ['"']>]
INFO:docx2md:      fieldCode ['"']
INFO:docx2md:    run <---> [<FieldChar end>]
INFO:docx2md:      fieldChar end
INFO:docx2md:resolving bookmarks

It seems like there's a separate stream, independent of the main text, that needs to be parsed. Here the text to be parsed (delineated by begin and end fieldChars) is (give or take some whitespace):

XE "Macfarlane:John"

If you select more of the options when adding the entry you get a more complex string. For example:

% ./docx2md.py test.docx -l 1
INFO:docx2md:document 
INFO:docx2md:  paragraph <Normal> John MacFarlane^[[MacFarlane:John]{.index-entry .index-bold .index-italic} [See Pandoc]{.index-text}]
INFO:docx2md:    run <---> John MacFarlane
INFO:docx2md:      text John MacFarlane
INFO:docx2md:    run <---> [<FieldChar begin>]
INFO:docx2md:      fieldChar begin
INFO:docx2md:    run <---> [<FieldCode ['XE', '"']>]
INFO:docx2md:      fieldCode ['XE', '"']
INFO:docx2md:    run <---> [<FieldCode ['MacFarlane:John']>]
INFO:docx2md:      fieldCode ['MacFarlane:John']
INFO:docx2md:    run <---> [<FieldCode ['"', '\\t', '"']>]
INFO:docx2md:      fieldCode ['"', '\\t', '"']
INFO:docx2md:    run <-i-> ['', <FieldCode ['See']>]
INFO:docx2md:      fieldCode ['See']
INFO:docx2md:    run <---> [<FieldCode ['Pandoc']>]
INFO:docx2md:      fieldCode ['Pandoc']
INFO:docx2md:    run <---> [<FieldCode ['"', '\\b', '\\']>]
INFO:docx2md:      fieldCode ['"', '\\b', '\\']
INFO:docx2md:    run <---> [<FieldCode ['i']>]
INFO:docx2md:      fieldCode ['i']
INFO:docx2md:    run <---> [<FieldCode []>]
INFO:docx2md:      fieldCode []
INFO:docx2md:    run <---> [<FieldChar end>]
INFO:docx2md:      fieldChar end

XE "Macfarlane:John" \t "See Pandoc" \b \i

jgm Dec 5, 2024
Maintainer

The documentation says that there can be multiple instrText elements, and their contents need to be concatenated. Currently changing the code to handle this.

wlupton · 2024-12-05T17:54:41Z

wlupton
Dec 5, 2024

I was looking at this too and came up with the following using a python-docx fork (I also tried to carry over \b, \i etc. as additional classes). Using footnotes seemed like a good idea at the time but I guess it adds nothing (apart from behaving better if the CSS classes aren't handled)? I was also going to write a sample lua filter to convert the footnotes into an index.

docx

markdown

This is Alan Kay^[[Kay:Alan]{.index-entry} [See Kay]{.index-text}].

html

<p>This is Alan Kay<a href="#fn1" class="footnote-ref" id="fnref1"
role="doc-noteref"><sup>1</sup></a>.</p>
<section id="footnotes" class="footnotes footnotes-end-of-document"
role="doc-endnotes">
<hr />
<ol>
<li id="fn1"><p><span class="index-entry">Kay:Alan</span> <span
class="index-text">See Kay</span><a href="#fnref1" class="footnote-back"
role="doc-backlink">↩︎</a></p></li>
</ol>
</section>

0 replies

jgm · 2024-12-05T19:38:46Z

jgm
Dec 5, 2024
Maintainer

OK, I have now added code to pandoc that parses these index entries into empty Span elements.
These can be processed by a filter but will otherwise generally be ignored.

1 reply

cycomachead Dec 9, 2024

This is awesome! Thank you!

I was able to build pandoc and test (and then saw you already released 3.6!) and after a few minutes got a filter working easily. :)

Saved me a good bit of time from writing some other scripts.

wlupton · 2024-12-06T19:45:25Z

wlupton
Dec 6, 2024

What if there's "See XXX" text, as in the example below? I believe that there can in theory be arbitrary such text (it doesn't need to start "See"; that's just the default) and it can include formatting (by default "See" is in italics).

Also, I wonder whether the XE text (here "Kay:Alan") is being parsed ("Kay" is the main entry and "Alan" is the subentry)? (Sorry, I know I could answer this question by looking at the code.)

1 reply

jgm Dec 6, 2024
Maintainer

That is handled by my changes. Try it.
You'll get a span with an entry and a crossref attribute.
The entry will just be literally Kay:Alan -- you'll have to deconstruct that if you need a different format for your desired output. E.g. in LaTeX s/:/!/.

wlupton · 2024-12-07T08:50:04Z

wlupton
Dec 7, 2024

OK. I see. LGTM. Thanks!

% pandoc-nightly test.docx -t markdown
John MacFarlane[]{.indexref entry="MacFarlane:John" crossref="See Pandoc"}

(Might it be worth retaining the \b and \i in some form, because these might carry additional information about the entry type? "Special features, such as pages with illustrations or with substantial bibliographical references, may be indicated by bold or italic numerals, but such devices should be used sparingly.")

2 replies

jgm Dec 7, 2024
Maintainer

I suppose we could do that, and maybe also the \y.

jgm Dec 9, 2024
Maintainer

I already implemented this by the way.

adunning · 2024-12-11T13:58:05Z

adunning
Dec 11, 2024

This could be a very handy way to index a book. Will there be a method for transferring entries back to Docx? (In an ideal world of course it would be brilliant if the same markup could also be used to produce an index in LaTeX!)

2 replies

cycomachead Dec 11, 2024

I haven't fully finished the work I'm doing but it took only a few minutes to get a working lua filter that wraps the content of a span in an \index{} tag and then with \makeindex and \printindex

Right now, I'm not interested in doing a fully automated word to latex+md conversion, just mostly automated.

https://github.com/snap-cloud/snap-manual/blob/quarto/tests/index-extractor.lua#L24
But maybe that's a start at something.

jgm Dec 11, 2024
Maintainer

Right now we just have docx reader support but if you want you can submit issues for docx writer and latex writer/reader support.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Word Index Entry Render as the HTML to WORD #10171

{{title}}

Replies: 7 comments 11 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Word Index Entry Render as the HTML to WORD #10171

Replies: 7 comments · 11 replies

jgm Sep 11, 2024 Maintainer

jgm Dec 5, 2024 Maintainer

jgm Dec 5, 2024 Maintainer

jgm Dec 5, 2024 Maintainer

jgm Dec 5, 2024 Maintainer

jgm Dec 6, 2024 Maintainer

jgm Dec 7, 2024 Maintainer

jgm Dec 9, 2024 Maintainer

jgm Dec 11, 2024 Maintainer

Replies: 7 comments 11 replies

jgm
Sep 11, 2024
Maintainer

jgm Dec 5, 2024
Maintainer

jgm Dec 5, 2024
Maintainer

jgm Dec 5, 2024
Maintainer

jgm
Dec 5, 2024
Maintainer

jgm Dec 6, 2024
Maintainer

jgm Dec 7, 2024
Maintainer

jgm Dec 9, 2024
Maintainer

jgm Dec 11, 2024
Maintainer