Extract images embedded in Word

XSweet should extract images embedded in Word docs and display them in the html.

When an image is added to a Word doc, Word puts a copy of the original image in the /word/media directory. Each image in the Word document has an id that connects it to the corresponding image file.

Images are inserted as <drawing>s. Ids appear in many of the tags inside the <drawing> tag, but it looks like the one that matters - r:embed comes from the a:blip tag, inside a pic:blipFill:

<pic:blipFill>
    <a:blip r:embed="rId4">

The doc's _rels/document.xml.rels file holds the connection between the <a:blip r:embed>'s ID in the document.xsl and the image file location in the /word/media directory, through the Target property in the <Relationship> in the _rels/document.xml.rels:

<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
  <Relationship Id="rId4" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/image" Target="media/image1.png"/>

In the document.xml file, there are lots of image properties, but as a starting point, we should ignore everything except the id, and use it to fetch and insert the original picture file where it appears in the xml.

A few considerations:

As a start, let's ignore the size of the image in the docx (and any other properties from the docx) and link to the original target image. How hard is it for XSweet to follow the trail from the rId through the _rels/document.xml.rels file to the image file?
Images can be either inline (<w:drawing><wp:inline...) or floating (<w:drawing><wp:anchor...). If they're floating, it's possible that they would appear in different place in the docx than in the html, but let's worry about that later.
Are there any special considerations that jump out at you for handling different image formats differently?
Until now, XSweet and Typescript have produced single self-contained files as outputs, but for image extraction, we'll need copies of the embedded images to point at in the html. I think it would be good if XSweet copied the original image files and included them in a directory as an output (once, not over and over for each step). That way, the HTML image links could relatively point to the files. Is something like this possible using xslt, or would it need to be done some other way? How we handle this with INK is a whole separate conversation but let's start with something as simple as we can.

Open to suggestions and interested to hear what you think.

Here is a very simple example docx: Image_test_docx.docx

And here is how images are currently initially extracted:

<p>
  <noProof>
    <div class="drawing" />
  </noProof>
</p>

Image_test_docx-1EXTRACTED.html

Edited Jul 27, 2018 by Alex Theg