Extract images embedded in Word
XSweet should extract images embedded in Word docs and display them in the html.
When an image is added to a Word doc, Word puts a copy of the original image in the /word/media
directory. Each image in the Word document has an id that connects it to the corresponding image file.
Images are inserted as <drawing>
s. Ids appear in many of the tags inside the <drawing>
tag, but it looks like the one that matters - r:embed
comes from the a:blip
tag, inside a pic:blipFill
:
<pic:blipFill>
<a:blip r:embed="rId4">
The doc's _rels/document.xml.rels
file holds the connection between the <a:blip r:embed>
's ID in the document.xsl
and the image file location in the /word/media
directory, through the Target
property in the <Relationship>
in the _rels/document.xml.rels
:
<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
<Relationship Id="rId4" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/image" Target="media/image1.png"/>
In the document.xml
file, there are lots of image properties, but as a starting point, we should ignore everything except the id, and use it to fetch and insert the original picture file where it appears in the xml.
A few considerations:
- As a start, let's ignore the size of the image in the docx (and any other properties from the docx) and link to the original target image. How hard is it for XSweet to follow the trail from the
rId
through the_rels/document.xml.rels
file to the image file? - Images can be either inline (
<w:drawing><wp:inline...
) or floating (<w:drawing><wp:anchor...
). If they're floating, it's possible that they would appear in different place in the docx than in the html, but let's worry about that later. - Are there any special considerations that jump out at you for handling different image formats differently?
- Until now, XSweet and Typescript have produced single self-contained files as outputs, but for image extraction, we'll need copies of the embedded images to point at in the html. I think it would be good if XSweet copied the original image files and included them in a directory as an output (once, not over and over for each step). That way, the HTML image links could relatively point to the files. Is something like this possible using xslt, or would it need to be done some other way? How we handle this with INK is a whole separate conversation but let's start with something as simple as we can.
Open to suggestions and interested to hear what you think.
Here is a very simple example docx: Image_test_docx.docx
And here is how images are currently initially extracted:
<p>
<noProof>
<div class="drawing" />
</noProof>
</p>