Extract images embedded in Word
XSweet should extract images embedded in Word docs and display them in the html.
When an image is added to a Word doc, Word puts a copy of the original image in the
/word/media directory. Each image in the Word document has an id that connects it to the corresponding image file.
Images are inserted as
<drawing>s. Ids appear in many of the tags inside the
<drawing> tag, but it looks like the one that matters -
r:embed comes from the
a:blip tag, inside a
<pic:blipFill> <a:blip r:embed="rId4">
_rels/document.xml.rels file holds the connection between the
<a:blip r:embed>'s ID in the
document.xsl and the image file location in the
/word/media directory, through the
Target property in the
<Relationship> in the
<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships"> <Relationship Id="rId4" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/image" Target="media/image1.png"/>
document.xml file, there are lots of image properties, but as a starting point, we should ignore everything except the id, and use it to fetch and insert the original picture file where it appears in the xml.
A few considerations:
- As a start, let's ignore the size of the image in the docx (and any other properties from the docx) and link to the original target image. How hard is it for XSweet to follow the trail from the
_rels/document.xml.relsfile to the image file?
- Images can be either inline (
<w:drawing><wp:inline...) or floating (
<w:drawing><wp:anchor...). If they're floating, it's possible that they would appear in different place in the docx than in the html, but let's worry about that later.
- Are there any special considerations that jump out at you for handling different image formats differently?
- Until now, XSweet and Typescript have produced single self-contained files as outputs, but for image extraction, we'll need copies of the embedded images to point at in the html. I think it would be good if XSweet copied the original image files and included them in a directory as an output (once, not over and over for each step). That way, the HTML image links could relatively point to the files. Is something like this possible using xslt, or would it need to be done some other way? How we handle this with INK is a whole separate conversation but let's start with something as simple as we can.
Open to suggestions and interested to hear what you think.
Here is a very simple example docx: Image_test_docx.docx
And here is how images are currently initially extracted:
<p> <noProof> <div class="drawing" /> </noProof> </p>