WIP - Handle track changes
XSweet should handle track changes.
Current .docx track changes format
- ANY OTHERS? TEST A LARGER GROUP OF FORMATTING CHANGES
Information included in OOXML track changes
The following data exists in the .docx for each tracked change, as attributes on the same tags that mark track changesL <w:ins
, <w:del
, and <w:rPrChange
:
-
w:id
: integers assigned mostly sequentially to track changes as they are made (a few things other than track changes get assigned these, such as<w:bookmark...>
s). w:author
-
w:date
: timestamp in ISO 8601 format
Example: <w:ins w:id="1" w:author="Alex Theg" w:date="2020-02-03T07:44:00Z">
Insertions
Insertions are denoted by a <w:ins
tag
Inline text: For the simplest insertions, inline within a text run, the preceding run ends (</w:r>
), then there's an insertion start tag (<w:ins...>
), then a normal <w:r><w:t>
inside it, followed by the normal closing tags and a </w:ins>
:
<w:p w14:paraId="5363EBC2" w14:textId="0075D436" w:rsidR="00105FD3" w:rsidRDefault="005D2EF4">
<w:r>
<w:t>Now let's do some inline additions starting after the colon but before the space:</w:t>
</w:r>
<w:ins w:id="12" w:author="Alex Theg" w:date="2020-02-05T19:07:00Z">
<w:r>
<w:t xml:space="preserve">
here's the inline insertion which ends on the w in now. now
</w:t>
</w:r>
</w:ins>
<w:r>
<w:t>! The exclamation mark is the first untracked character again.</w:t>
</w:r>
</w:p>
If an insertion covers multiple paragraphs, then each paragraph will have a <w:pPr><w:rPr><w:ins.../>
with a timestamp that matches the timestamp of the last tracked change from that paragraph. I think a single whole-paragraph insertion doesn't necessarily get this <w:pPr><wPrPr><w:ins.../>
. All inserted text within the paragraph is still marked as a normal text insertion (<w:ins...><w:r><w:t>
), so it appears the <w:pPr>
can usually be ignored. The only exception is the insertion of an entirely blank paragraph, i.e. hit return twice - the paragraph created by the first keystroke. That empty para looks like this:
<w:p w14:paraId="2661557F" w14:textId="77777777" w:rsidR="00105FD3" w:rsidRDefault="00105FD3">
<w:pPr>
<w:rPr><w:ins w:id="7" w:author="Alex Theg" w:date="2020-02-05T19:02:00Z"/></w:rPr>
</w:pPr>
</w:p>
This is worth verifying as TC support is being built, but it may also be a moot point:
- IIRC, we may already disregard everything in
<w:pPr>
s (low degree of confidence here) - Pretty sure we also remove
<p>
s with no content somewhere in the pipeline (higher confidence)
Deletions
Deletions are denoted by a <w:del
tag.
Inline text: simple inline text deletions look the same as inline text insertions, except that they are surrounded by a <w:delText>
tag instead of the regular <w:r>
tag. This means they'd need slightly different handling than the contents of a text insertion TC:
<w:p w14:paraId="507F6657" w14:textId="77777777" w:rsidR="00803599" w:rsidRDefault="00803599" w:rsidP="00803599">
<w:r>
<w:t xml:space="preserve">
The quick
</w:t>
</w:r>
<w:del w:id="0" w:author="Alex Theg" w:date="2020-02-08T17:18:00Z">
<w:r w:rsidDel="00803599">
<w:delText xml:space="preserve">
brown fox
</w:delText>
</w:r>
</w:del>
<w:r>
<w:t>jumped over the lazy dog.</w:t>
</w:r>
</w:p>
Multi-paragraph deletions are handled similarly to insertions:
- Paragraphs in a multi-para deletion have a
<w:pPr><w:rPr><w:del.../>
set on the para level. - The deletions are still tracked as regular deletions: wrapped in
<w:del...>
tags. - For deletions, the text is inside
<w:delText>
tags, instead of<w:t>
tags.
Inline formatting changes
see issue #164 (closed)
Inline formatting changes are denoted by a <w:rPrChange
tag that includes the track change id, author, and date data.
Formatting changes happen on the <w:r>
level. The position of the formatting tag changes depends on whether it is being applied or removed:
Added formatting follows the below pattern. Note that the applied format tag comes between <w:rPr>
and <w:rPrChange...>
, and the <w:rPr/>
tag inside the w:rPrChange...>
is self-closing:
<w:r...>
<w:rPr>
<w:[NAME OF FORMAT TAG APPLIED]/>
<w:rPrChange w:id="[ID]" w:author="[AUTHOR]" w:date="[ISO8601 TIMESTAMP]">
<w:rPr/>
</w:rPrChange>
</w:rPr>
<w:t>Hello there.</w:t>
</w:r>
Here is the generic pattern for formatting removal. Note that the format tag being removed is inside the <w:rPr>
tags inside the <w:rPrChange>
tags:
<w:r...>
<w:rPr>
<w:rPrChange w:id="[ID]" w:author="[AUTHOR]" w:date="[ISO8601 TIMESTAMP]">
<w:rPr>
<w:[NAME OF FORMAT TAG APPLIED]/>
</w:rPr>
</w:rPrChange>
</w:rPr>
<w:t>Hello there.</w:t>
</w:r>
Here are examples of inline formatting being applied and removed for:
- bold
- italics
- small caps
- subscript
- superscript
- highlighting
<!-- add bold -->
<w:r w:rsidR="00DD440B" w:rsidRPr="00DD440B">
<w:rPr>
<w:b/>
<w:rPrChange w:id="1" w:author="Alex Theg" w:date="2020-02-05T19:18:00Z">
<w:rPr/>
</w:rPrChange>
</w:rPr>
<w:t>bold</w:t>
</w:r>
<!-- remove bold -->
<w:r w:rsidRPr="008D6A23">
<w:rPr>
<w:rPrChange w:id="7" w:author="Alex Theg" w:date="2020-02-05T19:20:00Z">
<w:rPr>
<w:b/>
</w:rPr>
</w:rPrChange>
</w:rPr>
<w:t>bold</w:t>
</w:r>
<!-- add italics -->
<w:r w:rsidR="00DD440B" w:rsidRPr="00DD440B">
<w:rPr>
<w:i/>
<w:rPrChange w:id="2" w:author="Alex Theg" w:date="2020-02-05T19:18:00Z">
<w:rPr/>
</w:rPrChange>
</w:rPr>
<w:t>italics</w:t>
</w:r>
<!-- remove italics -->
<w:r w:rsidRPr="008D6A23">
<w:rPr>
<w:rPrChange w:id="8" w:author="Alex Theg" w:date="2020-02-05T19:20:00Z">
<w:rPr>
<w:i/>
</w:rPr>
</w:rPrChange>
</w:rPr>
<w:t>italics</w:t>
</w:r>
<!-- add small caps -->
<w:r w:rsidR="00DD440B" w:rsidRPr="00DD440B">
<w:rPr>
<w:smallCaps/>
<w:rPrChange w:id="3" w:author="Alex Theg" w:date="2020-02-05T19:18:00Z">
<w:rPr/>
</w:rPrChange>
</w:rPr>
<w:t>small-caps</w:t>
</w:r>
<!-- remove small caps -->
<w:r w:rsidRPr="008D6A23">
<w:rPr>
<w:rPrChange w:id="9" w:author="Alex Theg" w:date="2020-02-05T19:20:00Z">
<w:rPr>
<w:smallCaps/>
</w:rPr>
</w:rPrChange>
</w:rPr>
<w:t>small-caps</w:t>
</w:r>
<!-- add subscript -->
<w:r w:rsidR="00DD440B" w:rsidRPr="00DD440B">
<w:rPr>
<w:vertAlign w:val="subscript"/>
<w:rPrChange w:id="4" w:author="Alex Theg" w:date="2020-02-05T19:18:00Z">
<w:rPr/>
</w:rPrChange>
</w:rPr>
<w:t>subscript</w:t>
</w:r>
<!-- remove subscript -->
<w:r w:rsidRPr="008D6A23">
<w:rPr>
<w:rPrChange w:id="10" w:author="Alex Theg" w:date="2020-02-05T19:20:00Z">
<w:rPr>
<w:vertAlign w:val="subscript"/>
</w:rPr>
</w:rPrChange>
</w:rPr>
<w:t>subscript</w:t>
</w:r>
<!-- add superscript -->
<w:r w:rsidR="00DD440B" w:rsidRPr="00DD440B">
<w:rPr>
<w:vertAlign w:val="superscript"/>
<w:rPrChange w:id="5" w:author="Alex Theg" w:date="2020-02-05T19:19:00Z">
<w:rPr/>
</w:rPrChange>
</w:rPr>
<w:t>superscript</w:t>
</w:r>
<!-- remove superscript -->
<w:r w:rsidRPr="008D6A23">
<w:rPr>
<w:rPrChange w:id="11" w:author="Alex Theg" w:date="2020-02-05T19:20:00Z">
<w:rPr>
<w:vertAlign w:val="superscript"/>
</w:rPr>
</w:rPrChange>
</w:rPr>
<w:t>superscript</w:t>
</w:r>
<!-- add highlight -->
<w:r w:rsidR="00DD440B" w:rsidRPr="00DD440B">
<w:rPr>
<w:highlight w:val="yellow"/>
<w:rPrChange w:id="6" w:author="Alex Theg" w:date="2020-02-05T19:19:00Z">
<w:rPr/>
</w:rPrChange>
</w:rPr>
<w:t>highlight</w:t>
</w:r>
<!-- remove highlight -->
<w:r w:rsidRPr="008D6A23">
<w:rPr>
<w:rPrChange w:id="12" w:author="Alex Theg" w:date="2020-02-05T19:20:00Z">
<w:rPr>
<w:highlight w:val="yellow"/>
</w:rPr>
</w:rPrChange>
</w:rPr>
<w:t>highlight</w:t>
</w:r>
Whole-paragraph formatting changes
see issue #165
Whole-para formatting changes are reflected on the paragraph level under <w:pPr>
in a similar format as on the runs. Similarly to how insertions and deletions can probably be generally ignored (with the possible exception of blank paras), my guess is that formatting changes marked at the para level can be ignored entirely, since the same formatting change appears to be marked on all the <w:r>
s in the paragraph as well. Here's an example of a whole-para formatting change:
<w:p w14:paraId="2D7ADC77" w14:textId="31895971" w:rsidR="00C61225" w:rsidRPr="000432AB" w:rsidRDefault="00C61225">
<w:pPr>
<w:rPr>
<w:b/>
<w:rPrChange w:id="14" w:author="Alex Theg" w:date="2020-02-05T19:27:00Z">
<w:rPr/>
</w:rPrChange>
</w:rPr>
</w:pPr>
<w:r w:rsidRPr="000432AB">
<w:rPr>
<w:b/>
<w:rPrChange w:id="15" w:author="Alex Theg" w:date="2020-02-05T19:27:00Z">
<w:rPr/>
</w:rPrChange>
</w:rPr>
<w:t>Here's a whole-para formatting change. Does it show up where it would if it was just a word, or change the whole paragraph's properties?</w:t>
</w:r>
<w:r w:rsidR="000432AB" w:rsidRPr="000432AB">
<w:rPr>
<w:b/>
<w:rPrChange w:id="16" w:author="Alex Theg" w:date="2020-02-05T19:27:00Z">
<w:rPr/>
</w:rPrChange>
</w:rPr>
<w:t xml:space="preserve">
Bit at end selected</w:t>
</w:r>
</w:p>
Adjacent track changes
Sometimes, adjacent TC insertions are marked as multiple TCs in the OOXML but treated as a single TC insertion in Word:
<w:p w14:paraId="2FB11FDC" w14:textId="7955A033" w:rsidR="00105FD3" w:rsidRDefault="00105FD3">
<w:pPr>
<w:rPr><w:ins w:id="2" w:author="Alex Theg" w:date="2020-02-05T19:02:00Z"/></w:rPr>
</w:pPr>
<w:ins w:id="3" w:author="Alex Theg" w:date="2020-02-05T19:02:00Z">
<w:r>
<w:t xml:space="preserve">1 of 2
</w:t>
</w:r>
</w:ins>
<w:ins w:id="4" w:author="Alex Theg" w:date="2020-02-05T19:01:00Z">
<w:r>
<w:t xml:space="preserve">I'm turning on TC again now. The return between this and the previous TC-insertion para above it is not tracked. Now I'll hit enter twice and start a new
</w:t>
</w:r>
</w:ins>
<w:ins w:id="5" w:author="Alex Theg" w:date="2020-02-05T19:02:00Z">
<w:r>
<w:t>para, making this one long multi-para insertion</w:t>
</w:r>
</w:ins>
<w:ins w:id="6" w:author="Alex Theg" w:date="2020-02-05T19:01:00Z">
<w:r>
<w:t>.</w:t>
</w:r>
</w:ins>
</w:p>
<w:p w14:paraId="2661557F" w14:textId="77777777" w:rsidR="00105FD3" w:rsidRDefault="00105FD3">
<w:pPr>
<w:rPr><w:ins w:id="7" w:author="Alex Theg" w:date="2020-02-05T19:02:00Z"/></w:rPr>
</w:pPr>
</w:p>
<w:p w14:paraId="7F19C05A" w14:textId="719C1229" w:rsidR="00105FD3" w:rsidRDefault="00105FD3">
<w:pPr>
<w:rPr><w:ins w:id="8" w:author="Alex Theg" w:date="2020-02-05T19:04:00Z"/></w:rPr>
</w:pPr>
<w:ins w:id="9" w:author="Alex Theg" w:date="2020-02-05T19:02:00Z">
<w:r>
<w:t>2 of 2 It's still the same insertion</w:t>
</w:r>
</w:ins>
<w:ins w:id="10" w:author="Alex Theg" w:date="2020-02-05T19:03:00Z">
<w:r>
<w:t>. TC left on for the next enter,</w:t>
</w:r>
</w:ins>
<w:ins w:id="11" w:author="Alex Theg" w:date="2020-02-05T19:04:00Z">
<w:r>
<w:t xml:space="preserve">
but not the one after that.</w:t>
</w:r>
</w:ins>
</w:p>
The same thing goes for deletions: adjacent deletions marked separately in the OOXML are treated as a single deletion in Word:
<w:p w14:paraId="47DD4588" w14:textId="20155193" w:rsidR="00803599" w:rsidRDefault="00803599" w:rsidP="004A18F2">
<w:pPr><w:outlineLvl w:val="0"/></w:pPr>
<w:r>
<w:t>1. Can you have adjacent deletions in the underlying .</w:t>
</w:r><w:proofErr w:type="spellStart"/>
<w:r>
<w:t>docx</w:t>
</w:r><w:proofErr w:type="spellEnd"/>
<w:r>
<w:t xml:space="preserve">
that are treated as one
</w:t>
</w:r><w:bookmarkStart w:id="0" w:name="_GoBack"/><w:bookmarkEnd w:id="0"/>
<w:del w:id="1" w:author="Alex Theg" w:date="2020-02-08T17:27:00Z">
<w:r w:rsidDel="00BF16DE">
<w:delText>TC</w:delText>
</w:r>
<w:r w:rsidDel="00573FCF">
<w:delText>, sim</w:delText>
</w:r>
</w:del>
<w:del w:id="2" w:author="Alex Theg" w:date="2020-02-08T17:25:00Z">
<w:r w:rsidDel="00AC7454">
<w:delText xml:space="preserve">ilar
</w:delText>
</w:r>
<w:r w:rsidDel="004A18F2">
<w:delText>to insertions</w:delText>
</w:r>
</w:del>
<w:r>
<w:t>?</w:t>
</w:r>
</w:p>
Adjacent but separately marked text insertions or text deletions are still treated by Word as a single one, even if they are non-sequential. E.g. if an author makes:
- Tracked insertion A
- Tracked insertion B somewhere else in the document
- Tracked insertion C adjacent to tracked insertion A
then the touching insertions A and C would be accepted or rejected together in Word.
Inline formatting appears similar: equal adjacent TC'd formatting changes are treated in Word as one change to be accepted or rejected together.
Current XSweet behavior (no handling)
The initial docx-html-extract.xsl
doesn't currently recognize any of the track changes markup from the OOXML as important or worth saving. As such, it drops the surrounding tags and change data, but passes through the content tags within, since those are important. The result of this is effectively to:
- Accept all text additions, leaving them in place
- Reject all text deletions, as the text just gets passed through.
- formatting
Target formats
UPDATE:
- The HTML-like format for text insertions and deletions is reflected in the subsequent comments on this ticket
- #164 (closed) reflects the HTML-like format for formatting changes
- #172 (closed) reflects a newer target format for TCs for Wax 2
ORIGINAL:
From @christos, as a starting point:
Addition
<track-change status="add" user-id="1" username="demo" color='{"addition":"#4990e2","deletion":"#c00"}'> some addition</track-change>
Deletion
<track-change status="delete" user-id="1" username="demo" color='{"addition":"#4990e2","deletion":"#c00"}'>test</track-change>
add Format change
<track-change-format status="add-formating" oldtype="[]" addedtype="[{" username ":"demo ","type ":"strong "}]" user-id="1" username="demo"><strong>test</strong></track-change-format>
remove and add format change
<track-change-format status="delete-formating" oldtype="[{" username ":"demo ","type ":"strong "}]" addedtype="[]" user-id="1" username="demo">
<track-change-format status="add-formating" oldtype="[]" addedtype="[{" username ":"demo ","type ":"emphasis "}]" user-id="1" username="demo"><em>test</em></track-change-format>
</track-change-format>
Architecture
UPDATE: #167 exists for considering different options/modes for handling TCs
ORIGINAL:
- Initial XSweet extraction XSL sheet (
docx-html-extract.xsl
)will need to preserve the OOXML track changes markup.should know whether to preserve track changes at all. It's important that this first sheet can take a yes or no, because there's plenty of handling down-pipeline that cleans up the HTML by joining adjacent tags of the same type and attributes together, pushes formatting from inline to para level, etc., and I can see TC markup preventing a lot of that from happening. Thus, if someone doesn't need TC preserved, they should be able to opt out and get the best result from all those cleanups. - Subsequent XSL sheets will need to know how to pass through the relevant OOXML track changes tags.
- The
final-rinse.xsl
sheet in thehtml-polish
group would need to know how to handle track changes, if they're passed to it. At a minimum, this step would need to know whether to:- keep the OOXML track change markup as-is, to be transformed into an target format in a subsequent step
- Accept (or reject) all changes, to preserve XSweet's ability to output to valid HTML, if that is the target format[Better_tc_test.docx]
Temp testing docx: beautified_document.xml Better_tc_test.docx