Commit b70cf299 authored by Alex Theg's avatar Alex Theg
Browse files

cleaned bash and ruby scripts for xsweet

parents
# XSweet scripts
These ruby and bash scripts are designed to allow you to run the XSweet suite of XSLTs locally, with a copy of every step's output and input.
## Requirements
These scripts assume you have an Unix-like environment and Ruby installed (no specific version requirements have been tested)
## 1. Clone the repo
Clone this repo and `cd` into it:
`git clone https://gitlab.coko.foundation/XSweet/XSweet_scripts`
`cd XSweet_scripts`
## 2. Download the latest XSweet `master` branches
Run:
`ruby xsweet_downloader.rb`
This downloads the `master` branches of the 3 XSweet repositories as .zip files:
* XSweet Core: https://gitlab.coko.foundation/XSweet/XSweet/
* HTMLevator: https://gitlab.coko.foundation/XSweet/HTMLevator/
* Editoria Typescript: https://gitlab.coko.foundation/XSweet/editoria_typescript/
These are then unzipped, and `HTMLevator` and `Editoria Typescript` are moved into the XSweet Core directory (e.g. `XSweet-master-b58cce5612ef70d5da4e5dbd6cabd1babd510eca`). This script also copies the `execute_chain.sh` file into the XSweet Core directory.
## 3. Add folders containing `.docx` files to the `to_convert` folder
For these scripts to work properly, `.docx` files _must_ be placed inside an additional folder in the `to_convert` folder. These scripts were meant to convert many books at a time, so they look for _directories_ inside the `to_convert` directory that contain one or more `.docx` files:
For example, these scripts will find and convert this .docx:
`to_convert/alex/alexs_book_file.docx`
But if the `.docx` is placed directly in the `to_convert` directory, without being nested in another directory inside `to_convert`, it won't be converted. This will _not_ be converted:
`to_convert/alexs_book_file.docx`
## 4. Create an unzipped copy of the `.docx` files
MS Word `.docx` files are really compressed archives. By changing a `.docx` extension to `.zip`, you can unzip these archives. These scripts rely on the raw, unzipped XML files to work.
If this is the first time you are converting a given `.docx` file, you will need to run the `docx_unzipper.rb` script.
`ruby docx_unzipper.rb`
This script:
* indexes all `.docx` files in the `to_convert` directory
* removes spaces from the `.docx` file names
* copies each `.docx` and changes the copy's file extension from `.docx` to `.zip`
* unzips each archive, then deletes the `.zip` file
After the script has run, you'll have an unzipped copy of every `.docx` file alongside the original file. Leave both the original and unzipped files in place: the next script searches for `.docx` files to convert, but pulls the underlying data for conversion from the corresponding unzipped directory.
You only need to run this script once each time you add new `.docx` files to convert - once you've got a `.docx` and its unzipped counterpart, you can run the conversions any number of times.
## 5. Run the conversions
The command:
`ruby xsweet_runner.rb`
runs the conversion chain (as specified in the `execute_chain.sh` file) on every `.docx` file in the `to_convert` directory. The converted files are output to an `output` directory inside the XSweet Core directory.
* The first step specified by the `execute_chain.sh` script uses the `.docx` file's `document.xml` sheet (and other data files from the Word file) as its input.
* With a few exceptions, each step outputs a new file, and the next step uses the previous output as its input.
## Important notes
These scripts were built for rapid development and testing of XSweet, and as such, there are some important gotchas to be aware of:
* All scripts _must_ be run from the root directory of this repo.
* You can reconfigure which XSL sheets are run and in what order by editing the `execute_chain.sh` script. But note that when the `xsweet_downloader.rb` script is run, in addition to downloading the latest XSweet repos, it also copies the `execute_chain.sh` script in this repo's root directory into a `scripts` directory in the XSweet core directory. It is this copy that is invoked by the `xsweet_runner.rb`. Thus, modifying the `xsweet_runner.rb` file in the `XSweet_scripts` root directory has no effect on the conversion until you either copy it into the XSweet Core directory's `scripts` folder, or rerun the `xsweet_downloader` script.
* `xsweet_runner.rb` determines the XSweet Core repository to use by searching in this repo's root directory for a directory name that starts with "XSweet..." (e.g. "XSweet-master-b58cce5612ef70d5da4e5dbd6cabd1babd510eca"). If you want to download a new version of XSweet but keep the old copy you were working off, either move it to a different directory or append something to the beginning of the file name.
* Finally, these scripts are not maintained, supported, etc. - they are merely handy development tools to use as-is.
require 'FileUtils'
# dir = '/Users/atheg/Desktop/crawler/to_convert'
convertDir = Dir.getwd + "/to_convert"
book_list = Dir["#{convertDir}/*"].select {|cand| File.directory? cand}
book_list.each do |book|
docx_list = Dir["#{book}/*.docx"]
docx_list.each do |docx|
no_space = docx.split(" ").join
if docx != no_space
FileUtils.mv(docx, no_space)
end
zip_name_no_ext = no_space.slice(0...-5)
zip_name_ext = zip_name_no_ext + ".zip"
FileUtils.cp(no_space, zip_name_ext)
# %x`open #{zip_name_ext}`
%x`unzip #{zip_name_ext} -d #{zip_name_no_ext}`
end
sleep(8)
zip_list = Dir["#{book}/*.zip"]
zip_list.each do |zip|
FileUtils.rm zip
end
end
#!/bin/bash
# For producing HTML5 outputs via XSweet XSLT from sources extracted from .docx (Office Open XML)
# $DOCNAME is any short identifier
P=$1
BOOKNAME=$2
DOCNAME=$3
# $DOCXdocumentXML is the 'word/document.xml' file extracted (unzipped) from a .docx file
# (Also, its neighbor files from the .docx package should be available.)
# DOCXdocumentXML=$2
# Bind $DOCXdocumentXML and $DOCNAME via $1 and $2 and try
# CL > ./ExtractandRefine.sh yourdocumentname path/to/your/document.xml
# (Which would make it possible to call this script from another one
# and even loop over file sets.)
# Note Saxon is included with this distribution, qv for license.
saxonHE="java -jar ../lib/SaxonHE9-8-0-1J/saxon9he.jar" # SaxonHE (XSLT 2.0 processor)
# EXTRACTION
EXTRACT="../applications/docx-extract/docx-html-extract.xsl" # "Extraction" stylesheet
# NOTE: RUNS TABLE EXTRACTION FROM INSIDE EXTRACT
NOTES="../applications/docx-extract/handle-notes.xsl" # "Refinement" stylesheets
SCRUB="../applications/docx-extract/scrub.xsl"
JOIN="../applications/docx-extract/join-elements.xsl"
COLLAPSEPARA="../applications/docx-extract/collapse-paragraphs.xsl"
LINKS="../applications/htmlevator/applications/hyperlink-inferencer/hyperlink-inferencer.xsl"
PROMOTELISTS="../applications/list-promote/PROMOTE-lists.xsl"
# NOTE: RUNS mark-lists, then itemize-lists
# HEADER PROMOTION
HEADERCHOOSEANDPROMOTE="../applications/htmlevator/applications/header-promote/header-promotion-CHOOSE.xsl"
DIGESTPARA="../applications/htmlevator/applications/header-promote/digest-paragraphs.xsl"
MAKEHEADERXSLT="../applications/htmlevator/applications/header-promote/make-header-escalator-xslt.xsl"
FINALRINSE="../applications/html-polish/final-rinse.xsl"
# TYPESCRIPT
UCPTEXT="../applications/htmlevator/applications/ucp-cleanup/ucp-text-macros.xsl"
UCPMAP="../applications/htmlevator/applications/ucp-cleanup/ucp-mappings.xsl"
SPLITONBR="../applications/typescript/p-split-around-br.xsl"
EDITORIANOTES="../applications/typescript/editoria-notes.xsl"
EDITORIABASIC="../applications/typescript/editoria-basic.xsl"
EDITORIAREDUCE="../applications/typescript/editoria-reduce.xsl"
XMLTOHTML5="../applications/html-polish/html5-serialize.xsl"
# INDUCESECTIONS="../applications/htmlevator/applications/induce-sections/induce-sections.xsl"
# Intermediate and final outputs (serializations) are all left on the file system.
$saxonHE -xsl:$EXTRACT -s:$P/$DOCNAME/word/document.xml -o:../outputs/$BOOKNAME/$DOCNAME-1EXTRACTED.xhtml
echo Made $DOCNAME-1EXTRACTED.xhtml
$saxonHE -xsl:$NOTES -s:../outputs/$BOOKNAME/$DOCNAME-1EXTRACTED.xhtml -o:../outputs/$BOOKNAME/$DOCNAME-2NOTES.xhtml
echo Made $DOCNAME-2NOTES.xhtml
$saxonHE -xsl:$SCRUB -s:../outputs/$BOOKNAME/$DOCNAME-2NOTES.xhtml -o:../outputs/$BOOKNAME/$DOCNAME-3SCRUBBED.xhtml
echo Made $DOCNAME-3SCRUBBED.xhtml
$saxonHE -xsl:$JOIN -s:../outputs/$BOOKNAME/$DOCNAME-3SCRUBBED.xhtml -o:../outputs/$BOOKNAME/$DOCNAME-4JOINED.xhtml
echo Made $DOCNAME-4JOINED.xhtml
$saxonHE -xsl:$COLLAPSEPARA -s:../outputs/$BOOKNAME/$DOCNAME-4JOINED.xhtml -o:../outputs/$BOOKNAME/$DOCNAME-5COLLAPSED.xhtml
echo Made $DOCNAME-5COLLAPSED.xhtml
$saxonHE -xsl:$LINKS -s:../outputs/$BOOKNAME/$DOCNAME-5COLLAPSED.xhtml -o:../outputs/$BOOKNAME/$DOCNAME-6LINKS.xhtml
echo Made $DOCNAME-6LINKS.xhtml
$saxonHE -xsl:$PROMOTELISTS -s:../outputs/$BOOKNAME/$DOCNAME-6LINKS.xhtml -o:../outputs/$BOOKNAME/$DOCNAME-7PROMOTELISTS.xhtml
echo Made $DOCNAME-7PROMOTELISTS.xhtml
$saxonHE -xsl:$HEADERCHOOSEANDPROMOTE -s:../outputs/$BOOKNAME/$DOCNAME-7PROMOTELISTS.xhtml -o:../outputs/$BOOKNAME/$DOCNAME-8HEADERSPROMOTED.xhtml
echo Made $DOCNAME-8HEADERSPROMOTED.xhtml
# CLASSIC HP
# $saxonHE -xsl:$DIGESTPARA -s:../outputs/$BOOKNAME/$DOCNAME-7PROMOTELISTS.xhtml -o:../outputs/$BOOKNAME/$DOCNAME-8DIGESTEDPARA.xhtml
# echo Made $DOCNAME-8DIGESTEDPARA.xhtml
#
# $saxonHE -xsl:$MAKEHEADERXSLT -s:../outputs/$BOOKNAME/$DOCNAME-8DIGESTEDPARA.xhtml -o:../outputs/$BOOKNAME/$DOCNAME-9BESPOKEHEADERXSLT.xsl
# echo Made $DOCNAME-9BESPOKEHEADERXSLT.xsl
# HEADERXSL="../outputs/$BOOKNAME/$DOCNAME-9BESPOKEHEADERXSLT.xsl"
#
# $saxonHE -xsl:$HEADERXSL -s:../outputs/$BOOKNAME/$DOCNAME-7PROMOTELISTS.xhtml -o:../outputs/$BOOKNAME/$DOCNAME-10CLASSICHEADERSPROMOTED.xhtml
# echo Made $DOCNAME-10CLASSICHEADERSPROMOTED.xhtml
$saxonHE -xsl:$FINALRINSE -s:../outputs/$BOOKNAME/$DOCNAME-8HEADERSPROMOTED.xhtml -o:../outputs/$BOOKNAME/$DOCNAME-9RINSED.xhtml
echo Made $DOCNAME-9RINSED.xhtml
$saxonHE -xsl:$UCPTEXT -s:../outputs/$BOOKNAME/$DOCNAME-9RINSED.xhtml -o:../outputs/$BOOKNAME/$DOCNAME-10UCPTEXTED.xhtml
echo Made $DOCNAME-10UCPTEXTED.xhtml
$saxonHE -xsl:$UCPMAP -s:../outputs/$BOOKNAME/$DOCNAME-10UCPTEXTED.xhtml -o:../outputs/$BOOKNAME/$DOCNAME-11UCPMAPPED.xhtml
echo Made $DOCNAME-11UCPMAPPED.xhtml
$saxonHE -xsl:$SPLITONBR -s:../outputs/$BOOKNAME/$DOCNAME-11UCPMAPPED.xhtml -o:../outputs/$BOOKNAME/$DOCNAME-12SPLITONBR.xhtml
echo Made $DOCNAME-12SPLITONBR.xhtml
$saxonHE -xsl:$EDITORIANOTES -s:../outputs/$BOOKNAME/$DOCNAME-12SPLITONBR.xhtml -o:../outputs/$BOOKNAME/$DOCNAME-13EDITORIANOTES.xhtml
echo Made $DOCNAME-13EDITORIANOTES.xhtml
$saxonHE -xsl:$EDITORIABASIC -s:../outputs/$BOOKNAME/$DOCNAME-13EDITORIANOTES.xhtml -o:../outputs/$BOOKNAME/$DOCNAME-14EDITORIABASIC.xhtml
echo Made $DOCNAME-14EDITORIABASIC.xhtml
$saxonHE -xsl:$EDITORIAREDUCE -s:../outputs/$BOOKNAME/$DOCNAME-14EDITORIABASIC.xhtml -o:../outputs/$BOOKNAME/$DOCNAME-15EDITORIAREDUCE.html
echo Made $DOCNAME-15EDITORIAREDUCE.html
$saxonHE -xsl:$XMLTOHTML5 -s:../outputs/$BOOKNAME/$DOCNAME-15EDITORIAREDUCE.html -o:../outputs/$BOOKNAME/$DOCNAME-16HTML5.html
echo Made $DOCNAME-16HTML5.html
require 'httparty'
require 'FileUtils'
new_file_xsweet = File.join(Dir.pwd, "xsweet.zip")
new_file_typescript = File.join(Dir.pwd, "typescript.zip")
new_file_htmlevator = File.join(Dir.pwd, "htmlevator.zip")
downloaded_file = File.new(new_file_xsweet, "w")
downloaded_file.write(HTTParty.get("https://gitlab.coko.foundation/XSweet/XSweet/repository/archive.zip?ref=master").body)
downloaded_file.close
downloaded_file = File.new(new_file_typescript, "w")
downloaded_file.write(HTTParty.get("https://gitlab.coko.foundation/XSweet/editoria_typescript/repository/archive.zip?ref=master").body)
downloaded_file.close
downloaded_file = File.new(new_file_htmlevator, "w")
downloaded_file.write(HTTParty.get("https://gitlab.coko.foundation/XSweet/HTMLevator/repository/archive.zip?ref=master").body)
downloaded_file.close
%x`open #{new_file_xsweet}`
sleep(1)
%x`open #{new_file_typescript}`
sleep(1)
%x`open #{new_file_htmlevator}`
sleep(1)
typescript_name = Dir.glob("editoria_typescript*").pop
xsweet_name = Dir.glob("XSweet-*").pop
htmlevator_name = Dir.glob("HTMLevator-*").pop
puts "typescript_name"
puts typescript_name
puts "xsweet_name"
puts xsweet_name
puts "htmlevator_name"
puts htmlevator_name
%x`mv #{typescript_name} #{xsweet_name}/applications/typescript`
%x`mv #{htmlevator_name} #{xsweet_name}/applications/htmlevator`
%x`mkdir #{xsweet_name}/scripts`
%x`cp ./execute_chain.sh #{xsweet_name}/scripts`
FileUtils.rm "xsweet.zip"
FileUtils.rm "typescript.zip"
FileUtils.rm "htmlevator.zip"
xsweet_script_path = Dir.glob("XSweet*").pop
puts xsweet_script_path
rootDir = Dir.getwd
convertDir = rootDir + "/to_convert"
book_list = Dir["#{convertDir}/*"].select {|cand| File.directory? cand}
book_list.delete("#{convertDir}/temp")
puts "BOOK LIST"
number_label = 1
book_list.each do |book|
puts "#{number_label}: #{book}"
number_label += 1
end
puts "BOOK LIST END"
book_list.each do |book|
book = book.split("/").last
puts "BOOK: #{book}"
file_path_list = Dir["#{convertDir}/#{book}/*"].select {|cand| File.directory? cand}
# puts file_path_list
file_list = []
file_path_list.each do |file_path|
file_list << file_path.split("/").last
end
newDir = rootDir + "/#{xsweet_script_path}/scripts"
Dir.chdir newDir
file_list.each do |chapter|
puts "converting: #{chapter}"
%x`sh execute_chain.sh #{convertDir}/#{book} #{book} #{chapter}`
end
puts "done with #{book}"
end
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment