I’ve been writing document assembly software for the past few weeks and I’ve found some interesting challenges in the process. One is an issue where some placeholders in the document template break after I edit the template file manually.
Before I show you the details and the solution I found, let me give you some context by reviewing what document assembly is.
Document Assembly Systems
An automated document assembly system, or document automation system, merges data originated from a repository, or collected on the fly through a user interface, into a document template, to generate electronic documents as its output.
There can be various levels of complexity in this process, ranging from simple data insertion into the document template, to applying complex logic and conditional transformations to the source data during the merge process. Think about hiding or showing paragraphs, inserting or removing list items, and highlighting or underlining text blocks.
For the merge to work, someone must mark the document template’s content with placeholders that the assembly software will use to perform the required transformations.
There are multiple ways to create these placeholders. In my particular case, I’m using Microsoft Word document with simple text placeholders defined with square bracket characters: “[“ and “]”. Here’s a screenshot showing a fragment of a template:
This requirement came from the users of the system. They have performed this assembly process manually for years and use this type of placeholder in the process.
Document Assembly and the Open XML Format
I use Open XML in my assembly software to replace each placeholder with the corresponding data. Microsoft developed Open XML as a file format for word processing documents, spreadsheets and presentations. This is the default format used since Microsoft Office 2007 (We’re using Office 2010 in our document assembly system), and it’s connected to the issue I was having with the document template’s placeholders.
A Word document in Open XML format is a zip file (although it has the docx extention) that contains several XML files. These files carry the contents of the document, formatting information and other types of metadata. If you change a Word document’s extension from docx to zip and open the zip file, you will see something like this:
If you open the word folder, you will see a bunch of xml files that describe the contents of the Word document:
I perform the data merge by opening the document.xml file depicted in the screenshot above, searching for each placeholder instance and replacing it with the corresponding field from a data file.
The Placeholders Challenge
Back to the issue I was having, I found that sometimes my software would miss placeholders after I manually edited the document template in Microsoft Word. The placeholders would look normal when I opened the template in Word to check for problems. Here’s an example:
Then, when I opened the XML file and checked the [MeetingWeekDayMonthYear] placeholder, I would see this:
<w:r w:rsidR="00EF5483" w:rsidRPr="004F0A1A"> <w:rPr> <w:b/> <w:spacing w:val="-3"/> <w:szCs w:val="24"/> </w:rPr> <w:t>[</w:t> </w:r> <w:proofErr w:type="spellStart"/> <w:r w:rsidR="00EF5483" w:rsidRPr="004F0A1A"> <w:rPr> <w:b/> <w:spacing w:val="-3"/> <w:szCs w:val="24"/> </w:rPr> <w:t>MeetingWeekDayMonthYear</w:t> </w:r> <w:proofErr w:type="spellEnd"/> <w:r w:rsidR="00EF5483" w:rsidRPr="004F0A1A"> <w:rPr> <w:b/> <w:spacing w:val="-3"/> <w:szCs w:val="24"/> </w:rPr> <w:t>]</w:t> </w:r>
As you can see, the square brackets characters “[“ and “]” are getting separated from the rest of the placeholder. That’s what causes the software to not find it.
The first thought that came to mind to solve this was to “freeze” the template once the system went into production mode. I quickly discarded it because the system’s users frequently request changes to the template for legitimate reasons.
When Spell Checking Doesn’t Help
I had to think about what could “break” the placecholders. Going back to the XML, what separates the square bracket characters from the “MeetingWeekDayMonthYear” text is the “w:proofErr” element. This indicates that Word is spell checking the text “MeetingWeekDayMonthYear” and not the bracket characters, which is OK with me, except that it’s braking my placeholder.
The observation led me to believe that if I turned off spell checking for the placeholder, Word would keep the placeholder in one piece. That’s what I tried and it worked!
To turn off spell checking for a placeholder, I selected the placeholder and opened the Language/Set Proofing Language menu in the Review tab:
When the Language window popped up, I checked the “Do not check spelling or grammar” option:
I checked the XML document generated after making this change and saving the file. The placeholder was back to normal:
<w:r w:rsidR="00CB7293" w:rsidRPr="004F0A1A"> <w:rPr> <w:b/> <w:noProof/> <w:spacing w:val="-3"/> <w:szCs w:val="24"/> </w:rPr> <w:t>[MeetingWeekDayMonthYear]</w:t> </w:r>
Summary and Next Steps
It turned out that disabling the spell checker for the text in the placeholders solved the issue I describe in this article. In future articles, I will describe other issues I’ve found while working on the document assembly software I describe here.
Quick questions for you – Have you written software for manipulating documents? What challenges have you found? Please leave a note in the comments section below.