In my previous post I wrote about the evolution of the Echelon Reference Series. So far there have been four stages:
- Raw copy and paste aggregation. Ultimately not useful to me because it threw away so much information without gaining me much.
- Aggregating by source document, marking up in Word using styles specific to the problem domain. Ultimately too specific, not abstract enough, and my workflow was prone to error… as illustrated very clearly when a moment’s inattention blew away almost everything. Oops… but it cleared the deck for the next version.
- As second iteration, but slightly more abstract, making it easier to handle in code. More importantly, started using source control (as I should have been from the start). Also started to parse and gain information from the text itself, allowing automating linking and cross-referencing of content. Ultimately insufficient because it lacked fine control over layout, and the automated data extraction similarly lacked fine control. Released: ERS: Barbarians, ERS: Clerics, and ERS: Sorcerers.
- Revised document construction mechanisms. Made workflow more efficient by building a common index file, then linking and cross-referencing information as new content was added. While it is getting closer, I realized there is opportunity for another level of abstraction in content (which I’ll talk about below) and combining the document files (as opposed to data files) so I need only maintain one set for all six versions (see below) of each. Released: ERS: Rogues, ERS: Fighters, ERS: Monks, and ERS: Rangers in RAF (except Monks, which is WIP — Work In Progress, the next stage), and will release the other ERS class books and the ERS spell books (sample ERS: Elemental Wizard Spells, PWYW).
Now to describe the next generation of the Echelon Reference Series.
Next Generation Data Geekery
Grab your pocket protectors, this is going to be a ride.
The biggest limitation I’ve had with the Echelon Reference Series lately is around data selection, and redundancy in document scaffolding. What does that mean? Well…
- Almost every title of the Echelon Reference Series has two main flavors: PRD-Only, and 3pp+PRD. The first has material only from the PRD (well, mostly… I don’t count my augmentations), the second includes select third-party material.
- Each book in the Echelon Reference Series has three versions, at different stages of development. I started releasing the early-stage versions (at a discount, see below) so they at least exist, and I can then improve on them.
- The RAF (‘Rough And Fast’, politely) version has the basic text content, but not much more. No diagrams, no additional useful redundancy (such as applying the archetypes to the associated class to build a ‘new class’ and see what it looks like when the archetype is used. While I do look for places where items are referenced (such as feats and skills) I might not have yet done it exhaustively. This version of the product sells at 50% off (75% off in bundles) because they’re not complete, and buyers will also get the WIP and Final versions at no extra charge when they come available.
- The WIP (‘Work in Progress’) is more developed. The text has been gone over more thoroughly, and I’ve added many diagrams (but possibly not all). I probably don’t have the archetype classes in place, but I might have started organizing things better. This version of the product sells at 25% off (50% in bundles) because while they’re getting closer, they’re not actually done yet. Again, buyers will also get the Final versions at no extra charge when they are available.
- The Final version has the diagrams and the archetype classes, and so on. I’m done with this, until I add more content. This version no longer has a discount on its own, but does have a 25% discount in bundles.
I have found while working on them, though, that even though the earlier versions aren’t as complete as I’m aiming for, they’re still pretty good. The RAF version is, after I finish cleaning the text up, a pretty close copy of the source material, but organized and consistently formatted. I can see many people preferring this version, in fact. Similarly, the WIP has that and the diagrams, without some of the redundant text that provides context: I can see people preferring this version because it doesn’t have excess material, but presents the rest of the content in a more approachable manner.
As a result I’ve pretty much decided to release each document with RAF, WIP, and Final included, when and as available, with discounts for ‘buying early’ (before Final version). This means I’ll be maintaining six versions of each title, and the current framework… does not do that well. What was more or less manageable with two versions (PRD-Only and 3pp+PRD) becomes difficult with six versions of each document.
A couple years ago I described workflows for extracting data from Word files, so I won’t describe it here again… except as ‘painfully complicated and prone to error’. I’ve largely worked around the problems, but even now I run into problems when a line ends (or starts? I forget) on an HTML element, in which case the space that should follow it gets removed. This leads to cases where I get text like
Bloodline Spells magic missile(3rd)
If you look closely, there is no space character between ‘magic missile‘ and ‘(3rd)’. I have not found an acceptable workaround for this that is easy and consistently effective.
Moving ahead, I will instead convert the Word files to ‘WordprocessingML’, an XML representation of the Word internal structure of the file. This starts in a useful character encoding (UTF-8) rather than the less than useful windows-1232, and more importantly does not need HTML Tidy (which appears to have a lot of influence on the problem). This means that once the content leaves Word it will be in happy XML, where it is easy for me to get at.
Document Creation, Single Sourcing
I realized that with some changes to how I capture the information I can probably get each title down to a single source document (plus my data store, of course). Major sections simply copy content from my data store into the output document. For instance, a chapter containing all the rogue talents contains the chapter title and introductory text, then a long list of object IDs of content to copy from source into this document. Right now the PRD-Only and 3pp+PRD rogue talent chapters are different files, but I realized that if the IDs are properly unique I can get away with a single list and plug into different data stores depending on version. If I plug into the 3pp+PRD data file all the identified objects will be copied, if I plug into the PRD-Only data file then many of the identified objects (the 3pp ones) won’t be copied. I achieve my PRD-Only/3pp+PRD split with no further effort… at least as far as data objects are concerned.
Text selection and formatting is a little more effort:
- All PRD content ends up in the 3pp+PRD version, but there are sometimes entire chapters that exist only for the 3pp+PRD version (exalted domains are in ERS: Clerics (3pp+PRD), but not in ERS: Clerics (PRD-Only), so an ‘Exalted Domains’ chapter in the PRD-Only version would be empty and out of place). There needs to be a way to turn off certain content based on PRD-Only/3pp+PRD distinctions.
- Final includes content not present in WIP and RAF (archetype classes, for example) and WIP includes content not present in RAF (diagrams, some expanded text). It’s easy to exclude the diagrams in the RAF version by simply ignoring the diagram instructions, but there needs to be a way to exclude text.
- Because the content differs from version to version (that is, I might need six different sets of tweaks), there needs to be a way to include or exclude tweaks based on (PRD-Only, 3pp+PRD) and (RAF, WIP, Final) distinctions.
This actually should be easier than it sounds.
Word is very flat (except for tables): a chapter heading, a section heading, and a list nested within two other lists are all at the same level according to Word. The first two steps when processing the XML files created from these Word files are:
- Remove stuff I don’t care about. Word has a lot of overhead in the file, defining styles and whatnot. I don’t care about it, I get rid of it. This step includes mapping the content elements to other elements with attributes to be used later. For instance, a paragraph with ‘doc 4 Chapter’ (document level 4, chapter — ‘doc 4 indicating depth and so they’re ordered properly in the style manager, ‘chapter’ to remind me of the semantic intent) gets turned into <section outline-level=”4″ />. This happens with many elements.
- Build the document hierarchy, so each outline-level=’1′ element contains all following objects of lower (or with no) outline level, repeating until there are no more outline-levels… then do the same for list-level. (Incidentally the game objects live somewhere around outline-level=10… and For Reasons, stat blocks are considered lists)
- Instructions to import files or copy game objects are also given outline-levels, which 1. keeps them from being nested incorrectly in other content, and 2. allows me to append content to them after importing.
This gives me a very easy way to solve my problem. I can assign attributes (exact mechanism not yet determined, I have many options) to the various objects so they are relevant only for certain builds.
- prd means ‘include only in a PRD-Only build’.
- 3pp means ‘include only in a 3pp+PRD build’.
- raf means ‘include only in an RAF build’
- wip means ‘include only in a WIP build’
- fin means ‘include only in a final build’
- !prd means ‘do not include in a PRD-Only build’
- !3pp means ‘do not include in a 3pp+PRD build’
- !raf means ‘do not include in an RAF build’
- !wip means ‘do not include in a WIP build’
- !fin means ‘do not include in a final build’
When processing, it is very easy for me to know which version I’m working on. The “Exalted Domains” chapter I mentioned earlier would be marked (either on the chapter itself or in the include instruction) as “3pp”, meaning it is only to be included in the 3pp+PRD version, while the “Cleric Archetype Classes” would be marked “fin”. The layout tweaks can also be marked with these values, so “3pp wip” means “do this tweak only if it’s the 3pp+PRD WIP version” (because all the PRD-Only and the RAF and FIN versions don’t need this tweak).
This largely solves my ‘scaffolding’ problem. A single set of input documents should now be transformable into six output documents, depending on flags set. I’ll need to make some changes in the make-the-final-document scripts, but by and large this should greatly reduce the file handling I need to do.
The earliest versions of the Echelon Reference Series data store, at least after I started parsing the data, had very specific styles, from ‘class’ down to individual class subfeature types such as rage powers and bardic performances. The middling versions used some better abstractions and let me get away from being quite so specific, but when rendering the documents I had to examine the context of the object to see what it was. For instance, I would deduce that a particular class-subfeature is a rage power because it’s parent was a class-feature called ‘rage power’. This worked, as far as it went, but led to my ‘inserting’ parent data objects so the abstractions would work.
This had two unfortunate side effects, one minor and one more significant. The minor was that if I were to render the source document in PDF (as I often do as a data check, to verify the structure is correct) I would have extraneous objects in the document. This is ultimately not a big deal, but I found it jarring. The more significant effect was that it caused me to have many objects in the system with exactly the same ID. This was more troublesome.
It looks like the easiest solution consists of embracing abstraction. Much of the time I need only know that I have an object and what type it is (i.e. a label). It is still necessary to be able to nest the items, but the following seems to work well:
- Replace all data-specific stat-block styles (spells and monsters are the most-used, but there are others) with ‘d20 Abstract’, ‘d20 Abstract Group’, and ‘d20 Abstract Sub’. These can be applied to all object types. ‘d20 Abstract’ is the most commonly used, ‘d20 Abstract Group’ provides a heading in the stat block (often seen in monster stat blocks), and ‘d20 Abstract Sub’ is a child object of a ‘d20 Abstract’, most commonly used so there can be more than one paragraph in a stat block field (such as a monster special ability that needs more than one paragraph to describe). These are actually identified internally as list items so they can interact and include lists.
- Replace all game object styles with a combination of nine styles (three sets of three):
- d20-1-Decl, d20-1-Object, d20-1-Section (Heading levels 1-3)
- d20-2-Decl, d20-2-Object, d20-2-Section (Heading levels 4-6)
- d20-3-Decl, d20-3-Object, d20-3-Section (Heading levels 7-9)
- Add ‘d20 Attribute’, which adds or overrides meta information about the object, that isn’t game information. This is used mostly to provide processing hints, and doesn’t get used much.
The naming scheme seems odd, but is set up that way so they appear in my style manager in a useful order.
Functionally there is no difference between a d20-2-Object and a d20-1-Object, except that the d20-2-Object can be nested within a d20-1-Object… and because I declare the types explicitly now, this mostly does not come up often.
When I encode a character class, I can do something like
(description goes here)
[d20-1-Section] Class Features
Rogues have the following class features
[d20-2-Decl] Class Feature
[d20-2-Object] Weapon and Armor Proficiency
[d20-2-Object] Sneak Attack
[d20-2-Object] Rogue Talents
(description… not including actual rogue talents)
The headings — all the styled paragraphs in the block quote above — are actually increasingly indented in Word, to make the hierarchy easier to see. They also show up in the navigation pane in tree format, making it easy to navigate.
While processing, I end up with an object (of type ‘class’) called ‘rogue’, with the ID ‘class.rogue’. This has descriptive text and a section containing five objects (second-tier, but still ‘objects’) of type ‘class feature’. These objects each have a description and are called respectively ‘Weapon and Armor Proficiency’, ‘Sneak Attack’, ‘Trapfinding’, ‘Evasion’, and ‘Rogue Talents’. Because they are inside another object, though, their IDs are slightly different: class-feature.weapon-and-armor-proficiency.rogue, class-feature.sneak-attack.rogue, class-feature.trapfinding.rogue, class-feature.evasion.rogue, and class-feature.rogue-talents.rogue. Each also has a ‘group ID’ (gid) that has the ‘.rogue’ suffix removed.
That each instance of the object now has a unique ID is incredibly valuable. It lets me to identify and refer to (or copy) a specific data object. In many ways they are equivalent (they do the same thing, and satisfy the same prerequisites… usually), but in some ways they are different (class-feature.evasion.rogue is gained at rogue second level, but class-feature.evasion.ranger is gained at ninth level).
This also makes it feasible for me to define ‘universal class features’ (provide a single standard definition for a class feature such as evasion). I can then change the class-specific definitions to the class-specific application (‘Rogues gain evasion at 2nd level’, ‘Monks gain evasion at 2nd level’, ‘Rangers gain evasion at 9th level’). It will no longer be necessary to define the class feature each time a class gains it, and more importantly it becomes reasonable to get rid of ‘gains evasion, as a 2nd-level rogue’.
Regarding the rogue talents above, the class feature describes the rules for rogues taking rogue talents (gained at 2nd level and every even level after that). The actual rogue talent definitions happen outside the class, mostly because most rogue talents are not defined in the class (in other supplements). Also, rogues aren’t the only class to gain rogue talents, so defining them outside the class makes it easier to ‘share’ them. I might put the following in another chapter (using d20-2-Decl
[d20-2-Decl] Rogue Talent
[d20-2-Object] Bleeding Attack (Ex, Sneak Attack Exclusive)
(description goes here)
[d20-2-Object] Combat Trick
(description goes here)
[d20-2-Object] Fast Stealth (Ex)
(description goes here)
[d20-2-Object] Finess Rogue
(description goes here)
This gives me four new objects, named ‘Bleeding Attack’, ‘Combat Trick’, ‘Fast Stealth’, and ‘Finesse Rogue’ (with IDs rogue-talent.bleeding-attack, rogue-talent.combat-trick, rogue-talent.fast-stealth, rogue-talent.finesse-rogue). I used d20-2 styles to show that it isn’t necessary to start at d20-1.
In case you’re curious, Bleeding Attack is given the ‘sneak attack exclusive’ type for clarity. “Talents marked with an asterisk add effects to a rogue’s sneak attack. Only one of these talents can be applied to an individual attack and the decision must be made before the attack roll is made.” is not terribly useful when asterisks are used in many places for different things, I prefer to be explicit.
And with a little bit of forethought, I can even prepare for the type tags to be objects themselves, so if I want I can define them in data and provide textual descriptions for them that I can present when needed.
I had originally planned to write about object type taxonomy, but this article is already almost 2,900 words long! Next post!