Yet Another Grand Reorganization

One of the hazards of being a data geek, and good at it, is that over time you become better at it. You find better ways to do things.

Also, the Pathfinder Roleplaying Game has become progressively more complex, both from player perspective (hence the Echelon Reference Series) and particularly from a data modeling perspective.

It’s time for me to fall back and regroup, reorganize how I’m capturing the data. The workflow changes somewhat, but more importantly the data and file management changes.

Existing Data Models

Today, each source is captured more or less exactly as published. Each Word file represents (usually) one source, complete with document structure (book, part, chapter, section, subsection, etc.) and game element structure (each game element is a major element, minor element, subelement, etc., and there are divisions within them such as for a bloodline’s bloodline powers).

Word produces relatively flat, unhierarchical files. Whether converted to HTML or converted to XML, the document structure basically lacks hierarchy. For instance, conceptually a book’s structure has chapters, and a chapter may have sections. That is, there is an implied hierarchy. In the document files, though, rather than

chapter
- paragraph
- paragraph
- section
  - paragraph
  - table
    - table rows
  - paragraph
- section
  - paragraph

you’ll see

chapter
paragraph
paragraph
section
paragraph
table
- table rows
paragraph
section
paragraph

There are tricks, tools, and techniques for dealing with these, and I’ve gotten good at them. However, the process ultimately generates some eighteen levels of grouping (7 for document structure and 11 for data). In many cases I need to infer from element ancestry what I’m looking at. That is, I might have a “class-feature” called “domain”. Each “class-subfeature” is an instance of a domain (“Air”, “War”, etc.). The “class-subsubfeature” in that is a domain power… but it’s up to me, in my code, to recognize that.

Do you have any idea how many class features exhibit complicated internal structure like this?

New Data Models

I’m splitting the files. Data will all go into one set of files, and ‘document content’ will go into another set of files. I have found in my preliminary experiments that the ‘documents’ change remarkably little after initial capture, but the data elements get tweaked and massaged quite a bit.

Sometimes the purely document content is not Open Game Content (OGC), or is outright Product Identity (PI). I capture it mostly so I can reproduce the original document formatted to my taste (easier for me to read; so many publishers use hard to read fonts, for example), partly so I can view the game elements in situ so I have better context for examining them later, and frankly because, well, I’m a data geek and I like things to be complete. The PI and other non-OGC never gets republished.

Document Files

The document files are pretty straightforward. They follow normal document structure conventions (chapters, sections, etc.). They also can have “include commands” that identify game elements to be added to the document at that point when rendered.

When I captured the text of Pathfinder® Roleplaying Game: Ultimate Combat™ I originally reproduced the document structure, marking up the feats so they could be automatically extracted. Now I would create a data file for the feats, and the “Feats chapter” is reduced to introductory text and a series of “include these feats” instructions.

This gives me much more control than I had before. It also significantly reduces the size of the files I work with, which makes my job much easier.

Data Files

The data files are even simpler. Most game elements have a very similar structure, and the major difference is how they are applied.

Element name
- Summary/statblock information (includes prereqs)
- Descriptive text
- Subelements (repeatable… and have the same structure as the parent)

Right now I define new Word styles for the various element types… but structurally they almost all boil down to the same basic structure. There are exceptions, but probably 80% or more of the game elements I deal with fall into this structure.

Instead, I’m going to rely more on metadata to identify what a particular element is. The metadata type will have a definition that is shared when needed, but otherwise will be used just by name. That is,

“Feat” [data type marker]
- “Dodge” [data element]
  - prerequisites Dex 13
  - benefit lorem ipsum
- “Mobility” [data element]
  - prerequisites Dex 13, Dodge feat
  - benefit lorem ipsum

(I spelled out ‘Dodge feat’, but given just a name the parser can usually find what it’s after… but ‘Dodge feat’ is explicit and resolves ambiguous cases).

Ultimately I end up with data objects something like

<d20:object class="feat" name="Dodge">
  <d20:prereqs>
    <d20:prereq refid="score.dexterity" refclass="score" value="13" />
  </d20:prereqs>
  <!-- content elided -->
</d20:object>

<d20:object class="feat" name="Mobility">
  <d20:prereqs>
    <d20:prereq refid="score.dexterity" refclass="score" value="13" />
    <d20:prereq refid="feat.dodge" refclass="feat" />
  </d20:prereqs>
  <!-- content elided -->
</d20:object>

Feats, most class features, and so on are generally presented in pretty much the same way. There are exceptions, of course, but I can now focus on handling them differently at need rather than having to lay out each data type explicitly.

It can be more complicated, but in many ways it isn’t. For Polyhedral Pantheons I might have something like

Deity [base data type]
Shu-shi [parent=Deity]
Jixiang Shen [parent=Shu-Shi Deity]
Zhengchang Shen [parent=Shu-Shi Deity]
Bukeishiyi Shen [parent=Shu-Shi Deity]
Goblin [parent=Deity]
Vorubec [parent=Goblin Deity]
Jhesiri [parent=Goblin Deity]
Kouzelnik [parent=Goblin Deity]

Then

Jixiang Shen [data type]
- Huanghou
- Xingyun
- Xiao Ling
- Chengshi
- Zhongli
- Jingcai

Because I defined the data types as I did, I can traverse the relationships a couple ways. If I need to, I can determine that Huanghou (empress of heaven) is a Jixiang Shen (auspicious deity), a Shu-shi Deity, and a Deity. This gives me quite a bit of control over the formatting (default ‘game object’ formatting? more specific ‘deity’ formatting?), and even the indexing. The index might include

Deity
- Shu-shi
  - Jixiang Shen
    - Chengshi
    - Huanghou
    - Jingcai
    - Xiao Ling
    - Xingyun
    - Zhongli
Huanghou (shu-shi deity)

(because I decided I only wanted to go as far as the pantheon, not the subpantheon, here… custom indexing rule)

Closing Comments

My existing file and data structure has evolved over time to the point it has become hard to use. Splitting the files into “document content” and “game data content” lets me offload a lot of the more static content (document) and focus on the more often edited content (game data). It makes it easier to exclude the bits I mostly don’t care about most of the time (I don’t have to load the document content into my data store, where it gets repeatedly loaded and processed later) while keeping them available for later if I find I want them. This should speed up capture, editing, and processing.

Abstracting the data lets me rely more on the common aspects of the data. Feats, spells, and deities can all be structurally quite similar: name, statblock, text, done. I can start from there and refine as needed, rather than the current model that requires that I get detailed early, and find that I have many objects that are structurally the same.

This will let me do the RAF (Rough and Fast) versions of new data… well, rougher and faster. It will also let me focus my effort on the more complex cases where I want to know more about the game object. Spells can be structurally similar to other blockish game objects, but I can gain quite a bit by parsing further. Similarly, I know the “Domains” field of a deity definition will contain references to domains (object of type ‘domain’), so if I put just a little more effort into it I can parse and extract that information… let me both index the domain reference, and even update the domain object by adding a “Deities” line identifying the deities who have that domain.

I… did say I’m a data geek, right?