As is sadly typical during my A-Z Blog Challenge runs, I’m struggling in the middle. Ah well, it’s the weekend, time to catch up!
In a recent post I said that now that I’ve got a sense of the column types and their configuration, it’s time to get specific about how I might model the various entities.
Primary Entity Types
As with many relational databases, even when I know the primary entity types I expect to have supporting tables (for lookups, for many:many joins, and so on). Still, understanding the primary entities has to come first: without those, the rest doesn’t matter..
For most of these entities I’ll describe the starting attribute set. Users and plugins can add more, but this is the basic configuration.
Creator
A ‘person’-type entity that is involved in the creation of a document. A single person can be connected to many documents in multiple roles (see ‘Document_Creator’ below).
I’ve got a ‘marker’ attribute here, for situations where two creators exist with the same name. For instance, I know of two ‘Joel Rosenberg’ authors who write (or wrote, one has died) entirely different books. I might want to mark one as ‘Joel Rosenberg (fantasy)’ and one as ‘Joel Rosenberg (modern political)’, and have them both show up under ‘Joel Rosenberg’.
Field | Type | Notes |
rowid | INTEGER PRIMARY KEY | (created implicitly, auto increment) |
name | NAME | Name and sort name. |
marker | STRING | A string — often will be empty — to differentiate between two authors with the same name. |
description | TEXT | Description of the person (or more likely, notes). |
Document
Even though the File is the primary thing this application is managing, the user likely thinks mostly in terms of the documents: the books, the movies, and so on.
Field | Type | Notes |
rowid | INTEGER PRIMARY KEY | (created implicitly, auto increment) |
title | NAME | Document name and sort name. |
description | TEXT | Description of the document (might be sales copy, back-cover blurb, etc.) |
parent_fk | INTEGER | References Document, pointing at parent Document (i.e. in a magazine, an article would point at the issue it is part of). May be null. |
child_order | INTEGER | Order of this document within its parent, if any. Null if parent_fk is null, may be null if subdocuments are not ordered (probably wouldn’t happen). |
File
When you get right down to it, this all starts with the File. If I didn’t have files to manage, I wouldn’t be doing this.
Ideally, I’ll have one and only one copy of each file. If I want to try to insert a second file that is bitwise identical to another, I want to keep only one copy and share it between the entities that need it. I imagine that if all references to a File are removed, I’ll want to remove that File (and the file on disk).
Field | Type | Notes |
rowid | INTEGER PRIMARY KEY | (created implicitly, auto increment) |
repository_fk | INTEGER | References Repository table. One and only one of repository_fk and parent_fk must be not null. |
parent_fk | INTEGER | References File table (used by subfiles). One and only one of repository_fk and parent_fk must be not null. |
filepath | STRING | Path to data file, within the repository or parent file. In a repository this is likely based on the checksum (in part, ‘##/##/####’, for CRC-32, longer for CRC-64 or SHA-256), as a subfile this is probably the path from the root of the archive file. |
filename | STRING | Name of data file, within the repository or parent file. In a repository this might be ‘data.extension‘ (i.e. ‘data.jpg’) because it could have several values in the original system and a standardized name will be helpful, but as a subfile it might be the file within the archive (since that’s fixed — if it was named differently, we would not have bitwise-identical archive files). |
file_size | INTEGER | Size of the file, in bytes. |
checksum | STRING | CRC-32/CRC-64 or SHA-256 (I haven’t decided which) checksum of data file. CRC-32/CRC-64 could be stored as an integer (convenient!), SHA-256 would need a small array of integers or a 64-character string. |
media_type | STRING | Formerly known as ‘mime type’, identifying the type of content in the file (image, video, audio, text, etc.). Might promote to an entity in its own right if it looks like it could be a convenient way to connect plugins. |
file_extension | STRING | Extension on filename (i.e. MP3, JPG, EPUB, PDF, etc.); may be empty. |
Repository
A location on disk that holds media library content files. Even as much as I expect to have — and insist on supporting — multiple repositories, I imagine many libraries would have only one.
Does not have many attributes yet. This entity is mostly to provide an anchor for File locations (if I move a repository I change one record), but if I reach the point of automating file assignment to repositories it would probably hang off this record.
Field | Type | Notes |
rowid | INTEGER PRIMARY KEY | (created implicitly, auto increment) |
name | NAME | Name of the repository (i.e. human-readable identifier). [could be just a string, but ‘name’ not being a NAME would be wrong… might use a different name for this attribute] |
path | STRING | Location of repository on disk. |
description | TEXT | Description of the repository (i.e. purpose or nature of the content, probably used to guide the user in choosing which repository to add a file to). |
Series
A set of related Documents. Often but not necessarily ordered.
It occurs to me that if a Document may exist in no more than one Series, ‘Series’ could be a class of Document. The ‘Midkemia’ document (series) has no files but does have subdocuments; the ‘Riftwar Saga’ is a subdocument with subsubdocuments of ‘Magician: Apprentice’, ‘Magician: Master’, ‘Silverthorn’, and ‘A Darkness at Sethanon’. I think I might want to be able to support an individual Document appearing in multiple Series (Discworld comes to mind — the body of Discworld books as whole is one series, and the various story focal groups (Watch, Witches, Wizards, Death) could be subseries that contain many of the same books. If I didn’t want to do that, though, there are possibilities.
Happily, this is one of those things that could go either way, successfully!
Field | Type | Notes |
rowid | INTEGER PRIMARY KEY | (created implicitly, auto increment) |
title | NAME | Series name and sort name. |
description | TEXT | Description of the series (might be sales copy, back-cover blurb, etc.) |
parent_fk | INTEGER | References Series, pointing at parent Series (i.e. a story arc such as ‘Riftwar Saga’ would point at the ‘Midkemia’ series). May be null. |
child_order | INTEGER | Order of this series within its parent, if any. Null if parent_fk is null, may be null if not actually ordered (such as an anthology of the books in a series — the individual books would be ordered, the anthology compiling them is a single book and might not count as being ‘in order’). |
Tag
The Tag table is likely to be more ‘representative’ of how such tables work, rather than breaking them out indivdiually.
It is possible that tags will be hierarchical. I often use tags this way in calibre. On the other hand, I do this because I don’t have an easy way to separate them by purpose, short of having many ‘tags-like’ columns. It might be enough to have a ‘Tag Role’ that I use to constrain.
For instance, in my publishing resource library I have ‘creature type.dragon’ and ‘style.line art’ hierarchical tags for navigation purposes in the tag browser, but if I could assign ‘tag roles’ I might see just ‘dragon, line art’ in the tag list (and presented hierarchically in the tag browser). Must ponder.
Different entity types might have separate tag tables, so they don’t get mixed up. That is, ‘File’ and ‘Document’ and ‘Creator’ might have discrete tables containing their tags, or there might be a single ‘Tag’ table that serves all types and other attributes control which ones are applied where. Which means business rules could be violated (a ‘File Tag’ applied to a ‘Document’)… but that sort of thing is not uncommon, where the data model need only support all planned use cases, not necessarily be limited to planned use cases.
Field | Type | Notes |
rowid | INTEGER PRIMARY KEY | (created implicitly, auto increment) |
entity_type | STRING | Name of entity type this tag applies to. |
label | NAME | Tag name and sort name. I might want to have them sort according to ‘title rules’ rather than purely alphabetically. |
description | TEXT | Description of the tag and what it means. This is likely to be blank for most people, but I can see someone (*cough* me) wanting to explain what a tag means. |
parent_fk | INTEGER | References Tag table, identifies parent tag in a hierarchical structure. May be null. |
Secondary Entity Types
My secondary entity types are structural and mostly implement the many:many relationships, with some attributes specific to each relationship.
Document_Creator
Associates creators and documents, by role. For instance, (‘Magician: Apprentice’, ‘Raymond E. Feist’, ‘Author’) identifies Raymond E. Feist as the author of Magician: Apprentice.
Field | Type | Notes |
rowid | INTEGER PRIMARY KEY | (created implicitly, auto increment) |
document_fk | INTEGER | References Document table/ |
creator_fk | INTEGER | References Creator table/ |
role_fk | INTEGER | References Document_Creator_Role table/ |
creator_order | INTEGER | Sort order of this creator in this role for this document (i.e. primary author is 1, secondary author is 2, etc.). (document_fk, creator_fk, role_fk, creator_order) should probably be unique… but creator_order could be null. |
Document_Creator_Role
List of values used when associating a Creator with a Document.
The sort_order attribute allows me to present the list by priority (i.e. ‘Author’ before ‘Editor’ or ‘Artist’)… I could just present the roles in alphabetical order, but I
Field | Type | Notes |
rowid | INTEGER PRIMARY KEY | (created implicitly, auto increment) |
name | NAME | Name of the role (i.e. human-readable identifier). [could be just a string, but ‘name’ not being a NAME would be wrong… might use a different name for this attribute] |
description | TEXT | Description of the document creator role, |
sort_order | INTEGER | Order to present document creator roles. (Without this |
Document_File
Connects a Document and a File. When multiple Documents load bitwise-identical files I want to keep only one copy, but I expect that at some point I’ll want to restore the original file (possibly to the original path, but perhaps not). Because each Document may know the file by a different name or in a different location, I’ll store that information in the XM table rather than in the File (because multiple Documents may be associated) or in the Document (because multiple Files may be associated).
I might have a ‘primary’ flag or sort order, but I think that unlike document creator roles, no particular File is likely to have priority within a Document, and no particular Document is likely to have priority with a File.
If there is an order to the files (such as songs in an album, or chapter-files in an archive) this is likely better handled via ordering the Documents. An ‘album’ could be a class of Document or Series, and the individual tracks could be subdocuments within the parent album Document or Documents with the album Series.
Field | Type | Notes |
rowid | INTEGER PRIMARY KEY | (created implicitly, auto increment) |
document_fk | INTEGER | References Document table. |
file_fk | INTEGER | References File table. |
filename | STRING | Name of the file loaded. |
filepath | STRING | Location the file was loaded from. |
Document_Tag
Associates a Tag with a Document. This is an unordered set, so as long as (document_fk, tag_fk) are unique, I don’t need to track order.
Field | Type | Notes |
rowid | INTEGER PRIMARY KEY | (created implicitly, auto increment) |
document_fk | INTEGER | References Document table. |
tag_fk | INTEGER | References File table. |
Closing Comments
Explaining things always takes so long… creating the database can be much faster than writing it all out in human-readable form.
Then again, I’ve been doing this for quite a long time and I know all sorts of tricks for making it easier to pound out DDL.