Media Library: Numbers and Titles and Flags, Oh My!

As is sadly typical during my A-Z Blog Challenge runs, I’m struggling in the middle. Ah well, it’s the weekend, time to catch up!

In a recent post I said that now that I’ve got a sense of the column types and their configuration, it’s time to get specific about how I might model the various entities.

Primary Entity Types

As with many relational databases, even when I know the primary entity types I expect to have supporting tables (for lookups, for many:many joins, and so on). Still, understanding the primary entities has to come first: without those, the rest doesn’t matter..

For most of these entities I’ll describe the starting attribute set. Users and plugins can add more, but this is the basic configuration.

Creator

A ‘person’-type entity that is involved in the creation of a document. A single person can be connected to many documents in multiple roles (see ‘Document_Creator’ below).

I’ve got a ‘marker’ attribute here, for situations where two creators exist with the same name. For instance, I know of two ‘Joel Rosenberg’ authors who write (or wrote, one has died) entirely different books. I might want to mark one as ‘Joel Rosenberg (fantasy)’ and one as ‘Joel Rosenberg (modern political)’, and have them both show up under ‘Joel Rosenberg’.

Field	Type	Notes
rowid	INTEGER PRIMARY KEY	(created implicitly, auto increment)
name	NAME	Name and sort name.
marker	STRING	A string — often will be empty — to differentiate between two authors with the same name.
description	TEXT	Description of the person (or more likely, notes).

Document

Even though the File is the primary thing this application is managing, the user likely thinks mostly in terms of the documents: the books, the movies, and so on.

Field	Type	Notes
rowid	INTEGER PRIMARY KEY	(created implicitly, auto increment)
title	NAME	Document name and sort name.
description	TEXT	Description of the document (might be sales copy, back-cover blurb, etc.)
parent_fk	INTEGER	References Document, pointing at parent Document (i.e. in a magazine, an article would point at the issue it is part of). May be null.
child_order	INTEGER	Order of this document within its parent, if any. Null if parent_fk is null, may be null if subdocuments are not ordered (probably wouldn’t happen).

File

When you get right down to it, this all starts with the File. If I didn’t have files to manage, I wouldn’t be doing this.

Ideally, I’ll have one and only one copy of each file. If I want to try to insert a second file that is bitwise identical to another, I want to keep only one copy and share it between the entities that need it. I imagine that if all references to a File are removed, I’ll want to remove that File (and the file on disk).

Field	Type	Notes
rowid	INTEGER PRIMARY KEY	(created implicitly, auto increment)
repository_fk	INTEGER	References Repository table. One and only one of repository_fk and parent_fk must be not null.
parent_fk	INTEGER	References File table (used by subfiles). One and only one of repository_fk and parent_fk must be not null.
filepath	STRING	Path to data file, within the repository or parent file. In a repository this is likely based on the checksum (in part, ‘##/##/####’, for CRC-32, longer for CRC-64 or SHA-256), as a subfile this is probably the path from the root of the archive file.
filename	STRING	Name of data file, within the repository or parent file. In a repository this might be ‘data.extension‘ (i.e. ‘data.jpg’) because it could have several values in the original system and a standardized name will be helpful, but as a subfile it might be the file within the archive (since that’s fixed — if it was named differently, we would not have bitwise-identical archive files).
file_size	INTEGER	Size of the file, in bytes.
checksum	STRING	CRC-32/CRC-64 or SHA-256 (I haven’t decided which) checksum of data file. CRC-32/CRC-64 could be stored as an integer (convenient!), SHA-256 would need a small array of integers or a 64-character string.
media_type	STRING	Formerly known as ‘mime type’, identifying the type of content in the file (image, video, audio, text, etc.). Might promote to an entity in its own right if it looks like it could be a convenient way to connect plugins.
file_extension	STRING	Extension on filename (i.e. MP3, JPG, EPUB, PDF, etc.); may be empty.

Repository

A location on disk that holds media library content files. Even as much as I expect to have — and insist on supporting — multiple repositories, I imagine many libraries would have only one.

Does not have many attributes yet. This entity is mostly to provide an anchor for File locations (if I move a repository I change one record), but if I reach the point of automating file assignment to repositories it would probably hang off this record.

Field	Type	Notes
rowid	INTEGER PRIMARY KEY	(created implicitly, auto increment)
name	NAME	Name of the repository (i.e. human-readable identifier). [could be just a string, but ‘name’ not being a NAME would be wrong… might use a different name for this attribute]
path	STRING	Location of repository on disk.
description	TEXT	Description of the repository (i.e. purpose or nature of the content, probably used to guide the user in choosing which repository to add a file to).

Series

A set of related Documents. Often but not necessarily ordered.

It occurs to me that if a Document may exist in no more than one Series, ‘Series’ could be a class of Document. The ‘Midkemia’ document (series) has no files but does have subdocuments; the ‘Riftwar Saga’ is a subdocument with subsubdocuments of ‘Magician: Apprentice’, ‘Magician: Master’, ‘Silverthorn’, and ‘A Darkness at Sethanon’. I think I might want to be able to support an individual Document appearing in multiple Series (Discworld comes to mind — the body of Discworld books as whole is one series, and the various story focal groups (Watch, Witches, Wizards, Death) could be subseries that contain many of the same books. If I didn’t want to do that, though, there are possibilities.

Happily, this is one of those things that could go either way, successfully!

Field	Type	Notes
rowid	INTEGER PRIMARY KEY	(created implicitly, auto increment)
title	NAME	Series name and sort name.
description	TEXT	Description of the series (might be sales copy, back-cover blurb, etc.)
parent_fk	INTEGER	References Series, pointing at parent Series (i.e. a story arc such as ‘Riftwar Saga’ would point at the ‘Midkemia’ series). May be null.
child_order	INTEGER	Order of this series within its parent, if any. Null if parent_fk is null, may be null if not actually ordered (such as an anthology of the books in a series — the individual books would be ordered, the anthology compiling them is a single book and might not count as being ‘in order’).

Tag

The Tag table is likely to be more ‘representative’ of how such tables work, rather than breaking them out indivdiually.

It is possible that tags will be hierarchical. I often use tags this way in calibre. On the other hand, I do this because I don’t have an easy way to separate them by purpose, short of having many ‘tags-like’ columns. It might be enough to have a ‘Tag Role’ that I use to constrain.

For instance, in my publishing resource library I have ‘creature type.dragon’ and ‘style.line art’ hierarchical tags for navigation purposes in the tag browser, but if I could assign ‘tag roles’ I might see just ‘dragon, line art’ in the tag list (and presented hierarchically in the tag browser). Must ponder.

Different entity types might have separate tag tables, so they don’t get mixed up. That is, ‘File’ and ‘Document’ and ‘Creator’ might have discrete tables containing their tags, or there might be a single ‘Tag’ table that serves all types and other attributes control which ones are applied where. Which means business rules could be violated (a ‘File Tag’ applied to a ‘Document’)… but that sort of thing is not uncommon, where the data model need only support all planned use cases, not necessarily be limited to planned use cases.

Field	Type	Notes
rowid	INTEGER PRIMARY KEY	(created implicitly, auto increment)
entity_type	STRING	Name of entity type this tag applies to.
label	NAME	Tag name and sort name. I might want to have them sort according to ‘title rules’ rather than purely alphabetically.
description	TEXT	Description of the tag and what it means. This is likely to be blank for most people, but I can see someone (cough me) wanting to explain what a tag means.
parent_fk	INTEGER	References Tag table, identifies parent tag in a hierarchical structure. May be null.

Secondary Entity Types

My secondary entity types are structural and mostly implement the many:many relationships, with some attributes specific to each relationship.

Document_Creator

Associates creators and documents, by role. For instance, (‘Magician: Apprentice’, ‘Raymond E. Feist’, ‘Author’) identifies Raymond E. Feist as the author of Magician: Apprentice.

Field	Type	Notes
rowid	INTEGER PRIMARY KEY	(created implicitly, auto increment)
document_fk	INTEGER	References Document table/
creator_fk	INTEGER	References Creator table/
role_fk	INTEGER	References Document_Creator_Role table/
creator_order	INTEGER	Sort order of this creator in this role for this document (i.e. primary author is 1, secondary author is 2, etc.). (document_fk, creator_fk, role_fk, creator_order) should probably be unique… but creator_order could be null.

Document_Creator_Role

List of values used when associating a Creator with a Document.

The sort_order attribute allows me to present the list by priority (i.e. ‘Author’ before ‘Editor’ or ‘Artist’)… I could just present the roles in alphabetical order, but I

Field	Type	Notes
rowid	INTEGER PRIMARY KEY	(created implicitly, auto increment)
name	NAME	Name of the role (i.e. human-readable identifier). [could be just a string, but ‘name’ not being a NAME would be wrong… might use a different name for this attribute]
description	TEXT	Description of the document creator role,
sort_order	INTEGER	Order to present document creator roles. (Without this

Document_File

Connects a Document and a File. When multiple Documents load bitwise-identical files I want to keep only one copy, but I expect that at some point I’ll want to restore the original file (possibly to the original path, but perhaps not). Because each Document may know the file by a different name or in a different location, I’ll store that information in the XM table rather than in the File (because multiple Documents may be associated) or in the Document (because multiple Files may be associated).

I might have a ‘primary’ flag or sort order, but I think that unlike document creator roles, no particular File is likely to have priority within a Document, and no particular Document is likely to have priority with a File.

If there is an order to the files (such as songs in an album, or chapter-files in an archive) this is likely better handled via ordering the Documents. An ‘album’ could be a class of Document or Series, and the individual tracks could be subdocuments within the parent album Document or Documents with the album Series.

Field	Type	Notes
rowid	INTEGER PRIMARY KEY	(created implicitly, auto increment)
document_fk	INTEGER	References Document table.
file_fk	INTEGER	References File table.
filename	STRING	Name of the file loaded.
filepath	STRING	Location the file was loaded from.

Document_Tag

Associates a Tag with a Document. This is an unordered set, so as long as (document_fk, tag_fk) are unique, I don’t need to track order.

Field	Type	Notes
rowid	INTEGER PRIMARY KEY	(created implicitly, auto increment)
document_fk	INTEGER	References Document table.
tag_fk	INTEGER	References File table.

Closing Comments

Explaining things always takes so long… creating the database can be much faster than writing it all out in human-readable form.

Then again, I’ve been doing this for quite a long time and I know all sorts of tricks for making it easier to pound out DDL.