Unlike at work, where ‘operations’ means ‘keeping Production from tipping over’, here I mean ‘what things do we want to be able to do with this application?’
In database terms, we want the full range of CRUD actions: Create, Read, Update, and Delete records. This includes any other artifacts (such as files on disk), but these actions are fundamental to almost all database applications.
What does it mean to do these things, though?
By Entity Type
I’ll start by discussing how CRUD operations might work for the various entity types I described yesterday. I expect ultimately I’ll want to be able to perform operations that touch on several entity types at once (such as adding a Document and associated Files in one step, as far as the interface is concerned), but those complex operations are still defined in terms of simpler operations.
When deleting records, I expect I’ll always cascade to remove first-generation related records. I might want the option of removing second-generation related records (i.e. if no Document uses a particular Tag anymore, get rid of that Tag) but I have not yet decided. Even if I don’t do this by accident, I expect I’ll want processes available that will make it easy to identify and cull orphaned entities.
The Creator table starts very simple: (name, marker, description).
Creating a Creator is very straightforward. If I’ve seen the name (or rather, name and marker) before, return that record’s rowid, else create a new record with the supplied name and return the rowid of that record. Things get a little trickier if other fields are provided (such as description), in that they could either be ignored or could lead to an unexpected update. I need to decide what to do here.
Reading a Creator is just that, returning the record. There is no file associated for me to export, and any data export I need to do is probably covered by the general export mechanisms.
Updating a Creator is straightforward, just changing the field values.
Deleting a Creator would necessarily include deleting associated records (Creator_Tag_XM, Creator_Document_XM) but not the second-generation connections. At least, not by default, but I can imagine culling those as well.
This is probably the second-most complicated entity involved. There aren’t as many fields as File and there is no physical artifact, but it does have more fields than most and many more connections (Creators, Files, Tags… already to be built in, never mind what else gets added).
Creating a Document might not have much duplicate checking. Book titles are far from unique, after all, and it’s entirely possible to have multiple books with the same title. I can imagine that if there is a true identifier involved (such as ISBN, nominally) it might be possible to determine two documents are in fact the same, but otherwise I don’t think it would be appropriate to automatically merge documents with the same title. I’m probably going to accept multiple instances of the same Document because I can’t know better… even if they have bitwise-identical files. So: create new record, return rowid of new record.
Reading a Document returns metadata about the document. No files to export.
Updating a Document updates metadata about the document.
Deleting a Document, as always, must cascade to remove first-generation connections (Document_Creator_XM, Document_File_XM, and Document_Tag_XM records). It could be reasonable that removing the last Document referring to a particular File removes that File, but I’ll want to think about this. It could be reasonable to have Documentless Files. Deleting a Document means also deleting the subdocuments.
As far as single-entity operations are concerned, File is probably the most complicated one because there are physical artifacts (i.e. files on disk) involved and because the semantics may differ depending on how exactly the physical files are to be managed. Also, because these physical files are managed there could be some contextual differences that need to be explored.
Creating a file record can mean the application captures a copy of the file and adds it to the file repository. In doing so, the application looks first to see if the same file (by size and checksum) exists already. If not, copy the file to the repository and return the new file record’s rowid. If so, return the rowid of the existing file record. There could be a ‘safe mode’ so that when a new file matches an existing file on (size, checksum) it also does a bitwise comparison, but I expect that will never find a collision between two files that are not actually the same… but it’s a harmless check, so why not? There could also be a ‘unique’ constraint so that trying to add a file that already exists raises an error.
Creating a file record could also mean the application captures details of where the file exists outside the application’s file repositories. I can imagine these will need to be periodically checked to ensure they are still where (and what! that they are bitwise-identical to the original) they were when originally captured. I’m not such a fan of this approach, but can see using it for situations such as the DriveThru Downloader. Why take a copy when the downloader already has them in a file hierarchy? Answer: so I don’t lose them when they get overwritten by the downloader or moved by the user… but I can imagine some users preferring to have only a single copy. This might also be used for something like indexing a Plex media library — my media library app only catalogues, it’s not a media server.
Reading a file could have several meaning. First, there’s just reading and presenting metadata (such as when the file was captured, where it is, and so on, and extended attributes such as image details). There could also be exporting a copy of the file to disk outside the repository, but this likely works best in context of a Document (i.e. via the Document_File_XM table).
Updating a file could update only information on the record (such as the description), but it could also mean a transformative change to the file. In that case I would think it should mean either creating a new File record with the transformed file (i.e. keep both copies — just because I change this file does not necessarily mean I want to change it for everything that uses this file), or could mean updating the File record and create a new physical file on disk (i.e. I do want to update this file for all things that reference it). I can see supporting either or both approaches.
Deleting a file means removing the File record, of course, and could be cascaded (removing Document_File_XM records, File_Tag_XM records, etc.). It also means removing the physical file from the repository, but if the File record captures a file outside the repository that external file probably should not be deleted. That could be an option, and I’ll probably implement it and make it available, but I don’t see using it myself. This might also include deleting significant entities (Documents and Tags) that no longer have other files associated… but I’ll want that to be controlled, or at least confirmed by the user, rather than default behavior. Deleting a File means also deleting any subfiles.
This is likely to be a low-activity entity. I suspect many libraries might have only one record, in fact.
Creating a Repository means given the repository a name and identifying a location (the root directory of the repository). In future it might mean adding rules to automate selection of the repository when adding new files, but for now I’m not going to bother.
Reading a Repository returns metadata about the repository. This might include some metrics about the repository location, such as how much space is consumed and how much is still available on that disk): there could be system dependencies here.
Updating a Repository could get scary. Changing the name is no big deal, but I’m not sure if I want the ability to move the content from within the application, or should move the directory using system tools and then update the record to point at the new location. Either way I’ll want to be careful. I will probably want/need the ability to bulk move Files from one Repository to another, such as when a disk is getting full, so they might use the same mechanisms.
Deleting a Repository is similarly scary. I can’t delete the record as long as there are File records associated. Dropping a Repository needs lots of warnings and opportunities to back out, and still it might not be expected to remove the physical files. Must consider.
Series are very much like Documents, to the point I have considered treating them as a special type of document. About the only thing stopping me is that it means I could not have a single Document in multiple Series, which is a thing that I want to allow.
Creating a Series is like creating a Document: multiple series can exist with the same name and I can’t tell if they should or should not be merged, so I won’t.
Reading a Series returns metadata.
Updating a Series updates the metadata.
Deleting a Series will remove first-generation links but probably not second-generation. This includes subseries.
Even simpler than Series, except that as implemented all tags are likely to be stored in a single table, for all entity types (i.e. Document tags and File tags are in the same table).
Creating a Tag will insert a new record if the (entity type, label) is unique and return the rowid of that record, or return the rowid of the record if it is not unique. I have the sense — I might change my mind — that the hierarchy should be there only to organize the tags and the labels should be unique for each entity type. This could be short-lived, already I can think of publisher resources where I might want a ‘color’ style tag for stock art (as opposed to line art or greyscale) but also a ‘color’ parent tag used so I can mark that a texture is predominately green (though that latter case I might use ‘hue’ instead).
Reading a Tag returns the tag metadata, probably including any hierarchical details.
Updating a Tag updates the tag metadata, but could mean some rejigging if renaming would cause a collision (replacing ‘scarlet’ with ‘red’ to simplify things could mean anything currently using ‘scarlet’ needs to be moved to the ‘red’ tag).
Deleting a Tag, as always, removes first-generation links. I would think it should never remove second-generation links: it’s possible that if I delete a Document I won’t care about the Files anymore if nothing else uses them, but it’s unlikely that dropping a categorization tool means I want to get rid of the thing I’m categorizing.
This is where things get more interesting. From a user perspective I don’t think I want to have to track all the individual steps myself. I’d rather do something like “add this document with these creators and this files and these tags” and let the tool sort out how that happens.
Likely to be added only on its own or in context of a Document. Probably won’t be ‘Creator made these Documents”, but “Document had these Creators”.
Likely the most common situation. For instance, all files in a single directory of the DriveThru downloader stores files in a ‘publisher / title’ directory structure, with all files for a single title in a single folder. I want to be able to say “that folder is this Document from this publisher (Creator… or close enough for this example) and has these Files”. One instruction for a single Document with an arbitrary number of Files. If I have metadata from DriveThru (authors, title ID, rating, etc.) I’ll probably want to augment the Document with that.
I can imagine wanting to add a File with an implied Document. I don’t necessarily know the full Document details, but I could use the file name as an implied Document title, or even use metadata from within the file. For instance, I downloaded a Raging Swan megabundle from Open Gaming Store and while I don’t have actual metadata from OGS, Raging Swan is one of the best I’ve found at embedding metadata in their documents. Other sources are pretty consistent in how the files are named, and calibre allows the use of regular expressions to extract metadata such as author names and book titles from the names of the files being loaded.
Still, I suspect I’d still turn this around: if I can determine enough to define a Document, no matter how contrived or kludgy, I’d likely add a Document that has the single File, rather than add a File with an associated Document. But there’s no reason I couldn’t do it that way.
Like Tags below, I expect Series would be created directly and without associated entities (except Tags, they could have Tags on creation)… or created implicitly as part of loading a Document.
Even more than File, I expect Tags created on their own are created just as Tags, with no associated entities… but any entity that has Tags would implicitly result in new Tag records if needed. That is, adding a Document might have a string (or a JSON array) listing the tags and if they don’t exist before they will exist after. Ditto for Files and Creators.
So many moving parts. This doesn’t even touch on complicated operations, just some of the critical ones.
And describing this all led me to a potentially excellent but potentially horrible abstraction I might want to pursue. More in the next post.