This post was almost ‘Business Requirements’, but I didn’t call it that for two reasons.
First, even inasmuch as my background is in software development, this isn’t about ‘business’, this is a personal itch I’m trying to scratch.
Second, ‘requirements’ are, by definition, measurable: it must be possible to decide that they have been met.
I’m not that far along in my thoughts, not ready for that level of rigor and formality… so ‘basic needs’ it is.
For ease of discussion, and because this was originally prompted by ebook library management, I’ll refer to the primary entities of interest as ‘documents’… even if they are sound files, video files, archive files, or even non-files.
In no particular order, here are things I want to see:
- Large number of documents, in a performant way.
- This is critical; if I don’t get this I’m not switching.
- Arbitrary file types, including but not limited to the following.
- archives (ZIP, 7z, RAR, etc.)
- ebooks (which is what prompted this)
- Rich metadata.
- Probably want this to be extensible or customizable, so I can add new fields easily.
- Ideally on all entities, it would be good to be able to add tags to individual files or authors, not just on documents.
- TAGS FOR EVERYTHING. Documents, files, authors, let me tag it all.
- “Multiple metadata”? Metadata can be captured from the filename (if you define how), from the file metadata itself (though I’ve found this is often unreliable), from online sources… capture multiple sets and give an interface to reconcile them? When merging multiple documents, do the same? Not sure…
- Batch processing. I want to be able to easily (and ideally, quickly) upload a large number of files automatically.
- This could be via a command line interface or easy to use API.
- Modular implementation with plugin support.
- High-cohesion and low-coupling.
- Could be use to handle the various file types, there’s really no reason to build in ‘ebooks’ or ‘images’, these could be built as default plugins (and potentially swapped out!)
- Hierarchical entities.
- Documents and subdocuments: articles or essays in a magazine or book. It could be nice to search for an author and identify not only that author’s books, but the issues of the magazines they have articles and the titles of those articles.
- Files and subfiles: ZIP file containing image files, could be nice to get a contact sheet for the image files. Yes, it should be possible to have a document that points at a subfile.
- Series and subseries: Midkemia books by Raymond E. Feist, there are subseries (Riftwar, Empire, Krondor, etc.) within… we can capture that.
- Author and subauthor? Not sure about this one, but could be used for pen names and the like. Dave Duncan published also as ‘Ken Hood’, it might be good to have ‘all Dave Duncan books’ include the Ken Hood books.
- Powerful deduplication, it’s not uncommon to find exactly the same file from multiple sources. This can chew up a lot of storage.
- I think my ideal would be to have multiple ‘documents’ pointing at the same file. That way I can retain the original packaging but reduce the storage required.
- Probably done by a combination of file size and checksum matching (two files with the same number of bytes and the same SHA256 checksum are almost certainly the same file — and I can include a bitwise comparison if I want to be certain).
- Where documents have the same files but different metadata, consider creating a ‘master/canonical’ version of the metadata but allow the others as ‘aliases’. For instance, I might find that a publisher names the license PDF by package, but it’s bitwise the same file for all packages… create a ‘<Publisher> Stock Art’ canonical entry and have it associated with each relevant package.
- Separate database and storage locations. Let me put the metadata database on a fast drive (internal SSD) but store the files on external USB drives, or a NAS, or something. I don’t need them to be together. Possibly cache cover images on the faster drive.
- Support multiple repositories with the same database: put my graphics assets on this drive, sound assets on that drive, etc.
- Manage by rules (based on file type, file size, partial checksum, etc.).
- (Probably) store a copy of the metadata in the folder with the file, in case things go sideways and we need to rebuild the database.
- Metadata-agnostic repository. Don’t use any potentially-changing metadata (such as author, title, publisher, etc.) in the primary storage path. Doing so could mean changing metadata leads to moving files around (annoying, and slow), inconsistencies if we’re sharing the files between documents, and bad performance if you end up with a very large number of objects in a single directory. Windows does not handle this well.
- Absolutely have export of content to metadata-driven folders. If I want to export everything from a publisher to a folder hierarchy based on their product lines, I should be able to export to “<publisher>/<productline>/<title>/<files*>” if I want.
- Consider cataloguing in place: just scan directories and catalogue the files where they are, don’t move them at all. (Not my preference, but can probably support.)
- Multiple concurrent accesses. At the least I’d like to have two libraries open at once, if not have two processes manipulating the same library.
- (nice to have but not needed) Metadata capture/scraping: API calls to Amazon, IMDB, CDDB, other sources, etc.
- (nice to have but not needed) Device interaction (upload to my ebook reader or tablet)… I don’t do this often enough to care, I could be satisfied with doing it by hand, or indirectly via another tool such as calibre.