‘Test your assumptions’ is an important step in any kind of analysis and design exercise. It appears calibre’s database design is not entirely the cause of poor performance at higher record volumes.
Don’t get me wrong, the design and implementation of the database can certainly be tightened up… but I don’t think this alone will be enough.
Calibre’s Data Model
I took a copy of my ‘Dev-Graphics’ library (229,244 ‘books’ — images, actually), put it on an internal SSD, and opened it in SQLite’s DB Browser.
As should be expected, the Books table holds most built-in columns for attributes with only one value (e.g. title, load date, has_cover flag, and so on). If the attribute has multiple values (authors, tags, and so on) they are implemented as many:many relationships via a link table.
Custom columns, though, are stored outside the Books table and are heavily normalized. Each custom column results in one or two additional tables being created, depending on the nature and cardinality of the data.
- If the book attribute is a collection of values (custom tags, for example) there will be two tables. One has the tag strings and one is a link table between Books and the column-specific table. This structure is necessary to normalize a many:many relationship.
- If the book attribute is a single value and is a string, there will be two tables. One has the string value and one is a link table between Books and the column-specific table. This can reduce storage requirements by sharing strings, which works where many records have the same value but fails when values are more or less unique. For many of my libraries I have an attribute for the path I loaded the file from (each of which may be used many times), and an attribute for the fully-qualified path (which is unique; I use it to make sure I’m not loading the same file again). Both are implemented the same way.
- If the book attribute is a single value and is another primitive (number, date, etc.) there may be a single table that holds the book ID and the value assigned. For instance, ‘download date’ is a primitive and there would be little to gain by splitting this into two tables.
There are also tables containing control information regarding how to interpret these custom columns and their implementation.
This implementation is internally fairly consistent, making it not-horrible to code around. It is relatively straightforward and painless to add and remove custom columns, because they don’t change underlying table definitions (Books table is Books table, always). Depending how it’s consumed, there could be a performance hit (to be explained later).
Media Library Data Model
The model I’m working on will manipulate the entity tables, for a few reasons.
The obvious one is performance. The fewer tables you need to join, the faster things tend to work. There are cases where this is not true, but I don’t expect them to happen here.
The less obvious one is mind space: if every custom column can add one or two tables, and each entity can have custom columns, the number of tables becomes unwieldy or even overwhelming. If I have four primary entity types and add five custom columns to each, I could be looking at 20-40 tables just for the custom columns. That is, instead of having four tables I’ve got two to four dozen tables. Admittedly there will be attributes that require additional tables, either for many:many relationships (which are probably between entities, so one more table) or to minimize string duplication, but even some of those can be integrated more tightly.
A third reason is a courtesy to the programmer: implementing built-in and custom columns differently means implementing access to those columns differently. Do I look on the entity record or to a custom column record (potentially requiring a join to two other tables, remember, for each custom column)? Some of this is unavoidable because of differences in cardinality or the string deduplication, but then the difference comes down to domain rules and optimization, not who built it in.
I wanted to compare performance between the two implementations. To do so, I created a table that was a copy of the Books table, with ‘values’ from related tables appended. These ‘value’ fields are mostly from link tables (because those were the easiest to work with… and all populated).
create table new_books as select b.*, bcc2.value content_capture_enum_fk, bcc11.value filename_fk, bcc12.value filepath_fk, bcc13.value filequalified_fk, bcc16.value path_fk, bcc17.value path_bundles_fk from books b join books_custom_column_2_link bcc2 on b.id = bcc2.book join books_custom_column_11_link bcc11 on b.id = bcc11.book join books_custom_column_12_link bcc12 on b.id = bcc12.book join books_custom_column_13_link bcc13 on b.id = bcc13.book join books_custom_column_16_link bcc16 on b.id = bcc16.book join books_custom_column_17_link bcc17 on b.id = bcc17.book;
This gave me 229,244 records, each with the same content as the Books records but with six foreign keys appended. Only the ‘filequalified’ is expected to be unique; file names can be duplicated and all paths should expect to be duplicated and thus are good candidates for string consolidation. In my model these will likely be stored on the Document_File_xm table, but even then I can see value in adding a ‘Folder’ or ‘Directory’ or ‘Path’ entity. Later! Not changing the model now.
I then wrote a query to get the same values out of both the new table and the original tables. In reality I almost certainly would name the linked tables more meaningfully (‘custom_column_2’ might be ‘content_capture_enum’), but I was looking at performance when I wrote this. I could have rolled the text values directly into the New_Books table but chose not to at this time.
select nb.*, cc2.value content_capture_state, cc11.value filename, cc12.value filepath, cc13.value filequalified, cc16.value path, cc17.value path_bundles from new_books nb join custom_column_2 cc2 on nb.content_capture_enum_fk = cc2.id join custom_column_11 cc11 on nb.filename_fk = cc11.id join custom_column_12 cc12 on nb.filepath_fk = cc12.id join custom_column_13 cc13 on nb.filequalified_fk = cc13.id join custom_column_16 cc16 on nb.path_fk = cc16.id join custom_column_17 cc17 on nb.path_bundles_fk = cc17.id;
The same information from the original tables is a little more effort.
select b.*, bcc2.value content_capture_enum, cc11.value filename, cc12.value filepath, cc13.value filequalified, cc16.value path, cc17.value path_bundles from books b join books_custom_column_2_link bcc2 on b.id = bcc2.book join books_custom_column_11_link bcc11 on b.id = bcc11.book join books_custom_column_12_link bcc12 on b.id = bcc12.book join books_custom_column_13_link bcc13 on b.id = bcc13.book join books_custom_column_16_link bcc16 on b.id = bcc16.book join books_custom_column_17_link bcc17 on b.id = bcc17.book join custom_column_2 cc2 on bcc2.value = cc2.id join custom_column_11 cc11 on bcc11.value = cc11.id join custom_column_12 cc12 on bcc12.value = cc12.id join custom_column_13 cc13 on bcc13.value = cc13.id join custom_column_16 cc16 on bcc16.value = cc16.id join custom_column_17 cc17 on bcc17.value = cc17.id;
About twice as many tables involved in the query. The tables are at least consistently named (books_custom_column_2_link joins books and custom_column_2), but it’s still twice as many tables… and I had to go to the custom_columns table to get the IDs that clued me in to the names of tables I needed to use.
Performance-wise? Depending on the query, it ranges from a few times the runtime (230 milliseconds to 500-900 milliseconds) to fetch the data back to up to ten or more times (I’ve seen more than 6 seconds!). Given the variability in runtime, I suspect this is due in part to system load or other interference.
… if the work is actually done in the database. I suspect calibre actually captures the information table by table and aggregates things internally. That is, rather than creating the SQL shown above and letting SQLite handle it, I suspect calibre is fetching the Books records, then the Custom_Column_2 data and attaching them to the Books data internally, then fetching the Custom_Column_11 data and attaching them to the Books data internally, and so on.
The data model I’m working with will better meet my needs for several reasons.
- Be easier to read and understand, by reducing the proliferation of tables needed to capture custom attributes.
- Be easier to maintain in code, by reducing the number of ways attributes are implemented. There will still be multiple, because cardinality and optimization can require them, but there will be fewer.
- Be more performant by simplifying the SQL and the number of tables that need to be accessed to get the information about an entity.
- I expect this will have some effect on performance, but if my suspicions regarding calibre’s implementation are correct, I expect to see greater gain from letting the DBMS do the heavy lifting than from implementing the tables more efficiently.
This is not entirely positive.
- When configuring the custom columns, particularly when removing them, database maintenance should probably include rebuilding certain tables entirely. When you drop a custom column in calibre it can simply drop the relevant tables, in my model it might be necessary to remove constraints, create the new target tables, drop original tables, rename new tables to old names, and recreate constraints.
Even the potentially negative elements aren’t that bad. This is a common database refactoring/defragmenting activity anyway, and when I created the New_Books table it ran in a handful of seconds, even with 229,244 records and joining six tables. This should not be a very time-consuming activity. But then, my day job is as a data warehouse architect; we do this to literally thousands of tables a day.