PG IDs
@mcallister_alaskagrown convinced me that that human-readable IDs for Harry Potter editions and books would be a useful thing to have. The concept is somewhat inspired by Errington’s J. K. Rowling Bibliography in which he assigns an ID to each state of the books. It’s the kind of thing collectors love since it is conducive to list-making.
There are of course other systems for identifying books—the most relevant being ISBNs. However, ISBNs don’t quite cut it—they aren’t used universally, they aren’t used for all the variation that we want to capture as collectors… I am sorely reminded of this XKCD:
I decided to risk it anyway. And I will call them PG IDs. (PotterGlot IDs if that wasn’t obvious.)
It is not as simple as you might think to try and design such a system. Something like Errington’s won’t cut it—he was only dealing with English; TheList has a couple orders of magnitude more things to enumerate! But also Errington had an advantage that I don’t have. He was working with a known, fixed set—that is, he essentially has perfect knowledge of what he’s enumerating. At least, by the time the book was published he did. There may be additions over time, but they are easy to add in his scheme without disrupting it.
With TheList, I’m not just adding new editions that we find, but also discovering new facts about what’s already there! The authorized Bengali macroedition just got recategorized as a revision because we discovered a previously unknown unauthorized precursor. A few months ago, we discovered that Marathi was completely retranslated and published under the same ISBN—all of those editions were shifted over to a brand new translation. Where a book falls on TheList can change which makes it very difficult to meet two of the most important criteria for designing an ID system:
- IDs are immutable. From the moment it is assigned, the ID doesn’t change regardless of how what we know about it might change.
- IDs are at least somewhat meaningful. If an ID is completely arbitrary, it’s usefulness is somewhat limited. It would be a nice feature of such a system if allowed you to glean some basic information about the referent.
In an open-ended system like TheList, these are competing requirements. If I knew everything I had to capture and knew that the information I already had was perfectly correct, then it would be easy to design a system to meet both needs. However, for an ID system to be descriptive or non-arbitrary while uncertainty exists, it must admit the possibility that the descriptive elements might change.
Changing IDs are not great; ninety percent of the reason for creating an ID is as a shorthand to make sure that two individuals are talking about the same thing. Even if you keep track of changes and make sure that every ID is unique so that there is always only one referent for one ID, the possibility of one referent having more than one ID introduces the possibility for confusion. Such a system is possible, but it’s not ideal.
On the flip-side, arbitrary IDs are easy and because they don’t mean anything, they never need to change. In fact, everything in the TheList already has immutable UUID1s—they appear in the urls. We could totally use those! For example:
- The 1997 original hardcover: 4160db82-2cc3-47d3-abf0-49cf40a07dbf
- Bloomsbury (the Company): 1fb8507c-b1c0-4b79-8d32-7e8e658cf568
- Bloomsbury Publishing (the name): 5f968f77-3f02-4c16-b0dc-ee6fcc7a9e61
I think you probably see those are less than ideal—they are literally random and very long. Great for databases2 but not for humans.
After a little back and forth with @mcallister_alaskagrown, I settled on a kind of hybrid system—one that includes immutable components but also descriptive components that may change in relatively rare circumstances.
There are two different ID templates to cover two types of objects in the database: objects that encompass just one of the ‘capital-B’-BOOKS (the abstract Harry Potter book in the series that collectively includes all the editions of that story in all the languages) and objects that encompass multiple ‘capital-B’-BOOKS (like box-sets or compilations).
The Single-BOOK template is used for translations, macroeditions, (most) editions, volumes, and (most) books. It consists of up to six components separated by hyphens. I say ‘up to six’, because the template is ‘additive’ and reflects the relationship between these objects—what I mean will become clear looking at the template itself and some examples.
Component: | BOOK Code | Language Code | Translation Number | Macroedition Category+ Number | Edition Number | Book Number |
---|---|---|---|---|---|---|
Example: | HP1- | ENG- | i- | O1- | 00001- | 0000001 |
Description: | HP1~HP7, CC, TBB, FB, QTA | As defined by ISO-639-3 | # expressed as a roman numeral | O# Tr# A# R# Tl# N# | ##### 5-digit number | ####### 7-digit number |
BOOK Code, is hopefully self-explanatory.
Language Code: I’m a fan of standards. I use ISO-639-3 extensively myself and in the backend of the database. It assigns 3-letter codes to all the languages of the world, both dead and alive, which is used in all manner of organizational systems. In this case, it provides a nice, short, fixed-length code conducive to using in an ID.
Translation Number: This is a number expressed as a roman numeral that is unique to the language (including all the BOOKS). It includes the original books even though they are not technically ‘translations’.
This is the first stopping point in this template and it constitutes a complete Translation (or Original!) PG ID:
HP1-ENG-i
This PG ID refers to, of course, the original UK English Philosopher’s Stone. I say, ‘complete’ Translation PG ID because “HP1” is not strictly necessary. It is descriptive rather than distinctive in that the Language Code + the Translation number—ENG-i—is sufficient on its own and HP1 is just adding additional useful information which may not ordinarily be apparent from ENG-i by itself. The translation number is fundamentally arbitrarily assigned although in simple cases like this, there will be correlations.
For translations, both the ‘complete PG ID’ and the ‘sufficient PG ID’ are immutable—they will never change because I will make the commitment that if a Translation is added to TheList, I will have accurately identified the language it is in and the BOOK it is a Translation of. More on why the translation number is a roman numeral momentarily.
Macroedition Category + Number: There are a closed set of macroedition categories on the list: originals, translations, revisions, adaptations, transliterations, narrations and variants which are abbreviated O, Tr, R, A, Tl, N and V respectively. These abbreviations are paired with an assigned macroedition number that, like the translation number, is unique to the language. This is the second stopping point in the template which constitutes the complete PG ID for a macroedition:
HP1-ENG-i-O1
This is what I mean by the PG ID being ‘additive’. The complete macroedition PG ID includes the complete translation PG ID that it belongs to. The BOOK Code, the translation number and the macroedition category are descriptive for the macroedition PG ID. The language code + the macroedition number—ENG-1—is sufficient on its own to uniquely identify the macroedition. The language code + the macroedition number is immutable; the descriptive elements are not, although it is very unlikely that the BOOK Code or translation number would ever change. The macroedition category? That happened to a Bengali macroedition just the other day. It’s not common, but it does happen.
At this point, the HP1-ENG-i-O1 is starting to look really redundant. And it is both in a superficial sense—because this is the first macroedition of the first ‘translation/original’ of the first BOOK, yes 1 appears a lot—and also in a meaningful sense because if you can determine the rest just from ENG-1, why use the complete PG ID? First, remember that this is a special case—more complicated situations are going to arise and that superficial redundancy is going to go away. Second, there is a trade off in the availability of information: information included in the PG ID reduces the effort required to pull up that information so adding a little more text increases the overall usefulness of the ID for humans. (My database is perfectly happy to just use the UUID.)
But also the full PG ID is useful for grouping. This will become more apparent once we take a look at the next two components, but essentially the descriptive elements of the PG ID help keep related items together just through normal sorting.
Edition number: This is a five digit number that is immutable, unique across the entirety of TheList, and sufficient to identify the edition. Adding it to the macroedition PG ID is the complete form of an edition PG ID:
HP1-ENG-i-O1-00001
The prefixed macroedition PG ID is descriptive, and although changes are not frequent, they are more likely to happen than changes to the macroedition PG ID itself. This is most likely to happen when we find a new macroedition has been sitting under our noses the entire time, say a translation was significantly revised. All the editions after that point will be transitioned to the new macroedition and consequently the complete PG ID for those editions will have changed. However, the edition number itself will not have and so even if you’re looking at two different (complete) PG IDs, if the edition number is the same, it is still obvious that the two IDs refer to the same edition.
The volumes of multi-volume editions are treated the same as the editions themselves—they are assigned an edition number as is their parent edition and there is nothing in the PG ID that indicates the difference between edition and volume nor reflects their relationship. This is actually very similar to how editions and volumes are treated in the database itself—a volume is just a special type of edition.
Edition numbers will always be five digits (unlike the translation and macroedition numbers); numbers lower than 10,000 will be zero-padded to five digits. Five digits (starting at 1, not 0) allows us to capture 99,999 different editions—based on what is currently in the list and how I expect it to grow over time, I believe this is sufficient for several decades of editions at least, if we ever get that high ultimately.
It’s worth noting at this point that as we move down the template components, the more arbitrary the ‘sufficient’ identifier becomes and the more volatile the complete PG ID becomes. This is where the value of the descriptive components begins to be more obvious. Since edition numbers are unique across TheList and are added in an ad hoc manner through incidental research, scraping, or importing data sources, they are essentially random. A list of edition numbers couldn’t be sorted meaningfully, and no details can be gleaned from the number itself except for perhaps when one edition was added to TheList relative to another edition.
Book number: Books are the most refined object in TheList—’refined’ in the sense that it is the most specific, not the most sophisticated. These are usually distinguished by a print run, but not always; they could be different states or some other sub-edition distinguishing characteristic. Like the edition number, the book number is immutable, unique across the entirety of TheList, and sufficient to identify the book. Adding it to the edition PG ID is the complete form of the book PG ID.
HP1-ENG-i-O1-00001-0000001
As with the edition number, the PG ID prefixed to the book number is descriptive and may well change. The number will always be a zero-padded seven digit number—which allows us to capture nearly 10,000,000 books. I can’t imagine ever hitting that limit, but it does allow for 100 ‘books’ for every edition that we can capture with this scheme.
The number of digits is the characteristic that distinguishes edition numbers from book numbers and allows them to be sufficiently distinctive on their own. But even if I am completely wrong in my estimates for the number of editions or books required, the PG ID could bump up the editions by a digit and the book number by multiple digits without damaging the backwards compatibility of the existing PG IDs or start confusing edition and book numbers.
That finally brings us to the ID template for Multiple-BOOK objects which will be used for book sets and compilations. The best example of a compilation, if you’re not entirely sure what I mean by that, is the German Bible—the single volume edition that Carlsen produced that contains all seven of the novels. This template is somewhat simpler
Component: | Type Indicator | Language Code | BOOK Codes / Range | Edition Number | Book Number |
---|---|---|---|---|---|
Example: | CMP- | DEU- | HP1~7 | 00256- | 0000896 |
Description: | SET, CMP | As defined by ISO-639-3 | A list of BOOK codes included separated by period (‘.’); a range of the original seven books is abbreviated with a tilde (‘~’) | ##### 5-digit number | ####### 7-digit number (only applicable to CMP) |
It pained me to relinquish the descriptive power of the translation and macroedition components, but realistically, there’s just no way to incorporate it for so many books. Note that the book number component is only relevant to compilations—it is certainly true that book sets will go through multiple printings or might have other distinguishing states; however those differences will be captured by their child-editions.
Book sets, by their nature are a collection of editions, each of which will be identified with their own PG IDs having their own books. We’ve merely lost the insight into that relationship because it is too complex to succinctly express in the ID, much the same way as we lose that insight with multi-volume editions (you’ll recall that each volume receives its own edition number—only those PG IDs will prefix book numbers and the relationships to their parent edition is hidden).
I mentioned previously that ‘volumes’ in TheList’s database were just a special kind of ‘edition’—the same is true of book sets. And it may seem unintuitive to not better distinguish these different things; however, there’s good reason not to do so. They all share largely the same set of properties: they are hardcover or softcover, they have a publication date and place, they have a publisher and they have defined dimensions. The global ‘edition-ID-system’—ISBNs—largely treat them as being equivalent (albeit more inconsistently than I will). As far as describing Harry Potter books in the world, these three things are more similar than they are distinct.
I will consider the need for PG IDs for other objects in the database (translators, publishing companies, artists, cover art etc.) but for now I don’t think it’s really necessary. They all have names that already (mostly) uniquely identify them in a human-friendly way. But also they don’t enter the discourse in the same way; there’s less of a need to align our common understanding since we don’t barter and trade those objects the way we do our books. I hope PG IDs contribute something of value toward that goal.
- Universally unique identifier ↩︎
- UUIDs are not just unique in my database, but quite likely unique in every database in the world for the foreseeable future! Such is the nature of very large random numbers! ↩︎
Incredibly wonderful Idea!
It’s brilliant!
Cannot wait to see the final product.
I do have lots of questions about the users, though.