Let’s talk about entry IDs
So, previously I was just letting the database generate entry IDs on its own. In SQLite this means it would always generate the next entry ID as the highest one plus one. This is fine in a circumstance where entries always get assigned their IDs in the same place, but in Publ’s model that might not be the case.
So, previously I was just letting the database generate entry IDs on its own. In SQLite this means it would always generate the next entry ID as the highest one plus one. This is fine in a circumstance where entries always get assigned their IDs in the same place, but in Publ’s model that might not be the case.
One particularly fun circumstance: A site’s maximum ID is 406. You’ve written an entry on your own machine, and someone else has written an entry on their own machine (or you wrote two entries, each on a separate branch), and when testing the rendering they both end up getting entry ID 407. Which one’s right?
Well, obviously, only one of them can have that entry ID. But the way Publ was set up, the entries would fight over that ID; whichever file was touched most recently would inherit the ID, and sorting it out could get really tricky.
To that end I just implemented two strategies to prevent issues like this:
- Collision reduction
- Collision mitigation
Collision reduction
One of the core problems is the simple auto-increment mechanism for generating IDs, which makes it very easy for multiple entries to be assigned the same ID. This also has the problem of making it very easy for a user of the site to guess about future entry IDs; if the most recent entry has ID 204 and they are impatient and want to peek ahead at scheduled posts, it’s pretty likely that the next post will be 205. A scheduled post is still visible if you can guess the ID (by design), so we want to make it harder for people to guess.
So, simply using a random number generator for ID generation seems pretty simple. But one of the core design goals of Publ is to keep URLs humane — and simply generating a random integer means that IDs will generally be around 9 digits long! That’s not very humane at all!
So, what this strategy does instead is it takes the number of entries, multiplies it by a constant factor \(C\) and adds another constant \(K\), then uses that as a limit for the random number generation. Then it generates IDs until it finds one that isn’t already taken.
“But wait,” you might ask, “doesn’t that mean that you might have to generate an unbounded number of IDs before it finds one that’s not taken?” Well, yes, this is true, but because of the load factor we can predict how many times this might have to happen. It’s not worth going into the computation of this but suffice it to say that if you already have \(N\) entries, it will take on average \(\frac{N}{(N+K)(C-1)}\) attempts to find an unused ID. So for example, with \(C=5\) and \(K=10\) (which happen to be the values I’ve selected for now) and a site with 1000 entries, the average number of collisions per entry generated is \( \approx 0.2475 \) — that is, for every 4 entries created, only 1 will need a second ID generated.
Also, since \(K\) never changes and is small, as \(N\) tends towards infinity the math converges on about \(\frac{1}{C-1}\). (For that matter, \(K\) is only even there to pad out the random range for smaller sites.)
So, the nice thing about these constants is that the number of digits in an entry ID will be on the order of the number of digits of the number of entries, so if a site has 12,400 entries you can expect the entry ID to have around 5 digits. (It’s actually a bit larger; with \(N\) entries the average number of digits will be around \(\log{N} + \log{C}\), so with \(C=5\) that’s on average 0.7 digits longer than the number of entries — so for example, if you have a 3-digit entry count, 70% of your new entries will have a 4-digit ID).
Wow, that was a lot of math. And Publ doesn’t actually currently support math
display correctly, oops. Guess I have to fix that. Well, it’s on the list. Turns out that Misaka is designed to work with MathJax, so
that one was easy to cross off.
Collision mitigation
So, even with all this in place, there’s still a small chance that a colliding ID
might be generated. Or maybe someone copied an entry file to use as a template but
failed to remove the Entry-ID
header. (Which means they also probably forgot to
remove UUID
, but one issue at a time!) So, how do we deal with that?
Well, we could just see if an ID already exists when we ingest an entry, but that doesn’t help on updating an entry that already exists.
It’s also possible that an entry changed filenames but didn’t change IDs — and per Publ’s design this means we need to keep the ID persistently.
So, what we do now is when we scan an entry, if there’s an ID on it that already exists in the index, and that existing entry has a different file path, we treat the entry as if it doesn’t have an ID assigned in the file.
There are probably some sequences of events where this isn’t ideal, and this also doesn’t solve the issue of merging two sites and figuring out how to deal with any internal permalinks which were used before. Well, it’s a tough problem. (Fortunately the collision reduction strategy makes this less likely of a problem.)
Really the best approach for permanently linking to another article on the site
is to assign the article a Path-Alias
. That is why on this example site, the
major manual pages are given aliases like entry-format
so that
I don’t have to worry about the IDs getting reassigned.
What else did I do today?
Well, testing a bunch of collision reduction strategies took some time! I also tried to make some pretty graphs but didn’t really have anything compelling to show for it in the end. Maybe I’ll go back and add some when I have proper image renditions anyway.
I also found another bug in view pagination; if there’s a bunch of entries with the same date as each other, paginations were pretty much random because I didn’t have the tiebreaker sort in the associated queries. So I fixed that.
And I wrote this blog entry.