Publ: Development Blog

v0.3.9 Released

Posted (5 years ago)

This entry marks the release of Publ v0.3.9. It has the following changes:

  • Added more_text and related functionality to image sets (an example being visible over here)
  • Improved and simplified the caching behavior (fixing some fiddly cases around how ETags and last-modified worked, or rather didn’t)

I also made, and then soon reverted, a change around how entry IDs and publish dates were automatically assigned to non-published entries. I thought it was going to simplify some workflow things but it only complicated the code and added more corner cases to deal with, all for something that doesn’t actually address the use case I was worried about. So never mind on that.

(What happened to v0.3.8? I goofed and forgot to merge the completed more_text et al changes into my build system first. Oops.)

See below for more on the caching changes.

Previously, I was trying to compute the ETag and Last-Modified for every page based on a rather conservative metric; basically it would try to find the most recently-modified file that was likely to affect the rendering of a page, and generate an ETag based on its file metadata and a Last-Modified based on its file modification time.

But determining which files were likely to contribute to the output is hard to do without simply running the template system, so it tried building a bunch of heuristics. And in the end it still ended up making it so that pretty much everything on the site either got the same ETag as everything else (making it not very useful for things that work by spidering the site, such as Pushl) or had a non-useful Last-Modified and this still required considerable file I/O to do.

Also it turned out that some of the caching logic was possibly doing weird things with the way that Flask-Caching works.

Now what I do is I just generate the page text and compute a hash of said text, and cache those together at the template rendering level, rather than at the route level. This turns out to be nearly as fast on a cache miss, and significantly faster on a cache hit, and the cache hits are themselves way more cacheable and there’s fewer edge cases to worry about and so on.

I also got rid of the Last-Modified check because anything which cares about this will be using an ETag anyway, and Last-Modified logic is hard to get right.

There’s probably a few things that could now be more efficient but those feel like micro-optimizations; for example, path redirection logic is no longer cached (but it’s pretty fast anyway), and I feel like making the website itself more cacheable (and more correctly cacheable) is worth it.

At some point I do need to run some sort of profiling tool on Publ to see where any actual code bottlenecks are; right now I mostly do smoke-test-type benchmarking and occasionally check to make sure my main website isn’t taking up significant CPU. I’d love to see if there’s any low-hanging fruit to make it scale even better though.

Update: Now that my sites are deployed on the new version, I can share some interesting timing information.

Before, a Pushl update of beesbuzz.biz and publ.beesbuzz.biz took around 45 seconds, mostly spent downloading every single entry (because caching is hard).

After, a Pushl update takes around 4 seconds. Meaning, it’s 10x faster – all because it can cache smarter and can simply ignore the vast majority of the feed content!

(And most of the time it takes is actually spent spinning up the pipenv.)