Publ: Development Blog

So much for Dreamhost

2018-05-25T21:42:32-07:00

One of the overarching reasons I decided to build Publ the way I did was in order to take advantage of Dreamhost’s support for Passenger WSGI. I was expecting that to be the primary means of hosting my main site (which is way too big for a Heroku instance) and given how smoothly things were working with this site on Dreamhost I figured it wouldn’t be a big deal.

However, there was a huge monkey wrench thrown into things when I switched my site’s configuration over to Passenger; despite all of my configuration being exactly the same between publ.beesbuzz.biz and beesbuzz.biz, the rendition cache on beesbuzz.biz was getting its permissions set wrong, and there was some rather weird behavior with how it was making the temporary files to begin with.

In investigating this I attempted to upgrade my packages on publ.beesbuzz.biz, and all h*ck broke loose.

Basically, Dreamhost, being shared hosting, is in the business of overselling capacity. They used to do a very good job of managing their capacity. But then things like WordPress happened, and more sites got bigger and more complex and started taking way more memory, and for whatever reason Dreamhost decided that they would shift towards only supporting sites built in WordPress (or basic static hosting), and then they started getting increasingly more aggressive about their “procwatch” process-killer, and somewhere along the line it reached a tipping point where now you can’t even run pipenv install without tripping their process monitoring.

I must have just been at the knife’s edge of that with publ.beesbuzz.biz, because spinning up a second Publ app was too much for it to handle.

So, for now, ~~I’ve rolled beesbuzz.biz back to my old MovableType-based site~~, I have made the Heroku instance of publ.beesbuzz.biz the official one (if you are reading this then great, DNS has propagated!), and I am going to look into deploying Publ on my LiNode VPS, which it turns out has way more capacity than I’m using (thanks to them having given me incremental upgrades over the 6.5 years I’ve been with them) and which should be just fine for this purpose.

UPDATE: I have now deployed the new beesbuzz.biz on my LiNode VPS and it went off without a hitch, although DNS is probably going to take a while to propagate. Configuration is a bit fiddly though, and I’d really like this to be easy for non-server-experts to do!

In the long run I’m going to move my stuff away from Dreamhost, because beesbuzz.biz was my last major site running there and at this point I’m basically paying $7/month for mediocre DNS service.

So, while setting things up on LiNode is going to be more difficult, that is what I’ll be going with for now (mostly because my LiNode plan just renewed like a month ago so I have two more years prepaid anyway).

In the longer term I’m going to look at other webhosts; WebFaction looks pretty good, for example, and they come highly-recommended in the Python developer community. And their pricing is quite competitive!

Anyway, getting Flask running on gunicorn with an Apache reverse proxy was fairly straightforward. It’s not the simplest thing to get going but at least I have a working site (modulo DNS caching, anyway). Hopefully I can get my SSL sorted out soon too.

Dates are hard

2018-05-18T12:00:00-07:00

There’s an old joke in programming, that the two hardest things to do are naming things, cache invalidation, and off-by-one errors. But this doesn’t pay sufficient respect to one of the other hardest things, namely handling date and time.

Many systems don’t bother handling dates in any sort of universal way; they just treat all entry times as being local time and call it a day. But this has a problem: whenever the time zone changes, it means that every date it refers to is now different than how it was when it first happened. Any traveling photographer who has tried reconciling EXIF times in their photo software after going across the world understands this pain. So does anyone who attempts to schedule recurring meetings between different time zones (or even different hemispheres), especially when Daylight Saving changes in one locale but not in the other (or in opposite directions).

This is also a problem for any given CMS, and it’s an intractable problem.

A naïve approach

An approach I’ve seen several times is to simply store entry dates in local time, and format them accordingly. But this messes up anything that’s timezone- aware; scheduled posts made for 2:30 AM will appear, then disappear, then reappear again when daylight saving ends, and Atom feeds will have items slip around whenever there’s a time change (or if the person running the site decides to move into a different timezone or whatever). So, this makes the data unstable; it’s only a minor hassle in the grand scheme of things but it still represents a data integrity error.

Publ’s attempt at being clever

At present, what Publ does is to store dates based on their local time of writing, but keeps them with a timezone, and indexes them based on UTC for the purpose of pagination and so on. It formats the date based on its original post time in its respective time zone, and this seems to work okay; dates always appear the same no matter when you look at them, and the relative time offset is stable with respect to when it’s being calculated.

But date-based pagination always has to be based on something, and I chose local time for that. And this can cause all sorts of weirdness to happen, especially for entries posted between 11 PM and 1 AM (depending on circumstances).

Say an entry is posted at 11:30 PM on January 31; for me (pacific time) this puts it at 23:30-08:00, which is the same as 07:30 UTC on February 1. Then later, Daylight Saving kicks in. The entry is still at 23:30-08:00 (i.e. 07:30 UTC).

Now say someone is looking at January entries. During standard time their pagination range is going to be 00:00-08:00 on January 1 thru 23:59-08:00 on January 31, which translates to UTC as 08:00 January 1 - 08:00 February 1. Okay, so 07:30 UTC on February 1 comes before 08:00 February 1. Great, my January 31 entry still appears in January.

But then they come back during DST, and the local timezone is now -07:00. So someone browsing the site for January entries now gets a pagination range of 07:00 January 1 - 07:00 February 1. Suddenly my last-minute-of-January entry is now part of February’s page instead.

Let’s say down the road I move to New York, which means my local timezone is now -05:00 standard or -04:00 daylight saving. Oops, all of the pagination for my site has changed again. And what’s worse, all the older entries no longer make any sort of sense, especially my posted-near-midnight comics.

Incidentally, this violates one of the core tenets of Publ — that pagination should be stable.

Splitting the difference?

So, how about this approach: always paginate and sort entries based on what their local time is (so an entry posted on 01/31/2017 always appears to be on the page for 01/31/2017 regardless of the indicated time zone), and only use the UTC normalization for determining a relative interval to the current time (i.e. whether it’s in the future for scheduled posts, and how many seconds ago it was posted for the “N seconds ago” display). This seems like an okay compromise, although it does mean that if a person is traveling between time zones things might get a little weird around the boundaries, and sorting might not always make perfect temporal sense (but it exposes fewer boundary conditions that will make pagination break, so while it’s not technically correct it’s at least predictable).

But, that seems less broken than other possibilities. It satisfies the principle of least surprise, it keeps pagination stable, and it keeps the presented date consistent with the authored date (even if it might cause some weird jumping- around in some cases).

So, I think that is what I will change Publ to do. It’s (slightly) more code and more annoyance but it seems like the best path forward.

Even if it means time will sometimes run backward.

The Trouble with PHP

2018-05-08T00:00:00-07:00

I’ve had people ask me why I’m not building Publ using PHP. While much has been written on this subject from a standpoint of what’s wrong with the language (and with which I agree quite a lot!), that isn’t, to me, the core of the problem with PHP on the web.

So, I want to talk a bit about some of the more fundamental issues with PHP, which actually goes back well before PHP even existed and is intractibly linked with the way PHP applications themselves are installed and run.

(I will be glossing over a lot of details here.)

Some history

Back when the web was first created, it was all based around serving up static files. You’d have an HTML file (usually served up from a public_html directory inside your user account on some server you had access to, which was sometimes named or aliased www but more often was just some random machine living on your university’s network), and it acted much like a simplified version of FTP — someone would go to a URL like http://example.com/~username/ and you’d see an ugly directory index of the files in there (if you didn’t override it with an index.html or, more often in those days, index.htm), and then someone would click on the page they wanted to look at like homepage3.html and it would retrieve this file and whatever flaming skull .gif files it linked to in an tag and the copy of canyon.mid you put an around, and that would be that. The web server was really just a file server that happened to speak HTTP.

Then one day, servers started supporting things called SSIs, short for “server-side includes.” This let you do some very simple templatization of your site; the server wouldn’t just serve up the HTML file directly, but it would scan it for simple SSI tags that told the server to replace this tag with another file, so that you could, for example, have a single navigation header that was shared between all your pages, and a common footer or whatever.

But this mechanism was still pretty limited, and so about two minutes later someone came up with the idea of the Common Gateway Interface, or CGI; this would make it so the server would see a special URL like /cgi-bin/formail.pl and instead of serving up the content of the file, it would run the file as a separate program and serve up its output.

At this time, HTTP generally used just a single verb, GET, which would get a resource. CGI needed a way of passing in parameters to the program. Instead of just running the program like a command line (which would be very insecure), they passed in parameters through environment variables; for example, if the user requested the “file” at /cgi-bin/formail.pl?email=fwiffo@example.com&text=Hi+I+like+your+site!, the web server would set the environment variable QUERY_STRING to the value of everything after the ?, which formail.pl would then parse out.

If the POST verb were used instead, then the server would also read some additional data from the user’s web browser and then send that to the script via its standard input.

Basically, the web server was no longer just a file server, but a primitive command processor.

Early security

Back when this first started, system administrators knew better than to let just anyone run just any program from the web server. After all, people might do silly things like make it very easy to execute arbitrary commands on the server — and since the web server often ran as the root/administrator user, this would be very bad indeed. Even the admins who were savvy enough to set up a special sandbox user for the HTTP server would still need it to run everything from a common, trusted account that might have had access to common areas of the server.

So, the usual approach was to have just a single /cgi-bin/ directory with trusted programs that were vetted and installed by the administrator, for things that they felt were important or useful for everyone to have. Usually this would be things like standard guest books (the great-great-grandfather to comment sections) or email contact forms (since spam was starting to become a problem and it was already dangerous to put your email address on the public web).

Back in these days people generally didn’t have a database — after all, Oracle was expensive — and it didn’t really matter anyway; if you wanted to have a complex website you’d just run some sort of static site generator (which was often written in tcsh or Perl or something) and if you needed scheduled posts you’d do it by having a cron job periodically update things. So, it wasn’t really that much of an impediment to have this setup.

If you were really savvy and wanted to run, say, an interactive online multiplayer game of your own design, you’d simply run your own server (often under your desk in your dorm room) and you’d have root access and could install everything you wanted in /cgi-bin/.

Because everything in /cgi-bin/ was run as a program, you knew better than to let your scripts save other files into that same directory; if it was a thing where people could upload files or post comments, it’s not like it would do any good anyway (since then the server would try to run them as programs, and you can’t run a .jpg).

Shared hosting

Then as the web really started to take off, shared hosting providers started appearing, and CGI access became a pretty commonly-requested high-end feature. Generally the shared hosting providers didn’t want to let just anyone upload a script to be run by the server, but they also didn’t want to have to manually vet each and every script that users wanted to install. So, as a compromise, they set up special rules so that within your own server space you could have a /cgi-bin/ directory and that things run from that directory would run under your account, rather than as the web server (using a mechanism called suexec).

This provided a pretty good compromise; users still had to know what they were doing in order to install their scripts, but they still ran from a little sandboxed location, and because of the way suexec worked it was pretty unlikely for even a very badly-written script to cause problems, because if the script tried to save out an executable file into the cgi-bin directory, it wouldn’t be saved out with execute privileges, so it would just cause an error 500 to occur. After all, /cgi-bin/picture.jpg wasn’t a program, so why should it run?

Increased flexibility

But then things started to get a little more complicated. People wanted their main index page to be able to run as a script, without it forwarding the page to /cgi-bin/index.pl or whatever.

So, another compromise happened: the CGI mechanism, which previously was set up to only run the scripts from the /cgi-bin/ directory, got a few new rules, such as “if the filename ends in .cgi (or other common extensions like .pl or .py) run it as a script.” It still needed permissions to be set correctly, though, and by this point suexec was generally set up so that there were even more rigorous checks before it would run the script. And there were so many safety checks in place that this was still generally okay.

Around this time it also started becoming common to have access to a database such as mySQL or Postgresql, which allowed more flexibility and more two-way content. Forums became a thing. So did early blogs. Most of this software started out by having the database just for storage and the software would simply write out static files, but this started to have scaling problems and the webserver got busy with the software writing these files out all the time, so it became more common for the software to simply read from the database directly as it ran. This helped somewhat, but it also shifted a significant amount of load over to constantly establishing short-lived database connections, because every time the forum program ran it had to connect.

Hello PHP

At some point, PHP started to get popular.

PHP itself was originally intended as another way of adding server-side scripting into HTML files; it was in effect a templating system for HTML. In the earliest days it was often just treated as another scripting language; the server would be configured to consider .php as another name for .cgi or .pl or whatever, and the file would still be run as a script. In some cases it even needed to start with #!/usr/local/bin/php and it needed to be set executable with the correct permissions and so on (although this setup was uncommon).

However, most sites used mod_php, a server extension that allowed the web server to handle PHP files directly. In many respects it was very similar to mod_cgi, except it did a few interesting things. One of the undeniable benefits was that it was now able to maintain the database connection persistently, rather than having to re-establish a connection every time a script ran. It was also generally a bit nicer for speed because commonly-used PHP scripts could stay in memory and not have to be re-interpreted every time a page was loaded.

But there were a couple of other implications this led to. In particular:

It embedded the PHP interpreter into the web server itself (rather than running it as an external program)
Since it was no longer shelling out to an external program, it could always run a .php file regardless of its execution permissions — and so that’s what it did

There were a few different variations on this and it didn’t always just run PHP from the web server (for example, some of the better hosts figured out that they could have each user run their own separate per-user FastCGI server that would run the PHP programs as the separate users, or whatever) but regardless of the setup, you now had PHP always running and not having to care about the permissions of the file, meaning you now had some persistent process running what was essentially executable code without the usual safeguards that a shared server would have.

This actually seemed like a good thing at the time, but then many, many pieces of software started allowing arbitrary people to upload images, and often wouldn’t make sure that what was supposedly an image was actually an image…

And so that’s where we stand today.

This makes sites potentially vulnerable even if they aren’t written in PHP themselves; for example, if your HTML directory permissions are set to be slightly too permissive, and another site on the server gets hacked, that hacked site can potentially be used to place a .php file into your site, and since mod_php doesn’t check ownership permissions it now runs on your site with whatever permissions PHP would normally run in your account. (And this isn’t just a theoretical; I’ve had sites hacked in this way! Now I run a nightly script that ensures that my directory permissions are correct and tells me about new .php files that appeared since the last check, just to be sure.)

So, long story short, one of the biggest problems with PHP isn’t with the language itself, but with the way that PHP gets run; people (and their bots) can find ways to upload arbitrary files with a .php extension and, if that upload is visible to the webserver (which it often will be), then a request to view that file will execute that file, regardless of its origin, and from there it can do anything that your own site can.

Other PHP features of note

Granted, the erroneously-executable upload feature is only responsible for some of the security exploits I’ve seen in the wild. I wasn’t really intending to get into language-specific issues (after all, I linked to much better, more-comprehensive articles about it in the introduction), but it’s worth mentioning some of them anyway, as I have seen all of these be used to hack websites I’ve helped to clean up and secure.

The biggest one: For a very long time, the include() function would happily support any arbitrary URL and would download and run whatever URL it was given. And it was very easy for a PHP script to be accidentally written to allow an arbitrary user to provide such an arbitrary URL. (And by “a very long time” I mean that this was the default configuration until very recently, and many hosts still configure it that way for backwards compatibility.)

Some might be looking at the PHP docs I linked to there and thinking, “wait, but it’s not running the PHP code locally.” What the docs mean are that if you do like include('http://example.com/foo.php'); it’s the output of foo.php that gets included. However, that output could in turn be more PHP code, which would then be executed locally, meaning on your server. And PHP doesn’t even care what the file extension is; doing an include() on asdf.txt or pony.jpg will happily execute whatever blocks exist inside of it as well.

There’s also a few other features of PHP that lend itself to arbitrary code execution. One particularly fun one was the PCRE e flag, which indicated that the result of the regular expression should be executed as arbitrary code; and as PCRE flags are embedded into the regular expression itself, a carefully-crafted search term (on a less-carefully-crafted search page) could run arbitrary code. Fortunately, this has been removed in PHP 7; unfortunately, a lot of web hosts still run PHP 5 (or older!) and so this option — which never had a single legitimate usage — is still available on the vast majority of web servers out there.

How Flask (and therefore Publ) are different

So, I’m posting this on the Publ blog, which implies that I’m trying to build a favorable comparison for Publ. And that’s a perfectly fine inference to take.

Publ is built on Flask, which uses the WSGI (Web Server Gateway Interface), rather than the CGI, model of execution. This is a bit more complicated than I want to get into but the short version is that rather than the web server running a program based on the URL, Publ stays running as a standalone program that the webserver sends commands to as requests come in. So, it’s never asking a file how it should be run, but instead it’s telling a single program to handle a request. So, there’s no danger of some random file being executed when it shouldn’t be.

“But wait,” you might ask, “isn’t that exactly what you were complaining about mod_php doing?” Well, that’s true, mod_php works by always having the PHP interpreter running and able to execute whatever arbitrary code it comes across. However, in the Python world, code is kept separate from data. Loading a URL in Flask isn’t mapping to a script file that gets loaded and run, it’s calling an established, fixed function that loads a content file and formats it through a template.

Another thing that Flask does is it separates out template content (which is executable) from static file content. Static files aren’t executable by default. Templates can embed arbitrarily-complex code, but they can only use functions that are provided to them — there’s no direct access to the entire Python standard library, for example, and so the most dangerous functions aren’t included by default. (And Publ does not provide any of those functions either, at least not purposefully.)

Important note: When I say static files aren’t executable by default, this simply refers to how Publ sees them. If your site is configured to serve up static files where PHP or CGI scripts are executable, then any such scripts that end up in your static files will indeed be executable. This is going to be the case on pretty much any shared hosting provider, for example.
Also, regardless of the server setup, Publ can’t magically protect your content or template directory from outright misconfigurations with permissions. Even classic static sites need to be secured from third-party/unauthorized access.

Publ itself also only knows how to handle a handful of content formats — Markdown, HTML, and images — and ignores everything else. So if a .php file somehow ends up in the content directory, it won’t matter at all — Publ just ignores it. It will never attempt to run code that’s embedded in a content file, nor does it even even know how to. And Publ doesn’t handle arbitrary user uploads anyway (nor is there any plan to ever support this); anything that would be potentially hazardous would have been put there by some other means.

Publ’s design is basically just a fancy way of presenting static files, just like in the early days of the web. It just serves up the static files dynamically. Or, as I keep on saying, Publ is like a static publishing system, only dynamic.

(Of course, if your directory permissions are set wrong, someone can still use someone else’s exploited PHP-based site to attack your account and modify Publ’s code. But there’s nothing that Flask or Publ can do to prevent that, and this is just a general security problem that impacts everyone regardless of what they’re running.)

It would of course be foolish of me to claim that Publ itself is 100% secure and impossible to hack. And at least on Dreamhost there’s the very real possibility that somehow an arbitrary .php file gets injected into the static files (perhaps by an incorrect directory permission or whatever), which isn’t a flaw in Publ itself but the end result (a hacked site) is the same. So far as I can tell there’s no way to entirely disable PHP on a Dreamhost-based Publ instance, and it’s really the ability to run PHP that makes PHP so dangerous in this world.

So, I’m not going to claim that Publ is 100% secure or unhackable. But it sure has one heck of a head start.