When is an Error an Error?

Lately I’ve been thinking about how well our current computing systems encode meaning and one place I think we fall short on is errors.

Errors come in many forms and we have developed many ways of encoding them and how to proceed (or not) in case of one. Many systems (programs, protocols, people, etc.) label certain behaviors as errors which the other does not. This distinction is pretty vague I admit but is perhaps clearest when comparing a KeyNotFound exception with another lookup function that returns a Maybe a. Each of these has different implications for what the program is supposed to accomplish and what the programmer wants to account for. I’m not vouching for one or the other here, just noting that one person’s error is another person’s normal case.

The real impetus for writing this is from thinking about (and dealing with) the boundaries of systems; the two examples I have in mind are HTTP and execv.

First, imagine you are writing a program and need to launch a child process and wait for its return code. How do we (typically) tell if everything went “okay”? - a nonzero exit status means something went wrong, otherwise things should be okay. This is not always the most informative thing unless we dig through a manual, but at least the information was communicated to us; we get at least 1 bit of information (zero vs nonzero) out of checking the exit status (more if we know that the child uses some specific set of codes for various things). However, if you now add in to the mix another agent (an “interpreter” or otherwise something which has its own set of errors + the errors that may be generated by an input to it), then we run into issues. If I launch a child process of python foo.py and get a nonzero exit status, how do I know whether python returned that status (eg. foo.py has a syntax error) or whether foo.py gave that status.

Second, imagine you are talking to a “REST” API over HTTP and send something like GET api/v1/users/aconz2 and get back a 404. How do you know whether the implementer of this API is using 404 to mean “that user is not found” or if you made a typo in the path and it should really be GET api/v1.0/users/aconz2? This is the essentially the same thing as adding an interpreter because a “regular” (think static file server) HTTP server would only return 404 for a resource which does not exist, but now that something (python, ruby, etc. server) is interpreting the meaning behind the path, we don’t know who is signaling that error.

What I think both of these examples illustrate is that we lose information along the communication path. Some might say that this is okay because we should treat the thing-we-are-communicating-with as a black-box and so we can only interpret the outer layer’s responses in order to maintain encapsulation. But this can really suck in terms of tracking down issues in systems. The whole point of errors should be to give us the information necessary to a) act accordingly (ie. we know there exists an error which is different from a successful case) and b) implement a fix somewhere in the system to make it a successful case (if it should be one) (ie. my program is wrong and I need to fix it, your program is wrong and you need to fix it). When errors get overloaded in meaning, we lose the ability to reason about either of those goals. This is the #1 reason why StackOverflow exists is because error handling, across the board, is lossy and we need humans to piece together all this crap.

I would love to say I great idea that would make this better, but I don’t. Maybe you do? Let me know!