Context Sensitivity and relational comparison operators

I'm elbows-deep in a language implementation, and since (for a new experience) it's not some kind of LISP, I'm dealing with infix operators and implicit operator precedence. I solved a little problem and thought you all might find my rambling thoughts about it interesting.

I noticed that treating relational comparison operators strictly as context-free binary operators matches the way they are in many programming languages but does not match the way they're used in mathematics (and some other programming languages).

When I write
A < B < C < D

assuming we have a context-free parser and left-associative relative operations, this is
(((A < B) < C) < D)
and we wind up comparing the result of a comparison. But this is not what a mathematician means when they write that expression. Standard infix math notation wants it interpreted as
(A < B) && (B < C) && (C < D).

Engineering considerations say we want that with single subexpression evaluation and short-circuit semantics.

If generating code for a C program, the relop doesn't check whether the left subtree is another relop; all it knows is that the left subtree returned #false (or #true), so it compares its right subtree to that value. And treats booleans as a subrange of numbers [0..1], so the comparisons have humorous results that require "inside" knowledge about the binary representation of booleans to comprehend. We get single subexpression evaluation but no short-circuit semantics.

In languages like Java, comparing booleans and numbers is a static type error, because static type errors are probably more useful than humorous results that require inside knowledge about binary representation of booleans to correctly interpret. No subexpression evaluation, no short circuit semantics.

If it's a logic-descended language, the comparison may return NIL or a non-NIL value, and the non-NIL value it returns is the value of the last subexpression evaluated, meaning the value of the right subtree. This treats numbers as a subrange of boolean values (all of them are #true). And any relop whose left subtree is NIL returns NIL without evaluating its right subtree. An analogue of the mathematician's interpretation of chained relational operations falls out "naturally" and if you never do numeric operations on a value you got in a context that encouraged you to think of it as a boolean, you will never notice the difference. You also get single subexpression evaluation, you get short circuit semantics - all seems good!

But it breaks static type checking. This means NIL is its own type, and relational operators have to check for it at runtime, and it can get returned to continuations that expect numbers. So now everything that uses booleans or numbers has to do runtime type checking. From a static type checking POV treating numbers as a subrange of booleans is even more expensive than treating booleans as a subrange of numbers.

When I'm traversing the abstract syntax tree after parsing, I can have different code templates to emit code for relational operations. I pick one depending on whether a subtree or the parent node is or is not another relop. So I get static type checking, I get the mathematician's interpretation of chained relational operators, I get the efficiency of caching a single evaluation instead of expanding a macro and evaluating something a second time.... all is good.

But now I broke referential transparency. Identical subexpressions encountered in different contexts mean different things. All the type checking can be done statically, but the relational ops other than the root of the chain are making a choice between jumping to their own continuation site in their parent relational op (to return a number) or to the boolean continuation at the call site of the whole chain (to return #false). Only the root of the chain is special; it has no numeric continuation, and can jump instead to the exit continuation to return #true. This is statically typed return polymorphism.

I was considering whether to be upset about breaking referential transparency, but then I realized people are using languages with exceptions all the time without complaining about it, so referential transparency can take a walk, I guess, if it gets us static type checking.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Chained relational operations

I view this kind of thing as syntactic sugar: the chained relational operators build a sequence of expressions and comparisons, which the language then handles idiosyncratically.

Then, treating the chain as an aggregate expression rather than pretending it's a bunch of unrelated subexpressions shouldn't break referential transparency any more than, say, tuple syntax, should it?

If nothing else, you might want the language to check and make sure the chained comparisons are in the same direction:
A <= B == C < D
instead of
X < Y >= X
which would be fiddly at best to try and do with syntactically independent unrelated expressions as well.

kind-of syntactic sugar, but not quite

Generally, the term "syntactic sugar" refers to a syntactic structure that can be directly deconstructed into a series of simpler syntactic structures in the language. As such,

w < x < y < z

is kind-of syntactic sugar, because in theory it could be translated to

w < x && x < y && y < z

.

However, the result is actually not a direct deconstruction -- at least not from my own experience with this particular syntax in the Ecstasy programming language. Instead, the translation is more like the inlined version of:

() -> {val tmp_x = (x); if (!(w < tmp_x)) return false; val tmp_y = (y); return tmp_x < tmp_y && tmp_y < z;}()

In other words, there is a code emission aspect to it that ensures that each argument in the expression is only evaluated once. While it's a minor detail, it's still an important one.

If nothing else, you might want the language to check and make sure the chained comparisons are in the same direction

This is a very good point. We allow '<' and '<=' to be mixed, and we allow '>' and '>=' to be mixed, but we do not allow any other combinations. So these are legal:

  • a == b == c
  • a != b != c
  • a < b <= c
  • a > b >= c

But these are not:

  • a == b != c
  • a < b >= c
  • a < b == c

Why restrict to one direction?

Isn't it natural to say 'A > B < C' when you want to know if B is the least of the three and don't care about the relative ordering of A and C?

It's still subject to short-circuit evaluation - C doesn't get evaluated if A < B for example. It's still subject to single evaluation in that B only gets evaluated once. And it's still clear in that each operation applies to the pair it's between. So why not mix them up?

Simplicity is next to Godliness (BNW)

Isn't it natural to say 'A > B < C' when you want to know if B is the least of the three and don't care about the relative ordering of A and C?

It's an assault on the reader, and for such little gain. What if you wanted to compare B as the least of 4? Of 5? Why is 3 so special?

It's simply much easier to read and understand

a < b < c < d < e

than it is to read and understand

a < b > c != d >= e

.

There is a wonderful benefit to not wanting to gouge one's own eyes out with a rusty spoon.

Hmm. There SHOULD be a way to say that.

Now that you mention it, there really ought to be a way to say "least of four" or "least of five", and it really ought to be possible with reasonably easy-to-read expressions.

Maybe the comma operator should be used between expressions that should be considered as a set without implying or requiring any particular order, so someone can say

(B < A,C,D,E,F < G)

When they want to know if B is least and G is greatest but don't care about the relative ordering of other variables, and the compiler (compiling to C) can emit

((B<A) && (A<G) && (B<C) && (C<G) && (B<D) && (D<G) && (B<E) && (E<G) && (B<F) && (F<G))

Plus whatever organization and caching are needed to make sure each subexpression is evaluated at most once and that no more of them are evaluated than needed to determine the result.

That saves enough work and hair in just writing out relational operations that I'm pretty sure it would be worth it. Add single-evaluation and short-circuit semantics and I think we've just spotted something that I would have considered an absolute necessity if I had only been used to it.

As you say, there is some merit in not wanting to gouge out your eyes with a rusty spoon, and just looking at the first seems like a huge win in both readability and expressiveness compared to the second.

According to the same convention the expression I had above,

(A > B < C)

would be expressible as

(B < A,C)

which is the same number of meaningful characters and a parse tree the same size, but seems a bit more clear because it packs A,C together with a non-ordering operator that makes that part of the intent clearer.

Interesting

Closer to what I'd personally like. (Not that that is a good measure.)



I could also imagine a "set" syntax that could be used, e.g.

(B < {A, C})



But most of the time, the common use of the condition looks something like:

0 <= index < size

Syntactic sugar for what, exactly?

It's an important thing to realize that infix operators, precedence, and associativity rules are a form of syntactic sugar for carrying out operations on particular subexpressions in particular order.

When I worried that the (A < B) subexpression in (A < B < C) didn't mean the same thing as it would mean if it appeared by itself, I was implicitly buying into the notion that A < B < C was syntactic sugar for (A < B) < C.

Which it actually isn't. That breakdown of the syntax tree was just a side effect of an excessively simple grammar. With a different grammar the parse tree of "(A < B) < C" (where the internal parens are literal in the source code) doesn't have the same shape as the parse tree of "A < B < C".

The chained relational expression is a fundamentally different operation, the same way unary + and - are different from two-argument addition and subtraction. So the parse needs to distinguish the chained operation without the explicit parens from the non-chained operation.

I don't have a problem with mixing relational ops in a chained relational expression. It's entirely reasonable to write "A > B < C" when you want to check whether B is the least of the three subexpressions, or "A > B ≤ C" if you want to check whether B is both less than A and no greater than C.

Absolutely

A number of early users advocated for mixfix in BitC. I was initially reluctant, but eventually decided to give it a try. I was surprised at just how much of the expression space could be handled by precedence parsers, especially when extended with auto-thunking and multi-part keywords (that's enough to process if-then-else and much of SQL, for example).

One thing that made me uncomfortable was the unscoped character of mixfix in most implementations. Load two DSL and there's absolutely no telling how the relative precedence will come out. Given which, it seems essential to be able to say which DSLs should be in effect at any particular point, and perhaps have some ability to rearrange the precedence relationships to make joint use of DSLs possible by correcting commingled precedences. Then there's the question of how to (or how not to) import a DSL and its keywords explicitly.

So far as I'm aware, these issues remain unexplored.