# ICFP / FSCD day 2 – rough notes

(Blog posts for Day 1, Day 2, Day 3, Day 4 (half day))

I decided to take electronic notes at ICFP and FSCD (colocated) this year, and following the example of various people who put their conference notes online (which I’ve found useful), I thought I would attempt the same. However, there is a big caveat: my notes are going to be partial and may be incorrect; my apologies to the speakers for any mistakes.

(ICFP Keynote #2) – Challenges in Assuring AI, John Launchbury

The moment we think we’ve got our (verification) hands round the world’s systems (e.g., compilers), a whole new set of systems appear: AI.

The dual of “have I tested enough?” is “how good is my model?” (in a verification-based approach). What is the model we have for AI systems?

What is intelligence? Information processing. But there are different kinds of information processing ability; asking if a system is ‘intelligent’ is too coarse/binary. John breaks this down into ‘perceiving’ P, ‘learning’ L, ‘abstracting’ A (create new meanings), and ‘reasoning’ R (plan and decide).

AI wave 1 : rule-based, hand-crafted knowledge: P 1, L 0, A 0, R 3 (poor handling of uncertainty, no learning). Type systems are a little like this (perceiving, didn’t have to tell it everything). Can do some pretty amazing stuff: see smart security systems that analysis code and patch bugs.

AI wave 2: statistical learning. Specific problem domains, train them on big data. Still requires a lot of engineering work (to get the right models and statistical techniques for the task). P 3, L 3, A 1, R 1 (less good at reasoning).

What does it mean to prove a vision system correct?

Manifold hypothesisdata coming in is high-dimensional but the features you are interested in form lower-dimensional structures. E.g., with cars, lots of data comes in but understanding the data means separating the manifolds (of low dimensionality).
Another example, a 28×28 image is a 784 dimensional space (as in 784 numbers, 784-length vector). Variation in handwritten digits form 10 distinct manifolds within the 784 dimensional space. High-degree of coherence between all data samples for a particular digit. (“Manifold” here may be slightly more informally used than the full mathematical definition).

Imagine two interlocking spiral arms which are really 1-dimensional manifolds in a 2-dimensional space. By continuous transforms on the 2-dimensional space these can morphed into linearly separable components. Sometimes stretching into a new dimension enables enclosed manifolds to be separated.

Neural nets separate manifolds; each layer of a neural network stretches and squashes the data space until the manifolds are cleanly separated. Non-linear functions at each step (otherwise multiple linear transformations would collapse into just one layer). Use calculus to adjust the weights (error propagation backwards through the layers).

Structured neural nets have different kinds of layers, e.g., feature maps (which perform a local analysis over the whole input); generated potentially via convolutions (i.e., turn a 28×28 image into 20 24×24 images), do some subsampling. Programming these requires some trial-and-error.

Models to explain decisions: “why did you classify this image as a cat?”; not just “I ran it through  my calculations and ‘cat’ came up as highest”. A deep neural net could produce a set of words (fairly abstract) and then a language-generating recurrent neural net (RNN) translates these into captions (sentences). Currently, statistically impressive but individual unreliable.

Adversarial perturbations: Panda + <1% computed distortion, becomes classified as a Gibbon. Whilst the two images are indistinguishable to a human, the noise interferes with the manifold separation. There are even universal adversarial perturbations that throw-off classification of all images.

Assured control code (e.g., in a helicopter): a physical machine is modelled (system dynamics, power, inertia), a control system design is created, from which controller code is generated. But this does not produce a very nice to fly system. Need some kind of refinement back to the control code. Given the correct structure for the control code framework then the fine tuning from feedback on the actual system is very effective. This can be overlayed with different goals (different fitness functions), e.g., safety over learning.

AI wave 3: combining wave 1 and 2 techniques. Perceive into a contextual model on which abstractions are computed and which can be tuned (learning) to produce reasoning. Aim to get a better intelligence profile: P 3, L 3, A 2, R 3. Hand assurance arguments on the contextual model part instead.
e.g., build models that are specialised based on our knowledge, e.g., we know the probably number of strokes in each digit 0-9 and its trajectory. The generative model generates explanation of how a test character was generated (go from the model of whats its looking for to what it sees; how likely is the thing seen generatable from the model of a particular digit).

Need to be able to specify intent in a variety of ways; all real world specifications are partial and humans can’t always communicate the full intent of what they want (cf., trying to explain how to drive a car).

What are we trying to assure? Mistakes in sensing/reasoning leading to undesirable actions, undesirable emergent behaviours, hackable systems being subverted, misalignment between human/machine values.

(AlphaGo is a wave3-like system as it combined the neural net-based classification of moves with traditional planning/reasoning/tree-based pruning/search).

(FSCD Keynote #2) – Uniform Resource Analysis by Rewriting: Strengths and Weaknesses, George Moser

Can we use complexity measures on term rewriting systems to make complexity arguments about higher-order functions?

The main scheme is to convert a program via a complexity reflecting transformation into a Term Rewriting System (TRS), then do automated resource analysis on the TRSS to get an asymptotic bound. The first half of the talk is some background on doing complexity analysis (which was a bit fast for me so my notes are a bit incoherent below). There are some really nice complexity analysis tools for programs coming out of this work (links near the end).

Some preliminaries: Example, set up list reverse as a finite set of rewrite rules (a Term Rewriting System, TRS):

rev(xs)            -> rev'(xs, nil)
rev'(nil, acc)     -> acc
rev'(x :: xs, acc) -> rev'(xs, x :: acc)

A computation of this set of rules is the application of the rules from left to right. The rules are non-overlapping.

Can see rewriting as an abstraction of functional programming, or can see it as part of equational reasoning (universal algebra); but this really generalises functional programming (function symbols can appear nested on the left-hand side).

Definition: A TRS is terminating if the rewriting relation is “well-founded”.
Definition: A function symbol f is “defined” if it is the root symbol on the left of a rule (other f is a constructor).
Definition: Runtime complexity wrt. a terminating TRS is defined:

$dh(t) = max \{ n \mid exists u . t \rightarrow^n u\}$
$rc(n) = max \{ dh(t) \mid \mathsf{size}(t) \leq n \wedge \textit{n and t are "basic"} \}$.

RC is the runtime complexity; Q: is this a “natural notion” for rewriting? (I don’t know how one defined natural here).
Derivational complexity has no restriction on the terms,
$dc(n) = max \{ dh(t) \mid \textsf{size}(t) \leq n\}$ This has been used mainly to make termination arguments.
See “Termination proofs and the length of derivations” (Hofbauer, Lautemann, 1989).

Definition: Multiset Path Order. A precedence order $>$ induces a multiset path order $>_{mpo}$, (whose definition I’m not going to transcribe, Wikipedia has a definition) It basically says that $s = f(s_1, ..., s_n) > t = g(t_1,...,t_m)$ if $s$ is $>$ all of $t_i$, or $s$ is $\geq$ some $t_i$.

The Hydra Battle- is this terminating? (Dershowitz and Jouannaud designed the TRS as a termination problem, later rectified by Dershowitz): The beast is a finite tree, each leaf corresponds to a head. Hercules chops off heads of the Hydra, but the Hydra regrows:

• If the cut head has a pre-predecessor (grandmother) then the remaining subtree is multiplied by the stage of the game (heads regrowing and multiplying).
• Otherwise, nothing happens (the head doesn’t regrow)

Can show it is terminating by an inductive argument over transfinite numbers. (but complexity is quite bad). See “The Hydra battle and Cicho’s principle” (Moser, 2009).

The RTA list of open problem was mentioned: http://www.win.tue.nl/rtaloop/

TRSes can be represented via a matrix interpretation (details were too fast for me).
These techniques were developed into a fully automated complexity analysis tool for TRSes called tct-trs (modular complexity analysis framework).

Going back to the original goal: can we use this for higher-order programs? Consider the textbook example of reverse defined by a left-fold. Need to do a complexity preserving conversion into a TRS (complexity preserving transformation ensures lower bounds, complexity reflexing ensures upper bounds). First, defunctionalise via a rewrite system into a first-order system. Built a tool tct-hoca which works well for various small higher-order functional programs (with complexity proofs taking up to 1 minute to compute).  See the tools at:

What about “real” programs or “imperative” programs? Use ‘integer transition systems’ for modelling imperative programs, which is integrated into the tools now.

Some related work: Multivariate amortized source analysis (Hoffman, Aehlig, Hofmann, 2012) and a type-based approach “Towards automatic resource bound analysis for OCaml” (Hofmann, Das, Weng, POPL 2017) (note to self: read this!).

A previous challenge from Tobias Nipkow: how do you do this with data structures like splay trees (cf Tarjan)? There is some new work in progress. Build on having sized types so that we can account for tree size. An annotated signature then decorates function types with the size polynomial, e.g. $\texttt{splay} : A \times T_n \xrightarrow{n^3 + 2} T_n$ (recursive function to create a splay tree). A type system for this was shown, with cost annotation judgment (no time to copy down any details).

The strength of the uniform resource analysis approach is its modularity and extensibility (different intermediate languages and complexity problems).
Weaknesses: the extensibility/modularity required some abstraction which weakens the proving power, so adding new kinds of analysis (e.g., constant amortised / logarithmic amortised) requires a lot of work.
Audience question: can I annotate my program with the expected complexity and get a counter-example if its false? (counter-examples, not yet).

(FSCD) Continuation Passing Style for Effect Handlers, Daniel Hillerström, Sam Lindley, Bob Atkey, KC Sivaramakrishnan

Consider two effects: nondeterminism and exceptions, in the drunk toss example (a drunk person tosses a coin and either catches it and gets a bool for head or tails, or drops the coin [failure/exception]).

drunkToss : Toss ! {Choose:Bool; Fail:Zero}
drunkToss = if do Choose
then if do Choose then Heads else Tails
else absurd do Fail

Induces a simple computation tree.

A modular handler, using row polymorphism (with variable r below)

allChoices : a ! {Choose : Bool; r} => List a ! {r}
allChoices = return x |-> [x]
Choose k |-> k true ++ k false

fail : a ! {Fail: Zero; r} => List a ! {r}
fail = return x |-> [x]
Fail k   |-> []

The ! is an annotation explaining the effect operations, k are continuations, r are row variables (universally quantified).  Two possible interpretations:

(1) handle the computation with fail first, then handle that with allChoices
returns the result [[Heads, Tails], []].

(2) handle with allChoices first then fail gives the result [].

So the order of the handling matters. The operational semantics for handlers is based on a substitution in the computation tree for the relevant operations.

How do handlers get compiled? Use CPS! (restricted subset of the lambda calculus, good for implementing control flow). Based on $\lambda^{\rho}_{\textit{eff}}$ (Hillerstrom, Lindley 2016) – lambda calculus with effect handlers (separate syntactic category but contains lambda terms), row polymorphism, and monadic metalanguage style constructs for effectful let binding and ‘return’. The CPS translation is a homomorphism on all the lambda calculus syntax, but continuation passing on the effectful fragment (effectful let and return).  One part of the interpretation for the effectful operation term: $[[\textbf{do} \, l \, V]] = \lambda k . \lambda h . \; h (l \; \langle[[V]], \lambda x . k \; x \; h\rangle)$ (see the paper for the full translation!)

A key part of the translation is to preserve the stack of handlers so that the order is preserved and (the interpretation) of operations can access up and down the stack.
A problem with the translation is that it yields administrative redexes (redundant beta-redexes) and is not tail-recursive. Use an uncurried CRPS where the stack of continuations is explicit (see Materzok and Biernacki 2012), this leads to a new translation which can manipulate the continuation stack directly (see paper). However, this still yields some administrative redexes. Solution: adopt a two-level calculus (Danvy, Nielsen 2033) which has two kinds of continuations: static (translation time) and dynamic (run time), where you can reify/reflect between the two. This let’s you squash out administrative redexes from the runtime to the static part, which can then be beta-reduced immediately at compile time (end up with no static redexes).
The paper proves that this translation preserves the operational semantics.

I was going to ask about doing this typed: this is shown in the paper.

(ICFP) How to Prove Your Calculus Is Decidable: Practical Applications of Second-Order Algebraic Theories and Computation, Makoto Hamana

SOL (Second-Order Laboratory) – tool for analysing confluence and termination of second-order rewrite rules within Haskell. Provides an embedded Haskell DSL (via Template Haskell) for describing rewrite rules. SOL can then automatically check termination and confluence.

Confluence + termination => decidability. Can use this to prove a calculi is decidable.

Based on a higher-order version of Knuth-Bendix critical pair checking using extended HO pattern unification (Libal,Miller 2016).

For example, the theory of monads is decidable: given two well-typed terms s, t consistent of return, bind, and variables, the question  is s = t derivable from the three laws is decidable. How? First, orient the monad laws as rewrite laws. Check confluence (Church-Rosser, CR) and Strong Normalisation (SN)– together these imply their exists unique normal forms, which are then used to prove equivalence (reduce terms to their normal forms). So, how to prove CR and SN?

One way to prove Strong Normalisation is to assign weights and show that reductions reduce weights. Unfortunately, this does not work in the higher-order context of the monad operations. Another standard technique is to, use the “reducibility method” (see Tait,Girard and Lindley,Stark’05). SOL uses the ‘general schema criterion’ which works for general rewrite rules (Blanqui’00,’16) using various syntactic conditions (positivity, accessibility, safe use of recursive calls, metavariables). [sound but obviously not complete due to halting problem]. (Uses also Newman’s Lemma– a terminating TRS is confluent when it is locally confluent).

This tool can be applied to various calculi (see paper, 8 different systems). Sounds really useful when building new calculi. The tool was applied to a database of well know termination/confluence problems and could check 93/189 termination and 82/96 confluence problems.

“Let’s prove your calculus is decidable using SOL” (Makoto is happy to assist!)

(ICFP) A Relational Logic for Higher-Order Programs
Alejandro Aguirre, Gilles Barthe, Marco Gaboardi, Deepak Garg, Pierre-Yves Strub

Relational properties are things like, if X and Y are related by R then F(X) and F(Y) are related by R, or over two properties, e.g., the results are related instead by S.

Relational refinement types: $\Gamma \vdash t_1 \sim t_2 : \{n \mid n_1 = n_2\}$ These can be used to express monotonicity properties, are syntax directed, and exploit structural similarities of code. However, say we want to prove the naturality of take, but formulating the right relational refinement type is difficult because the computations are structurally different.

Start with a basic logic with lambda-terms over simple inductive types, and a separate layer of predicates

RHOL has judgements $\Gamma \mid \Psi \vdash t_1 : \tau_1 \sim t_2 : \tau_2 \mid \phi(r_1, r_2)$ where $\phi$ is a binary predicate on properties of the two terms and $\Psi$ gives assertions about the free variables. A key idea here is that the types and assertions are separate.

An example rule two-sided rule (where we inductively construct terms with the same syntactic constructor on both sides of a relation):

$\dfrac{\Gamma, x_1 : \sigma_1, x_2 : \sigma_2 \mid \Psi, \phi' \vdash t_1 : \tau_1 \sim t_2 : \tau_2 \mid \phi} {\Gamma \mid \Psi \vdash \lambda x_1 : t_1 : \sigma_1 \rightarrow \tau_1 \sim \lambda x_2 . t_2 : \sigma_2 \rightarrow \tau_2 \mid \forall x_1 . \phi' \Rightarrow \phi[r_1 x_1 / r_1, \, r_2 x_2 / r_2]}$.
This seems quite natural to me: abstraction introduces a universal quantification over the constraints attached to the free variables.

Rules can also be one side (where you construct a term on just one side of a relation).
The paper shows that RHOL is as expressive as HOL (by a mutual encoding).

Other relational typing system can be embedded into RHOL: Relational Refinement Types, DCC (Dependency Core Calculus, and RelCost (relational cost).

Q: if HOL and RHOL are equivalent (rather than their just being an embedding of HOL into RHOL say) then why have RHOL? Relational reasoning was gained which was not built into HO. Update: I discussed this further later with one of the authors, and they made the point that people have done relational-style reasoning in HOL but it requires taking a pairing of programs and building all the relational machinery by hand which is very cumbersome. RHOL has all this built in and the one-side and two-side rules let you reason about differently-structured terms much more easily. I can see this now.

(ICFP)
Foundations of Strong Call by Need

Thibaut Balabonski, Pablo Barenbaum, Eduardo Bonelli, Delia Kesner

What does this lambda term reduce to:  $(\lambda x \, y. x \, x) \, (id \, id)$? It depends on call-by-value, call-by-name, call-by-need, or full beta then you get different answers. The call-by-value, call-by-name, and call-by-need strategies are called weak because they do not reduce underneath lambda abstraction. As a consequence, they do not compute normal forms on their own.

Weak call-by-need strategy (the usual strategy outlined in the literature): procrastinate and remember. Beta reduction into a let binding (don’t evaluate the right-hand side of an application). (There was some quite technical reduction work here, but I noted this interesting rule on “bubbling” reductions out by splitting a nested reduction into two: $t[v[u/y]/x] \rightarrow t[u/y][v/x]$).

Strong call-by-need provide the call-by-need flavour of only evaluating needed values once, but crucially computes normal forms (by reducing under a lambda). This strong approach has two properties (conservative) the strong strategy first does whatever the weak strategy will do; and (complete) if there is a beta-normal form it will reach (a representative of) it. This comes by reducing under the lambda at some points. Consequently, this approach only ever duplicates values, not computations.

Q: If you implemented this in Haskell, would it make it faster? I didn’t quite hear the answer, but this would completely change the semantics as Haskell is impure with respect to exceptions and non-termination. It sounded like it would be more useful for something pure like Coq/Agda (where it could provide some performance improvement).

(ICFP) – No-Brainer CPS Conversion, Milo Davis, William Meehan, Olin Shivers

The Fischer/Reynolds algorithm for CPS transformation introduces administrative redexes. Since then, other algorithms have improved on this considerably (Danvy/Filinsky). In this approach, some additional reductions are applied to the term to get a smaller CPS term out. The moral of the story (or the key idea) is to treat the CPS converter like a compiler (use abstract representations, symbol tables).

Rules of the game: make the terms strictly smaller (greedy reductions), should be strongly normalisation, and confluent, keep the original algorithmic complexity of the CPS transforms (linear time).

Approach:
– Variables are the same size, so do beta redexes on variables
– Beta reduce application of a lambda to a lambda $(\lambda x . e) (\lambda y . e')$ only when the number of reference to $x$ in $e$ is less than or equal to 1 (so things don’t explode).
– Eta reduce (??only when there are no references to the variable in the body??) (I think this is what was said, but I think I might have made a mistake here).

Relies on some machinery: doing reference counts (variable use count), abstract representation of continuations (halt, variables, function continuations, and application continuations) rather than going straight to syntax and syntax constructor functions (which can do reductions) to dovetail with this, and explicit environments to be used with function closures.

(ICFP) Compiling to Categories, Conal Elliott

Why overload? Get a common vocabulary, laws for modular reasoning.

But generally doesn’t apply to lambda, variables, and application. So what to do? Eliminate them! Rewrite lambda terms into combinators: const, id, apply, pairing of functions (fork), curry, and keep going…. and you can automate this via a compiler plugin (so you don’t have to write code like this yourself).

This is implemented by the interface (algebra) of a category (for function composition / identities) + Cartesian (product) structure + coproducts + exponents: giving you (bi)Cartesian-closed categories. Implement this as class/interface so you can give different realisations/compilations, e.g., a graph drawing implementation (Conal showed some pretty pictures generated from this).

Another implementation is to generate Verilog (hardware descriptions), or to graphics shader code.

What about compiling to the derivative of a program? Represent this via the type:
newtype D a b = D (a -> b x (a -o b)) i.e., a differentiable function is a function which produces its result and its derivative as a linear map (an implementation was then shown, see paper). You can then compose interpretations, e.g., graph interpretation with derivative interpretation.

Another interpretation: interval analysis (e.g., Interval Double = (Double, Double)). There are more examples in the paper, including constraint solving via SMT by writing functions to Bool which are then compiled directly into SMT constraints (rather than using existing embeddings, e.g., SBV).

Q: what about recursion? Can be done (compiling from a letrec into a fix-point combinator class).

(ICFP) Visitors Unchained, François Pottier

(Humorous tagline from Francois: Objects at ICFP?! Binders again?!)

Manipulating abstract syntax with binding can be a chore: nameplate (boilerplate on names, term coined by James Cheney). Large and easy to get wrong (especially in weakly typed languages, non-dependent languages). This paper is a way of getting rid of nameplate in OCaml via a library and an syntax extension. Supports multiple representations, complex binding, is modular, and relies on just a small amount of code generation.

Several Haskell libraries support this via its support for datatype generic programming. This will use OCaml’s object technology. Annotating a data type with [@@deriving visitors { variety = 'map' }] automatically generates a visitor pattern (a class) for doing a map-operations on the datatype. To use this, create a local object inheriting the visitor class and override one or more methods, e.g., for a syntax tree type, override the visit method for the “add” node in order to implement the rewriting of e+0 -> e.
Visitors are quite versatile: easy to customise behaviour via inheritance which is something you can’t do in FP without a lot of redesign to your algorithms.

Want to traverse syntax with binders, considering three things (from three perspectives or “users” in the talk here):
(1) “End user” describes the structure of an AST, parameterised by types of bound names and free names and using a parametric data type abs whenever a binding is needed (e.g., in a lambda). This should also deriving the map visitor(see above) with option ["BindingForms.map"].
(2) “Binding library” (provided as part of this work) defines the abs type which capture the binding construct, but they leave the scope extrusion function extend as a parameter (extend gets used when you traverse a binder to go inside a scope, e.g., going into the body of a lambda and extending the context by the bound variable).
(3) The last part defines how to represent environments (and lookups) via an overriding a visitor, nominal terms, and scope extrusion. (The work provides some implementations of these part in a library too).

As seen above, the structure of this library is fairly modular. The “end user” has to glue all these things together.
One limitation is that nonlinear patterns (e.g., as in Erlang) can’t be represented (I didn’t immediately see why though).

(ICFP) – Staged Generic Programming, Jeremy Yallop

Advert at the start: if you have an abstract library that is not very high-performance, then the techniques here could be useful to you.

Generic programming a la Scrap Your Boilerplate, for any data type get automatic traversals e.g., listify (matches c) for some constant c, gives you a list of all nodes in a data type matching c; as a programming you don’t need to write listify or matches: they are generated. Requires internally a way to represent types as data and to map data types into a generic form, a collection of traversals over the general form, and generic schemes which plug together the traversals with parameter generic queries. (In Haskell see the syb). However, the overhead of this roughly 20x writing the code manually! This overhead was quite portable: reimplementing Scrap Your Boilerplate into OCaml had about the same amount of slow down compared with manual traversals.

Why this slow? Type comparisons are slow, lots of indirection, and lots of unnecessary applications of applications. (Dominic: to me it seems like it ends up a bit like the slowness involved in dynamically typed languages).

Solution: use staging to optimise the datatype generic implementation. Staging lets you explain which parts of the program can be evaluated early, and which bits need to be evaluated later (due to waiting for inputs)– these are the quoted bits (inside  with the option to jump back into the early part by backquoting ~).

Naively staging SYB keeps the type representation unchanged, but the shallow traversals and recursion schemes can be quoted judiciously such that type comparison, indirect calls, and polymorphism are all eliminated. (Jeremy showed us the code output.. but it has lots of noise and opportunities for improvement).

Instead, use a special fixed-point operator that inlines non-recursive functions and perform (semantics preserving) code transformations as well: e.g., decomposing the generic map operation.

One technique is to allow equations on “partially-static data” e.g., if you have [] @ x @ [] (i.e., static empty list concat dynamic list x concat static empty list []) this still can be statically rewritten into x. Thus apply laws on the static parts as far as possible.
Another technique was to eta reduces matches that have identical branches.
(I think there were a few more thing here). Ends up with much tighter code which performs very close to or even better than the hand-written code.

Q: How does this compare to the Template-your-Boilerplate library for Haskell?
I completely missed the answer as I was looking up this library which I really should be using…!