### Building terms incrementally: expressivity

A few weeks ago, I showed a very simple model for constructing and storing typed combinators incrementally. By incremental, I meant that type-checking was a sequential and iterative process: instead of giving it a whole term, type-checked at once (the usual, batch approach), you could split it up into pieces and commit each of them individually. Each commit would give you a unique identifier for the piece of term you just provided, and if the user bothers to write down these names, he can then elaborate new terms on top of them, possibly in a non-linear fashion. That way, we can model bottom-up term construction, and incrementally modify a term.

We noted also some defects to this approach, two of which we are going to address today: one is the absence of top-down construction (a very common way of building terms incrementally), the other is the burden of remembering names for successive versions. The interesting part is that we are going to simulate them inside our model, taking advantage of the extensibility of the signature.

Let’s dive into it!

### Versioning in the system

The system we built so far interacts with a user in a very crude fashion: give it a term (mentioning identifiers), it will give you a unique identifier for this term (commit); give it an identifier, it will give you the associated term. If the user actually uses it for versioning, he will need to remember the identifiers standing for successive versions. Let’s see what we can do to alleviate this burden.

We are going to extend our base language by extending the signature, with new constants that will represent lists of versions.

#### A tree of versions

• A base type $\nu$ of versions
• An initial, empty version $O : \nu$
• An infinity of version constructors $S_A : A \to \nu \to \nu$ indexed by a type $A \neq \nu$

These are — just — the constructors for (heterogeneous) lists. They allow us quite directly to record the timeline of successive versions — not of the intermediate commits, which are there only to ensure future sharing, but real, toplevel versions of a term. The bottom of the list represents the empty repository, and a cons cell a version with its ancestor.

Example: Let’s take our example from last time: we want to transform the term $f\ (g\ a)\ (g\ b)$ into $f\ (g\ a)\ (g\ (g\ b))$ (we will omit the type informations now that we are convinced that these terms are ensured to be well-typed). We commit successively:

• $O \leadsto U$ (the 0th version, i.e. the empty repository)
• $f\ (g\ a) \leadsto X$
• $g\ b \leadsto Y$
• $X\ Y \leadsto Z$
• $S_\alpha\ Z\ U \leadsto V$ (first version)
• $X\ (g\ Y) \leadsto T$
• $S_\beta\ T\ V$ (second version, having $V$ as ancestor)

This revised notion of (toplevel) commit — commit intermediate terms bottom-up and when you are done, commit a “version marker” pointing to the term and the last version — allows to e.g.:

• remember and checkout the n-th last commit
• add commit messages (argument to add to the constructor $S_A$)

More than this actually: although we have here a simple list structure, the sharing inherent to our repositories makes it possible to represent trees of versions. A version node having many children is a branching node.

Example: We can commit $S_\alpha\ X\ (g'\ Y)\ U$ as an alternative to our second version. $U$ is then a branch node.

#### A graph of versions

But in this encoding of version history, different branches can never rejoin, or merge, back into a unique history line. For this we would need to revise our choice of constructors: a version needs to have several possible ancestors. Let’s give it another shot: instead of $O$ and $S_A$, we add the following constants to the signature:

• Two base types $\nu$ (versions) and $\mu$ (lists of ancestors)
• An empty list of ancestors $O : \mu$
• A cons cell for lists of ancestors $S : \nu \to \mu \to mu$
• An infinity of constants for versions $V_A : A\to\nu\to\mu$ indexed by a type $A$

With this set of constructors, the structure of versions is a DAG, as it is very common in version control systems (VCS).

Note that we are not talking about actually merging terms here, but just represent the history of versions branching and merging.

Example: Let’s represent the following diamond-like graph of scalar constants: $b\to a, c\to a, d\to b, d\to c$:

$X = V_\alpha\ a\ O; Y = V_\delta\ d\ (S\ (V_\beta\ b\ X)\ (S\ (V_\gamma\ c\ X) O), Y$

#### A popular storage model

What I just described can actually be used to encode the storage model of various version control systems. Mercurial is one of them, Git is also. Git‘s object model can be viewed as a typed database like the one I use here, in the following signature:

• Four base types: $tree, commit, flist, alist$ and a built-in $string$ type
• $blob : string \to obj$ (a file content)
• $nil : flist$ (lists of trees or blobs; hierarchy of files & directories)
• $cons\_blob : string \to blob \to flist \to flist$ (a file with its name)
• $cons\_tree : string \to tree \to flist \to flist$ (a directory with its name)
• $tree : flist \to tree$ (constructors for directory hierarchies)
• $commit : string \to tree \to alist \to commit$ (commit in Git’s sense, with its message, list of ancestors and associated tree)
• $anil : alist$ (no ancestor)
• $acons : commit \to alist \to alist$ (ancestor cons, with its commit object)

We recognize here the structure of our graph of versions ($\mu$ is $alist$, $\nu$ is $commit$).

Git adds and additional constraint, enforcing what can be shared or not: there must be exactly one commit (in our sense: one name) by  application of constants $blob$, $tree$ or $commit$. For example $X=blob\ "hello"$ or $Y=tree\ (cons\_blob\ "README"\ X\ (cons\_tree\ "doc"\ Z\ nil))$ are valid Git objects that can be assigned a name, but neither are $tree\ (cons\_blob\ "README"\ (blob\ "contents")\ nil)$ nor $nil$, respectively because constant $blob$ cannot appear “deep-inside” a commit, and because $nil$ is not an allowed toplevel constant.

In the case of Mercurial, the signature is the same, only the sharing constraint changes: the “mandatory toplevel constants” are $blob$ and $commit$. A whole tree structure with blobs names at its leaf is called a manifest.

One point is missing in this explanation to get to real-world VCS: how does the system choose names for commits? We’ll have the occasion to come back on this issue, probably in my next post. In our very simple setting here, one proposition that makes it close to hash consing is the following.

Up to now, we considered that the user would write the terms with metavariables; actually we would prefer to present him the whole term to modify. When he is done with the modification, the system infers the sharing of subterms. That’s what happens in Git: we checkout a whole working directory, which is then committed back with modifications, inferring sharing of sub-directories.

Names can facilitate the detection of sharing: we pick a hash function (e.g. SHA1), and name each commit by the SHA1 string of its contents. If we want to know whether a term $t$ (the working directory) is already present in the database, we compute the hash of $t$, and query the database for this hash. This way our database is content-adressable.

### Simulating top-down construction

I’ll now explain the uses of top-down construction, and see how we can simulate it in our simple, sharing-based model. I’ll use the same trick as before: add special-purpose constants to the signature and define new operations on top of the existing ones.

#### Why top-down?

In many modern programming languages with an expressive type systems, incremental construction of programs became a necessity. The additional expressiveness makes the task of building well-typed program at once difficult, and iterating the batch type-checker on a whole file on each change too costly, so that there was a need, for usability motivations, for some kind of incrementality.

Let’s take Agda for example. It is a dependently typed programming language, featuring an Emacs interface that allows the user to type some terms, leaving blanks — holes — in the middle. The system then tries to infer the contents of these holes, or, if it can’t, at least their types. The user can then see the type expected for these holes, and fill it up, in turn leaving holes if needed. If the inferred and filled types match, then the system knows that it doesn’t have to recheck the whole file, but just paste the contents of the hole in it. This interaction is top-down: define the head of the term before the leaves. Note that if a change in an Agda file is done outside a hole, then the whole file needs to be rechecked. Agda doesn’t support bottom-up construction!

Let’s turn to Coq, our proof assistant of choice. It features a tactic language to incrementally build terms of a given type (proofs of a given statement). In a first approximation, a tactic is a program taking a type (the expected type of a hole) and returning a term of this type, possibly with new holes that will need to be instantiated, filled up by new tactic applications. Iterating the application of tactic is constructing a term top-down.

#### How top-down?

Paradoxically, top-down construction is usually implemented in a way syntactically very similar to our bottom-up model (see this or that article): The state of the system is coded by a pair of a metavariable environment $R$ and a distinguished (head) metavariable $X$, defined in $R$:

$R ::= \Delta, X$

The environment maps metavariables to types and either their definition (a term), or, if they are (still) undefined, to a hole.

$\Delta : X \mapsto o : A \qquad o ::= []\ |\ t$

Once set, the “toplevel” metavariable cannot be changed, the only provided operation being:

• the instantiation of a metavariable $\Delta[X\mapsto t : A]$ provided $\Delta(X) = [] : A$ and
• the definition of a fresh, uninstantiated one $\Delta[X\mapsto [] : A]$ provided $\Delta(X)$ was undefined before.

To simulate this behaviour in our model is then very easy: just add an infinite set of constants $[]_A : A$ to your signature $\Sigma$, (where $A$ is any type).

• The operation $\Delta[X\mapsto [] : A]$ is simulated by the almost identical $\Delta[X\mapsto []_A : A]$. The side-condition is guaranteed, since in our model, names are always taken fresh by the system!
• The second operation — the instantiation of a metavariable — corresponds to the replacement of the subterm $[]_A$ by its contents, as described in the previous post: let us suppose that the initial state represent a term sliced enough that we can exploit sharing of common subterms (those not affected by the instantiation); we add metavariables so as to reconstruct the spine of changed subterms; finally we change the head (distinguished) metavariable.

An example will be more concrete: let’s instantiate $Z$ to $d\ c$ in the term $a\ (b\ Z)\ (b\ c)$. Before the change, the repository could look like this (we omit the type informations):

$[X=a\ Y\ T; T=b\ c; Y=b\ Z; Z=[]], X$

From this repository, we commit in turn the terms $W=d\ c$, $V=b\ W$ and $U = a\ V\ T$. We could have committed the term $a\ (b\ (d\ c))\ T$ in one operation, but it is always preferable to slice it for future sharing. The final repository looks like the initial one, just augmented with new metavariables and with a new head:

$[X=a\ Y\ T; T=b\ c; Y=b\ Z; Z=[]; W=d\ c; V=b\ W; U=a\ V\ T], U$

Another instance of non-batch mechanism commonly implemented in programming languages is the read-eval-print loop. There is one in OCaml, there is one in Coq (although hidden if you use one of the IDEs). The user types commands which are type-checked (and then evaluated, but that’s not the discussion here). Common commands are:

• bare expressions, in this case the state of the system is not changed (the expression is just checked in the current context),
• definitions or declarations, which add a binding in the current context
• … any state-changing operation. A state is usually defined as a (typing) context, plus some other meta-information.

Often, an undo command is provided to rollback the changes introduced by the last command. This whole mechanism is usually implemented outside the logic of the system, in an ad-hoc manner. Although it is not directly possible to simulate it in our model right now, let’s show that we are not too far. If we can encode the state of the system as a type, then the execution of a command boils down to the instantiation of a metavariable: let’s call the focus a(n other) distinguished metavariable, uninstantiated. Its type is the current state of the system. Then:

• a definition command Definition x := t. boils down to instantiating the current focus to the term let x = t in [] with [] being the new focus, the type of this new hole reflecting the introduction of x in the context.
• the same way, a declaration command Axiom x : A. instantiates the current focus to fun x : A -> []. The new focus [] has a type mentioning x : A.

The types of our languages have been up to now very simple. We see here that we are going to need to refine them so as to make more information from the terms available in types. The switch to seeing our toy type-checker as a metalanguage is closer.

### Conclusion

Here we are: we showed that the model is expressive enough to encode the kind of incrementality we expect from programming languages interface, and the kind of history model we expect from a version control system, all this with type-safety in mind. For this, we extended the signature of our language. In a sense, we considered our type system as a metalanguage on top of our language of choice (e.g. SKI combinators). This idea of exploiting a metalanguage will become crucial next time when we will start the real business: how to encode actual proofs to fit them to incremental type-checking.