Quantcast
Channel: Natepod
Viewing all articles
Browse latest Browse all 81

Dataspace 4: The Term-inator

$
0
0

Prelude:
Dataspace 0: Those Memex Dreams Again
Dataspace 1: In Search of a Data Model
Dataspace 2: Revenge of the Data Model
Dataspace 3: It Came From The S-Expressions
Dataspace 4: The Term-inator

I want a Memex. Roughly, I want some kind of personal but shareable information desktop where I can enter very small pieces of data, cluster them into large chunks of data, and – most importantly – point to any of these small pieces of data from any of these chunks.

Before we can build such a system, we have to settle on a data model. The logic programming language Prolog gives us a potentially useful universal data model – term structure –  but we don’t get the most out of it unless we express it in Lisp-style S-expressions,  which reveal hidden semantics that even the formal logic and logic programming communities didn’t catch up with until 1989. But S-expressions also introduce ambiguity that wasn’t there in the original Prolog term structure.

There is a very simple way forward from here, and it’s one that I haven’t seen described before. I think it has potential as a fundamental data semantics for building very large or very small distributed systems.

This is an approach to representing term structure in S-expressions, so I’m calling it term expressions. If we wanted to be cheeky, we could call it T-expressions (because it’s one more than S…)

So S-expressions are a way of representing lists as pairs plus a final nil. But they don’t have to have a final nil. Instead, they can end in a dotted pair or dotted list – with a symbol after the dot indicating ‘the rest of the list’. This comes in really useful in variable matching. Eg:

(a b c . D)

is a dotted list. That D might perhaps be a variable, and if we were matching, it might stand for ‘everything after the c’. Which could be a list itself.

Here’s the core idea:

What if, instead of just a nil, or a dot and then a symbol, we allow a S-expression list to end with – or simply be – a third option: an arbitrary logical term?

So perhaps we had a list looking like this:

(a b c / d e f)

That ‘/’ is the magic character that replaces the dot. Anything after a /  is a logical term. It can have arbitrary structure.

Given this, we can represent a Prolog term and a Prolog list differently:

(/ likes john mary)  -> likes(john, mary)
(likes john mary) -> [likes, john, mary]

That’s it. The rest of this series is unpacking what this very simple, dumb syntax-level idea allows us to do. But it’s not just syntax –  like moving from Prolog terms to S-expressions, it opens a rather large new semantic level doorway. It lets us express things that we can’t quite fit into either S-expressions, or Prolog terms (or SQL relations, or JSON objects, or OOP objects, or filesystems – or sets, or Lisp reader macros, or arbitrary data types), and also simplifies dealing with and representing all of these.

At least that’s my hope. And I think there are reasons to expect hope.

In short:

  • The low-level data model is linked list pair structure, as in Lisp. It could be run in a more compressed form, perhaps – like straight sequences of machine words. But we’ll start with normal Lisp pairs as our basic layer. We can look at PicoLisp for inspiration for how to build a full language (now even an OS) on top of this.
  • The serialisation syntax is S-expressions but without the dot. We have parentheses () to open and close lists.
  • (We can also keep the dot if we want, for embedding within standard Lisps. But we  have the option to completely lose the dot for extra simplicity. Sometimes getting rid of a single reserved character can help a lot. Dots are not the best character to reserve, frankly, since they exist inside English text and numbers.)
  • Instead of the dot, we reserve one character to be a ‘term marker’. Initially I was favouring the vertical bar |, but because that’s a reserved character in many Lisps, I’m now going with the backslash / .
  • We probably also need a quoting strategy, and I’m more leaning toward C-style backslash \ for ‘quote the next character’ rather than double-quotes “” (because quotes get mangled, and more importantly, quotes don’t nest well at all, and we want recursive nestability)
  • Though, you know? Looking at the standard keyboard, if I were writing a character-level parser from scratch, and not caring about embedding… I’d probably go with [] for brackets, / for term, and ‘ for quote. Because they’re all on four keys right next to each other, next to Enter, visually distinct, and unshifted.
  • At a low-level data pointer level, we can probably get rid of the / as a symbol and put it as a tag bit in the pointer. But that’s an optimisation to think about later.

 


Viewing all articles
Browse latest Browse all 81

Trending Articles