This figure shows commonly used representations when processing software languages, and the mappings between them.

Megamodel of SLE Artifacts

The left-hand side has flat, sequence-based representations (strings, token sequences) and the right-hand side has tree-based representations (concrete and abstract syntax trees).

The vertical axis indicates abstraction, with the more concrete representations at the bottom. ‘Layout’ in this sense includes things such as spaces and comments. We could also add extra layers for ‘layoutless with comments’ and ‘layoutless without comments’ (and vice-versa).

Mappings with solid arrows have a straight-forward universal definition, while dotted mappings depends on some kind of syntactic information.

The different artifacts are:

Str
Strings; sequences of characters.
Tkl
Sequences of tokens with layout.
Tok
Sequences of tokens without layout.
Ptr
Parse trees / concrete syntax trees with layout
Cst
Parse trees / concrete syntax trees without layout
Ast
Abstract syntax trees

There are sensible universal defaults for the implode mapping from Cst to Ast, but with significant potential for customisation of the Ast representation.