In the proposal for Morla we motivated the need for morphable programming language. In this paper we present a core language for Morla. The aim is to lay the fundation for such a language. This means we provide basic concepts for dealing with form and content. Additionally, we show how a language processing stack can be established with this language using a module concept. This additionally enables us to create new languages on the basis of existing languages by extending or modifying them ("morphing" a language).
Language processing is about how to specify form for some input and how to interpret this form. The form of this input is specified by structures. Structures are interpreted by maps.
A structure is defined as follows (using a variant of BNF and regular expressions for patterns):
structure = "@@structure " extname "@" extname ":" spec+ eoi extname = "[a-zA-Z_][a-zA-Z_0-9]*" spec = (extname ["?" | "+" | "*"]) | ("\"" "(\\\"|[^\"])*" "\"") | "\s" eoi = "@@\n?\r?"
E.g. the definition for the main form of a structure would be:
@@structure structure@main: "@@structure " extname "@" extname ":" spec+ eoi@@
The first non-terminal extname designates the name of the structure. The second non-terminal extname is the state (or context), in which the structure is valid. The non-empty sequence of spec non-terminals is the body of the structure. Each spec element in the body can be either a reference to a syntactic class, a pattern or a whitespace character. The specification of a syntactic class is given by the name of the class and a specifier for the cardinality (no specifier - mandatory, ? - optional, + - one or more, * - zero or more). Patterns are regular expressions - as they are supported by Python (they are interpreted with the options DOTALL and MULTILINE).
BNF-like definitions can be easily transformed to such structures. Each alternative in a BNF becomes a structure:
[ name "=" sequence1 "|" ... "|" sequenceN ] = "@@structure" "alternative1" "@" name ":" sequence1 "@@" ... "@@structure" "alternativeN" "@" name ":" sequenceN "@@"
sequence1, ..., sequenceN are sequences of non-terminals or terminals; alternative1, ..., alternativeN are just names for the different alternatives in a BNF. The name of the BNF becomes the state of the structure.
How does the interpreter handle such definitions? It starts in state "main". It tries to match all structures in the given state in their specified order. For matching a structure, the interpreter evaluates each item in the body of a structure:
If an item is successfully matched, the interpreter continues with the next item. If all items were successfully parsed, the structure is considered as matching. Then, the interpreter will call the interpreter of the structure (cf. below). After that the interpreter returns to the previous state.
If an item is not matching, the interpreter will try the next structure. If there is no next structure, it will return to the previous state and try the next structure there or fail if this already was the top-level state.
How is a structure interpreted? The reason for using these structures instead of BNF is that it is easier to associate an interpreter with a structure. For each structure there has to be a function, which is called like the structure. The interpreter will call this function, after it has successfully parsed a structure. For each item in the body of the structure, it passes an argument to the function. The result of the function will be included in the resulting syntax tree. We use Python in the realization. E.g. the interpretation function of the above structure could be:
def structure(literal1, extname1, literal2, extname2, literal3, spec, literal4): return None
Regarding the cardinality of a class specification the arguments are passed as follows:
In the implementation we use the class SyntaxToken to hold the results of the parsing (and the interpretation). The attribute arg may be used to access character data for terminals or the interpretation values for non-terminals.
The interpreter lets you use include code fragments using the core structure:
core = "@@core" "(\\@|[^@])+" eoi
With this structure you can include any kind of (Python) code in the interpreter. E.g. for our example the function becomes included as follows:
@@core def structure(literal1, extname1, literal2, extname2, literal3, spec, literal4): return None @@
In language processing you typically use layers to break down complexity. E.g. in a compiler you may have a layer for lexical analysis, another layer for the syntactical analysis and a finally a layer for the code generation.
Morla supports layers by a module concept. A module is a collection of structures and maps. This means, if a structure or map is evaluated, it is saved in the current module. You can change the current module with the following statement (extname specifies the name of the module):
module = "@@module " extname eoi
Note, that, if a new structure is added to a module, it may replace a structure with the same name and state. This also holds for maps with the same name. The top-level state in which the interpetation is started is always "main".
In order to use these modules, you have to specify them in the call of the Morla interpreter:
python morla.py <Morla program> core <module 1> ... <module n>
The core module, which consists of the predefined structures and maps, should be always first. The sequence of modules indicates, how the program is evaluated. First, the Morla program is fed to the core. Then, its output (more on that below) is evaluated by module 1. This is repeated for all remaining modules.
You may structure your language processing hierarchy using the include instruction:
include = @@include " path eoi path = "[^@]*"
This instruction includes the specified Morla program in the current program, which is indicated by the path (a relative path relates to the directory of the current evaluated program).
Each module, which is processed, receives the output of the previous module (or for the core module the actual program text).
The input is a sequence of tokens. Currently, either unicode character or instance of the class TaggedToken are supported. The former can be matched with patterns (cf. above). The latter can be matched with unspecified (abstract) syntactic class. E.g. if you have a structure like that:
For Number no structure has to be specified. If the structure is evaluated, it is checked, whether a TaggedToken with the tag Number is at the current position of the stream. If the match is successful, its argument (cf. below) will be included in the call to the map of the structure.
The output of a module is a sequence of tokens. For the core module, this is sequence of all characters, which were not interpreted. For all other modules, currently only sequences of tagged tokens are supported. To place a tagged token in the output stream just return an instance of TaggedToken. E.g.:
@@core def someMap(lit): return TaggenToken("myTag", lit.arg) @@
The interpreter is called as follows:
python morla.py <Morla program> core <module 1> ... <module n>
What have we achieved so far? We have established a fundamental mechanism to handle form that is as potent as BNF but is easier to interpret. Additionally, the language also provided a means to handle interpretation. Last, but not least the integration of these features enabled us to have a language that is dynamically able to change its nature. It is a morphable language.
The module concept enables us to build up language processing stacks, which is essential for designing complex languages. Together with this it is also possible to modify or extend a language.
While these concepts enable us to do language processing in principle, at least the following practical issues remain:
In order to become more language independent one of the next steps would be to provide a language, which is sophisticated enough, so that the maps can be expressed in it.
© 2008-2009 by Gil Müller (www.gil-mueller.com)