The New Punched Cards

Over the past few years, XML has become the standard format for storing structured data in files, and for transferring it across networks.

As a mark-up language for documents (e.g. XHTML), I have no serious problems with XML. As a data format, however, it leaves a lot to be desired.

Firstly, when data is stored in a file as XML, typically most of the file is mark-up rather than the data itself. This means, of course, that more disk space than necessary is used up and the data takes longer to load into memory that it otherwise would. This is the almost inevitable result of using a language for data serialization which was originally designed for markup.

Secondly, a data serialization standard with the same capabilities as XML, but which is much less verbose, has already existed for decades. It is called S-expressions and is used for data input and output in Lisp. Instead of writing

  <countries>
    <country>
      <name>Belgium</name>
      <capital>Brussels</capital>
      <headOfState>
        <king>Albert</king>
      </headOfState>
      <currency>Euro</currency>
      <diallingCode>32</diallingCode>
      <languages>
        <language>Dutch</language>
        <language>French</language>
        <language>German</language>
      </languages>
    </country>
    <country>
      <name>United Kingdom</name>
      <headOfState>
         <queen>Elizabeth</queen>
      </headOfState>
      <capital>London</capital>
      <currency>Pound</currencly>
      <diallingCode>44</diallingCode>
      <languages>
        <language>English</language>
        <language>Welsh</language>
        <language>Gaelic</language>
     </languages>
    </country>
  </countries>

with S-expressions, you can write

  (countries (country (name "Belgium")
                      (headOfState (king "Albert"))
                      (capital "Brussels")
                      (currency "Euro")
                      (diallingCode 32)
                      (languages "Dutch" "French" "German"))
             (country (name "United Kingdom")
                      (headOfState (queen "Elizabeth"))
                      (capital "London")
                      (currency "Pound")
                      (diallingCode 44)
                      (languages "English" "Welsh" "Gaelic")))

It should be pointed out that in the S-expressions, "Belgium" and so on are strings. Without the double quotes, they would be symbols. Where numbers are included in data, (e.g. the dialling codes above) an S-expression parser would recognize them as such. Lists of items (e.g. the languages) can simply be enumerated. White space outside of strings is used as a separator, and is otherwise ignored. Where necessary, one space suffices, though more are usually added to improve human readability. XML has no type distinctions -- everything is a string, spaces are treated like any other character, so are significant -- hardly ever what is wanted. Lists of atomic items require an extra wrapping around each item or have to be parsed by the application).

Elements are terminated by close tags in XML (there is a small concession to bloat avoidance in the form of empty tags), and by close parentheses in S-expressions: one character.

There is no problem in determining the corresponding open tag if a parenthesis-matching editor (such as EMACS) is used to enter the data, and in any case most data is serialized by programs, not by data entry clerks, so syntax-checking is usually not needed but has to be done anyway.

By the way, XML attribute-value pairs, which have no direct equivalent in S-expressions, are just syntactic sugar.

  <foo a="x" b="red">bar</foo>

can be represented as, for example

  (foo #(a "x" b "red") bar)

if the convention is that if the first item after the tag is a vector, it represents attribute-value pairs.

One advantage often cited for the use of XML is that it's no longer necessary to write a parser for your data -- just plug an XML parser into your system, use XML and Bob's your uncle. This is disingenuous in several ways. Parsers are normally generated automatically from grammars, rather than written by hand, and writing a grammar is no harder than writing an XML DTD. So you need never write a parser -- just some BNF statements -- if you use a tool such as YACC to generate a parser. Using XML means that you have to include an XML parser in your system, which means you're locked into XML's high overhead. It takes significant processor time to write, because it is so verbose, and takes even more time to read because its syntax is unnecessarily complex.

For corporate data, XML is usually overkill. Most corporate data is in relational databases, and can be (and often is) serialized perfectly adequately as pipe separated values. XML cannot be used to serialize data structures more complex than trees, such as graphs, without first transforming them into trees. And, if it's trees you want to serialize, you can do this using S-expressions with a much reduced overhead.

As well as for Lisp data, S-expressions are of course also used as a syntax for Lisp programs. This is a frequently heard objection to the use of Lisp, but despite at least three attempts at giving Lisp a more Algol-like syntax (Lisp 2, CGOL and Dylan), S-expressions are still used for writing Lisp because experienced Lispers appreciate their advantages. Anyway, why should programs have a different syntax from data when programs are data?

This has not escaped the notice of XMLers who now have a variety of programming languages (ranging from XSLt for manipulating XML through to more general-purpose langauages such as Water: "The language is as easy as BASIC and as powerful as LISP." (sic)) which use XML as a syntax. (Apparently there's a mental block on admitting that Lispers got it right all along -- maybe they just can't stand smart-arses. How else could one make sense of the highly proprietary Curl: S-expressioms, but with braces instead of parens. But they have it wrong: Lisp can have any syntax it wants, even XML or Curl.)

XML, like Java, is succeeding because of the hype surrounding it, rather than because of its intrinsic merits. It is the new punched cards, and now it is used as the syntax for Ant, the Job Control Language for Java, the new COBOL.

Erik Naggum's thoughts:


This page was linked to from

and was last updated on 2007-10-21 at 02:09.

© Copyright Donald Fisk 2006