Thursday, December 02, 2004

XML, ASN.1, and information encodings

Observation: There are quite a few different syntactical formats of files & protocols of primarily text & numbers. For example: XML, ASN.1, and then all name-value carriage return protocols commonly used on the internet like HTTP & SMTP, and then there's Windows .ini files, Java .properties files, and others too.

The problem: Mainly, it's a problem because it is more that developers have to learn. With each are settle nuances like character encodings and escape sequences that differentiate them. Also, there are usually multiple programming API libraries that accompany each of them and each are different with quirks and/or bugs (and must be learned). And there are pros/cons to each of the formats in their own right with regards to readability, verbosity, structurability (i.e. attributes? namespaces? entities?), and availability of a schema language.

My solution: First lets define three layers to this... the binary encoding, then what I call the info-set (i.e. the abstract model which is often eluded to with the API), and then the visual representation. Next, write out some basic definitions of all these formats in terms of these three layers. What we need, and this is important, is a common info-set. The XML info-set (AKA the DOM) is featureful and modern and should be the common one. Now re-express (or "map") the other models into the XML info-set. Now define and implement binary marshallers and unmarshallers between the various binary layers and the common info-set model. This should attempt to also place the non-pertinent white space and other comments in such a way to preserve this faithfully in case we need to translate backwards. A native implementation of the DOM with a target format other than XML is another way to get this done too. As far as the view layer... it turns out that many of the formats have the same binary and view representation such that we don't have to tackle that additional problem.

DELETE:
I think what perpetuates this problem is that the textual representation for all of them is identical to their binary encoding. "Hogwash" some of you might say, but seriously, I think this is it.

0 Comments:

Post a Comment

<< Home