On persistence and documents

by Basile Starynkevitch. Send comments by email to basile@starynkevitch.net

The audience for this is mostly software developers using Linux. For a more serious example of persistent system, see my Bismon GPLv3+ system and the draft report (skip the first few pages, for H2020 bureaucracy) about it.

We all routinely use persistence. In office electronic documents.

Take for example any reasonably short LibreOffice document (when printed, it gives a few pages). Suppose it is some foo.odt file.
Copy that file elsewhere:cp foo.odt /tmp/foocopy.odt;
Then run the following magic:

cd /tmp
unzip foocopy.odt

Maybe that surprises you, but a LibreOffice document is just a zip archive of a set of textual files. IIRC, this is also true of Microsoft Word .docx files.
I did that with a file from my daughter (because she is routinely using LibreOffice, but I am routinely using LaTeX):

rimski.x86_64 /tmp 8:04 .0 % unzip traduction-chanson-JF-polovtsiennes.odt
Archive: traduction-chanson-JF-polovtsiennes.odt
extracting: mimetype
extracting: Thumbnails/thumbnail.png
inflating: content.xml
inflating: settings.xml
inflating: meta.xml
inflating: styles.xml
inflating: manifest.rdf
creating: Configurations2/popupmenu/
creating: Configurations2/statusbar/
creating: Configurations2/toolbar/
creating: Configurations2/menubar/
creating: Configurations2/floater/
inflating: Configurations2/accelerator/current.xml
creating: Configurations2/images/Bitmaps/
creating: Configurations2/toolpanel/
creating: Configurations2/progressbar/
inflating: META-INF/manifest.xml

As you can observe, a LibreOffice document is just a zip archive of textual files, mostly .xml. The manifest file manifest.rdf is also XML. Take a few moments to look with a textual editor or pager inside these files.

You all know the principles of XML, even if the details are complex. You can look into these XML files with any editor, since XML is a textual format. You won't understand the details, but you should get a valid intuition. FWIW, the OpenDocument and OOXML specifications are quite complex in their details (the worse being, as usual, the MicroSoft promoted OOXML). But documents are represented (inside your computer) as graphs (intuitively, if you are given the task of making a word processor from scratch, the first thing to think about is the fact that a document is a graph, and that the word processor is capable of changing it and displaying it nicely).

Now, think of it differently. My Bismon doing conceptually something very similar.
A persisted heap can be intuitively viewed as a "document" (without even claiming to define what a document can be). Because documents are, like persistent heaps, some kind of in-memory graph (and that intuition is the core to understand HTML or XML or SGML). They have nodes, and some of them are shared, with several edges reaching them.

In Bismon, it is conceptually similar. We persist the graph of the persistent heap (you could call that graph a "document", but I won't call it this way) in textual format. For performance reasons, we don't like XML (it is too slow to parse in practice), but the principles are still the same. We prefer our own, hand-crafted, textual format. We do care a lot about parsing time.