Add XML based Storage and query backend
Motivation and a bit of theory
Even if FW is not an OAIS (don't confuse with OASIS) conformant archiving tool, we already have/should have/will have several metadata too, like:
- Descriptive metadata- these going into web-page head section for improved SEO
- title of the topic
<title>
(additionally to the current reversed breadcrumb path)
- author -
<meta name="author"
- abstract (not useded in the FW, yet - would be nice because of SEO, abstract shoud go into
<meta name="description"
)
- tags, keywords -
<meta name="keywords"
btw, TagsPlugin should be in the core because of SEO - unfortunately the new TagsPlugin needs sql - too bad
- Administrative / Technical / Representational metadata,like:
- encoding (not really needed when all topic will be utf8)
- information about the topic-content-markup (e.g. legacy FW, DITA, TEI, OpenDocument, XHTML.. - now not needed)
- maybe in the future, flag for the hidden topic (like dotted files in the shell) for the topics what should not display in the WebIndex
- ACL
- creation date
- our modification history metadata (like rlog)
- maybe sometimes in the future (gpg signed topic - so, public keys, fingerprints, SHA1 checksum...)
- etc...
- Structural metadata:these are twofold
- first, the part what is notabout the "content structure" but about the relations to another content, like:
- view-template
- topic-parent
- list of attached files
- list of referenced (linked) topics (known at the save - caching them = mean faster access and can allow extended fuctionality)
- list of included topics
- attached form name and the attached form-fields and their content)
- second, what are about the structure of topic (and i don't talking about),like:
- section definitions
- annotations - in-content comments like in the XWiki (what can be hidden or no)
- footnotes
- cross-referenced sections (e.g. some extended vesion of ExplicitNumberingPlugin)
- etc - anything what is about the content of the topic
For dealing with the above
metadata (remember, don't talking about the topic content) - need consider 2 things:
- how to exchange them
- how to store them
Serialization
Converting any internal representation of data into well-known and accepted format
for the purpose of data exchange between different parts of system or different systems.
We can have more serialization plugins for the different data-exchange needs. E.g. JSON for browsers and/or Mongo and sevaral other currently popular applications, or, the serialization can be done with XML with well defined scheme. Creating serialization plugins are not very hard, when the internal data-structure has enough granularity.
But, the serialization is about the data-exchange and major work will be done with FW. (it is perfectly OK, but sometimes is much easier read files directly)
FW currently (and i really hope then in the future too!) storing topic in the filesystem files. The storage-file-format (remember, not talking about the macros and TML) is
self-invented format what was created before XML got even defined. So we have:
META:TOPICINFO{author="ProjectContributor" date="1233365367" format="1.1" version="1"}
and so on...
It's wonderful to see, how FW invented and using all things what are needed for modern application. It is like old grandfather - have experiences, its works, but not very well looking and know nothing about the common modern tools like GPS...
But now, XML is one of most common format for the structured informations.
<?xml version="1.0" encoding="UTF-8"?>
<fosdoc>
<version>1.0</version>
<format>1.1</format>
<web>System</web>
<topic>TopicName</topic>
<parent>System.SomeTopic</parent>
<createdBy>ProjectContributor</createdBy>
<creationDate>1233365367</creationDate>
<lastModified>1233383000</lastModified>
<title>Topic title goes here</title>
<viewTemplate></viewTemplate>
<content>---+ Heading
* body
* with %MACROS / *TML* comes
* here
</content>
</fosdoc>
Although we can continue inventing our
%META:SOMETHINGINTHEFUTURE
i'm perfectly sure than with the XML we
- will get more possible features
- can easily read, write and create FW's files directly with a wide range of already existing XML tools without the need using FW library, or developing own external parsers.
- will get more extensible format for plugins (best example is NatSkin = the plugin inventing his own parseable structures instead of simple extending one well defined XML
- easy implementation of some advanced metadata without the need inventing
%META:SOMETHINGINTHEFUTURE
- easy transformations (XSLT)
- similiar handling of meta-elemets and content e.g. $topic->{field}->{name} and $topic->{content}
- easy integration of external systems - enough to know how to read-write xml
- extremely easy form data integration with the external systems
- less code to maintain (the whole XML read/write stuff is CPAN)
Drawbacks:
- as someone mentioned on the IRC, the currect format is more
grep / sed
friendly than XML
- legacy and backward compatibility
So,
- we can lose
sed
compatibility, but can gain much more...
- we can lose backward compatibility (only at the file level), but the conversion tool from the current format into XML is probably few lines script.
XML schema definition
Of course, we can "invent" our XML schema as above.
But, imo, would be best implement one of already standardized XML schema. Here are several. I'm recommending using of
METS XML. METS is primary exchange format. But it is usable (and in the archival systems often used)
as data storage format too.
Benefits as storage format - the same as above for the XML.
Developing something like MetsExportPlugin (as serialization format) and remain with the old storage-file-structure is probably worth only when we want address how to easily exchange topics between different FW installations. It could extremely help with inter-foswiki communication and we don't need invent something "self-made" exchange format.
Benefits using METS as (one of) serialization format:
- Easy make an full export/import topics-format, from/to another foswikis. The METS allow embed any (base64 encoded) objects directly inside to METS file, (e.g. easily can embed attached images into one XML file and that's mean easy exchange the whole topic, without "inventing" or own serialization topic-exchange format.
- METS is XML
- Possibility insert into METS any other matadata standards - recommending to use Dublin Core for Decription metadata like DC:CREATOR DC:TITLE (or MODS)
- not need invent "attachments", because "structure linking" is integrated directly into METS
- Marketing buzz (foswiki known METS)
Even using METS as serialization format can be helpful, the main point is still on using XML as raw storage format.
Impact:
- Store/Store2
- Plugins? (Are here plugins dealing directly with files? (and doesn't use Store::*)
- current fw installations - can be converted with an "conversion script" 1:1 without any problems.
The parts of Meta.pm and several other parts of source could replace with few lines of code, like:
use XML::Simple;
my $topic = XMLin( $topicfile );
my $content = $topic->{content};
my $author = $topic->{author};
my $form = $topic->{form};
do_someting( $form->{name} ) if $form && exists($form->{fieldname});
#and so on...
See also:
Ps: When going to talk about the topic-content-structure, here is many alternatives too.
- Already mentioned DITA, what is interesant because slicing the document to parts (like sections), but for some other needs here are another alternatives
- TEI
- NLM (Pubmed)
- or good-old docBook...
- OpenDoc
- etc..
But this is another question (and hard to decide) - the all of the above is about the well defined storage format.
--
JozefMojzis - 22 Dec 2011
We discussed on IRC, and I appreciate anybody interested in Foswiki XML
I'm involved with several Foswiki <-> XML information exchange efforts at work; and I think I'm getting tangled up in this (the problem of exchanging information for quite narrow, domain-specific applications - e.g. exchanging the structured data we have in formfields in a way that can be absorbed into some external database), versus what Jozef is talking about - using an XML where the value is simply in standardised metadata (authorship), but the content-mappings (or rather, infrastructure/capability for content mappings) for exchanging
content is left unsolved.
SvenDowideit has done tremendous work making the
serialisation of Foswiki pluggable, with
Foswiki::Serialise.
RestPlugin is an example
Foswiki::Serialise
module that can translate to/from JSON.
But making serialisation pluggable is much easier than allowing the content to have different representations (ontology mappings, etc).
An XML-isisation effort should consider
CmisPlugin and
SupportDITA
--
PaulHarvey - 18 Dec 2011
After an initial failure on my part to understand where Jozef was coming from, I finally understood. Writing a new (de)serialiser - even in the current codebase - is well do-able, and I
think is all that would be needed - though some tweaks might be required in the VC store, which might read TOPICINFO for fast-reading purposes. Maybe a day's work? The questions in my mind are: Who wants this? Why do they want it? Why haven't they done it themselves? (and the perennial Is anyone prepared to pay me (or anyone else) to do it? Gotta eat)
--
CrawfordCurrie - 19 Dec 2011
Once the Foswiki storage format is all XML, why should I stop there and
not continue and introduce XQUERY instead of the YASSE (yet another selfmade search engine) we have now (warning rhetorical question / devil's advocate)? As you see: once you enter XML it soon becomes quite a game changer in which direction Foswiki could evolve. This
is possible in theory, best demonstrated by
http://www.marklogic.com.
--
MichaelDaum - 19 Dec 2011
I cleaned / extended a bit my writing in the hope clear some misunderstandings caused with my limited ability to express myself...
--
JozefMojzis - 22 Dec 2011
I've made this into a feature request, as it fixs well into the post
store2 world.
Personally, I'm more insterested in using the
JSON
(de-)serialisation that I've already begun coding in my store2 branch on github, but One of the points of what I'm doing, is to enable admins to choose what
mix of storage backends, query engines and serialisations are appropriate for their needs.
there's lots of work todo - lots more unit tests to write to ensure that we break nothing, and to allow the existing TML-encoded text format to continue to work - as it
is good for small simple installations.
_and a formalised integration speed benchmark would be good - something that has a variety of web/topic/query denisties and complexities so we can show users the pros&cons of the different implementations.
note also that text file format is one a small part of the store2 pluggable - database and
NoSQL is another (given that Crawford, Paul and I are working on both SQL and
MongoDB, and I will probably do a hadoop query some time in 2012 - assuming that i have time)
--
SvenDowideit - 23 Dec 2011
Four years since last comment. No committed developer. Changing to a parked proposal.
--
Main.GeorgeClark - 13 Feb 2016 - 19:17