Feature Proposal: Once and for all, get rid of insane duplication in meta-data

Motivation

I've been working on the PlainFileStoreContrib recently, and there are a number of problems I want to highlight and discuss. I think we need a strategy going forward dealing with these problems.

As most readers will know, the original Foswiki file store is based around RCS. Here's the current picture of the RCS filestore:

Every topic and attachment is stored in RCS. RCS stores meta-data for each revision of the file: the author (cUID), date, version number and a comment with each modification.

The Foswiki::Store interface provides the following methods to deal with this meta-data:
  • getVersionInfo, which reports author, date, version number and comment.
    • This is driven from the RCS history, if one exists.
    • In the absence of a history for a topic, it falls back to the META:TOPICINFO (of which more later)
    • In the absence of a history for an attachment, it falls back to trying to read META:TOPICINFO from the attachment BUG HERE.
    • in either case if META:TOPICINFO is not found, it falls back to { author => unknown user, date => file modification date }
  • getApproxRevTime, which reports an approximate time for the most recent revision.
    • This is driven purely by the file date of the most recent revision, ignoring the history BUG HERE
  • getRevisionAtTime which retrieves the number of the revision at a given time
    • this is driven from the dates stored in the history

On top of this we have the META: data stored in the topic. This comprises two meta-data fields.
  • META:TOPICINFO which stores a date, author and version number for the topic.
  • META:FILEATTACHMENT which stores a date, author, version number and comment for the attachment.

Clearly there is room for a great deal of confusion here. A topic or attachment can be modified outside Foswiki without updating the META:. However this information is used - it is presented to the user in the UI, and is used by any consumer of META in extensions BUG HERE.

Recent work I did to maintain compatibility between the history and the most recent revision has ameliorated the problems somewhat. However problems continue to bite me. I'm fed up with mucking around with this, so I want to be quite radical.

Implementation

  1. deprecate the date, version, author and comment fields in META:TOPICINFO and META:FILEATTACHMENT. By this I mean "stop writing them out", and ignore them if they are present in topics.
  2. remove any knowledge of META from the store implementations (i.e remove the fallbacks).
  3. rewrite Foswiki::Meta to populate META:TOPICINFO and META:FILEATTACHMENT from Foswiki::Store::getVersionInfo on topic load.
  4. record the version of the topic in which a META:FILEATTACHMENT was most recently updated. This avoids having to iterate through the history to find the first version of the topic containing that attachment.

Impact

  1. getVersionInfo in the RCS store implementations requires a lookup of RCS, which will have a performance penalty.
  2. all store implementations will be required to maintain a meta-database with date, version, author and comment information for all revisions.
  3. topic dates/authors may change in installations where the database is already f**ked

-- Contributors: CrawfordCurrie - 03 Jul 2012

Discussion

Removing any duplication sounds good in my ears. Not so requiring even more interaction with RCS. Reading your proposal first sounded as it would actually reduce the need to call for RCS.

The current situation is that Foswiki's performance significantly slowed down in the recent versions. The difference became even more compelling migrating a client from TWiki-4.0.3 to Foswiki-1.1.4: some simple pages even tripled in execution time. Devel::NYTProf showed that it was due to (a) TablePlugin (there were a hundred TML tables on one page) and (b) getting the maximum number of revisions in the history. The latter is what made me look deeper and as far as I understand the mess every view will always result in at least one fork of rcs.

From what I saw, I had the impression that while a lot was done to be more "correct" in terms of doing the version control thing right, this striving for correctness did not come without a significant performance penalty, the same penalty mentioned in Impact (1).

We need to find a way back to address performance again, maybe by even sacrificing correctness again, but I hope that's not really required.

The most important thing imho to keep in mind is that reading version control information must be treated as an expensive information to get. Even when one call does only take a few milliseconds, we see these routines being called very frequently for one view. The simple view of a page must no longer be depending on the number of revisions of the topic being rendered. We need to get this factor out of the equation. For instance a topic with a thousand revision, no matter what is on it, nearly takes ages to squeeze out to the browser.

We had a lot of store talk but not resulting in improved performance. Sadly, the contrary is the case. So my hope with this proposal is to see this addressed too, while not sacrificing performance by any means just for the matter of YASR (yet another store refactoring).

-- MichaelDaum - 03 Jul 2012

Using RCS to store topics was expedient; but there's a reason most of the rest of the industry has abandoned RCS; it's shit. Foswiki carries a lot of baggage (the fields I want to deprecate) because it's shit (the fields are in essence a cache of the information stored in RCS, but it's a cache that can drift arbitrarily). A lot of the shit has leaked into the implementation of the core because there was no discipline in the interfaces. Sven and I have done a lot of work to build that discipline, and yes, there have been performance penalties. In the words of the old proverb "you can't make an omelette without breaking eggs".

At the moment, because of the way the store works (and has to work because of RCS) there is a huge, pointless penalty forced onto every other store implementation. We have to be brave, and stop hampering ourselves with this crap. I've been trying to work around it for too long, and have reached the point where I just can't do any more without such a change.

Note that I said getVersion has a performance penalty. That's only because of the current implementation. George has already been investigating ways to accelerate the RCSlite store to avoid that penalty; for example, accelerated reading of the ,v file without requiring a complete rebuild of the version history in memory. Another route is to copy what PlainFileStore will be doing, and maintain a meta-database of revision information (in which case, why bother with RCS any more).

Finally "as far as I understand the mess every view will always result in at least one fork of rcs=" - no. The code is written to trust the =META:TOPICINFO unless the .txt file is more recent than the .txt,v. If that is not the case, there will be no read of the history. So there are two additional stat calls (on the .txt and the .txt,v) but that should be all. At least, that's the design; if it does something else then it's a bug and needs to be fixed.

As I've been saying for the last 5 years, performance cannot be the only goal here. We need to start with correctness - which we haven't achieved yet - and then focus on performance.

-- CrawfordCurrie - 04 Jul 2012

Right now, we have the choice between two broken RCS implementations: RcsWrap sux in performance for simple pages, RcsLite sux in performance for attachments and topics with lots of revisions.

I think we are on the same page as long as this YASR results in better performance. I don't have shares in RCS either.

If your PlainFileStoreContrib is there to rescue us all, then let's please add it to the core.

-- MichaelDaum - 04 Jul 2012

It can rescue us, but not without some further work on storing meta-data (specifically, author and comment) for attachments. the changes described herein are mainly to prevent future confusion regarding the status of META.

-- CrawfordCurrie - 04 Jul 2012

Note that there is an alternative to deprecating META, which is to ensure that any call to readTopic fixes up the META fields. This is pretty much what is attempted at the moment. The reason I dislike it is that it effectively a hidden (and unconditional) call to getVersionInfo. I would rather make this call explicit, as when the info is actually required.

-- CrawfordCurrie - 07 Jul 2012

The reason for the recent performance fixes was that the RCS implementations were not using the META:TOPICINFO, and directly dived into the revision control system. This was way too expensive. The META:TOPICINFO has always been regarded as a kind of cache for the last revision's information.

So why would you like to deprecate META? All of META or just TOPICINFO? Which other sensible locations are there to store meta data about a topic (and about attachments)?

-- MichaelDaum - 07 Jul 2012

Just the named fields in TOPICINFO and FILEATTACHMENT. As explained above, this data is stored in the version control system meta-data. Note that the RCS implementations can use TOPICINFO so long as it is trustworthy i.e. as long as the cache version ( the .txt) of the topic is older or the same date as the .txt,v. Otherwise we run into horrendous problems with the wrong information being reported/used.

The reason for the deprecation is to ensure that there is only one source for this information. Currently if you use META:TOPICINFO you will get different information than if getVersionInfo is called. One of these should be the "golden" source of version information, as we are patently failing to keep them in synch.

-- CrawfordCurrie - 15 Jul 2012

Normally, TOPICINFO is in sync with the same info stored in RCS. That's why even by default we have to use META:TOPICINFO no matter what. If not, performance for an RCS store spirals down to unusable because reading the same information from RCS is far too expensive. That's the reality today. See the analysis and fixes in Tasks.Item11983.

-- MichaelDaum - 15 Jul 2012

At the end of the day I don't mind so long as they are consistent (return the same information) for all versions. Currently an RCS store can return a different author (for example) for an old version of a topic in TOPICINFO versus getVersionInfo, depending on what is in the RCS history and what is in META:TOPICINFO. While the store should be self-consistent now (Foswiki 1.2) in the past this has not been the case. By deprecating one of the routes to this info, we reduce the risk of an inconsistency happening again in the future.

Having said that, I don't mind if both mechanisms exist so long as there are unit tests that guarantee their equivalence under all circumstances. However I do not want to have to rely on TOPICINFO as the master of this information, as it is expensive to retrieve in a non-RCS store where the information is stored out of band.

-- CrawfordCurrie - 15 Jul 2012

it looks like someone broke trunk:

http://trunk.foswiki.org/System/WebChanges

-- SvenDowideit - 15 Jul 2012

That's triggered by topics with an invalid version string in META:TOPICINFO. Stuff like $Rev ...$. Foswiki::Store::cleanUpRevID returns zero in that case ... which triggers the assert ... a hidden error being there for about 2 years, maybe even longer.

-- MichaelDaum - 23 Jul 2012

I don't agree with the basic assumption that all META:TOPICINFO and META:FILEATTACHMENT info is bad. In fact relying on the filesystem's file modification times as PlainFileStoreContrib does, is very fragile and found to be buggy in releases before 2.1.9. File modification times can easily be messed around with by tar, rsync and other tools, even easier than editing a txt file. There is a performance bug in all versions before 2.1.9 in Foswiki::UI::View where getRevisionHistory() is called on each click. If a file was detected to be modified out of band will this cause a call into the store's version control system. In case of RCSStoreContrib each click will them cause a fork of rlog. This is not acceptable, even more as this call to getRevisionHistory() is used to please PatternSkin-only pseudo macros REVARGS and REVTITLE.

This all caused serious performance issues which trigged me to reimplement an RCS based storage, called RcsFastStoreContrib, which is based on a few assumptions contradicting the ones of this proposal, i.e. that all META:TOPICINFO is bad. Instead RcsFastStoreContrib trusts the information, at least during view time. If there was an out of band change to a topic or attachment then these are fixed on the next save action. RcsFastStoreContrib does not monitor any file modifications during view time as we simply don't have time for that during a view.

I could have agreed to removing getVersionInfo() as it is a lot easier to just read META:TOPICINFO. This is already the case within the code of getVersionInfo() most of the time in an attempt to reduce calls deeper into the store layers. However this proposal goes far beyond that thus my veto.

-- MichaelDaum - 14 Oct 2024
 
Topic revision: r13 - 14 Oct 2024, MichaelDaum
The copyright of the content on this website is held by the contributing authors, except where stated elsewhere. See Copyright Statement. Creative Commons License    Legal Imprint    Privacy Policy