Motivation
I've been working on the
PlainFileStoreContrib recently, and there are a number of problems I want to highlight and discuss. I think we need a strategy going forward dealing with these problems.
As most readers will know, the original Foswiki file store is based around RCS. Here's the current picture of the RCS filestore:
Every topic and attachment is stored in RCS. RCS stores meta-data for each revision of the file: the author (cUID), date, version number and a comment with each modification.
The
Foswiki::Store
interface provides the following methods to deal with this meta-data:
-
getVersionInfo
, which reports author, date, version number and comment.
- This is driven from the RCS history, if one exists.
- In the absence of a history for a topic, it falls back to the
META:TOPICINFO
(of which more later)
- In the absence of a history for an attachment, it falls back to trying to read
META:TOPICINFO
from the attachment BUG HERE.
- in either case if
META:TOPICINFO
is not found, it falls back to { author => unknown user, date => file modification date }
-
getApproxRevTime
, which reports an approximate time for the most recent revision.
- This is driven purely by the file date of the most recent revision, ignoring the history BUG HERE
-
getRevisionAtTime
which retrieves the number of the revision at a given time
- this is driven from the dates stored in the history
On top of this we have the META: data stored in the topic. This comprises two meta-data fields.
-
META:TOPICINFO
which stores a date, author and version number for the topic.
-
META:FILEATTACHMENT
which stores a date, author, version number and comment for the attachment.
Clearly there is room for a great deal of confusion here. A topic or attachment can be modified outside Foswiki
without updating the
META:
. However this information is
used - it is presented to the user in the UI, and is used by any consumer of META in extensions
BUG HERE.
Recent work I did to maintain compatibility between the history and the most recent revision has ameliorated the problems somewhat. However problems continue to bite me. I'm fed up with mucking around with this, so I want to be quite radical.
Implementation
- deprecate the
date
, version
, author
and comment
fields in META:TOPICINFO
and META:FILEATTACHMENT
. By this I mean "stop writing them out", and ignore them if they are present in topics.
- remove any knowledge of
META
from the store implementations (i.e remove the fallbacks).
- rewrite
Foswiki::Meta
to populate META:TOPICINFO
and META:FILEATTACHMENT
from Foswiki::Store::getVersionInfo
on topic load.
- record the version of the topic in which a
META:FILEATTACHMENT
was most recently updated. This avoids having to iterate through the history to find the first version of the topic containing that attachment.
Impact
-
getVersionInfo
in the RCS store implementations requires a lookup of RCS, which will have a performance penalty.
- all store implementations will be required to maintain a meta-database with date, version, author and comment information for all revisions.
- topic dates/authors may change in installations where the database is already f**ked
--
Contributors: CrawfordCurrie - 03 Jul 2012
Discussion
Removing any duplication sounds good in my ears. Not so requiring even more interaction with RCS. Reading your proposal first sounded as it would actually reduce the need to call for RCS.
The current situation is that Foswiki's performance significantly slowed down in the recent versions. The difference became even more compelling migrating a client from TWiki-4.0.3 to Foswiki-1.1.4: some simple pages even tripled in execution time. Devel::NYTProf showed that it was due to (a) TablePlugin (there were a hundred TML tables on one page) and (b) getting the maximum number of revisions in the history. The latter is what made me look deeper and as far as I understand the mess every
view
will
always result in at least one fork of
rcs
.
From what I saw, I had the impression that while a lot was done to be more "correct" in terms of doing the version control thing right, this striving for correctness did not come without a significant performance penalty, the same penalty mentioned in Impact (1).
We need to find a way back to address performance again,
maybe by even sacrificing correctness again, but I hope that's not really required.
The most important thing imho to keep in mind is that reading version control information must be treated as an
expensive information to get. Even when one call does only take a few milliseconds, we see these routines being called very frequently for one view. The simple
view
of a page must no longer be depending on the number of revisions of the topic being rendered. We need to get this factor out of the equation. For instance a topic with a thousand revision, no matter what is on it, nearly takes ages to squeeze out to the browser.
We had a lot of store talk but
not resulting in improved performance. Sadly, the contrary is the case.
So my hope with this proposal is to see this addressed too, while
not sacrificing performance by any means just for the matter of YASR (yet another store refactoring).
--
MichaelDaum - 03 Jul 2012
Using RCS to store topics was expedient; but there's a reason most of the rest of the industry has abandoned RCS; it's shit. Foswiki carries a lot of baggage (the fields I want to deprecate)
because it's shit (the fields are in essence a cache of the information stored in RCS, but it's a cache that can drift arbitrarily). A lot of the shit has leaked into the implementation of the core because there was no discipline in the interfaces. Sven and I have done a
lot of work to build that discipline, and yes, there have been performance penalties. In the words of the old proverb "you can't make an omelette without breaking eggs".
At the moment, because of the way the store works (and has to work because of RCS) there is a huge, pointless penalty forced onto every other store implementation. We have to be brave, and stop hampering ourselves with this crap. I've been trying to work around it for too long, and have reached the point where I just can't do any more without such a change.
Note that I said
getVersion
has a performance penalty. That's only because of the current implementation. George has already been investigating ways to accelerate the RCSlite store to avoid that penalty; for example, accelerated reading of the ,v file without requiring a complete rebuild of the version history in memory. Another route is to copy what PlainFileStore will be doing, and maintain a meta-database of revision information (in which case, why bother with RCS any more).
Finally "as far as I understand the mess every
view
will
always result in at least one fork of
rcs=" - no. The code is written to trust the =META:TOPICINFO
unless the
.txt
file is more recent than the
.txt,v
. If that is not the case, there will be no read of the history. So there are two additional
stat
calls (on the .txt and the .txt,v) but that should be all. At least, that's the design; if it does something else then it's a bug and needs to be fixed.
As I've been saying for the last 5 years, performance cannot be the only goal here. We need to start with correctness - which we haven't achieved yet - and
then focus on performance.
--
CrawfordCurrie - 04 Jul 2012
Right now, we have the choice between two
broken RCS implementations: RcsWrap sux in performance for simple pages, RcsLite sux in performance for attachments and topics with lots of revisions.
I think we are on the same page as long as this YASR results in better performance. I don't have shares in RCS either.
If your
PlainFileStoreContrib is there to rescue us all, then let's please add it to the core.
--
MichaelDaum - 04 Jul 2012
It
can rescue us, but not without some further work on storing meta-data (specifically, author and comment) for attachments. the changes described herein are mainly to prevent future confusion regarding the status of META.
--
CrawfordCurrie - 04 Jul 2012
Note that there is an alternative to deprecating META, which is to ensure that any call to
readTopic
fixes up the META fields. This is pretty much what is attempted at the moment. The reason I dislike it is that it effectively a hidden (and unconditional) call to
getVersionInfo
. I would rather make this call explicit, as when the info is actually required.
--
CrawfordCurrie - 07 Jul 2012
The reason for the
recent performance fixes was that the RCS implementations were
not using the META:TOPICINFO, and directly dived into the revision control system. This was way too expensive. The META:TOPICINFO has always been regarded as a kind of cache for the last revision's information.
So why would you like to deprecate META? All of META or just TOPICINFO? Which other sensible locations are there to store meta data about a topic (and about attachments)?
--
MichaelDaum - 07 Jul 2012
Just the named fields in TOPICINFO and FILEATTACHMENT. As explained above, this data is stored in the version control system meta-data. Note that the RCS implementations can use TOPICINFO
so long as it is trustworthy i.e. as long as the cache version ( the
.txt
) of the topic is older or the same date as the
.txt,v
. Otherwise we run into horrendous problems with the wrong information being reported/used.
The reason for the deprecation is to ensure that there is only one source for this information. Currently if you use META:TOPICINFO you will get
different information than if
getVersionInfo
is called.
One of these should be the "golden" source of version information, as we are patently failing to keep them in synch.
--
CrawfordCurrie - 15 Jul 2012
Normally, TOPICINFO
is in sync with the same info stored in RCS. That's why even by default we
have to use META:TOPICINFO no matter what. If not, performance for an RCS store spirals down to unusable
because reading the same information from RCS is far too expensive. That's the reality today. See the analysis and fixes in
Tasks.Item11983.
--
MichaelDaum - 15 Jul 2012
At the end of the day I don't mind so long as they are consistent (return the same information) for all versions. Currently an RCS store can return a different author (for example) for an old version of a topic in
TOPICINFO
versus
getVersionInfo
, depending on what is in the RCS history and what is in META:TOPICINFO. While the store should be self-consistent now (Foswiki 1.2) in the past this has not been the case. By deprecating one of the routes to this info, we reduce the risk of an inconsistency happening again in the future.
Having said that, I don't mind if both mechanisms exist so long as there are unit tests that guarantee their equivalence under all circumstances. However I do
not want to have to rely on TOPICINFO as the master of this information, as it is expensive to retrieve in a non-RCS store where the information is stored out of band.
--
CrawfordCurrie - 15 Jul 2012
it looks like someone broke trunk:
http://trunk.foswiki.org/System/WebChanges
--
SvenDowideit - 15 Jul 2012
That's triggered by topics with an invalid version string in META:TOPICINFO. Stuff like
$Rev ...$
. Foswiki::Store::cleanUpRevID returns zero in that case ... which triggers the assert ... a hidden error being there for about 2 years, maybe even longer.
--
MichaelDaum - 23 Jul 2012
I don't agree with the basic assumption that all META:TOPICINFO and META:FILEATTACHMENT info is bad. In fact relying on the filesystem's file modification times as
PlainFileStoreContrib does, is very fragile and found to be buggy in releases before 2.1.9. File modification times can easily be messed around with by tar, rsync and other tools, even easier than editing a txt file. There is a performance bug in all versions before 2.1.9 in Foswiki::UI::View where
getRevisionHistory()
is called on each click. If a file was detected to be modified out of band will this cause a call into the store's version control system. In case of
RCSStoreContrib each click will them cause a fork of rlog. This is not acceptable, even more as this call to
getRevisionHistory()
is used to please
PatternSkin-only pseudo macros REVARGS and REVTITLE.
This all caused serious performance issues which trigged me to reimplement an RCS based storage, called
RcsFastStoreContrib, which is based on a few assumptions contradicting the ones of this proposal, i.e. that all META:TOPICINFO is bad. Instead
RcsFastStoreContrib trusts the information, at least during view time. If there was an out of band change to a topic or attachment then these are fixed on the next save action.
RcsFastStoreContrib does
not monitor any file modifications during view time as we simply don't have time for that during a view.
I could have agreed to removing
getVersionInfo()
as it is a lot easier to just read META:TOPICINFO. This is already the case within the code of
getVersionInfo()
most of the time in an attempt to reduce calls deeper into the store layers. However this proposal goes far beyond that thus my veto.
--
MichaelDaum - 14 Oct 2024