Item2537: Race condition in updating htpasswd can cause truncated htpasswd file
Priority: Urgent
Current State: Closed
Released In: 1.0.9
Target Release: patch
Applies To: Engine
Component:
Branches:
We have seen in real life the foswiki.org and also the old project .htpasswd file mysteriously being truncated.
I have been looking at the code and it is suddenly obvious why this can happen.
We need a good lock on the file when it is being updated.
What happens is
- One user either registers or changes email address ort password
- Another user does the same
The race is this
- 1. User 1 code opens the .htpasswd file and reads in the entire content
- 2. User 1 code does what it needs to do adding a user or changing password or email address
- 3. User 1 code opens the .htpasswd file in overwrite mode
- 4. User 1 code writes the entire file again
- 5. User 1 code closes file
If user 2 code does the same steps at almost the same time the user 1 code can do two things wrong
- It can read the password file after user 1 code has read it. This will cause the latest registration from user 1 to be overwritten by user 2 code
- It can read the password file in the time between the user 1 code opening and wiping the password code and the file being closed by user 1 code. What user 2 code reads can be anything from an empty file to a partly written file.
The conclusion is that we need to place a file lock on the .htpasswd file from the minute user code decides to read the .htpasswd file till it is written and closed. The only safe way to do this is using a semaphore file as placing a flock on .htpasswd will only minimize but not eliminate the race condition.
I will attempt to implement this.
I will
- Use a lock file
working/htpasswd.lock
- Use standard flock on this lock file
- As defensive programming I will let the code that checks the lockfile assume that a lockfile older than a few seconds is garbage and remove it before creating a new. I will let the code assume that a lockfile which is older than maybe 10 or 30 seconds is left behind by crashed code. This should prevent an installation from hanging.
I have not decided if I want to go all the way and protect all reading. In principle we should but what I want to target is the dataloss more than the casual failure that reading an email address or failed authentication can cause. I would not want the checking for lock files to be part of normal operation and slow Foswiki. I just want the locks to be used for changing content in htpasswd. Ie registrations, email changes and password changes.
--
KennethLavrsen - 29 Dec 2009
I was not sure the timeout would work satisfactory on windows. I could just create a new race condition.
The alternative is to loop in a
my $timeout = 10;
while(! flock($FH, LOCK_EX|LOCK_NB) ) {
sleep 1;
if(--$timeout < 0) {
barf;
}
}
but then what? Give up? Or go ahead?
I think the way I have implemented it will be relatively safe as the filelocks should be wiped when the file handler is declared as a local variable. Files should close and unlock when the file handler variable goes out of scope.
At least that is what I think I have read. Please review and feel free to suggest improvements.
Note that there is still a race condition possible if someone reads during the htpasswd update but the consequences is a "What was that?" and then you try again. So I think we can live with this. Otherwise the more correct way is to also lock when reading or put some intelligence in the lockfile so when just doing reading you read the content of the lock file. Doing exclusive locking on htpasswd during read will probably have a performance hit as it is done often. I think it is not needed as we have not see any reports that we have a problem with the features that just read. The issue was the severe dataloss when htpasswd ended up toasted.
I believe I have fixed this part.
--
KennethLavrsen - 29 Dec 2009
Good initiative, Kenneth!
I don't know if it will work on windows, but we could use shared locks for reading and exclusive locks for writing. And maybe the lock file is not needed. I'll make an experiment and post the patch here for analysis.
--
GilmarSantosJr - 29 Dec 2009
A lock file (created atomically with open(filename, O_CREAT|O_EXCL)) can be used as an exclusive lock if shared locks are not provided by the OS, but it harms performance a lot since only one process is allowed to read at a time.
Using
flock
we can have many simultaneous readers holding shared locks and one writer (and only one writer. no readers) at a time holding an exclusive lock. No lock files needed. According to the perl documentation on
flock
, we can't assume shared locks are fully portable, so if
EINVAL
is received,
_lockPasswd()
silently returns (and we get the old buggy behavior). I coded this way cause I don't know in which systems the shared locks are not available.
Please review the
attached patch and comment. I intentionally don't reuse the file handle created by
_lockPasswd
cause
truncate
is not universally available (it may be needed on writes whose new size is smaller than the previous) and locks are advisory, so the code is supposed to work (
hopefully we won't reach maximum open file descriptors limit).
--
GilmarSantosJr - 29 Dec 2009
What is your patch relative to? I cannot see it reverts my codelines so I am a bit confused.
I have been reading a lot about the subject and all the experts claim that you cannot flock a file safely that multiple programs can read and write access. The reason is that to do the flock you must open the file first. And in the small window between open and flock another process can also access the file.
When I study the code for CGI::Session and some CPAN libs handling similar things they all use a lockfile and use the flock on that instead.
Foswiki today has become unacceptably slow. I fear doing the "right thing" when reading htaccess is just going to make it slower. We should not do it if it has ANY cost to performance.
My fix will only slow down the situations: setting email address, setting password, resetting password and user registration. This happens very rarely and even 0.5 seconds extra time is not a problem. This happens a few times per day and already takes many seconds because emails have to be sent also. But we must avoid making normal reading any slower than it is. Measuring with ab does not tell the truth. No matter how much we test we always come to the conclusion that this new code does not make a measurable difference. But in real life it does. Relative to Cairo we are now 2-3 times slower, and my users are beginning to complain.
For reference about the use of extra lockfile see
http://interglacial.com/tpj/23/ which has an excellent walkthrough of the problem.
I see in your patch
+ flock( $this->{_flock}{fh}, LOCK_UN );
+ close( $this->{_flock}{fh} );
I have read several places that you create a new race condition by unlocking before close. The close unlocks. The LOCK_UN should under normal use never be used.
Another thing
I can see in e.g. setPassword
try {
+ $this->_lockPasswd(LOCK_SH);
my $db = $this->_readPasswd();
$db->{$login}->{pass} = $this->encrypt( $login, $newUserPassword, 1 );
$db->{$login}->{emails} ||= '';
+ $this->_lockPasswd(LOCK_EX);
_savePasswd($db);
}
catch Error::Simple with {
If I understand it right the LOCK_SH does not prevent anyone else from also reading.
So this means that when we readPasswd into $db, another process setting a password can be doing the same.
After the reading the first process does the $this->_lockPasswd(LOCK_EX) and writes the new password file. The 2nd process reaches the LOCK_EX and waits till the first process releases the lock. And then it takes the LOCK_EX and writes the password file.
But the 2nd process read the OLD password file and will add its new password and not include the password entry that was just added. The password file is rewritten from scratch and not appended so anything the first process added is wiped.
Unless I misunderstand how the LOCK_SH works I would say that my LOCK_EX around both the read and the write is the right way anytime we want to write in the htpasswd file. It makes sense to LOCK_SH when we read in a context where we do not use the read data to write back. But only if it can be done with no performance impact.
Locking while reading means that we ALWAYS lock each time we view a topic. The sub fetchUsers is run once each time we look at a topic. I cannot understand why we need to read .htpasswd from the Foswiki code. We have a valid session. It should not be needed at all. But it is. And this is why I do not believe it is a good idea to flock for each read. Even if there is a small risk that reading while the .htpasswd is being written can cause some hick-up.
--
KennethLavrsen - 29 Dec 2009
I branched the code just before your changes to work on it (that's why the patch doesn't revert your code)
About the "two writers" race condition, you're absolutely right. I coded it initially with LOCK_EX, but then I downgraded to LOCK_SH to not deny access to true readers at that window... but it's wrong. It should be LOCK_EX from the beginning. On the LOCK_UN before closing.. I see no problem, but I agree it's better to only close the file.
On the semaphore file, the article states:
Consider when one instance is updating counter.dat just as another new instance is about to read it:
Instance 1 Instance 2
----------------- -----------------
open COUNTER, ">counter.dat"
or die "Can't write-open: $!";
open COUNTER, "<counter.dat"
or die "Can't read-open: $!";
flock COUNTER, LOCK_EX;
my $hits = <COUNTER>;
close(COUNTER);
flock COUNTER, LOCK_EX;
There, the OS dutifully kept two instances at once from having an exclusive lock on the file. But the locking is too late, because instance 1, just by opening the file, has already overwritten counter.dat with a zero-length file, just as instance 2 was about to read it.
Pay attention to the mode that
Instance 1
opens the file. My patch opens the file with "+<", i.e. read-write access and the file is not modified. The update occurs after another
open
, when the lock is already held.
The acquire of a LOCK_SH fails only if a LOCK_EX is in place. In such situation the reader should not read the file and because of this I also locked the file for reading. Besides, the writer should not update the file while it's being read by another instance. Since updates are rare, there is no performance loss.
Since the readers should not read the file while it's being written, they should use locks. Since it's not possible to have shared locks with semaphore files, I decided to use flock directly on the
.htpasswd
.
--
GilmarSantosJr - 29 Dec 2009
Patch updated. Please review and comment.
--
GilmarSantosJr - 30 Dec 2009
I will have a look in a not too distant future. There is a new year party where we are the hosts so I am not sure if I get time the 31st. Then it will be the 1st when I get sober
--
KennethLavrsen - 31 Dec 2009
Got caught in the
BlackListPlugin work and the other core bugs which took all of my weekend. Will look at this tomorrow. From first glance it looks great.
I still need to convince myself that locking the file itself without semaphore does not leave a microscopic window open.
--
KennethLavrsen - 04 Jan 2010
I tried to apply the patch. It is a bit difficult when you do not provide a patch relative to the CURRENT svn and not even to Release01x00 branch which is all I am concerned about right now. My main objective is to get a 1.0.9 released now. Experimental code does not belong in release branch.
With your patch I get
Software error:
close() on unopened filehandle $__ANONIO__ at /var/www/Release01x00/core/lib/Foswiki/Users/HtPasswdUser.pm line 147.
[Tue Jan 5 00:30:27 2010] view: close() on unopened filehandle $__ANONIO__ at /var/www/Release01x00/core/lib/Foswiki/Users/HtPasswdUser.pm line 147.
at /usr/lib/perl5/5.8.8/CGI/Carp.pm line 314
CGI::Carp::realdie('[Tue Jan 5 00:30:27 2010] view: close() on unopened filehand...') called at /usr/lib/perl5/5.8.8/CGI/Carp.pm line 400
CGI::Carp::die('close() on unopened filehandle $__ANONIO__ at /var/www/Releas...') called at /var/www/Release01x00/core/lib/Foswiki.pm line 157
Foswiki::__ANON__('close() on unopened filehandle $__ANONIO__ at /var/www/Releas...') called at /var/www/Release01x00/core/lib/Foswiki/Users/HtPasswdUser.pm line 147
Foswiki::Users::HtPasswdUser::_unlockPasswd('Foswiki::Users::HtPasswdUser=HASH(0xa004a60)') called at /var/www/Release01x00/core/lib/Foswiki/Users/HtPasswdUser.pm line 307
Foswiki::Users::HtPasswdUser::__ANON__() called at /usr/lib/perl5/vendor_perl/5.8.8/Error.pm line 429
Error::subs::try('CODE(0xa16cc68)', 'HASH(0xa07be50)') called at /var/www/Release01x00/core/lib/Foswiki/Users/HtPasswdUser.pm line 308
Foswiki::Users::HtPasswdUser::fetchPass('Foswiki::Users::HtPasswdUser=HASH(0xa004a60)', 'KennethLavrsen') called at /var/www/Release01x00/core/lib/Foswiki/Users/TopicUserMapping.pm line 260
Foswiki::Users::TopicUserMapping::_userReallyExists('Foswiki::Users::TopicUserMapping=HASH(0x9fca1fc)', 'KennethLavrsen') called at /var/www/Release01x00/core/lib/Foswiki/Users/TopicUserMapping.pm line 173
Foswiki::Users::TopicUserMapping::handlesUser('Foswiki::Users::TopicUserMapping=HASH(0x9fca1fc)', 'undef', 'KennethLavrsen', '') called at /var/www/Release01x00/core/lib/Foswiki/Users.pm line 208
Foswiki::Users::_getMapping('Foswiki::Users=HASH(0x9d281fc)', 'undef', 'KennethLavrsen', 'undef', 1) called at /var/www/Release01x00/core/lib/Foswiki/Users.pm line 422
Foswiki::Users::getCanonicalUserID('Foswiki::Users=HASH(0x9d281fc)', 'KennethLavrsen') called at /var/www/Release01x00/core/lib/Foswiki/Users.pm line 256
Foswiki::Users::initialiseUser('Foswiki::Users=HASH(0x9d281fc)', 'KennethLavrsen') called at /var/www/Release01x00/core/lib/Foswiki.pm line 1572
Foswiki::new('Foswiki', 'undef', 'Foswiki::Request=HASH(0x9c58ab4)', 'HASH(0x9c49f98)') called at /var/www/Release01x00/core/lib/Foswiki/UI.pm line 293
Foswiki::UI::_execute('Foswiki::Request=HASH(0x9c58ab4)', 'CODE(0x9c58814)', 'view', 1) called at /var/www/Release01x00/core/lib/Foswiki/UI.pm line 275
Foswiki::UI::handleRequest('Foswiki::Request=HASH(0x9c58ab4)') called at /var/www/Release01x00/core/lib/Foswiki/Engine/CGI.pm line 29
Foswiki::Engine::CGI::run('Foswiki::Engine::CGI=HASH(0x9a2a544)') called at /var/www/Release01x00/core/bin/view line 45
For help, please send mail to the webmaster (root@localhost), giving this error message and the time and date of the error.
I fear there are more bugs lurking because you test one something totally different than I do. Please provide a tested patch for release branch based on current SVN if you want the code in 1.0.9
Otherwise I will go with what we have now.
--
KennethLavrsen - 04 Jan 2010
The code works only if
.htpasswd
already exists, so it's really easier to use a semaphore file. I don't have time this week to modify my proposal and to test OS other than Debian Linux.
--
GilmarSantosJr - 05 Jan 2010
For info - the above errors are with .htpasswd existing. It is my normal test server and all I did was view Main.WebHome.
I will go along with my solution for 1.0.9. We can always develop it more for the 1.1 context where we have the time. My fix may not prevent an occational glitch if a user views a topic at the wrong moment while a new user registers or changes password. But it will prevent the .htpasswd file from getting lost and this was from the beginning the main objective.
--
KennethLavrsen - 05 Jan 2010