[dtrace-discuss] Shared filesystem weirdness

Justin Lloyd jlloyd at digitalglobe.com
Wed Jun 25 11:54:27 PDT 2008


Vlad,

This is exactly what I'd been doing, though I was getting stumped at what to look for in the fbt trace of lstat64(). However, once I did an sdiff between that trace and one from a system that doesn't have the problem, it led to looking at it being an SNFS issue. Specifically, the first difference in the scall traces was to what seemed to be a SNFS name cache search function. This has been helpful in dealing with Quantum (who owns SNFS) and now we're thinking, based on other testing as well, that it may be a known bug in SNFS that on rare occasions causes directory corruption. So at the very least, you reassured me that I was taking the right track in analyzing the problem.

Thanks,
Justin

-----Original Message-----
From: dtrace-discuss-bounces at opensolaris.org [mailto:dtrace-discuss-bounces at opensolaris.org] On Behalf Of Vladimir Marek
Sent: Wednesday, June 25, 2008 12:16 AM
To: dtrace-discuss at opensolaris.org
Subject: Re: [dtrace-discuss] Shared filesystem weirdness

[...]
> On system X, doing a "/bin/ls /foo/bar/duh" (or "cd /foo/bar/duh;
> /bin/ls") lists file f but any command that tries to access the file 
> (e.g. via a stat, open, etc. system call) fails saying file not found.

[...]
> While I'm not looking for a script from anyone, I would appreciate any 
> advice on how to figure out why the kernel (snfs/cvfs driver?) is not 
> able to access the file from system X. Remember that I can use system 
> Y as a control system.

Well, without any filesystem knowledge, I would start looking at one specific syscall, say 'stat'. Then I would find out which syscall exactly it is

$ dtrace -n 'syscall::stat*:entry{trace(copyinstr(arg0))}'

Then I would record every function being executed during the syscall processing, with the function return values. (let's say it's stat64, and you are doing 'ls -l /foo/bar/duh/f'

$ dtrace -x flowindent -n 'syscall::stat64:entry/copyinstr(arg0)=="/foo/bar/duh/f"/{self->go=1}
fbt:::entry/self->go/{}
fbt:::return/self->go/{trace(arg1)}
syscall::stat64:return/self->go/{self->go=0; exit(0)}'

Then compare one run when the syscall succeeded and one where it failed.

-- 
	Vlad


More information about the dtrace-discuss mailing list