[sam-qfs-discuss] [sam-managers] Disk Archiving File corruption
Prasun Gupta
pgupta at ringling.edu
Tue Dec 4 07:13:53 PST 2007
!!!EUREKA!!!
I have an explanation for what happened, why the file counter reset itself
and the disk archive did not get written to the archive.
We make three copies of data from the file system, all of them to 3
different disk archives. On carefully looking at the log files, I have found
that the file counter starts counting from another copy's counter, and when
it does this switch, it also does not the write the disk archive to the file
system. Hence this ends up generating an inconsistency in the file system.
E.g.
uP6 ----> copy1 (d1/f22)
-----> copy2 (d4/f36)
------------> copy3 (d8/f25)
When the problem happened:
Copy1 took the counter from (d8/f26) from copy3's counter. Obviously a bug
in the archiving process.
Anybody has come across this problem and do you think this issue has been
fixed in patch-03.
I need to think of a workaround for this issue ?
--Prasun
-----Original Message-----
From: Prasun Gupta [mailto:pgupta at Ringling.EDU]
Sent: Tuesday, December 04, 2007 9:49 AM
To: Ted.Pogue at Sun.COM
Cc: Prasun Gupta; sam-qfs-discuss at opensolaris.org;
sam-managers at list.ee.ethz.ch
Subject: RE: [sam-qfs-discuss] [sam-managers] Disk Archiving File corruption
This problem was not caused by two independent systems writing to the same
disk volume. There is only one system that writes to it.
The original problem happened because for some reason archiver reset the
counter from d4/f44 d3/f4 the file archive, and then it never wrote the
archive files. But the inode and logs have the files pointing to the archive
file which it never wrote to.
Unfortunately sam-fs does not check tar file header before staging which
causes more problems. My current concern, is why did this happen, and how
can I prevent it or detect it?
Q1) Any ideas of anyone coming across that archiver all of a sudden changes
the file counter for the archiving, is this counter based on diskvols.seqnum
?
Q2) The archiver logs and file inode are showing successfully archived, but
the archive file was never written. E.g. This problem happened on Oct 28th,
and the archive target diskvols has files written in Oct. 19th. Normally, I
know if an archiver writes to a file which already exists, it will print a
warning message and then increment the counter, until it finds an empty
archive file slot ?
I just applied the Patch-03 a few days back, I can hope this issue is
resolved, but I would also like to have a more concrete feeling for what
caused the problem in the first place, any ideas ?
As I am concerned that this might happen again, which is what I want to
avoiding.
Thanks
Prasun
-----Original Message-----
From: Ted.Pogue at Sun.COM [mailto:Ted.Pogue at Sun.COM]
Sent: Monday, December 03, 2007 8:40 PM
To: Ted Pogue
Cc: Prasun Gupta; sam-qfs-discuss at opensolaris.org;
sam-managers at list.ee.ethz.ch
Subject: Re: [sam-qfs-discuss] [sam-managers] Disk Archiving File corruption
Also, there apparently was an issue with disk archiving corruption that was
fixed in patch 4.6-03. I recommend anyone using disk archiving upgrade
to this
patch ASAP.
Ted
Ted Pogue wrote:
> Prasun,
> There is a known issue with the stager checking tar headers for disk
> archives in 4.6.
>
> *Synopsis*: Tar header validation skipped
>
>
> We expect the fix to be in the next 4.6 patch, due out early next year.
>
> Ted
>
>
>
> Prasun Gupta wrote:
>
>> We use disk archiving for backing up data. We discovered a bunch of
>> corrupted files on the filesystem. In troubleshooting, I found out
>> that the staging an offline file from an archive (tar file), will give
>> back garbage if the file is not part of an archive.
>> This happened to be the case, as a few weeks back archiver logs showed
>> a file was archived to a archive file, but it never went there,
>> another strange thing was the archiver had reset the counter back to a
>> lower numbered file also. These archives were never written to the
>> destination.
>>
>> But the inode of the file and the archive were inconsistent with the
>> filesystem.
>>
>> I know of one remedy to resolve this problem is turn on checksum on
>> the file system, but this is going to put a lot of load on the server.
>>
>> Is there another way ?
>>
>> Why does the staging process not check if the tar archive file
>> contains file or not, instead of returning garbage back?
>>
>> Has anyone run into this problem ?
>>
>> I am currently checking the archive log records with the actual
>> archive file listings to determine how widespread this problem might
>> be, Is there a better way to determine file archive inconsistency ?
>>
>>
>> Any Help or pointer will be greatly appreciated, Thanks in advance.
>>
>> --Prasun
>>
>>
> _______________________________________________
> sam-qfs-discuss mailing list
> sam-qfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/sam-qfs-discuss
>
More information about the sam-qfs-discuss
mailing list