[sam-qfs-discuss] [sam-managers] Re: Disk Archiving File corruption

Ted Pogue Ted.Pogue at Sun.COM
Tue Dec 4 08:12:35 PST 2007


Prasan,
    As Mike said, you should really get a support case open to get proper
assistance.
Ted


Prasun Gupta wrote:
> More Information:
>
> The archiver, did not just take the counter from another copy target
> (diskvolume), but it actually wrote the archive to the different disk
> volume.  
> Archiver rerouted the path to another copies destination target.
>
> Sam-fs comes up with surprises all the time.
>
> In order to recover from this issue, does it make sense to manually copy the
> archive file, to the correct location. This would make the pointers correct.
> Now in the case there is already an archive file (star archive existing),
> can I merge the two archive files together and into a superset archive file.
> (Will this work). 
>
> I am not sure how the stager process gets the file from the archive, does it
> just use a blind offset ( or does it request a file by name from the
> archive), because if it uses an offset into the archive, then a manually
> made archive file will not have the offsets correct.
>
>
> Thanks
> --Prasun
>
> -----Original Message-----
> From: Prasun Gupta [mailto:pgupta at Ringling.EDU] 
> Sent: Tuesday, December 04, 2007 10:14 AM
> To: 'Prasun Gupta'; Ted.Pogue at Sun.COM
> Cc: sam-qfs-discuss at opensolaris.org; sam-managers at list.ee.ethz.ch
> Subject: RE: [sam-qfs-discuss] [sam-managers] Disk Archiving File corruption
>
> !!!EUREKA!!!
>
> I have an explanation for what happened, why the file counter reset itself
> and the disk archive did not get written to the archive.
>
> We make three copies of data from the file system, all of them to 3
> different disk archives. On carefully looking at the log files, I have found
> that the file counter starts counting from another copy's counter, and when
> it does this switch, it also does not the write the disk archive to the file
> system. Hence this ends up generating an inconsistency in the file system.
>
> E.g.
>
> uP6 ----> copy1 (d1/f22)
>      -----> copy2 (d4/f36)
>   ------------> copy3 (d8/f25)
>
> When the problem happened:
>
> Copy1 took the counter from (d8/f26) from copy3's counter. Obviously a bug
> in the archiving process.
>
>
> Anybody has come across this problem and do you think this issue has been
> fixed in patch-03.
> I need to think of a workaround for this issue ?
>
> --Prasun
>
>
>
> -----Original Message-----
> From: Prasun Gupta [mailto:pgupta at Ringling.EDU] 
> Sent: Tuesday, December 04, 2007 9:49 AM
> To: Ted.Pogue at Sun.COM
> Cc: Prasun Gupta; sam-qfs-discuss at opensolaris.org;
> sam-managers at list.ee.ethz.ch
> Subject: RE: [sam-qfs-discuss] [sam-managers] Disk Archiving File corruption
>
> This problem was not caused by two independent systems writing to the same
> disk volume. There is only one system that writes to it.
>
> The original problem happened because for some reason archiver reset the
> counter from d4/f44 d3/f4 the file archive, and then it never wrote the
> archive files. But the inode and logs have the files pointing to the archive
> file which it never wrote to. 
>
> Unfortunately sam-fs does not check tar file header before staging which
> causes more problems. My current concern, is why did this happen, and how
> can I prevent it or detect it?
>
> Q1) Any ideas of anyone coming across that archiver all of a sudden changes
> the file counter for the archiving, is this counter based on diskvols.seqnum
> ?
>
>
> Q2) The archiver logs and file inode are showing successfully archived, but
> the archive file was never written. E.g. This problem happened on Oct 28th,
> and the archive target diskvols has files written in Oct. 19th. Normally, I
> know if an archiver writes to a file which already exists, it will print a
> warning message and then increment the counter, until it finds an empty
> archive file slot ?
>
> I just applied the Patch-03 a few days back, I can hope this issue is
> resolved, but I would also like to have a more concrete feeling for what
> caused the problem in the first place, any ideas ?
>
> As I am concerned that this might happen again, which is what I want to
> avoiding.
>
> Thanks
> Prasun
>
>
> -----Original Message-----
> From: Ted.Pogue at Sun.COM [mailto:Ted.Pogue at Sun.COM] 
> Sent: Monday, December 03, 2007 8:40 PM
> To: Ted Pogue
> Cc: Prasun Gupta; sam-qfs-discuss at opensolaris.org;
> sam-managers at list.ee.ethz.ch
> Subject: Re: [sam-qfs-discuss] [sam-managers] Disk Archiving File corruption
>
> Also, there apparently was an issue with disk archiving corruption that was
> fixed in patch 4.6-03. I recommend anyone using disk archiving upgrade 
> to this
> patch ASAP.
> Ted
>
> Ted Pogue wrote:
>   
>> Prasun,
>>     There is a known issue with the stager checking tar headers for disk 
>> archives in 4.6.
>>
>> *Synopsis*: Tar header validation skipped
>>
>>
>> We expect the fix to be in the next 4.6 patch, due out early next year.
>>
>> Ted
>>
>>
>>
>> Prasun Gupta wrote:
>>   
>>     
>>> We use disk archiving for backing up data. We discovered a bunch of 
>>> corrupted files on the filesystem. In troubleshooting, I found out 
>>> that the staging an offline file from an archive (tar file), will give 
>>> back garbage if the file is not part of an archive.
>>> This happened to be the case, as a few weeks back archiver logs showed 
>>> a file was archived to a archive file, but it never went there, 
>>> another strange thing was the archiver had reset the counter back to a 
>>> lower numbered file also. These archives were never written to the 
>>> destination.
>>>
>>> But the inode of the file and the archive were inconsistent with the 
>>> filesystem.
>>>
>>> I know of one remedy to resolve this problem is turn on checksum on 
>>> the file system, but this is going to put a lot of load on the server.
>>>
>>> Is there another way ?
>>>
>>> Why does the staging process not check if the tar archive file 
>>> contains file or not, instead of returning garbage back?
>>>
>>> Has anyone run into this problem ?
>>>
>>> I am currently checking the archive log records with the actual 
>>> archive file listings to determine how widespread this problem might 
>>> be, Is there a better way to determine file archive inconsistency ?
>>>
>>>
>>> Any Help or pointer will be greatly appreciated, Thanks in advance.
>>>
>>> --Prasun
>>>
>>>     
>>>       
>> _______________________________________________
>> sam-qfs-discuss mailing list
>> sam-qfs-discuss at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/sam-qfs-discuss
>>   
>>     
>
>
>
>
>
>
> --
> Unsubscribe mailto:sam-managers-request at list.ee.ethz.ch?subject=unsubscribe
> Help        mailto:sam-managers-request at list.ee.ethz.ch?subject=help
> Archive     http://lists.ee.ethz.ch/sam-managers
>
>   


More information about the sam-qfs-discuss mailing list