From mike.kupfer at sun.com Wed Nov 4 14:11:43 2009 From: mike.kupfer at sun.com (Mike Kupfer) Date: Wed, 04 Nov 2009 14:11:43 PST Subject: [fm-discuss] fmd goes into maintenance, snv_126 Message-ID: <145181101.721257372733058.JavaMail.Twebapp@sf-app1> I updated my notebook (Sony VAIO) to snv_126 yesterday, and this morning I noticed that fmd was going into maintenance. Here's the text from /var/adm/messages: Nov 4 09:06:28 loiosh svc.startd[7]: [ID 652011 daemon.warning] svc:/system/fmd:default: Method "/usr/lib/fm/fmd/fmd" failed with exit status 1. Nov 4 09:06:28 loiosh svc.startd[7]: [ID 748625 daemon.error] system/fmd:default failed: transitioned to maintenance (see 'svcs -xv' for details) When I looked in the service log, it had [ Nov 4 09:05:27 Enabled. ] [ Nov 4 09:06:12 Executing start method ("/usr/lib/fm/fmd/fmd"). ] Assertion failed: comma != NULL, file ../../common/pcibus/did_props.c, line 347, function dev_for_hostbridge [ Nov 4 09:06:21 Method "start" exited with status 1. ] [ Nov 4 09:06:21 Executing start method ("/usr/lib/fm/fmd/fmd"). ] Assertion failed: comma != NULL, file ../../common/pcibus/did_props.c, line 347, function dev_for_hostbridge [ Nov 4 09:06:24 Method "start" exited with status 1. ] [ Nov 4 09:06:24 Executing start method ("/usr/lib/fm/fmd/fmd"). ] Assertion failed: comma != NULL, file ../../common/pcibus/did_props.c, line 347, function dev_for_hostbridge [ Nov 4 09:06:28 Method "start" exited with status 1. ] Should I file a bug? -- This message posted from opensolaris.org From Gavin.Maltby at Sun.COM Wed Nov 4 14:36:27 2009 From: Gavin.Maltby at Sun.COM (Gavin Maltby) Date: Thu, 05 Nov 2009 09:36:27 +1100 Subject: [fm-discuss] fmd goes into maintenance, snv_126 In-Reply-To: <145181101.721257372733058.JavaMail.Twebapp@sf-app1> References: <145181101.721257372733058.JavaMail.Twebapp@sf-app1> Message-ID: <4AF201EB.1020102@sun.com> Hi, Mike Kupfer wrote: > I updated my notebook (Sony VAIO) to snv_126 yesterday, and this morning I noticed that fmd was going into maintenance. Here's the text from /var/adm/messages: > > Nov 4 09:06:28 loiosh svc.startd[7]: [ID 652011 daemon.warning] svc:/system/fmd:default: Method "/usr/lib/fm/fmd/fmd" failed with exit status 1. > Nov 4 09:06:28 loiosh svc.startd[7]: [ID 748625 daemon.error] system/fmd:default failed: transitioned to maintenance (see 'svcs -xv' for details) > > When I looked in the service log, it had > > [ Nov 4 09:05:27 Enabled. ] > [ Nov 4 09:06:12 Executing start method ("/usr/lib/fm/fmd/fmd"). ] > Assertion failed: comma != NULL, file ../../common/pcibus/did_props.c, line 347, function dev_for_hostbridge > [ Nov 4 09:06:21 Method "start" exited with status 1. ] > [ Nov 4 09:06:21 Executing start method ("/usr/lib/fm/fmd/fmd"). ] > Assertion failed: comma != NULL, file ../../common/pcibus/did_props.c, line 347, function dev_for_hostbridge > [ Nov 4 09:06:24 Method "start" exited with status 1. ] > [ Nov 4 09:06:24 Executing start method ("/usr/lib/fm/fmd/fmd"). ] > Assertion failed: comma != NULL, file ../../common/pcibus/did_props.c, line 347, function dev_for_hostbridge > [ Nov 4 09:06:28 Method "start" exited with status 1. ] > > Should I file a bug? Yes, please. What was the last working Nevada version installed on this system? The assertion is not new, so it appears that something has changed in the underlying devfs tree to upset the parsing of it in the enumerator. Could you include prtconf, prtconf -p and ls -lR /devices as attachments in the bug, please. Are there any core files in /var/fm/fmd? Cheers Gavin From mike.kupfer at sun.com Wed Nov 4 15:26:49 2009 From: mike.kupfer at sun.com (Mike Kupfer) Date: Wed, 04 Nov 2009 15:26:49 -0800 Subject: [fm-discuss] fmd goes into maintenance, snv_126 In-Reply-To: <4AF201EB.1020102@sun.com> References: <145181101.721257372733058.JavaMail.Twebapp@sf-app1> <4AF201EB.1020102@sun.com> Message-ID: <7316.1257377209@sun.com> >>>>> "Gavin" == Gavin Maltby writes: Gavin> What was the last working Nevada version installed on this Gavin> system? snv_123. Gavin> Could you include prtconf, prtconf -p and ls -lR /devices as Gavin> attachments in the bug, please. Done (CR 6898284). Gavin> Are there any core files in /var/fm/fmd? Yes, clustered in sets of 3. The stack traces all appear to be the same (though I didn't look too closely). I attached the most recent core file to the CR. Let me know if there's anything else you need. cheers, mike From tom at netspot.com.au Wed Nov 4 21:02:20 2009 From: tom at netspot.com.au (Tom Lanyon) Date: Thu, 5 Nov 2009 15:32:20 +1030 Subject: [fm-discuss] Unable to load disk-monitor plugin / how to change SES indicators? Message-ID: Hi all, I'm trying to discover two things regarding my test system which has a bunch of SATA disks attached to a SAS expander: * if I have a drive error, how do I know which cXtYdZ logical device maps to which physical disk/bay? * how can I read temperature information from the drives? There seems to be some work done in this area by Eric Schrock and Rob Johnston[1], which has led me to the disk-monitor FMA plugin. I am assuming that this plugin will automatically handle temperature monitoring and lighting the fault/locate LEDs but am not entirely sure of this. I attempted to load the module but received: # fmadm load /usr/lib/fm/fmd/plugins/disk-monitor.so fmadm: failed to load /usr/lib/fm/fmd/plugins/disk-monitor.so: module failed to load (consult fmd(1M) log) I checked the log as instructed, but no errors or warnings were recorded. I know the log is working because when I accidentally tried to load the plugin's .conf file instead of the .so, I *did* receive an error in the fmd log: Nov 05 2009 15:08:36.125443460 ereport.fm.fmd.mod_init nvlist version: 0 version = 0x0 class = ereport.fm.fmd.mod_init ena = 0x751b2ea5e2305401 msg = failed to load /usr/lib/fm/fmd/plugins/disk- monitor.conf: Operation not supported __ttl = 0x1 __tod = 0x4af256cc 0x77a1d84 Can anyone suggest whether this is indeed what I should be doing, and if so, why can't I load this FMA plugin? Additionally, even if I get this running - what methods are there to manually identify a drive in the enclosure? ie, how do I send a command to the SES device? There needs to be some level of manual control available for this as I can think of multiple scenarios where I'd need to identify and extract a non-faulty drive from an enclosure. Regards, Tom [1] - http://blogs.sun.com/eschrock/entry/ses_sensors From Robert.Johnston at Sun.COM Wed Nov 4 23:08:28 2009 From: Robert.Johnston at Sun.COM (Rob Johnston) Date: Wed, 04 Nov 2009 23:08:28 -0800 Subject: [fm-discuss] Unable to load disk-monitor plugin / how to change SES indicators? In-Reply-To: References: Message-ID: <4AF279EC.40901@sun.com> Hi Tom, The disk-monitor module is not actually used to detect or diagnose disk faults, but rather is a response agent designed for the thumper/thor platforms. The disk-monitor module subscribes to FMA diagnosis and repair events and monitors changes in the disk topology (by listening to hotplug sysevents). In response to these events, it will send requests to the service processor (via IPMI) to update FRU information and flip the disk bay LED's on/off, as appropriate. In order for the disk-monitor module to operate, it needs to know the disk topology of the system, including, as you alluded to, the mapping of solaris disk devices to physical disk bays. For internal SATA/SAS disks, the code that constructs the disk topology currently relies on a set of xml files where we've hard-coded the mapping of drive bays to device nodes for a subset of Sun X64 platforms. For many (but not all[1]) external storage enclosures which support SES, we're able to dynamically derive the disk topology without the aid of any hard-coded information. In the absence of this disk topology, disk-monitor will bail out during initialization, which is likely happening on your system. That said, disk error telemetry is actually fed into the Fault Manager from two sources 1) The disk-transport module, which uses libdiskstatus to check for three failure conditions via uSCSI interfaces: over temperature predictive failure self-test failure 2) The sd driver, which will generate error telemetry for problems detected at the target driver level. Unfortunately, even though the your system will be capable of generating error telemetry for your disks, the system that diagnosis faults from the error telemetry also needs to consume information in the disk topology, so you're still probably out of luck. Hope this helps, rob [1] The full answer as to why can't derive the topology on all SES storage enclosures is a bit too involved to dive into here, but it basically depends on the complexity of the internal SAS topology of the array in question. If the array presents a single root target at the top of the topology then libses will do the right thing. However, if the topology uses SAS expanders to either multi-attach the disks or to talk to different subsets of disks then SES will present multiple targets at the top of the topology and to libses it may appear as multiple storage enclosures, which cause us to generate an inaccurate topology. There is a workaround for the latter case - libses provides a means of overriding the interpretation by delivering a small plugin module to ses (either a model-specific plugin for a specific array a single vendor specific plugin. There are a couple projects underway Tom Lanyon wrote: > Hi all, > > I'm trying to discover two things regarding my test system which has a > bunch of SATA disks attached to a SAS expander: > > * if I have a drive error, how do I know which cXtYdZ logical device > maps to which physical disk/bay? > > * how can I read temperature information from the drives? > > There seems to be some work done in this area by Eric Schrock and Rob > Johnston[1], which has led me to the disk-monitor FMA plugin. I am > assuming that this plugin will automatically handle temperature > monitoring and lighting the fault/locate LEDs but am not entirely sure > of this. > > I attempted to load the module but received: > > # fmadm load /usr/lib/fm/fmd/plugins/disk-monitor.so > fmadm: failed to load /usr/lib/fm/fmd/plugins/disk-monitor.so: > module failed to load (consult fmd(1M) log) > > > I checked the log as instructed, but no errors or warnings were > recorded. I know the log is working because when I accidentally tried to > load the plugin's .conf file instead of the .so, I *did* receive an > error in the fmd log: > > Nov 05 2009 15:08:36.125443460 ereport.fm.fmd.mod_init > nvlist version: 0 > version = 0x0 > class = ereport.fm.fmd.mod_init > ena = 0x751b2ea5e2305401 > msg = failed to load > /usr/lib/fm/fmd/plugins/disk-monitor.conf: Operation not supported > > __ttl = 0x1 > __tod = 0x4af256cc 0x77a1d84 > > > Can anyone suggest whether this is indeed what I should be doing, and if > so, why can't I load this FMA plugin? > > Additionally, even if I get this running - what methods are there to > manually identify a drive in the enclosure? ie, how do I send a command > to the SES device? There needs to be some level of manual control > available for this as I can think of multiple scenarios where I'd need > to identify and extract a non-faulty drive from an enclosure. > > Regards, > Tom > > [1] - http://blogs.sun.com/eschrock/entry/ses_sensors > _______________________________________________ > fm-discuss mailing list > fm-discuss at opensolaris.org From steve.hanson at sun.com Thu Nov 5 03:27:06 2009 From: steve.hanson at sun.com (Steve Hanson) Date: Thu, 05 Nov 2009 11:27:06 +0000 Subject: [fm-discuss] fmd goes into maintenance, snv_126 In-Reply-To: <7316.1257377209@sun.com> References: <145181101.721257372733058.JavaMail.Twebapp@sf-app1> <4AF201EB.1020102@sun.com> <7316.1257377209@sun.com> Message-ID: <4AF2B68A.6060609@sun.com> It looks like the hostbridge enumerator has always assumed that all pci hostbridges will be attached after calling di_devfs_path(). But here we have pci, instance #0 pci, instance #5 pci (driver not attached) pci (driver not attached) pci (driver not attached) pci (driver not attached) Interestingly the unattached devices don't show up in the ls -l output . I guess something has changed in the kernel so that these devices are either no longer attached or are now present in the devinfo snapshot when they weren't before. Steve >>>>>>"Gavin" == Gavin Maltby writes: >>>>>> >>>>>> > >Gavin> What was the last working Nevada version installed on this >Gavin> system? > >snv_123. > >Gavin> Could you include prtconf, prtconf -p and ls -lR /devices as >Gavin> attachments in the bug, please. > >Done (CR 6898284). > >Gavin> Are there any core files in /var/fm/fmd? > >Yes, clustered in sets of 3. The stack traces all appear to be the same >(though I didn't look too closely). I attached the most recent core >file to the CR. > >Let me know if there's anything else you need. > >cheers, >mike >_______________________________________________ >fm-discuss mailing list >fm-discuss at opensolaris.org > > From pfisher at alertlogic.net Thu Nov 19 03:23:55 2009 From: pfisher at alertlogic.net (Paul Fisher) Date: Thu, 19 Nov 2009 05:23:55 -0600 Subject: [fm-discuss] [Fwd: sluggish opensolaris-b127 with unknown fault] Message-ID: <4B052ACB.4050600@alertlogic.net> Hopefully someone can help get to the bottom of what is going on with this machine. Overall it appears that fmd is getting in the way of work getting done. I have a new 2x Intel 5520 supermicro box (prtdiag/prtconf/scanpci below) that installed extremely slowly (4+ hours) and then once booted is overall acting sluggishly. Load seems to stay at 0.50 all of the time doing nothing, and of that most is system time. Normal vmstat show not much going on: local at dev-storage-01:/test# vmstat 5 kthr memory page disk faults cpu r b w swap free re mf pi po fr de sr s0 s1 s2 s3 in sy cs us sy id 1 0 0 65936540 47229788 68 86 17 0 0 0 5 -0 15 21 21 496 1690 949 1 8 91 0 0 0 65222612 46351108 2 14 0 0 0 0 0 0 8 0 0 338 675 313 1 4 96 0 0 0 65222452 46351000 0 3 0 0 0 0 0 0 17 0 0 416 301 288 0 2 98 0 0 0 65222452 46351004 0 0 0 0 0 0 0 0 13 0 0 477 424 336 0 0 99 When I create a 8x mirrored vdev pool (1T samsung enterise drives) and do a simple dd test it maxes out at 100M/s, where I'd normally expect 500M/s+ at least. local at dev-storage-01:/storage# dd if=/dev/zero of=test.zeros bs=1M count=64000 64000+0 records in 64000+0 records out 67108864000 bytes (67 GB) copied, 630.583 s, 106 MB/s Interestingly, during the dd, the load goes to 7+ and vmstat looks like the box is working really hard to get the bits out to disk: PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP 2073 root 8024K 2140K cpu1 0 0 0:07:09 5.1% dd/1 1008 root 55M 43M cpu12 0 0 0:29:19 4.6% fmd/20 1261 root 13M 6288K sleep 59 0 0:01:39 0.1% intrd/1 2107 local 9556K 2832K cpu5 0 0 0:00:00 0.1% prstat/1 1924 local 9128K 2652K sleep 49 0 0:00:08 0.1% bash/1 1919 local 21M 5600K sleep 59 0 0:00:10 0.0% sshd/1 ... Total: 48 processes, 199 lwps, load averages: 8.34, 7.31, 5.23 local at dev-storage-01:~$ vmstat 5 kthr memory page disk faults cpu r b w swap free re mf pi po fr de sr s0 s1 s2 s3 in sy cs us sy id 1 0 0 65914512 47202672 66 84 17 0 0 0 5 -0 14 21 21 493 1654 929 1 8 91 9 0 0 65221928 46350452 7 92 0 0 0 0 0 0 4 0 0 493 459 541 2 47 51 3 0 0 65221600 46350148 0 1 0 0 0 0 0 0 22 0 0 614 383 1171 4 52 44 3 0 0 65221600 46350148 0 0 0 0 0 0 0 0 13 0 0 1015 680 2022 3 28 68 4 0 0 65221600 46350148 0 0 0 0 0 0 0 0 15 0 0 585 360 916 5 52 43 9 0 0 65221580 46350128 0 0 0 0 0 0 0 0 3 0 0 530 391 758 4 60 36 local at dev-storage-01:~$ mpstat 15 ... CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 0 0 1 90 16 70 3 18 106 0 70 0 63 0 37 1 0 0 0 249 204 99 2 20 177 0 1 0 52 0 48 2 0 0 0 189 140 84 1 16 150 0 44 0 60 0 40 3 0 0 1 61 2 91 2 20 147 0 33 0 46 0 54 4 2 0 1 36 1 48 1 12 56 0 25 6 45 0 49 5 0 0 0 40 0 51 1 12 66 0 3 7 48 0 44 6 0 0 1 37 1 59 1 16 67 0 9 2 48 0 49 7 0 0 0 52 3 51 3 14 64 0 9 20 36 0 44 8 0 0 0 66 3 40 2 11 116 0 33 0 50 0 50 9 0 0 1 47 2 83 0 19 169 0 0 0 38 0 62 10 0 0 1 43 2 68 0 16 128 0 1 0 34 0 66 11 0 0 552 13 4 59 0 12 92 0 18 0 49 0 51 12 0 0 0 58 0 42 0 11 45 0 2 1 69 0 30 13 0 0 1 46 1 43 1 14 60 0 14 14 52 0 34 14 1 0 0 53 1 49 1 13 60 0 26 9 44 0 47 15 0 0 1 39 1 57 0 15 75 0 4 0 47 0 53 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 0 0 0 51 15 113 0 27 208 0 1 0 43 0 56 1 0 0 0 300 282 85 2 21 163 0 51 0 54 0 46 2 0 0 0 201 188 102 3 20 147 0 39 2 63 0 35 3 0 0 0 26 4 122 1 27 252 0 2 1 35 0 64 4 1 0 0 14 1 60 1 15 68 0 22 6 42 0 52 5 0 0 0 8 1 57 1 15 68 0 27 9 43 0 48 6 0 0 0 11 1 66 1 16 83 0 44 5 45 0 50 7 0 0 0 12 3 62 3 17 71 0 26 12 43 0 45 8 0 0 0 27 3 73 2 18 168 0 24 1 45 0 53 9 0 0 1 30 2 100 4 25 142 1 44 0 45 0 55 10 0 0 0 25 3 82 3 21 153 0 6 1 36 0 63 11 0 0 0 16 3 73 2 16 201 0 53 1 50 0 49 12 0 0 0 11 1 51 1 15 71 0 21 6 47 0 47 13 0 0 49 7 1 63 1 15 72 0 19 6 47 0 46 14 0 0 0 6 1 57 1 14 82 0 1 1 46 0 52 15 0 0 0 12 1 64 2 15 78 0 35 14 45 0 41 I did finally see that fmd was running a lot and took a look at what it thought was going on: local at dev-storage-01:~$ pfexec fmadm faulty --------------- ------------------------------------ -------------- --------- TIME EVENT-ID MSG-ID SEVERITY --------------- ------------------------------------ -------------- --------- Nov 18 07:00:37 0ddd2e6a-06c4-6936-fb63-fbe93899370d SUNOS-8000-FU Major Host : dev-storage-01 Platform : X8DT3 Chassis_id : 1234567890 Product_sn : Fault class : defect.sunos.eft.undiag.fme FRU : None faulty Description : The diagnosis engine encountered telemetry for which it was unable to perform a diagnosis. Refer to http://sun.com/msg/SUNOS-8000-FU for more information. Response : Error reports have been logged for examination by Sun. Impact : Automated diagnosis and response for these events will not occur. Action : Ensure that the latest Solaris Kernel and Predictive Self-Healing (PSH) patches are installed. At this point I've reached the end of what I know to do in order to diagnose what is going on with this box. I would really appreciate some additional guidance on what I can to look at. Here is the configuration information on the box: local at dev-storage-01:~$ uname -a SunOS dev-storage-01 5.11 snv_127 i86pc i386 i86pc Solaris local at dev-storage-01:~$ cat /etc/release OpenSolaris Development snv_127 X86 Copyright 2009 Sun Microsystems, Inc. All Rights Reserved. Use is subject to license terms. Assembled 06 November 2009 local at dev-storage-01:~$ prtdiag System Configuration: Supermicro X8DT3 BIOS Configuration: American Megatrends Inc. 080015 09/24/2009 BMC Configuration: IPMI 1.5 (KCS: Keyboard Controller Style) ==== Processor Sockets ==================================== Version Location Tag -------------------------------- -------------------------- Intel(R) Xeon(R) CPU E5520 @ 2.27GHz CPU 2 Intel(R) Xeon(R) CPU E5520 @ 2.27GHz CPU 1 ==== Memory Device Sockets ================================ Type Status Set Device Locator Bank Locator ----------- ------ --- ------------------- ---------------- other in use 0 P1-DIMM1A BANK0 other in use 0 P1-DIMM1B BANK1 other in use 0 P1-DIMM2A BANK2 other in use 0 P1-DIMM2B BANK3 other in use 0 P1-DIMM3A BANK4 other in use 0 P1-DIMM3B BANK5 other in use 0 P2-DIMM1A BANK6 other in use 0 P2-DIMM1B BANK7 other in use 0 P2-DIMM2A BANK8 other in use 0 P2-DIMM2B BANK9 other in use 0 P2-DIMM3A BANK10 other in use 0 P2-DIMM3B BANK11 ==== On-Board Devices ===================================== ==== Upgradeable Slots ==================================== ID Status Type Description --- --------- ---------------- ---------------------------- 1 available PCI PCI#1 2 in use PCI Express PCI-E#2 3 available PCI PCI#3 4 in use PCI Express PCI#4 5 in use PCI Express PCI-E#5 6 available PCI Express PCI-E#6 local at dev-storage-01:~$ prtconf System Configuration: Sun Microsystems i86pc Memory size: 49143 Megabytes System Peripherals (Software Nodes): i86pc scsi_vhci, instance #0 fw, instance #0 cpu, instance #0 cpu, instance #1 cpu, instance #2 cpu, instance #3 cpu, instance #4 cpu, instance #5 cpu, instance #6 cpu, instance #7 cpu, instance #8 cpu, instance #9 cpu, instance #10 cpu, instance #11 cpu, instance #12 cpu, instance #13 cpu, instance #14 cpu, instance #15 sb, instance #1 pci, instance #0 pci15d9,1 (driver not attached) pci8086,3408, instance #0 pci15d9,10c9, instance #0 pci15d9,10c9, instance #1 pci8086,340a (driver not attached) pci8086,340c, instance #2 pci1000,3140, instance #1 sd, instance #5 sd, instance #6 sd, instance #7 sd, instance #8 sd, instance #9 sd, instance #10 sd, instance #11 sd, instance #12 pci8086,340e (driver not attached) pci8086,343a (driver not attached) pci8086,343b (driver not attached) pci8086,343c (driver not attached) pci8086,343d (driver not attached) pci8086,3418 (driver not attached) pci8086,3419 (driver not attached) pci8086,341a (driver not attached) pci8086,341c (driver not attached) pci8086,341d (driver not attached) pci8086,341e (driver not attached) pci8086,3439 (driver not attached) pci8086,342e (driver not attached) pci8086,3422 (driver not attached) pci8086,3423, instance #0 pci8086,3438 (driver not attached) pci15d9,1 (driver not attached) pci15d9,1 (driver not attached) pci15d9,1 (driver not attached) pci15d9,1 (driver not attached) pci15d9,1 (driver not attached) pci15d9,1 (driver not attached) pci15d9,1 (driver not attached) pci15d9,1 (driver not attached) pci15d9,1, instance #0 pci15d9,1, instance #1 pci15d9,1, instance #2 device, instance #0 keyboard, instance #0 mouse, instance #1 pci15d9,1, instance #0 pci8086,3a40, instance #4 pci1000,3140, instance #0 sd, instance #13 sd, instance #14 sd, instance #15 sd, instance #16 sd, instance #17 sd, instance #18 sd, instance #19 sd, instance #20 pci15d9,1, instance #3 pci15d9,1, instance #4 pci15d9,1, instance #5 pci15d9,1, instance #1 pci8086,244e, instance #0 display, instance #0 isa, instance #0 motherboard (driver not attached) asy, instance #0 asy, instance #1 asy, instance #2 i8042, instance #0 keyboard, instance #0 mouse, instance #0 motherboard (driver not attached) pit_beep, instance #0 pci15d9,1, instance #0 cdrom, instance #0 disk, instance #1 disk, instance #2 disk, instance #3 disk, instance #4 pci15d9,1 (driver not attached) used-resources (driver not attached) pseudo, instance #0 options, instance #0 xsvc, instance #0 agpgart, instance #0 local at dev-storage-01:~$ pfexec scanpci pci bus 0x0000 cardnum 0x00 function 0x00: vendor 0x8086 device 0x3406 Intel Corporation QuickPath Architecture I/O Hub to ESI Port pci bus 0x0000 cardnum 0x01 function 0x00: vendor 0x8086 device 0x3408 Intel Corporation QuickPath Architecture I/O Hub PCI Express Root Port 1 pci bus 0x0000 cardnum 0x03 function 0x00: vendor 0x8086 device 0x340a Intel Corporation QuickPath Architecture I/O Hub PCI Express Root Port 3 pci bus 0x0000 cardnum 0x05 function 0x00: vendor 0x8086 device 0x340c Intel Corporation QuickPath Architecture I/O Hub PCI Express Root Port 5 pci bus 0x0000 cardnum 0x07 function 0x00: vendor 0x8086 device 0x340e Intel Corporation QuickPath Architecture I/O Hub PCI Express Root Port 7 pci bus 0x0000 cardnum 0x0d function 0x00: vendor 0x8086 device 0x343a Intel Corporation Device unknown pci bus 0x0000 cardnum 0x0d function 0x01: vendor 0x8086 device 0x343b Intel Corporation Device unknown pci bus 0x0000 cardnum 0x0d function 0x02: vendor 0x8086 device 0x343c Intel Corporation Device unknown pci bus 0x0000 cardnum 0x0d function 0x03: vendor 0x8086 device 0x343d Intel Corporation Device unknown pci bus 0x0000 cardnum 0x0d function 0x04: vendor 0x8086 device 0x3418 Intel Corporation Quickpath Interconnect Physical Layer Port 0 pci bus 0x0000 cardnum 0x0d function 0x05: vendor 0x8086 device 0x3419 Intel Corporation Quickpath Interconnect Physical Layer Port 1 pci bus 0x0000 cardnum 0x0d function 0x06: vendor 0x8086 device 0x341a Intel Corporation Device unknown pci bus 0x0000 cardnum 0x0e function 0x00: vendor 0x8086 device 0x341c Intel Corporation Device unknown pci bus 0x0000 cardnum 0x0e function 0x01: vendor 0x8086 device 0x341d Intel Corporation Device unknown pci bus 0x0000 cardnum 0x0e function 0x02: vendor 0x8086 device 0x341e Intel Corporation Device unknown pci bus 0x0000 cardnum 0x0e function 0x04: vendor 0x8086 device 0x3439 Intel Corporation Device unknown pci bus 0x0000 cardnum 0x14 function 0x00: vendor 0x8086 device 0x342e Intel Corporation QuickPath Architecture I/O Hub System Management Registers pci bus 0x0000 cardnum 0x14 function 0x01: vendor 0x8086 device 0x3422 Intel Corporation QuickPath Architecture I/O Hub GPIO and Scratch Pad Registers pci bus 0x0000 cardnum 0x14 function 0x02: vendor 0x8086 device 0x3423 Intel Corporation QuickPath Architecture I/O Hub Control Status and RAS Registers pci bus 0x0000 cardnum 0x14 function 0x03: vendor 0x8086 device 0x3438 Intel Corporation QuickPath Architecture I/O Hub Throttle Registers pci bus 0x0000 cardnum 0x16 function 0x00: vendor 0x8086 device 0x3430 Intel Corporation DMA Engine pci bus 0x0000 cardnum 0x16 function 0x01: vendor 0x8086 device 0x3431 Intel Corporation DMA Engine pci bus 0x0000 cardnum 0x16 function 0x02: vendor 0x8086 device 0x3432 Intel Corporation DMA Engine pci bus 0x0000 cardnum 0x16 function 0x03: vendor 0x8086 device 0x3433 Intel Corporation DMA Engine pci bus 0x0000 cardnum 0x16 function 0x04: vendor 0x8086 device 0x3429 Intel Corporation DMA Engine pci bus 0x0000 cardnum 0x16 function 0x05: vendor 0x8086 device 0x342a Intel Corporation DMA Engine pci bus 0x0000 cardnum 0x16 function 0x06: vendor 0x8086 device 0x342b Intel Corporation DMA Engine pci bus 0x0000 cardnum 0x16 function 0x07: vendor 0x8086 device 0x342c Intel Corporation DMA Engine pci bus 0x0000 cardnum 0x1a function 0x00: vendor 0x8086 device 0x3a37 Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #4 pci bus 0x0000 cardnum 0x1a function 0x01: vendor 0x8086 device 0x3a38 Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #5 pci bus 0x0000 cardnum 0x1a function 0x02: vendor 0x8086 device 0x3a39 Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #6 pci bus 0x0000 cardnum 0x1a function 0x07: vendor 0x8086 device 0x3a3c Intel Corporation 82801JI (ICH10 Family) USB2 EHCI Controller #2 pci bus 0x0000 cardnum 0x1c function 0x00: vendor 0x8086 device 0x3a40 Intel Corporation 82801JI (ICH10 Family) PCI Express Port 1 pci bus 0x0000 cardnum 0x1d function 0x00: vendor 0x8086 device 0x3a34 Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #1 pci bus 0x0000 cardnum 0x1d function 0x01: vendor 0x8086 device 0x3a35 Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #2 pci bus 0x0000 cardnum 0x1d function 0x02: vendor 0x8086 device 0x3a36 Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #3 pci bus 0x0000 cardnum 0x1d function 0x07: vendor 0x8086 device 0x3a3a Intel Corporation 82801JI (ICH10 Family) USB2 EHCI Controller #1 pci bus 0x0000 cardnum 0x1e function 0x00: vendor 0x8086 device 0x244e Intel Corporation 82801 PCI Bridge pci bus 0x0000 cardnum 0x1f function 0x00: vendor 0x8086 device 0x3a16 Intel Corporation 82801JIR (ICH10R) LPC Interface Controller pci bus 0x0000 cardnum 0x1f function 0x02: vendor 0x8086 device 0x3a22 Intel Corporation 82801JI (ICH10 Family) SATA AHCI Controller pci bus 0x0000 cardnum 0x1f function 0x03: vendor 0x8086 device 0x3a30 Intel Corporation 82801JI (ICH10 Family) SMBus Controller pci bus 0x0001 cardnum 0x03 function 0x00: vendor 0x102b device 0x0532 Matrox Graphics, Inc. MGA G200eW WPCM450 pci bus 0x0002 cardnum 0x00 function 0x00: vendor 0x1000 device 0x0058 LSI Logic / Symbios Logic SAS1068E PCI-Express Fusion-MPT SAS pci bus 0x0004 cardnum 0x00 function 0x00: vendor 0x1000 device 0x0058 LSI Logic / Symbios Logic SAS1068E PCI-Express Fusion-MPT SAS pci bus 0x0006 cardnum 0x00 function 0x00: vendor 0x8086 device 0x10c9 Intel Corporation 82576 Gigabit ET Dual Port Server Adapter pci bus 0x0006 cardnum 0x00 function 0x01: vendor 0x8086 device 0x10c9 Intel Corporation 82576 Gigabit ET Dual Port Server Adapter pci bus 0x00fe cardnum 0x00 function 0x00: vendor 0x8086 device 0x2c40 Intel Corporation QuickPath Architecture Generic Non-Core Registers pci bus 0x00fe cardnum 0x00 function 0x01: vendor 0x8086 device 0x2c01 Intel Corporation QuickPath Architecture System Address Decoder pci bus 0x00fe cardnum 0x02 function 0x00: vendor 0x8086 device 0x2c10 Intel Corporation QPI Link 0 pci bus 0x00fe cardnum 0x02 function 0x01: vendor 0x8086 device 0x2c11 Intel Corporation QPI Physical 0 pci bus 0x00fe cardnum 0x02 function 0x04: vendor 0x8086 device 0x2c14 Intel Corporation QPI Link 1 pci bus 0x00fe cardnum 0x02 function 0x05: vendor 0x8086 device 0x2c15 Intel Corporation QPI Physical 1 pci bus 0x00fe cardnum 0x03 function 0x00: vendor 0x8086 device 0x2c18 Intel Corporation QuickPath Memory Controller pci bus 0x00fe cardnum 0x03 function 0x01: vendor 0x8086 device 0x2c19 Intel Corporation QuickPath Memory Controller Target Address Decoder pci bus 0x00fe cardnum 0x03 function 0x02: vendor 0x8086 device 0x2c1a Intel Corporation QuickPath Memory Controller RAS Registers pci bus 0x00fe cardnum 0x03 function 0x04: vendor 0x8086 device 0x2c1c Intel Corporation QuickPath Memory Controller Test Registers pci bus 0x00fe cardnum 0x04 function 0x00: vendor 0x8086 device 0x2c20 Intel Corporation QuickPath Memory Controller Channel 0 Control Registers pci bus 0x00fe cardnum 0x04 function 0x01: vendor 0x8086 device 0x2c21 Intel Corporation QuickPath Memory Controller Channel 0 Address Registers pci bus 0x00fe cardnum 0x04 function 0x02: vendor 0x8086 device 0x2c22 Intel Corporation QuickPath Memory Controller Channel 0 Rank Registers pci bus 0x00fe cardnum 0x04 function 0x03: vendor 0x8086 device 0x2c23 Intel Corporation QuickPath Memory Controller Channel 0 Thermal Control Registers pci bus 0x00fe cardnum 0x05 function 0x00: vendor 0x8086 device 0x2c28 Intel Corporation QuickPath Memory Controller Channel 1 Control Registers pci bus 0x00fe cardnum 0x05 function 0x01: vendor 0x8086 device 0x2c29 Intel Corporation QuickPath Memory Controller Channel 1 Address Registers pci bus 0x00fe cardnum 0x05 function 0x02: vendor 0x8086 device 0x2c2a Intel Corporation QuickPath Memory Controller Channel 1 Rank Registers pci bus 0x00fe cardnum 0x05 function 0x03: vendor 0x8086 device 0x2c2b Intel Corporation QuickPath Memory Controller Channel 1 Thermal Control Registers pci bus 0x00fe cardnum 0x06 function 0x00: vendor 0x8086 device 0x2c30 Intel Corporation QuickPath Memory Controller Channel 2 Control Registers pci bus 0x00fe cardnum 0x06 function 0x01: vendor 0x8086 device 0x2c31 Intel Corporation QuickPath Memory Controller Channel 2 Address Registers pci bus 0x00fe cardnum 0x06 function 0x02: vendor 0x8086 device 0x2c32 Intel Corporation QuickPath Memory Controller Channel 2 Rank Registers pci bus 0x00fe cardnum 0x06 function 0x03: vendor 0x8086 device 0x2c33 Intel Corporation QuickPath Memory Controller Channel 2 Thermal Control Registers pci bus 0x00ff cardnum 0x00 function 0x00: vendor 0x8086 device 0x2c40 Intel Corporation QuickPath Architecture Generic Non-Core Registers pci bus 0x00ff cardnum 0x00 function 0x01: vendor 0x8086 device 0x2c01 Intel Corporation QuickPath Architecture System Address Decoder pci bus 0x00ff cardnum 0x02 function 0x00: vendor 0x8086 device 0x2c10 Intel Corporation QPI Link 0 pci bus 0x00ff cardnum 0x02 function 0x01: vendor 0x8086 device 0x2c11 Intel Corporation QPI Physical 0 pci bus 0x00ff cardnum 0x02 function 0x04: vendor 0x8086 device 0x2c14 Intel Corporation QPI Link 1 pci bus 0x00ff cardnum 0x02 function 0x05: vendor 0x8086 device 0x2c15 Intel Corporation QPI Physical 1 pci bus 0x00ff cardnum 0x03 function 0x00: vendor 0x8086 device 0x2c18 Intel Corporation QuickPath Memory Controller pci bus 0x00ff cardnum 0x03 function 0x01: vendor 0x8086 device 0x2c19 Intel Corporation QuickPath Memory Controller Target Address Decoder pci bus 0x00ff cardnum 0x03 function 0x02: vendor 0x8086 device 0x2c1a Intel Corporation QuickPath Memory Controller RAS Registers pci bus 0x00ff cardnum 0x03 function 0x04: vendor 0x8086 device 0x2c1c Intel Corporation QuickPath Memory Controller Test Registers pci bus 0x00ff cardnum 0x04 function 0x00: vendor 0x8086 device 0x2c20 Intel Corporation QuickPath Memory Controller Channel 0 Control Registers pci bus 0x00ff cardnum 0x04 function 0x01: vendor 0x8086 device 0x2c21 Intel Corporation QuickPath Memory Controller Channel 0 Address Registers pci bus 0x00ff cardnum 0x04 function 0x02: vendor 0x8086 device 0x2c22 Intel Corporation QuickPath Memory Controller Channel 0 Rank Registers pci bus 0x00ff cardnum 0x04 function 0x03: vendor 0x8086 device 0x2c23 Intel Corporation QuickPath Memory Controller Channel 0 Thermal Control Registers pci bus 0x00ff cardnum 0x05 function 0x00: vendor 0x8086 device 0x2c28 Intel Corporation QuickPath Memory Controller Channel 1 Control Registers pci bus 0x00ff cardnum 0x05 function 0x01: vendor 0x8086 device 0x2c29 Intel Corporation QuickPath Memory Controller Channel 1 Address Registers pci bus 0x00ff cardnum 0x05 function 0x02: vendor 0x8086 device 0x2c2a Intel Corporation QuickPath Memory Controller Channel 1 Rank Registers pci bus 0x00ff cardnum 0x05 function 0x03: vendor 0x8086 device 0x2c2b Intel Corporation QuickPath Memory Controller Channel 1 Thermal Control Registers pci bus 0x00ff cardnum 0x06 function 0x00: vendor 0x8086 device 0x2c30 Intel Corporation QuickPath Memory Controller Channel 2 Control Registers pci bus 0x00ff cardnum 0x06 function 0x01: vendor 0x8086 device 0x2c31 Intel Corporation QuickPath Memory Controller Channel 2 Address Registers pci bus 0x00ff cardnum 0x06 function 0x02: vendor 0x8086 device 0x2c32 Intel Corporation QuickPath Memory Controller Channel 2 Rank Registers pci bus 0x00ff cardnum 0x06 function 0x03: vendor 0x8086 device 0x2c33 Intel Corporation QuickPath Memory Controller Channel 2 Thermal Control Registers -- paul From steve.hanson at sun.com Thu Nov 19 03:44:58 2009 From: steve.hanson at sun.com (Steve Hanson) Date: Thu, 19 Nov 2009 11:44:58 +0000 Subject: [fm-discuss] [Fwd: sluggish opensolaris-b127 with unknown fault] In-Reply-To: <4B052ACB.4050600@alertlogic.net> References: <4B052ACB.4050600@alertlogic.net> Message-ID: <4B052FBA.4020300@sun.com> Hi Paul, Can you send the "fmdump -eV" output? Steve > Hopefully someone can help get to the bottom of what is going on with > this machine. Overall it appears that fmd is getting in the way of > work getting done. > > I have a new 2x Intel 5520 supermicro box (prtdiag/prtconf/scanpci > below) that installed extremely slowly (4+ hours) and then once booted > is overall acting sluggishly. Load seems to stay at 0.50 all of the > time doing nothing, and of that most is system time. > > Normal vmstat show not much going on: > > local at dev-storage-01:/test# vmstat 5 > kthr memory page disk > faults cpu > r b w swap free re mf pi po fr de sr s0 s1 s2 s3 in sy cs > us sy id > 1 0 0 65936540 47229788 68 86 17 0 0 0 5 -0 15 21 21 496 1690 949 > 1 8 91 > 0 0 0 65222612 46351108 2 14 0 0 0 0 0 0 8 0 0 338 675 313 > 1 4 96 > 0 0 0 65222452 46351000 0 3 0 0 0 0 0 0 17 0 0 416 301 288 > 0 2 98 > 0 0 0 65222452 46351004 0 0 0 0 0 0 0 0 13 0 0 477 424 336 > 0 0 99 > > When I create a 8x mirrored vdev pool (1T samsung enterise drives) and > do a simple dd test it maxes out at 100M/s, where I'd normally expect > 500M/s+ at least. > > local at dev-storage-01:/storage# dd if=/dev/zero of=test.zeros bs=1M > count=64000 > 64000+0 records in > 64000+0 records out > 67108864000 bytes (67 GB) copied, 630.583 s, 106 MB/s > > Interestingly, during the dd, the load goes to 7+ and vmstat looks like > the box is working really hard to get the bits out to disk: > > PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP > > 2073 root 8024K 2140K cpu1 0 0 0:07:09 5.1% dd/1 > 1008 root 55M 43M cpu12 0 0 0:29:19 4.6% fmd/20 > 1261 root 13M 6288K sleep 59 0 0:01:39 0.1% intrd/1 > 2107 local 9556K 2832K cpu5 0 0 0:00:00 0.1% prstat/1 > 1924 local 9128K 2652K sleep 49 0 0:00:08 0.1% bash/1 > 1919 local 21M 5600K sleep 59 0 0:00:10 0.0% sshd/1 > ... > Total: 48 processes, 199 lwps, load averages: 8.34, 7.31, 5.23 > > local at dev-storage-01:~$ vmstat 5 > kthr memory page disk > faults cpu > r b w swap free re mf pi po fr de sr s0 s1 s2 s3 in sy cs > us sy id > 1 0 0 65914512 47202672 66 84 17 0 0 0 5 -0 14 21 21 493 1654 929 > 1 8 91 > 9 0 0 65221928 46350452 7 92 0 0 0 0 0 0 4 0 0 493 459 541 > 2 47 51 > 3 0 0 65221600 46350148 0 1 0 0 0 0 0 0 22 0 0 614 383 1171 > 4 52 44 > 3 0 0 65221600 46350148 0 0 0 0 0 0 0 0 13 0 0 1015 680 2022 > 3 28 68 > 4 0 0 65221600 46350148 0 0 0 0 0 0 0 0 15 0 0 585 360 916 > 5 52 43 > 9 0 0 65221580 46350128 0 0 0 0 0 0 0 0 3 0 0 530 391 758 > 4 60 36 > > local at dev-storage-01:~$ mpstat 15 > ... > CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys > wt idl > 0 0 0 1 90 16 70 3 18 106 0 70 0 63 > 0 37 > 1 0 0 0 249 204 99 2 20 177 0 1 0 52 > 0 48 > 2 0 0 0 189 140 84 1 16 150 0 44 0 60 > 0 40 > 3 0 0 1 61 2 91 2 20 147 0 33 0 46 > 0 54 > 4 2 0 1 36 1 48 1 12 56 0 25 6 45 > 0 49 > 5 0 0 0 40 0 51 1 12 66 0 3 7 48 > 0 44 > 6 0 0 1 37 1 59 1 16 67 0 9 2 48 > 0 49 > 7 0 0 0 52 3 51 3 14 64 0 9 20 36 > 0 44 > 8 0 0 0 66 3 40 2 11 116 0 33 0 50 > 0 50 > 9 0 0 1 47 2 83 0 19 169 0 0 0 38 > 0 62 > 10 0 0 1 43 2 68 0 16 128 0 1 0 34 > 0 66 > 11 0 0 552 13 4 59 0 12 92 0 18 0 49 > 0 51 > 12 0 0 0 58 0 42 0 11 45 0 2 1 69 > 0 30 > 13 0 0 1 46 1 43 1 14 60 0 14 14 52 > 0 34 > 14 1 0 0 53 1 49 1 13 60 0 26 9 44 > 0 47 > 15 0 0 1 39 1 57 0 15 75 0 4 0 47 > 0 53 > CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys > wt idl > 0 0 0 0 51 15 113 0 27 208 0 1 0 43 > 0 56 > 1 0 0 0 300 282 85 2 21 163 0 51 0 54 > 0 46 > 2 0 0 0 201 188 102 3 20 147 0 39 2 63 > 0 35 > 3 0 0 0 26 4 122 1 27 252 0 2 1 35 > 0 64 > 4 1 0 0 14 1 60 1 15 68 0 22 6 42 > 0 52 > 5 0 0 0 8 1 57 1 15 68 0 27 9 43 > 0 48 > 6 0 0 0 11 1 66 1 16 83 0 44 5 45 > 0 50 > 7 0 0 0 12 3 62 3 17 71 0 26 12 43 > 0 45 > 8 0 0 0 27 3 73 2 18 168 0 24 1 45 > 0 53 > 9 0 0 1 30 2 100 4 25 142 1 44 0 45 > 0 55 > 10 0 0 0 25 3 82 3 21 153 0 6 1 36 > 0 63 > 11 0 0 0 16 3 73 2 16 201 0 53 1 50 > 0 49 > 12 0 0 0 11 1 51 1 15 71 0 21 6 47 > 0 47 > 13 0 0 49 7 1 63 1 15 72 0 19 6 47 > 0 46 > 14 0 0 0 6 1 57 1 14 82 0 1 1 46 > 0 52 > 15 0 0 0 12 1 64 2 15 78 0 35 14 45 > 0 41 > > > I did finally see that fmd was running a lot and took a look at what it > thought was going on: > > local at dev-storage-01:~$ pfexec fmadm faulty > --------------- ------------------------------------ -------------- > --------- > TIME EVENT-ID MSG-ID > SEVERITY > --------------- ------------------------------------ -------------- > --------- > Nov 18 07:00:37 0ddd2e6a-06c4-6936-fb63-fbe93899370d SUNOS-8000-FU > Major > > Host : dev-storage-01 > Platform : X8DT3 Chassis_id : 1234567890 > Product_sn : > > Fault class : defect.sunos.eft.undiag.fme > FRU : None > faulty > > Description : The diagnosis engine encountered telemetry for which it was > unable to perform a diagnosis. Refer to > http://sun.com/msg/SUNOS-8000-FU for more information. > > Response : Error reports have been logged for examination by Sun. > > Impact : Automated diagnosis and response for these events will not > occur. > > Action : Ensure that the latest Solaris Kernel and Predictive > Self-Healing > (PSH) patches are installed. > > At this point I've reached the end of what I know to do in order to > diagnose what is going on with this box. I would really appreciate some > additional guidance on what I can to look at. > > > Here is the configuration information on the box: > > local at dev-storage-01:~$ uname -a > SunOS dev-storage-01 5.11 snv_127 i86pc i386 i86pc Solaris > local at dev-storage-01:~$ cat /etc/release > OpenSolaris Development snv_127 X86 > Copyright 2009 Sun Microsystems, Inc. All Rights Reserved. > Use is subject to license terms. > Assembled 06 November 2009 > > local at dev-storage-01:~$ prtdiag > System Configuration: Supermicro X8DT3 > BIOS Configuration: American Megatrends Inc. 080015 09/24/2009 > BMC Configuration: IPMI 1.5 (KCS: Keyboard Controller Style) > > ==== Processor Sockets ==================================== > > Version Location Tag > -------------------------------- -------------------------- > Intel(R) Xeon(R) CPU E5520 @ 2.27GHz CPU 2 > Intel(R) Xeon(R) CPU E5520 @ 2.27GHz CPU 1 > > ==== Memory Device Sockets ================================ > > Type Status Set Device Locator Bank Locator > ----------- ------ --- ------------------- ---------------- > other in use 0 P1-DIMM1A BANK0 > other in use 0 P1-DIMM1B BANK1 > other in use 0 P1-DIMM2A BANK2 > other in use 0 P1-DIMM2B BANK3 > other in use 0 P1-DIMM3A BANK4 > other in use 0 P1-DIMM3B BANK5 > other in use 0 P2-DIMM1A BANK6 > other in use 0 P2-DIMM1B BANK7 > other in use 0 P2-DIMM2A BANK8 > other in use 0 P2-DIMM2B BANK9 > other in use 0 P2-DIMM3A BANK10 > other in use 0 P2-DIMM3B BANK11 > > ==== On-Board Devices ===================================== > > ==== Upgradeable Slots ==================================== > > ID Status Type Description > --- --------- ---------------- ---------------------------- > 1 available PCI PCI#1 > 2 in use PCI Express PCI-E#2 > 3 available PCI PCI#3 > 4 in use PCI Express PCI#4 > 5 in use PCI Express PCI-E#5 > 6 available PCI Express PCI-E#6 > > > > local at dev-storage-01:~$ prtconf > System Configuration: Sun Microsystems i86pc > Memory size: 49143 Megabytes > System Peripherals (Software Nodes): > > i86pc > scsi_vhci, instance #0 > fw, instance #0 > cpu, instance #0 > cpu, instance #1 > cpu, instance #2 > cpu, instance #3 > cpu, instance #4 > cpu, instance #5 > cpu, instance #6 > cpu, instance #7 > cpu, instance #8 > cpu, instance #9 > cpu, instance #10 > cpu, instance #11 > cpu, instance #12 > cpu, instance #13 > cpu, instance #14 > cpu, instance #15 > sb, instance #1 > pci, instance #0 > pci15d9,1 (driver not attached) > pci8086,3408, instance #0 > pci15d9,10c9, instance #0 > pci15d9,10c9, instance #1 > pci8086,340a (driver not attached) > pci8086,340c, instance #2 > pci1000,3140, instance #1 > sd, instance #5 > sd, instance #6 > sd, instance #7 > sd, instance #8 > sd, instance #9 > sd, instance #10 > sd, instance #11 > sd, instance #12 > pci8086,340e (driver not attached) > pci8086,343a (driver not attached) > pci8086,343b (driver not attached) > pci8086,343c (driver not attached) > pci8086,343d (driver not attached) > pci8086,3418 (driver not attached) > pci8086,3419 (driver not attached) > pci8086,341a (driver not attached) > pci8086,341c (driver not attached) > pci8086,341d (driver not attached) > pci8086,341e (driver not attached) > pci8086,3439 (driver not attached) > pci8086,342e (driver not attached) > pci8086,3422 (driver not attached) > pci8086,3423, instance #0 > pci8086,3438 (driver not attached) > pci15d9,1 (driver not attached) > pci15d9,1 (driver not attached) > pci15d9,1 (driver not attached) > pci15d9,1 (driver not attached) > pci15d9,1 (driver not attached) > pci15d9,1 (driver not attached) > pci15d9,1 (driver not attached) > pci15d9,1 (driver not attached) > pci15d9,1, instance #0 > pci15d9,1, instance #1 > pci15d9,1, instance #2 > device, instance #0 > keyboard, instance #0 > mouse, instance #1 > pci15d9,1, instance #0 > pci8086,3a40, instance #4 > pci1000,3140, instance #0 > sd, instance #13 > sd, instance #14 > sd, instance #15 > sd, instance #16 > sd, instance #17 > sd, instance #18 > sd, instance #19 > sd, instance #20 > pci15d9,1, instance #3 > pci15d9,1, instance #4 > pci15d9,1, instance #5 > pci15d9,1, instance #1 > pci8086,244e, instance #0 > display, instance #0 > isa, instance #0 > motherboard (driver not attached) > asy, instance #0 > asy, instance #1 > asy, instance #2 > i8042, instance #0 > keyboard, instance #0 > mouse, instance #0 > motherboard (driver not attached) > pit_beep, instance #0 > pci15d9,1, instance #0 > cdrom, instance #0 > disk, instance #1 > disk, instance #2 > disk, instance #3 > disk, instance #4 > pci15d9,1 (driver not attached) > used-resources (driver not attached) > pseudo, instance #0 > options, instance #0 > xsvc, instance #0 > agpgart, instance #0 > > local at dev-storage-01:~$ pfexec scanpci > > pci bus 0x0000 cardnum 0x00 function 0x00: vendor 0x8086 device 0x3406 > Intel Corporation QuickPath Architecture I/O Hub to ESI Port > > pci bus 0x0000 cardnum 0x01 function 0x00: vendor 0x8086 device 0x3408 > Intel Corporation QuickPath Architecture I/O Hub PCI Express Root > Port 1 > > pci bus 0x0000 cardnum 0x03 function 0x00: vendor 0x8086 device 0x340a > Intel Corporation QuickPath Architecture I/O Hub PCI Express Root > Port 3 > > pci bus 0x0000 cardnum 0x05 function 0x00: vendor 0x8086 device 0x340c > Intel Corporation QuickPath Architecture I/O Hub PCI Express Root > Port 5 > > pci bus 0x0000 cardnum 0x07 function 0x00: vendor 0x8086 device 0x340e > Intel Corporation QuickPath Architecture I/O Hub PCI Express Root > Port 7 > > pci bus 0x0000 cardnum 0x0d function 0x00: vendor 0x8086 device 0x343a > Intel Corporation Device unknown > > pci bus 0x0000 cardnum 0x0d function 0x01: vendor 0x8086 device 0x343b > Intel Corporation Device unknown > > pci bus 0x0000 cardnum 0x0d function 0x02: vendor 0x8086 device 0x343c > Intel Corporation Device unknown > > pci bus 0x0000 cardnum 0x0d function 0x03: vendor 0x8086 device 0x343d > Intel Corporation Device unknown > > pci bus 0x0000 cardnum 0x0d function 0x04: vendor 0x8086 device 0x3418 > Intel Corporation Quickpath Interconnect Physical Layer Port 0 > > pci bus 0x0000 cardnum 0x0d function 0x05: vendor 0x8086 device 0x3419 > Intel Corporation Quickpath Interconnect Physical Layer Port 1 > > pci bus 0x0000 cardnum 0x0d function 0x06: vendor 0x8086 device 0x341a > Intel Corporation Device unknown > > pci bus 0x0000 cardnum 0x0e function 0x00: vendor 0x8086 device 0x341c > Intel Corporation Device unknown > > pci bus 0x0000 cardnum 0x0e function 0x01: vendor 0x8086 device 0x341d > Intel Corporation Device unknown > > pci bus 0x0000 cardnum 0x0e function 0x02: vendor 0x8086 device 0x341e > Intel Corporation Device unknown > > pci bus 0x0000 cardnum 0x0e function 0x04: vendor 0x8086 device 0x3439 > Intel Corporation Device unknown > > pci bus 0x0000 cardnum 0x14 function 0x00: vendor 0x8086 device 0x342e > Intel Corporation QuickPath Architecture I/O Hub System Management > Registers > > pci bus 0x0000 cardnum 0x14 function 0x01: vendor 0x8086 device 0x3422 > Intel Corporation QuickPath Architecture I/O Hub GPIO and Scratch Pad > Registers > > pci bus 0x0000 cardnum 0x14 function 0x02: vendor 0x8086 device 0x3423 > Intel Corporation QuickPath Architecture I/O Hub Control Status and > RAS Registers > > pci bus 0x0000 cardnum 0x14 function 0x03: vendor 0x8086 device 0x3438 > Intel Corporation QuickPath Architecture I/O Hub Throttle Registers > > pci bus 0x0000 cardnum 0x16 function 0x00: vendor 0x8086 device 0x3430 > Intel Corporation DMA Engine > > pci bus 0x0000 cardnum 0x16 function 0x01: vendor 0x8086 device 0x3431 > Intel Corporation DMA Engine > > pci bus 0x0000 cardnum 0x16 function 0x02: vendor 0x8086 device 0x3432 > Intel Corporation DMA Engine > > pci bus 0x0000 cardnum 0x16 function 0x03: vendor 0x8086 device 0x3433 > Intel Corporation DMA Engine > > pci bus 0x0000 cardnum 0x16 function 0x04: vendor 0x8086 device 0x3429 > Intel Corporation DMA Engine > > pci bus 0x0000 cardnum 0x16 function 0x05: vendor 0x8086 device 0x342a > Intel Corporation DMA Engine > > pci bus 0x0000 cardnum 0x16 function 0x06: vendor 0x8086 device 0x342b > Intel Corporation DMA Engine > > pci bus 0x0000 cardnum 0x16 function 0x07: vendor 0x8086 device 0x342c > Intel Corporation DMA Engine > > pci bus 0x0000 cardnum 0x1a function 0x00: vendor 0x8086 device 0x3a37 > Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #4 > > pci bus 0x0000 cardnum 0x1a function 0x01: vendor 0x8086 device 0x3a38 > Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #5 > > pci bus 0x0000 cardnum 0x1a function 0x02: vendor 0x8086 device 0x3a39 > Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #6 > > pci bus 0x0000 cardnum 0x1a function 0x07: vendor 0x8086 device 0x3a3c > Intel Corporation 82801JI (ICH10 Family) USB2 EHCI Controller #2 > > pci bus 0x0000 cardnum 0x1c function 0x00: vendor 0x8086 device 0x3a40 > Intel Corporation 82801JI (ICH10 Family) PCI Express Port 1 > > pci bus 0x0000 cardnum 0x1d function 0x00: vendor 0x8086 device 0x3a34 > Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #1 > > pci bus 0x0000 cardnum 0x1d function 0x01: vendor 0x8086 device 0x3a35 > Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #2 > > pci bus 0x0000 cardnum 0x1d function 0x02: vendor 0x8086 device 0x3a36 > Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #3 > > pci bus 0x0000 cardnum 0x1d function 0x07: vendor 0x8086 device 0x3a3a > Intel Corporation 82801JI (ICH10 Family) USB2 EHCI Controller #1 > > pci bus 0x0000 cardnum 0x1e function 0x00: vendor 0x8086 device 0x244e > Intel Corporation 82801 PCI Bridge > > pci bus 0x0000 cardnum 0x1f function 0x00: vendor 0x8086 device 0x3a16 > Intel Corporation 82801JIR (ICH10R) LPC Interface Controller > > pci bus 0x0000 cardnum 0x1f function 0x02: vendor 0x8086 device 0x3a22 > Intel Corporation 82801JI (ICH10 Family) SATA AHCI Controller > > pci bus 0x0000 cardnum 0x1f function 0x03: vendor 0x8086 device 0x3a30 > Intel Corporation 82801JI (ICH10 Family) SMBus Controller > > pci bus 0x0001 cardnum 0x03 function 0x00: vendor 0x102b device 0x0532 > Matrox Graphics, Inc. MGA G200eW WPCM450 > > pci bus 0x0002 cardnum 0x00 function 0x00: vendor 0x1000 device 0x0058 > LSI Logic / Symbios Logic SAS1068E PCI-Express Fusion-MPT SAS > > pci bus 0x0004 cardnum 0x00 function 0x00: vendor 0x1000 device 0x0058 > LSI Logic / Symbios Logic SAS1068E PCI-Express Fusion-MPT SAS > > pci bus 0x0006 cardnum 0x00 function 0x00: vendor 0x8086 device 0x10c9 > Intel Corporation 82576 Gigabit ET Dual Port Server Adapter > > pci bus 0x0006 cardnum 0x00 function 0x01: vendor 0x8086 device 0x10c9 > Intel Corporation 82576 Gigabit ET Dual Port Server Adapter > > pci bus 0x00fe cardnum 0x00 function 0x00: vendor 0x8086 device 0x2c40 > Intel Corporation QuickPath Architecture Generic Non-Core Registers > > pci bus 0x00fe cardnum 0x00 function 0x01: vendor 0x8086 device 0x2c01 > Intel Corporation QuickPath Architecture System Address Decoder > > pci bus 0x00fe cardnum 0x02 function 0x00: vendor 0x8086 device 0x2c10 > Intel Corporation QPI Link 0 > > pci bus 0x00fe cardnum 0x02 function 0x01: vendor 0x8086 device 0x2c11 > Intel Corporation QPI Physical 0 > > pci bus 0x00fe cardnum 0x02 function 0x04: vendor 0x8086 device 0x2c14 > Intel Corporation QPI Link 1 > > pci bus 0x00fe cardnum 0x02 function 0x05: vendor 0x8086 device 0x2c15 > Intel Corporation QPI Physical 1 > > pci bus 0x00fe cardnum 0x03 function 0x00: vendor 0x8086 device 0x2c18 > Intel Corporation QuickPath Memory Controller > > pci bus 0x00fe cardnum 0x03 function 0x01: vendor 0x8086 device 0x2c19 > Intel Corporation QuickPath Memory Controller Target Address Decoder > > pci bus 0x00fe cardnum 0x03 function 0x02: vendor 0x8086 device 0x2c1a > Intel Corporation QuickPath Memory Controller RAS Registers > > pci bus 0x00fe cardnum 0x03 function 0x04: vendor 0x8086 device 0x2c1c > Intel Corporation QuickPath Memory Controller Test Registers > > pci bus 0x00fe cardnum 0x04 function 0x00: vendor 0x8086 device 0x2c20 > Intel Corporation QuickPath Memory Controller Channel 0 Control > Registers > > pci bus 0x00fe cardnum 0x04 function 0x01: vendor 0x8086 device 0x2c21 > Intel Corporation QuickPath Memory Controller Channel 0 Address > Registers > > pci bus 0x00fe cardnum 0x04 function 0x02: vendor 0x8086 device 0x2c22 > Intel Corporation QuickPath Memory Controller Channel 0 Rank Registers > > pci bus 0x00fe cardnum 0x04 function 0x03: vendor 0x8086 device 0x2c23 > Intel Corporation QuickPath Memory Controller Channel 0 Thermal > Control Registers > > pci bus 0x00fe cardnum 0x05 function 0x00: vendor 0x8086 device 0x2c28 > Intel Corporation QuickPath Memory Controller Channel 1 Control > Registers > > pci bus 0x00fe cardnum 0x05 function 0x01: vendor 0x8086 device 0x2c29 > Intel Corporation QuickPath Memory Controller Channel 1 Address > Registers > > pci bus 0x00fe cardnum 0x05 function 0x02: vendor 0x8086 device 0x2c2a > Intel Corporation QuickPath Memory Controller Channel 1 Rank Registers > > pci bus 0x00fe cardnum 0x05 function 0x03: vendor 0x8086 device 0x2c2b > Intel Corporation QuickPath Memory Controller Channel 1 Thermal > Control Registers > > pci bus 0x00fe cardnum 0x06 function 0x00: vendor 0x8086 device 0x2c30 > Intel Corporation QuickPath Memory Controller Channel 2 Control > Registers > > pci bus 0x00fe cardnum 0x06 function 0x01: vendor 0x8086 device 0x2c31 > Intel Corporation QuickPath Memory Controller Channel 2 Address > Registers > > pci bus 0x00fe cardnum 0x06 function 0x02: vendor 0x8086 device 0x2c32 > Intel Corporation QuickPath Memory Controller Channel 2 Rank Registers > > pci bus 0x00fe cardnum 0x06 function 0x03: vendor 0x8086 device 0x2c33 > Intel Corporation QuickPath Memory Controller Channel 2 Thermal > Control Registers > > pci bus 0x00ff cardnum 0x00 function 0x00: vendor 0x8086 device 0x2c40 > Intel Corporation QuickPath Architecture Generic Non-Core Registers > > pci bus 0x00ff cardnum 0x00 function 0x01: vendor 0x8086 device 0x2c01 > Intel Corporation QuickPath Architecture System Address Decoder > > pci bus 0x00ff cardnum 0x02 function 0x00: vendor 0x8086 device 0x2c10 > Intel Corporation QPI Link 0 > > pci bus 0x00ff cardnum 0x02 function 0x01: vendor 0x8086 device 0x2c11 > Intel Corporation QPI Physical 0 > > pci bus 0x00ff cardnum 0x02 function 0x04: vendor 0x8086 device 0x2c14 > Intel Corporation QPI Link 1 > > pci bus 0x00ff cardnum 0x02 function 0x05: vendor 0x8086 device 0x2c15 > Intel Corporation QPI Physical 1 > > pci bus 0x00ff cardnum 0x03 function 0x00: vendor 0x8086 device 0x2c18 > Intel Corporation QuickPath Memory Controller > > pci bus 0x00ff cardnum 0x03 function 0x01: vendor 0x8086 device 0x2c19 > Intel Corporation QuickPath Memory Controller Target Address Decoder > > pci bus 0x00ff cardnum 0x03 function 0x02: vendor 0x8086 device 0x2c1a > Intel Corporation QuickPath Memory Controller RAS Registers > > pci bus 0x00ff cardnum 0x03 function 0x04: vendor 0x8086 device 0x2c1c > Intel Corporation QuickPath Memory Controller Test Registers > > pci bus 0x00ff cardnum 0x04 function 0x00: vendor 0x8086 device 0x2c20 > Intel Corporation QuickPath Memory Controller Channel 0 Control > Registers > > pci bus 0x00ff cardnum 0x04 function 0x01: vendor 0x8086 device 0x2c21 > Intel Corporation QuickPath Memory Controller Channel 0 Address > Registers > > pci bus 0x00ff cardnum 0x04 function 0x02: vendor 0x8086 device 0x2c22 > Intel Corporation QuickPath Memory Controller Channel 0 Rank Registers > > pci bus 0x00ff cardnum 0x04 function 0x03: vendor 0x8086 device 0x2c23 > Intel Corporation QuickPath Memory Controller Channel 0 Thermal > Control Registers > > pci bus 0x00ff cardnum 0x05 function 0x00: vendor 0x8086 device 0x2c28 > Intel Corporation QuickPath Memory Controller Channel 1 Control > Registers > > pci bus 0x00ff cardnum 0x05 function 0x01: vendor 0x8086 device 0x2c29 > Intel Corporation QuickPath Memory Controller Channel 1 Address > Registers > > pci bus 0x00ff cardnum 0x05 function 0x02: vendor 0x8086 device 0x2c2a > Intel Corporation QuickPath Memory Controller Channel 1 Rank Registers > > pci bus 0x00ff cardnum 0x05 function 0x03: vendor 0x8086 device 0x2c2b > Intel Corporation QuickPath Memory Controller Channel 1 Thermal > Control Registers > > pci bus 0x00ff cardnum 0x06 function 0x00: vendor 0x8086 device 0x2c30 > Intel Corporation QuickPath Memory Controller Channel 2 Control > Registers > > pci bus 0x00ff cardnum 0x06 function 0x01: vendor 0x8086 device 0x2c31 > Intel Corporation QuickPath Memory Controller Channel 2 Address > Registers > > pci bus 0x00ff cardnum 0x06 function 0x02: vendor 0x8086 device 0x2c32 > Intel Corporation QuickPath Memory Controller Channel 2 Rank Registers > > pci bus 0x00ff cardnum 0x06 function 0x03: vendor 0x8086 device 0x2c33 > Intel Corporation QuickPath Memory Controller Channel 2 Thermal > Control Registers > > -- > paul > > _______________________________________________ > fm-discuss mailing list > fm-discuss at opensolaris.org From Scott.Davenport at Sun.COM Thu Nov 19 11:35:10 2009 From: Scott.Davenport at Sun.COM (Scott Davenport) Date: Thu, 19 Nov 2009 11:35:10 -0800 Subject: [fm-discuss] [Fwd: sluggish opensolaris-b127 with unknown fault] In-Reply-To: <4B052FBA.4020300@sun.com> References: <4B052ACB.4050600@alertlogic.net> <4B052FBA.4020300@sun.com> Message-ID: <1258659310.29690.7.camel@prax> On Thu, 2009-11-19 at 11:44 +0000, Steve Hanson wrote: > Hi Paul, > > Can you send the "fmdump -eV" output? As well as 'fmtopo' output. There's a disconnect between the incoming telemetry and system topology. -scott > > > Hopefully someone can help get to the bottom of what is going on with > > this machine. Overall it appears that fmd is getting in the way of > > work getting done. > > > > I have a new 2x Intel 5520 supermicro box (prtdiag/prtconf/scanpci > > below) that installed extremely slowly (4+ hours) and then once booted > > is overall acting sluggishly. Load seems to stay at 0.50 all of the > > time doing nothing, and of that most is system time. > > > > Normal vmstat show not much going on: > > > > local at dev-storage-01:/test# vmstat 5 > > kthr memory page disk > > faults cpu > > r b w swap free re mf pi po fr de sr s0 s1 s2 s3 in sy cs > > us sy id > > 1 0 0 65936540 47229788 68 86 17 0 0 0 5 -0 15 21 21 496 1690 949 > > 1 8 91 > > 0 0 0 65222612 46351108 2 14 0 0 0 0 0 0 8 0 0 338 675 313 > > 1 4 96 > > 0 0 0 65222452 46351000 0 3 0 0 0 0 0 0 17 0 0 416 301 288 > > 0 2 98 > > 0 0 0 65222452 46351004 0 0 0 0 0 0 0 0 13 0 0 477 424 336 > > 0 0 99 > > > > When I create a 8x mirrored vdev pool (1T samsung enterise drives) and > > do a simple dd test it maxes out at 100M/s, where I'd normally expect > > 500M/s+ at least. > > > > local at dev-storage-01:/storage# dd if=/dev/zero of=test.zeros bs=1M > > count=64000 > > 64000+0 records in > > 64000+0 records out > > 67108864000 bytes (67 GB) copied, 630.583 s, 106 MB/s > > > > Interestingly, during the dd, the load goes to 7+ and vmstat looks like > > the box is working really hard to get the bits out to disk: > > > > PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP > > > > 2073 root 8024K 2140K cpu1 0 0 0:07:09 5.1% dd/1 > > 1008 root 55M 43M cpu12 0 0 0:29:19 4.6% fmd/20 > > 1261 root 13M 6288K sleep 59 0 0:01:39 0.1% intrd/1 > > 2107 local 9556K 2832K cpu5 0 0 0:00:00 0.1% prstat/1 > > 1924 local 9128K 2652K sleep 49 0 0:00:08 0.1% bash/1 > > 1919 local 21M 5600K sleep 59 0 0:00:10 0.0% sshd/1 > > ... > > Total: 48 processes, 199 lwps, load averages: 8.34, 7.31, 5.23 > > > > local at dev-storage-01:~$ vmstat 5 > > kthr memory page disk > > faults cpu > > r b w swap free re mf pi po fr de sr s0 s1 s2 s3 in sy cs > > us sy id > > 1 0 0 65914512 47202672 66 84 17 0 0 0 5 -0 14 21 21 493 1654 929 > > 1 8 91 > > 9 0 0 65221928 46350452 7 92 0 0 0 0 0 0 4 0 0 493 459 541 > > 2 47 51 > > 3 0 0 65221600 46350148 0 1 0 0 0 0 0 0 22 0 0 614 383 1171 > > 4 52 44 > > 3 0 0 65221600 46350148 0 0 0 0 0 0 0 0 13 0 0 1015 680 2022 > > 3 28 68 > > 4 0 0 65221600 46350148 0 0 0 0 0 0 0 0 15 0 0 585 360 916 > > 5 52 43 > > 9 0 0 65221580 46350128 0 0 0 0 0 0 0 0 3 0 0 530 391 758 > > 4 60 36 > > > > local at dev-storage-01:~$ mpstat 15 > > ... > > CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys > > wt idl > > 0 0 0 1 90 16 70 3 18 106 0 70 0 63 > > 0 37 > > 1 0 0 0 249 204 99 2 20 177 0 1 0 52 > > 0 48 > > 2 0 0 0 189 140 84 1 16 150 0 44 0 60 > > 0 40 > > 3 0 0 1 61 2 91 2 20 147 0 33 0 46 > > 0 54 > > 4 2 0 1 36 1 48 1 12 56 0 25 6 45 > > 0 49 > > 5 0 0 0 40 0 51 1 12 66 0 3 7 48 > > 0 44 > > 6 0 0 1 37 1 59 1 16 67 0 9 2 48 > > 0 49 > > 7 0 0 0 52 3 51 3 14 64 0 9 20 36 > > 0 44 > > 8 0 0 0 66 3 40 2 11 116 0 33 0 50 > > 0 50 > > 9 0 0 1 47 2 83 0 19 169 0 0 0 38 > > 0 62 > > 10 0 0 1 43 2 68 0 16 128 0 1 0 34 > > 0 66 > > 11 0 0 552 13 4 59 0 12 92 0 18 0 49 > > 0 51 > > 12 0 0 0 58 0 42 0 11 45 0 2 1 69 > > 0 30 > > 13 0 0 1 46 1 43 1 14 60 0 14 14 52 > > 0 34 > > 14 1 0 0 53 1 49 1 13 60 0 26 9 44 > > 0 47 > > 15 0 0 1 39 1 57 0 15 75 0 4 0 47 > > 0 53 > > CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys > > wt idl > > 0 0 0 0 51 15 113 0 27 208 0 1 0 43 > > 0 56 > > 1 0 0 0 300 282 85 2 21 163 0 51 0 54 > > 0 46 > > 2 0 0 0 201 188 102 3 20 147 0 39 2 63 > > 0 35 > > 3 0 0 0 26 4 122 1 27 252 0 2 1 35 > > 0 64 > > 4 1 0 0 14 1 60 1 15 68 0 22 6 42 > > 0 52 > > 5 0 0 0 8 1 57 1 15 68 0 27 9 43 > > 0 48 > > 6 0 0 0 11 1 66 1 16 83 0 44 5 45 > > 0 50 > > 7 0 0 0 12 3 62 3 17 71 0 26 12 43 > > 0 45 > > 8 0 0 0 27 3 73 2 18 168 0 24 1 45 > > 0 53 > > 9 0 0 1 30 2 100 4 25 142 1 44 0 45 > > 0 55 > > 10 0 0 0 25 3 82 3 21 153 0 6 1 36 > > 0 63 > > 11 0 0 0 16 3 73 2 16 201 0 53 1 50 > > 0 49 > > 12 0 0 0 11 1 51 1 15 71 0 21 6 47 > > 0 47 > > 13 0 0 49 7 1 63 1 15 72 0 19 6 47 > > 0 46 > > 14 0 0 0 6 1 57 1 14 82 0 1 1 46 > > 0 52 > > 15 0 0 0 12 1 64 2 15 78 0 35 14 45 > > 0 41 > > > > > > I did finally see that fmd was running a lot and took a look at what it > > thought was going on: > > > > local at dev-storage-01:~$ pfexec fmadm faulty > > --------------- ------------------------------------ -------------- > > --------- > > TIME EVENT-ID MSG-ID > > SEVERITY > > --------------- ------------------------------------ -------------- > > --------- > > Nov 18 07:00:37 0ddd2e6a-06c4-6936-fb63-fbe93899370d SUNOS-8000-FU > > Major > > > > Host : dev-storage-01 > > Platform : X8DT3 Chassis_id : 1234567890 > > Product_sn : > > > > Fault class : defect.sunos.eft.undiag.fme > > FRU : None > > faulty > > > > Description : The diagnosis engine encountered telemetry for which it was > > unable to perform a diagnosis. Refer to > > http://sun.com/msg/SUNOS-8000-FU for more information. > > > > Response : Error reports have been logged for examination by Sun. > > > > Impact : Automated diagnosis and response for these events will not > > occur. > > > > Action : Ensure that the latest Solaris Kernel and Predictive > > Self-Healing > > (PSH) patches are installed. > > > > At this point I've reached the end of what I know to do in order to > > diagnose what is going on with this box. I would really appreciate some > > additional guidance on what I can to look at. > > > > > > Here is the configuration information on the box: > > > > local at dev-storage-01:~$ uname -a > > SunOS dev-storage-01 5.11 snv_127 i86pc i386 i86pc Solaris > > local at dev-storage-01:~$ cat /etc/release > > OpenSolaris Development snv_127 X86 > > Copyright 2009 Sun Microsystems, Inc. All Rights Reserved. > > Use is subject to license terms. > > Assembled 06 November 2009 > > > > local at dev-storage-01:~$ prtdiag > > System Configuration: Supermicro X8DT3 > > BIOS Configuration: American Megatrends Inc. 080015 09/24/2009 > > BMC Configuration: IPMI 1.5 (KCS: Keyboard Controller Style) > > > > ==== Processor Sockets ==================================== > > > > Version Location Tag > > -------------------------------- -------------------------- > > Intel(R) Xeon(R) CPU E5520 @ 2.27GHz CPU 2 > > Intel(R) Xeon(R) CPU E5520 @ 2.27GHz CPU 1 > > > > ==== Memory Device Sockets ================================ > > > > Type Status Set Device Locator Bank Locator > > ----------- ------ --- ------------------- ---------------- > > other in use 0 P1-DIMM1A BANK0 > > other in use 0 P1-DIMM1B BANK1 > > other in use 0 P1-DIMM2A BANK2 > > other in use 0 P1-DIMM2B BANK3 > > other in use 0 P1-DIMM3A BANK4 > > other in use 0 P1-DIMM3B BANK5 > > other in use 0 P2-DIMM1A BANK6 > > other in use 0 P2-DIMM1B BANK7 > > other in use 0 P2-DIMM2A BANK8 > > other in use 0 P2-DIMM2B BANK9 > > other in use 0 P2-DIMM3A BANK10 > > other in use 0 P2-DIMM3B BANK11 > > > > ==== On-Board Devices ===================================== > > > > ==== Upgradeable Slots ==================================== > > > > ID Status Type Description > > --- --------- ---------------- ---------------------------- > > 1 available PCI PCI#1 > > 2 in use PCI Express PCI-E#2 > > 3 available PCI PCI#3 > > 4 in use PCI Express PCI#4 > > 5 in use PCI Express PCI-E#5 > > 6 available PCI Express PCI-E#6 > > > > > > > > local at dev-storage-01:~$ prtconf > > System Configuration: Sun Microsystems i86pc > > Memory size: 49143 Megabytes > > System Peripherals (Software Nodes): > > > > i86pc > > scsi_vhci, instance #0 > > fw, instance #0 > > cpu, instance #0 > > cpu, instance #1 > > cpu, instance #2 > > cpu, instance #3 > > cpu, instance #4 > > cpu, instance #5 > > cpu, instance #6 > > cpu, instance #7 > > cpu, instance #8 > > cpu, instance #9 > > cpu, instance #10 > > cpu, instance #11 > > cpu, instance #12 > > cpu, instance #13 > > cpu, instance #14 > > cpu, instance #15 > > sb, instance #1 > > pci, instance #0 > > pci15d9,1 (driver not attached) > > pci8086,3408, instance #0 > > pci15d9,10c9, instance #0 > > pci15d9,10c9, instance #1 > > pci8086,340a (driver not attached) > > pci8086,340c, instance #2 > > pci1000,3140, instance #1 > > sd, instance #5 > > sd, instance #6 > > sd, instance #7 > > sd, instance #8 > > sd, instance #9 > > sd, instance #10 > > sd, instance #11 > > sd, instance #12 > > pci8086,340e (driver not attached) > > pci8086,343a (driver not attached) > > pci8086,343b (driver not attached) > > pci8086,343c (driver not attached) > > pci8086,343d (driver not attached) > > pci8086,3418 (driver not attached) > > pci8086,3419 (driver not attached) > > pci8086,341a (driver not attached) > > pci8086,341c (driver not attached) > > pci8086,341d (driver not attached) > > pci8086,341e (driver not attached) > > pci8086,3439 (driver not attached) > > pci8086,342e (driver not attached) > > pci8086,3422 (driver not attached) > > pci8086,3423, instance #0 > > pci8086,3438 (driver not attached) > > pci15d9,1 (driver not attached) > > pci15d9,1 (driver not attached) > > pci15d9,1 (driver not attached) > > pci15d9,1 (driver not attached) > > pci15d9,1 (driver not attached) > > pci15d9,1 (driver not attached) > > pci15d9,1 (driver not attached) > > pci15d9,1 (driver not attached) > > pci15d9,1, instance #0 > > pci15d9,1, instance #1 > > pci15d9,1, instance #2 > > device, instance #0 > > keyboard, instance #0 > > mouse, instance #1 > > pci15d9,1, instance #0 > > pci8086,3a40, instance #4 > > pci1000,3140, instance #0 > > sd, instance #13 > > sd, instance #14 > > sd, instance #15 > > sd, instance #16 > > sd, instance #17 > > sd, instance #18 > > sd, instance #19 > > sd, instance #20 > > pci15d9,1, instance #3 > > pci15d9,1, instance #4 > > pci15d9,1, instance #5 > > pci15d9,1, instance #1 > > pci8086,244e, instance #0 > > display, instance #0 > > isa, instance #0 > > motherboard (driver not attached) > > asy, instance #0 > > asy, instance #1 > > asy, instance #2 > > i8042, instance #0 > > keyboard, instance #0 > > mouse, instance #0 > > motherboard (driver not attached) > > pit_beep, instance #0 > > pci15d9,1, instance #0 > > cdrom, instance #0 > > disk, instance #1 > > disk, instance #2 > > disk, instance #3 > > disk, instance #4 > > pci15d9,1 (driver not attached) > > used-resources (driver not attached) > > pseudo, instance #0 > > options, instance #0 > > xsvc, instance #0 > > agpgart, instance #0 > > > > local at dev-storage-01:~$ pfexec scanpci > > > > pci bus 0x0000 cardnum 0x00 function 0x00: vendor 0x8086 device 0x3406 > > Intel Corporation QuickPath Architecture I/O Hub to ESI Port > > > > pci bus 0x0000 cardnum 0x01 function 0x00: vendor 0x8086 device 0x3408 > > Intel Corporation QuickPath Architecture I/O Hub PCI Express Root > > Port 1 > > > > pci bus 0x0000 cardnum 0x03 function 0x00: vendor 0x8086 device 0x340a > > Intel Corporation QuickPath Architecture I/O Hub PCI Express Root > > Port 3 > > > > pci bus 0x0000 cardnum 0x05 function 0x00: vendor 0x8086 device 0x340c > > Intel Corporation QuickPath Architecture I/O Hub PCI Express Root > > Port 5 > > > > pci bus 0x0000 cardnum 0x07 function 0x00: vendor 0x8086 device 0x340e > > Intel Corporation QuickPath Architecture I/O Hub PCI Express Root > > Port 7 > > > > pci bus 0x0000 cardnum 0x0d function 0x00: vendor 0x8086 device 0x343a > > Intel Corporation Device unknown > > > > pci bus 0x0000 cardnum 0x0d function 0x01: vendor 0x8086 device 0x343b > > Intel Corporation Device unknown > > > > pci bus 0x0000 cardnum 0x0d function 0x02: vendor 0x8086 device 0x343c > > Intel Corporation Device unknown > > > > pci bus 0x0000 cardnum 0x0d function 0x03: vendor 0x8086 device 0x343d > > Intel Corporation Device unknown > > > > pci bus 0x0000 cardnum 0x0d function 0x04: vendor 0x8086 device 0x3418 > > Intel Corporation Quickpath Interconnect Physical Layer Port 0 > > > > pci bus 0x0000 cardnum 0x0d function 0x05: vendor 0x8086 device 0x3419 > > Intel Corporation Quickpath Interconnect Physical Layer Port 1 > > > > pci bus 0x0000 cardnum 0x0d function 0x06: vendor 0x8086 device 0x341a > > Intel Corporation Device unknown > > > > pci bus 0x0000 cardnum 0x0e function 0x00: vendor 0x8086 device 0x341c > > Intel Corporation Device unknown > > > > pci bus 0x0000 cardnum 0x0e function 0x01: vendor 0x8086 device 0x341d > > Intel Corporation Device unknown > > > > pci bus 0x0000 cardnum 0x0e function 0x02: vendor 0x8086 device 0x341e > > Intel Corporation Device unknown > > > > pci bus 0x0000 cardnum 0x0e function 0x04: vendor 0x8086 device 0x3439 > > Intel Corporation Device unknown > > > > pci bus 0x0000 cardnum 0x14 function 0x00: vendor 0x8086 device 0x342e > > Intel Corporation QuickPath Architecture I/O Hub System Management > > Registers > > > > pci bus 0x0000 cardnum 0x14 function 0x01: vendor 0x8086 device 0x3422 > > Intel Corporation QuickPath Architecture I/O Hub GPIO and Scratch Pad > > Registers > > > > pci bus 0x0000 cardnum 0x14 function 0x02: vendor 0x8086 device 0x3423 > > Intel Corporation QuickPath Architecture I/O Hub Control Status and > > RAS Registers > > > > pci bus 0x0000 cardnum 0x14 function 0x03: vendor 0x8086 device 0x3438 > > Intel Corporation QuickPath Architecture I/O Hub Throttle Registers > > > > pci bus 0x0000 cardnum 0x16 function 0x00: vendor 0x8086 device 0x3430 > > Intel Corporation DMA Engine > > > > pci bus 0x0000 cardnum 0x16 function 0x01: vendor 0x8086 device 0x3431 > > Intel Corporation DMA Engine > > > > pci bus 0x0000 cardnum 0x16 function 0x02: vendor 0x8086 device 0x3432 > > Intel Corporation DMA Engine > > > > pci bus 0x0000 cardnum 0x16 function 0x03: vendor 0x8086 device 0x3433 > > Intel Corporation DMA Engine > > > > pci bus 0x0000 cardnum 0x16 function 0x04: vendor 0x8086 device 0x3429 > > Intel Corporation DMA Engine > > > > pci bus 0x0000 cardnum 0x16 function 0x05: vendor 0x8086 device 0x342a > > Intel Corporation DMA Engine > > > > pci bus 0x0000 cardnum 0x16 function 0x06: vendor 0x8086 device 0x342b > > Intel Corporation DMA Engine > > > > pci bus 0x0000 cardnum 0x16 function 0x07: vendor 0x8086 device 0x342c > > Intel Corporation DMA Engine > > > > pci bus 0x0000 cardnum 0x1a function 0x00: vendor 0x8086 device 0x3a37 > > Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #4 > > > > pci bus 0x0000 cardnum 0x1a function 0x01: vendor 0x8086 device 0x3a38 > > Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #5 > > > > pci bus 0x0000 cardnum 0x1a function 0x02: vendor 0x8086 device 0x3a39 > > Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #6 > > > > pci bus 0x0000 cardnum 0x1a function 0x07: vendor 0x8086 device 0x3a3c > > Intel Corporation 82801JI (ICH10 Family) USB2 EHCI Controller #2 > > > > pci bus 0x0000 cardnum 0x1c function 0x00: vendor 0x8086 device 0x3a40 > > Intel Corporation 82801JI (ICH10 Family) PCI Express Port 1 > > > > pci bus 0x0000 cardnum 0x1d function 0x00: vendor 0x8086 device 0x3a34 > > Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #1 > > > > pci bus 0x0000 cardnum 0x1d function 0x01: vendor 0x8086 device 0x3a35 > > Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #2 > > > > pci bus 0x0000 cardnum 0x1d function 0x02: vendor 0x8086 device 0x3a36 > > Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #3 > > > > pci bus 0x0000 cardnum 0x1d function 0x07: vendor 0x8086 device 0x3a3a > > Intel Corporation 82801JI (ICH10 Family) USB2 EHCI Controller #1 > > > > pci bus 0x0000 cardnum 0x1e function 0x00: vendor 0x8086 device 0x244e > > Intel Corporation 82801 PCI Bridge > > > > pci bus 0x0000 cardnum 0x1f function 0x00: vendor 0x8086 device 0x3a16 > > Intel Corporation 82801JIR (ICH10R) LPC Interface Controller > > > > pci bus 0x0000 cardnum 0x1f function 0x02: vendor 0x8086 device 0x3a22 > > Intel Corporation 82801JI (ICH10 Family) SATA AHCI Controller > > > > pci bus 0x0000 cardnum 0x1f function 0x03: vendor 0x8086 device 0x3a30 > > Intel Corporation 82801JI (ICH10 Family) SMBus Controller > > > > pci bus 0x0001 cardnum 0x03 function 0x00: vendor 0x102b device 0x0532 > > Matrox Graphics, Inc. MGA G200eW WPCM450 > > > > pci bus 0x0002 cardnum 0x00 function 0x00: vendor 0x1000 device 0x0058 > > LSI Logic / Symbios Logic SAS1068E PCI-Express Fusion-MPT SAS > > > > pci bus 0x0004 cardnum 0x00 function 0x00: vendor 0x1000 device 0x0058 > > LSI Logic / Symbios Logic SAS1068E PCI-Express Fusion-MPT SAS > > > > pci bus 0x0006 cardnum 0x00 function 0x00: vendor 0x8086 device 0x10c9 > > Intel Corporation 82576 Gigabit ET Dual Port Server Adapter > > > > pci bus 0x0006 cardnum 0x00 function 0x01: vendor 0x8086 device 0x10c9 > > Intel Corporation 82576 Gigabit ET Dual Port Server Adapter > > > > pci bus 0x00fe cardnum 0x00 function 0x00: vendor 0x8086 device 0x2c40 > > Intel Corporation QuickPath Architecture Generic Non-Core Registers > > > > pci bus 0x00fe cardnum 0x00 function 0x01: vendor 0x8086 device 0x2c01 > > Intel Corporation QuickPath Architecture System Address Decoder > > > > pci bus 0x00fe cardnum 0x02 function 0x00: vendor 0x8086 device 0x2c10 > > Intel Corporation QPI Link 0 > > > > pci bus 0x00fe cardnum 0x02 function 0x01: vendor 0x8086 device 0x2c11 > > Intel Corporation QPI Physical 0 > > > > pci bus 0x00fe cardnum 0x02 function 0x04: vendor 0x8086 device 0x2c14 > > Intel Corporation QPI Link 1 > > > > pci bus 0x00fe cardnum 0x02 function 0x05: vendor 0x8086 device 0x2c15 > > Intel Corporation QPI Physical 1 > > > > pci bus 0x00fe cardnum 0x03 function 0x00: vendor 0x8086 device 0x2c18 > > Intel Corporation QuickPath Memory Controller > > > > pci bus 0x00fe cardnum 0x03 function 0x01: vendor 0x8086 device 0x2c19 > > Intel Corporation QuickPath Memory Controller Target Address Decoder > > > > pci bus 0x00fe cardnum 0x03 function 0x02: vendor 0x8086 device 0x2c1a > > Intel Corporation QuickPath Memory Controller RAS Registers > > > > pci bus 0x00fe cardnum 0x03 function 0x04: vendor 0x8086 device 0x2c1c > > Intel Corporation QuickPath Memory Controller Test Registers > > > > pci bus 0x00fe cardnum 0x04 function 0x00: vendor 0x8086 device 0x2c20 > > Intel Corporation QuickPath Memory Controller Channel 0 Control > > Registers > > > > pci bus 0x00fe cardnum 0x04 function 0x01: vendor 0x8086 device 0x2c21 > > Intel Corporation QuickPath Memory Controller Channel 0 Address > > Registers > > > > pci bus 0x00fe cardnum 0x04 function 0x02: vendor 0x8086 device 0x2c22 > > Intel Corporation QuickPath Memory Controller Channel 0 Rank Registers > > > > pci bus 0x00fe cardnum 0x04 function 0x03: vendor 0x8086 device 0x2c23 > > Intel Corporation QuickPath Memory Controller Channel 0 Thermal > > Control Registers > > > > pci bus 0x00fe cardnum 0x05 function 0x00: vendor 0x8086 device 0x2c28 > > Intel Corporation QuickPath Memory Controller Channel 1 Control > > Registers > > > > pci bus 0x00fe cardnum 0x05 function 0x01: vendor 0x8086 device 0x2c29 > > Intel Corporation QuickPath Memory Controller Channel 1 Address > > Registers > > > > pci bus 0x00fe cardnum 0x05 function 0x02: vendor 0x8086 device 0x2c2a > > Intel Corporation QuickPath Memory Controller Channel 1 Rank Registers > > > > pci bus 0x00fe cardnum 0x05 function 0x03: vendor 0x8086 device 0x2c2b > > Intel Corporation QuickPath Memory Controller Channel 1 Thermal > > Control Registers > > > > pci bus 0x00fe cardnum 0x06 function 0x00: vendor 0x8086 device 0x2c30 > > Intel Corporation QuickPath Memory Controller Channel 2 Control > > Registers > > > > pci bus 0x00fe cardnum 0x06 function 0x01: vendor 0x8086 device 0x2c31 > > Intel Corporation QuickPath Memory Controller Channel 2 Address > > Registers > > > > pci bus 0x00fe cardnum 0x06 function 0x02: vendor 0x8086 device 0x2c32 > > Intel Corporation QuickPath Memory Controller Channel 2 Rank Registers > > > > pci bus 0x00fe cardnum 0x06 function 0x03: vendor 0x8086 device 0x2c33 > > Intel Corporation QuickPath Memory Controller Channel 2 Thermal > > Control Registers > > > > pci bus 0x00ff cardnum 0x00 function 0x00: vendor 0x8086 device 0x2c40 > > Intel Corporation QuickPath Architecture Generic Non-Core Registers > > > > pci bus 0x00ff cardnum 0x00 function 0x01: vendor 0x8086 device 0x2c01 > > Intel Corporation QuickPath Architecture System Address Decoder > > > > pci bus 0x00ff cardnum 0x02 function 0x00: vendor 0x8086 device 0x2c10 > > Intel Corporation QPI Link 0 > > > > pci bus 0x00ff cardnum 0x02 function 0x01: vendor 0x8086 device 0x2c11 > > Intel Corporation QPI Physical 0 > > > > pci bus 0x00ff cardnum 0x02 function 0x04: vendor 0x8086 device 0x2c14 > > Intel Corporation QPI Link 1 > > > > pci bus 0x00ff cardnum 0x02 function 0x05: vendor 0x8086 device 0x2c15 > > Intel Corporation QPI Physical 1 > > > > pci bus 0x00ff cardnum 0x03 function 0x00: vendor 0x8086 device 0x2c18 > > Intel Corporation QuickPath Memory Controller > > > > pci bus 0x00ff cardnum 0x03 function 0x01: vendor 0x8086 device 0x2c19 > > Intel Corporation QuickPath Memory Controller Target Address Decoder > > > > pci bus 0x00ff cardnum 0x03 function 0x02: vendor 0x8086 device 0x2c1a > > Intel Corporation QuickPath Memory Controller RAS Registers > > > > pci bus 0x00ff cardnum 0x03 function 0x04: vendor 0x8086 device 0x2c1c > > Intel Corporation QuickPath Memory Controller Test Registers > > > > pci bus 0x00ff cardnum 0x04 function 0x00: vendor 0x8086 device 0x2c20 > > Intel Corporation QuickPath Memory Controller Channel 0 Control > > Registers > > > > pci bus 0x00ff cardnum 0x04 function 0x01: vendor 0x8086 device 0x2c21 > > Intel Corporation QuickPath Memory Controller Channel 0 Address > > Registers > > > > pci bus 0x00ff cardnum 0x04 function 0x02: vendor 0x8086 device 0x2c22 > > Intel Corporation QuickPath Memory Controller Channel 0 Rank Registers > > > > pci bus 0x00ff cardnum 0x04 function 0x03: vendor 0x8086 device 0x2c23 > > Intel Corporation QuickPath Memory Controller Channel 0 Thermal > > Control Registers > > > > pci bus 0x00ff cardnum 0x05 function 0x00: vendor 0x8086 device 0x2c28 > > Intel Corporation QuickPath Memory Controller Channel 1 Control > > Registers > > > > pci bus 0x00ff cardnum 0x05 function 0x01: vendor 0x8086 device 0x2c29 > > Intel Corporation QuickPath Memory Controller Channel 1 Address > > Registers > > > > pci bus 0x00ff cardnum 0x05 function 0x02: vendor 0x8086 device 0x2c2a > > Intel Corporation QuickPath Memory Controller Channel 1 Rank Registers > > > > pci bus 0x00ff cardnum 0x05 function 0x03: vendor 0x8086 device 0x2c2b > > Intel Corporation QuickPath Memory Controller Channel 1 Thermal > > Control Registers > > > > pci bus 0x00ff cardnum 0x06 function 0x00: vendor 0x8086 device 0x2c30 > > Intel Corporation QuickPath Memory Controller Channel 2 Control > > Registers > > > > pci bus 0x00ff cardnum 0x06 function 0x01: vendor 0x8086 device 0x2c31 > > Intel Corporation QuickPath Memory Controller Channel 2 Address > > Registers > > > > pci bus 0x00ff cardnum 0x06 function 0x02: vendor 0x8086 device 0x2c32 > > Intel Corporation QuickPath Memory Controller Channel 2 Rank Registers > > > > pci bus 0x00ff cardnum 0x06 function 0x03: vendor 0x8086 device 0x2c33 > > Intel Corporation QuickPath Memory Controller Channel 2 Thermal > > Control Registers > > > > -- > > paul > > > > _______________________________________________ > > fm-discuss mailing list > > fm-discuss at opensolaris.org > > > _______________________________________________ > fm-discuss mailing list > fm-discuss at opensolaris.org From maxpil at gmail.com Mon Nov 23 16:40:01 2009 From: maxpil at gmail.com (Max Levine) Date: Mon, 23 Nov 2009 19:40:01 -0500 Subject: [fm-discuss] fmadm not reporting PS failure? Message-ID: <21700e00911231640y6487100eh19bfe4212774d142@mail.gmail.com> I tested removing a power cord from a V440 PS, and fmadm doesn't seem to report the failure. - Is fmadm supposed to report PSU failures? - what is recommened solution to monitor PS failures? From Doug.Baker at Sun.COM Tue Nov 24 00:22:57 2009 From: Doug.Baker at Sun.COM (Doug Baker - Sun UK - Support Engineer) Date: Tue, 24 Nov 2009 08:22:57 +0000 Subject: [fm-discuss] fmadm not reporting PS failure? In-Reply-To: <21700e00911231640y6487100eh19bfe4212774d142@mail.gmail.com> References: <21700e00911231640y6487100eh19bfe4212774d142@mail.gmail.com> Message-ID: <4B0B97E1.9030905@sun.com> Max Levine wrote: > I tested removing a power cord from a V440 PS, and fmadm doesn't seem > to report the failure. > > - Is fmadm supposed to report PSU failures? Not on that platform. > - what is recommened solution to monitor PS failures? You need to monitor at the Solaris level using SunMC or something similar. Regards, Douglas > _______________________________________________ > fm-discuss mailing list > fm-discuss at opensolaris.org -- Dr Doug Baker Sun Microsystems Systems Support Engineer. UK Mission Critical Solution Centre. Tel : 0870 600 3222 From steve.hanson at sun.com Tue Nov 24 01:37:31 2009 From: steve.hanson at sun.com (Steve Hanson) Date: Tue, 24 Nov 2009 09:37:31 +0000 Subject: [fm-discuss] fmadm not reporting PS failure? In-Reply-To: <4B0B97E1.9030905@sun.com> References: <21700e00911231640y6487100eh19bfe4212774d142@mail.gmail.com> <4B0B97E1.9030905@sun.com> Message-ID: <4B0BA95B.5070905@sun.com> Doug Baker - Sun UK - Support Engineer wrote: > Max Levine wrote: > >> I tested removing a power cord from a V440 PS, and fmadm doesn't seem >> to report the failure. >> >> - Is fmadm supposed to report PSU failures? > > > Not on that platform. > >> - what is recommened solution to monitor PS failures? > > > You need to monitor at the Solaris level using SunMC or something > similar. I think "prtdiag -v" may also have the power supply faults. Steve > > Regards, > > Douglas > >> _______________________________________________ >> fm-discuss mailing list >> fm-discuss at opensolaris.org > > > From maxpil at gmail.com Tue Nov 24 05:08:34 2009 From: maxpil at gmail.com (Max Levine) Date: Tue, 24 Nov 2009 08:08:34 -0500 Subject: [fm-discuss] fmadm not reporting PS failure? In-Reply-To: <4B0B97E1.9030905@sun.com> References: <21700e00911231640y6487100eh19bfe4212774d142@mail.gmail.com> <4B0B97E1.9030905@sun.com> Message-ID: <21700e00911240508t55445fcfs7af212fe21c24115@mail.gmail.com> Is there a list of systems that fmadm supports reporting PS failures on? On Tue, Nov 24, 2009 at 3:22 AM, Doug Baker - Sun UK - Support Engineer wrote: > Max Levine wrote: >> >> I tested removing a power cord from a V440 PS, and fmadm doesn't seem >> to report the failure. >> >> - Is fmadm supposed to report PSU failures? > > Not on that platform. > >> - what is recommened solution to monitor PS failures? > > You need to monitor at the Solaris level using SunMC or something similar. > > Regards, > > Douglas > >> _______________________________________________ >> fm-discuss mailing list >> fm-discuss at opensolaris.org > > > -- > Dr Doug Baker > Sun Microsystems Systems Support Engineer. > UK Mission Critical Solution Centre. > Tel : 0870 600 3222 > From maxpil at gmail.com Tue Nov 24 05:14:15 2009 From: maxpil at gmail.com (Max Levine) Date: Tue, 24 Nov 2009 08:14:15 -0500 Subject: [fm-discuss] fmadm not reporting PS failure? In-Reply-To: <4B0B97E1.9030905@sun.com> References: <21700e00911231640y6487100eh19bfe4212774d142@mail.gmail.com> <4B0B97E1.9030905@sun.com> Message-ID: <21700e00911240514o6e95dfe9mfae0fb881c1b059d@mail.gmail.com> Does fmadm support cpu/mem diagnostics on this platform? On Tue, Nov 24, 2009 at 3:22 AM, Doug Baker - Sun UK - Support Engineer wrote: > Max Levine wrote: >> >> I tested removing a power cord from a V440 PS, and fmadm doesn't seem >> to report the failure. >> >> - Is fmadm supposed to report PSU failures? > > Not on that platform. > >> - what is recommened solution to monitor PS failures? > > You need to monitor at the Solaris level using SunMC or something similar. > > Regards, > > Douglas > >> _______________________________________________ >> fm-discuss mailing list >> fm-discuss at opensolaris.org > > > -- > Dr Doug Baker > Sun Microsystems Systems Support Engineer. > UK Mission Critical Solution Centre. > Tel : 0870 600 3222 > From Doug.Baker at Sun.COM Tue Nov 24 05:32:37 2009 From: Doug.Baker at Sun.COM (Doug Baker - Sun UK - Support Engineer) Date: Tue, 24 Nov 2009 13:32:37 +0000 Subject: [fm-discuss] fmadm not reporting PS failure? In-Reply-To: <21700e00911240514o6e95dfe9mfae0fb881c1b059d@mail.gmail.com> References: <21700e00911231640y6487100eh19bfe4212774d142@mail.gmail.com> <4B0B97E1.9030905@sun.com> <21700e00911240514o6e95dfe9mfae0fb881c1b059d@mail.gmail.com> Message-ID: <4B0BE075.4030402@sun.com> Max Levine wrote: > Does fmadm support cpu/mem diagnostics on this platform? Yes, there is cpu/mem diagnostics on the usIIIi platforms. No, I am not aware of a list of which platforms have specific functionality. Regards, Douglas > > On Tue, Nov 24, 2009 at 3:22 AM, Doug Baker - Sun UK - Support > Engineer wrote: >> Max Levine wrote: >>> I tested removing a power cord from a V440 PS, and fmadm doesn't seem >>> to report the failure. >>> >>> - Is fmadm supposed to report PSU failures? >> Not on that platform. >> >>> - what is recommened solution to monitor PS failures? >> You need to monitor at the Solaris level using SunMC or something similar. >> >> Regards, >> >> Douglas >> >>> _______________________________________________ >>> fm-discuss mailing list >>> fm-discuss at opensolaris.org >> >> -- >> Dr Doug Baker >> Sun Microsystems Systems Support Engineer. >> UK Mission Critical Solution Centre. >> Tel : 0870 600 3222 >> -- Dr Doug Baker Sun Microsystems Systems Support Engineer. UK Mission Critical Solution Centre. Tel : 0870 600 3222 From pfisher at alertlogic.net Tue Nov 24 07:56:21 2009 From: pfisher at alertlogic.net (Paul Fisher) Date: Tue, 24 Nov 2009 09:56:21 -0600 Subject: [fm-discuss] further help diagnosing ereport.cpu.intel.quickpath.mem_ce Message-ID: <4B0C0225.3050009@alertlogic.net> I am still trying to diagnose the issues with these two systems (dual Intel 5520/supermicro X8DT3). When I have two PC3-8500 ECC RDIMMs populated per channel (6, 4G RDIMMs/socket), I see the following fmdump -e reports reliably every 10 seconds: Nov 24 2009 09:34:40.410002797 ereport.cpu.intel.quickpath.mem_ce Nov 24 2009 09:34:40.410013341 ereport.cpu.intel.quickpath.mem_ce Nov 24 2009 09:34:40.410025191 ereport.cpu.intel.quickpath.mem_ce Nov 24 2009 09:34:40.410060807 ereport.cpu.intel.quickpath.mem_ce Nov 24 2009 09:34:40.410071957 ereport.cpu.intel.quickpath.mem_ce Nov 24 2009 09:34:40.410134184 ereport.cpu.intel.quickpath.mem_ce (I've included the details of these reports at the bottom of this email.) While the reports seem to indicate that there is a problem on the second row RDIMMs, swapping the memory sticks around yields the same problem. In fact, even when I have only on RDIMM in each channel the machine will experience occassional ereports on memory. I have tried forcing the BIOS to run the memory @ DDR-1066 and DDR-800 and still have the same behavior, so it would seem to not be related to memory quality. The memory modules appear to be reasonable quality stuff: Hynix I have tried opensolaris b127, b126, 2009.06, and S10U8, all with variations of these same symptoms. I am really hoping that someone can either give some guidance on further diagnosing what is going on, or can point me to which mailing list on which I should follow-up. Here is the detailed fmdump ereport information: bash-3.00# cat /tmp/fmdump.eV.sample Nov 24 2009 09:34:40.410002797 ereport.cpu.intel.quickpath.mem_ce nvlist version: 0 class = ereport.cpu.intel.quickpath.mem_ce ena = 0xb34793a6b810401 detector = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = hc hc-list = (array of embedded nvlists) (start hc-list[0]) nvlist version: 0 hc-name = motherboard hc-id = 0 (end hc-list[0]) (start hc-list[1]) nvlist version: 0 hc-name = chip hc-id = 1 (end hc-list[1]) (start hc-list[2]) nvlist version: 0 hc-name = memory-controller hc-id = 0 (end hc-list[2]) (end detector) compound_errorname = MC_CH_RD_ERR IA32_MCG_STATUS = 0x0 machine_check_in_progress = 0 bank_number = 0x8 bank_msr_offset = 0x420 IA32_MCi_STATUS = 0xcc0d5c800001009f overflow = 1 error_uncorrected = 0 error_enabled = 0 processor_context_corrupt = 0 error_code = 0x9f model_specific_error_code = 0x1 threshold_based_error_status = No tracking IA32_MCi_ADDR = 0xc3d099c80 IA32_MCi_MISC = 0x294e5def00015840 ECC-syndrome = 0x294e5def physaddr = 0xc3d099c80 resource = (array of embedded nvlists) (start resource[0]) nvlist version: 0 version = 0x0 scheme = hc hc-list = (array of embedded nvlists) (start hc-list[0]) nvlist version: 0 hc-name = motherboard hc-id = 0 (end hc-list[0]) (start hc-list[1]) nvlist version: 0 hc-name = chip hc-id = 1 (end hc-list[1]) (start hc-list[2]) nvlist version: 0 hc-name = memory-controller hc-id = 0 (end hc-list[2]) (start hc-list[3]) nvlist version: 0 hc-name = dram-channel hc-id = 0 (end hc-list[3]) (start hc-list[4]) nvlist version: 0 hc-name = dimm hc-id = 1 (end hc-list[4]) (start hc-list[5]) nvlist version: 0 hc-name = rank hc-id = 5 (end hc-list[5]) hc-specific = (embedded nvlist) nvlist version: 0 offset = 0x7fc0c400 (end hc-specific) (end resource[0]) mem_cor_ecc_counter = 0x69f 0x797 0x0 0x0 0x3 0x0 mem_cor_ecc_counter_last = 0x743 0x4ee 0x0 0x0 0x3 0x0 __ttl = 0x1 __tod = 0x4b0bfd10 0x1870256d Nov 24 2009 09:34:40.410013341 ereport.cpu.intel.quickpath.mem_ce nvlist version: 0 class = ereport.cpu.intel.quickpath.mem_ce ena = 0xb34793d17612001 detector = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = hc hc-list = (array of embedded nvlists) (start hc-list[0]) nvlist version: 0 hc-name = motherboard hc-id = 0 (end hc-list[0]) (start hc-list[1]) nvlist version: 0 hc-name = chip hc-id = 1 (end hc-list[1]) (start hc-list[2]) nvlist version: 0 hc-name = memory-controller hc-id = 0 (end hc-list[2]) (end detector) compound_errorname = MC_CH_RD_ERR IA32_MCG_STATUS = 0x0 machine_check_in_progress = 0 bank_number = 0x8 bank_msr_offset = 0x420 IA32_MCi_STATUS = 0xcc0001400001009f overflow = 1 error_uncorrected = 0 error_enabled = 0 processor_context_corrupt = 0 error_code = 0x9f model_specific_error_code = 0x1 threshold_based_error_status = No tracking IA32_MCi_ADDR = 0xc3e0d3300 IA32_MCi_MISC = 0xa13dcb9300015840 ECC-syndrome = 0xa13dcb93 physaddr = 0xc3e0d3300 resource = (array of embedded nvlists) (start resource[0]) nvlist version: 0 version = 0x0 scheme = hc hc-list = (array of embedded nvlists) (start hc-list[0]) nvlist version: 0 hc-name = motherboard hc-id = 0 (end hc-list[0]) (start hc-list[1]) nvlist version: 0 hc-name = chip hc-id = 1 (end hc-list[1]) (start hc-list[2]) nvlist version: 0 hc-name = memory-controller hc-id = 0 (end hc-list[2]) (start hc-list[3]) nvlist version: 0 hc-name = dram-channel hc-id = 0 (end hc-list[3]) (start hc-list[4]) nvlist version: 0 hc-name = dimm hc-id = 1 (end hc-list[4]) (start hc-list[5]) nvlist version: 0 hc-name = rank hc-id = 5 (end hc-list[5]) hc-specific = (embedded nvlist) nvlist version: 0 offset = 0x7fd66b80 (end hc-specific) (end resource[0]) mem_cor_ecc_counter = 0x6a0 0x7a6 0x0 0x0 0x3 0x0 mem_cor_ecc_counter_last = 0x69f 0x797 0x0 0x0 0x3 0x0 __ttl = 0x1 __tod = 0x4b0bfd10 0x18704e9d Nov 24 2009 09:34:40.410025191 ereport.cpu.intel.quickpath.mem_ce nvlist version: 0 class = ereport.cpu.intel.quickpath.mem_ce ena = 0xb34794013414401 detector = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = hc hc-list = (array of embedded nvlists) (start hc-list[0]) nvlist version: 0 hc-name = motherboard hc-id = 0 (end hc-list[0]) (start hc-list[1]) nvlist version: 0 hc-name = chip hc-id = 1 (end hc-list[1]) (start hc-list[2]) nvlist version: 0 hc-name = memory-controller hc-id = 0 (end hc-list[2]) (end detector) compound_errorname = MC_CH_RD_ERR IA32_MCG_STATUS = 0x0 machine_check_in_progress = 0 bank_number = 0x8 bank_msr_offset = 0x420 IA32_MCi_STATUS = 0xcc0000800001009f overflow = 1 error_uncorrected = 0 error_enabled = 0 processor_context_corrupt = 0 error_code = 0x9f model_specific_error_code = 0x1 threshold_based_error_status = No tracking IA32_MCi_ADDR = 0xc3cebadc0 IA32_MCi_MISC = 0x7f59c35b00011280 ECC-syndrome = 0x7f59c35b physaddr = 0xc3cebadc0 resource = (array of embedded nvlists) (start resource[0]) nvlist version: 0 version = 0x0 scheme = hc hc-list = (array of embedded nvlists) (start hc-list[0]) nvlist version: 0 hc-name = motherboard hc-id = 0 (end hc-list[0]) (start hc-list[1]) nvlist version: 0 hc-name = chip hc-id = 1 (end hc-list[1]) (start hc-list[2]) nvlist version: 0 hc-name = memory-controller hc-id = 0 (end hc-list[2]) (start hc-list[3]) nvlist version: 0 hc-name = dram-channel hc-id = 0 (end hc-list[3]) (start hc-list[4]) nvlist version: 0 hc-name = dimm hc-id = 1 (end hc-list[4]) (start hc-list[5]) nvlist version: 0 hc-name = rank hc-id = 5 (end hc-list[5]) hc-specific = (embedded nvlist) nvlist version: 0 offset = 0x7fbe49c0 (end hc-specific) (end resource[0]) mem_cor_ecc_counter = 0x6a0 0x7a9 0x0 0x0 0x3 0x0 mem_cor_ecc_counter_last = 0x6a0 0x7a6 0x0 0x0 0x3 0x0 __ttl = 0x1 __tod = 0x4b0bfd10 0x18707ce7 Nov 24 2009 09:34:40.410060807 ereport.cpu.intel.quickpath.mem_ce nvlist version: 0 class = ereport.cpu.intel.quickpath.mem_ce ena = 0xb3479489c416401 detector = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = hc hc-list = (array of embedded nvlists) (start hc-list[0]) nvlist version: 0 hc-name = motherboard hc-id = 0 (end hc-list[0]) (start hc-list[1]) nvlist version: 0 hc-name = chip hc-id = 1 (end hc-list[1]) (start hc-list[2]) nvlist version: 0 hc-name = memory-controller hc-id = 0 (end hc-list[2]) (end detector) compound_errorname = MC_CH_RD_ERR IA32_MCG_STATUS = 0x0 machine_check_in_progress = 0 bank_number = 0x8 bank_msr_offset = 0x420 IA32_MCi_STATUS = 0xcc0000800001009f overflow = 1 error_uncorrected = 0 error_enabled = 0 processor_context_corrupt = 0 error_code = 0x9f model_specific_error_code = 0x1 threshold_based_error_status = No tracking IA32_MCi_ADDR = 0xc3ad95680 IA32_MCi_MISC = 0x5ac6cb0500011080 ECC-syndrome = 0x5ac6cb05 physaddr = 0xc3ad95680 resource = (array of embedded nvlists) (start resource[0]) nvlist version: 0 version = 0x0 scheme = hc hc-list = (array of embedded nvlists) (start hc-list[0]) nvlist version: 0 hc-name = motherboard hc-id = 0 (end hc-list[0]) (start hc-list[1]) nvlist version: 0 hc-name = chip hc-id = 1 (end hc-list[1]) (start hc-list[2]) nvlist version: 0 hc-name = memory-controller hc-id = 0 (end hc-list[2]) (start hc-list[3]) nvlist version: 0 hc-name = dram-channel hc-id = 0 (end hc-list[3]) (start hc-list[4]) nvlist version: 0 hc-name = dimm hc-id = 1 (end hc-list[4]) (start hc-list[5]) nvlist version: 0 hc-name = rank hc-id = 5 (end hc-list[5]) hc-specific = (embedded nvlist) nvlist version: 0 offset = 0x7f921200 (end hc-specific) (end resource[0]) mem_cor_ecc_counter = 0x6a2 0x7c3 0x0 0x0 0x3 0x0 mem_cor_ecc_counter_last = 0x6a0 0x7a9 0x0 0x0 0x3 0x0 __ttl = 0x1 __tod = 0x4b0bfd10 0x18710807 Nov 24 2009 09:34:40.410071957 ereport.cpu.intel.quickpath.mem_ce nvlist version: 0 class = ereport.cpu.intel.quickpath.mem_ce ena = 0xb34794b6cb10001 detector = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = hc hc-list = (array of embedded nvlists) (start hc-list[0]) nvlist version: 0 hc-name = motherboard hc-id = 0 (end hc-list[0]) (start hc-list[1]) nvlist version: 0 hc-name = chip hc-id = 1 (end hc-list[1]) (start hc-list[2]) nvlist version: 0 hc-name = memory-controller hc-id = 0 (end hc-list[2]) (end detector) compound_errorname = MC_CH_RD_ERR IA32_MCG_STATUS = 0x0 machine_check_in_progress = 0 bank_number = 0x8 bank_msr_offset = 0x420 IA32_MCi_STATUS = 0xcc0000c00001009f overflow = 1 error_uncorrected = 0 error_enabled = 0 processor_context_corrupt = 0 error_code = 0x9f model_specific_error_code = 0x1 threshold_based_error_status = No tracking IA32_MCi_ADDR = 0xc2ff7dbc0 IA32_MCi_MISC = 0x431bdc5f00011180 ECC-syndrome = 0x431bdc5f physaddr = 0xc2ff7dbc0 resource = (array of embedded nvlists) (start resource[0]) nvlist version: 0 version = 0x0 scheme = hc hc-list = (array of embedded nvlists) (start hc-list[0]) nvlist version: 0 hc-name = motherboard hc-id = 0 (end hc-list[0]) (start hc-list[1]) nvlist version: 0 hc-name = chip hc-id = 1 (end hc-list[1]) (start hc-list[2]) nvlist version: 0 hc-name = memory-controller hc-id = 0 (end hc-list[2]) (start hc-list[3]) nvlist version: 0 hc-name = dram-channel hc-id = 0 (end hc-list[3]) (start hc-list[4]) nvlist version: 0 hc-name = dimm hc-id = 1 (end hc-list[4]) (start hc-list[5]) nvlist version: 0 hc-name = rank hc-id = 5 (end hc-list[5]) hc-specific = (embedded nvlist) nvlist version: 0 offset = 0x7ea9f3c0 (end hc-specific) (end resource[0]) mem_cor_ecc_counter = 0x6a5 0x7d1 0x0 0x0 0x3 0x0 mem_cor_ecc_counter_last = 0x6a2 0x7c3 0x0 0x0 0x3 0x0 __ttl = 0x1 __tod = 0x4b0bfd10 0x18713395 Nov 24 2009 09:34:40.410134184 ereport.cpu.intel.quickpath.mem_ce nvlist version: 0 class = ereport.cpu.intel.quickpath.mem_ce ena = 0xb34795aa2712401 detector = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = hc hc-list = (array of embedded nvlists) (start hc-list[0]) nvlist version: 0 hc-name = motherboard hc-id = 0 (end hc-list[0]) (start hc-list[1]) nvlist version: 0 hc-name = chip hc-id = 1 (end hc-list[1]) (start hc-list[2]) nvlist version: 0 hc-name = memory-controller hc-id = 0 (end hc-list[2]) (end detector) compound_errorname = MC_CH_RD_ERR IA32_MCG_STATUS = 0x0 machine_check_in_progress = 0 bank_number = 0x8 bank_msr_offset = 0x420 IA32_MCi_STATUS = 0xcc0003c00001009f overflow = 1 error_uncorrected = 0 error_enabled = 0 processor_context_corrupt = 0 error_code = 0x9f model_specific_error_code = 0x1 threshold_based_error_status = No tracking IA32_MCi_ADDR = 0xc2e261440 IA32_MCi_MISC = 0x48f99d1500015e46 ECC-syndrome = 0x48f99d15 physaddr = 0xc2e261440 resource = (array of embedded nvlists) (start resource[0]) nvlist version: 0 version = 0x0 scheme = hc hc-list = (array of embedded nvlists) (start hc-list[0]) nvlist version: 0 hc-name = motherboard hc-id = 0 (end hc-list[0]) (start hc-list[1]) nvlist version: 0 hc-name = chip hc-id = 1 (end hc-list[1]) (start hc-list[2]) nvlist version: 0 hc-name = memory-controller hc-id = 0 (end hc-list[2]) (start hc-list[3]) nvlist version: 0 hc-name = dram-channel hc-id = 0 (end hc-list[3]) (start hc-list[4]) nvlist version: 0 hc-name = dimm hc-id = 1 (end hc-list[4]) (start hc-list[5]) nvlist version: 0 hc-name = rank hc-id = 5 (end hc-list[5]) hc-specific = (embedded nvlist) nvlist version: 0 offset = 0x7e832140 (end hc-specific) (end resource[0]) mem_cor_ecc_counter = 0x6a7 0x7d8 0x0 0x0 0x3 0x0 mem_cor_ecc_counter_last = 0x6a5 0x7d1 0x0 0x0 0x3 0x0 __ttl = 0x1 __tod = 0x4b0bfd10 0x187226a8 bash-3.00# -- paul From Adrian.Frost at Sun.COM Tue Nov 24 09:46:04 2009 From: Adrian.Frost at Sun.COM (Adrian Frost) Date: Tue, 24 Nov 2009 17:46:04 +0000 Subject: [fm-discuss] further help diagnosing ereport.cpu.intel.quickpath.mem_ce In-Reply-To: <4B0C0225.3050009@alertlogic.net> References: <4B0C0225.3050009@alertlogic.net> Message-ID: <4B0C1BDC.9030606@sun.com> Paul Fisher wrote: > I am still trying to diagnose the issues with these two systems (dual > Intel 5520/supermicro X8DT3). When I have two PC3-8500 ECC RDIMMs > populated per channel (6, 4G RDIMMs/socket), I see the following > fmdump -e reports reliably every 10 seconds: > > Nov 24 2009 09:34:40.410002797 ereport.cpu.intel.quickpath.mem_ce > Nov 24 2009 09:34:40.410013341 ereport.cpu.intel.quickpath.mem_ce > Nov 24 2009 09:34:40.410025191 ereport.cpu.intel.quickpath.mem_ce > Nov 24 2009 09:34:40.410060807 ereport.cpu.intel.quickpath.mem_ce > Nov 24 2009 09:34:40.410071957 ereport.cpu.intel.quickpath.mem_ce > Nov 24 2009 09:34:40.410134184 ereport.cpu.intel.quickpath.mem_ce > > (I've included the details of these reports at the bottom of this email.) > > While the reports seem to indicate that there is a problem on the > second row RDIMMs, swapping the memory sticks around yields the same > problem. In fact, even when I have only on RDIMM in each channel the > machine will experience occassional ereports on memory. > > I have tried forcing the BIOS to run the memory @ DDR-1066 and DDR-800 > and still have the same behavior, so it would seem to not be related > to memory quality. The memory modules appear to be reasonable quality > stuff: > > Hynix > > I have tried opensolaris b127, b126, 2009.06, and S10U8, all with > variations of these same symptoms. I am really hoping that someone > can either give some guidance on further diagnosing what is going on, > or can point me to which mailing list on which I should follow-up. I think it is a hardware issue you are looking at and not software. You are seeing significant correctable memory errors I would guess that the dimms are not compatible with the motherboard, perhaps requiring too much power. > > Here is the detailed fmdump ereport information: > > bash-3.00# cat /tmp/fmdump.eV.sample > Nov 24 2009 09:34:40.410002797 ereport.cpu.intel.quickpath.mem_ce > nvlist version: 0 > class = ereport.cpu.intel.quickpath.mem_ce > ena = 0xb34793a6b810401 > detector = (embedded nvlist) > nvlist version: 0 > version = 0x0 > scheme = hc > hc-list = (array of embedded nvlists) > (start hc-list[0]) > nvlist version: 0 > hc-name = motherboard > hc-id = 0 > (end hc-list[0]) > (start hc-list[1]) > nvlist version: 0 > hc-name = chip > hc-id = 1 > (end hc-list[1]) > (start hc-list[2]) > nvlist version: 0 > hc-name = memory-controller > hc-id = 0 > (end hc-list[2]) > > (end detector) > > compound_errorname = MC_CH_RD_ERR > IA32_MCG_STATUS = 0x0 > machine_check_in_progress = 0 > bank_number = 0x8 > bank_msr_offset = 0x420 > IA32_MCi_STATUS = 0xcc0d5c800001009f > overflow = 1 > error_uncorrected = 0 > error_enabled = 0 > processor_context_corrupt = 0 > error_code = 0x9f > model_specific_error_code = 0x1 > threshold_based_error_status = No tracking > IA32_MCi_ADDR = 0xc3d099c80 > IA32_MCi_MISC = 0x294e5def00015840 > ECC-syndrome = 0x294e5def > physaddr = 0xc3d099c80 > resource = (array of embedded nvlists) > (start resource[0]) > nvlist version: 0 > version = 0x0 > scheme = hc > hc-list = (array of embedded nvlists) > (start hc-list[0]) > nvlist version: 0 > hc-name = motherboard > hc-id = 0 > (end hc-list[0]) > (start hc-list[1]) > nvlist version: 0 > hc-name = chip > hc-id = 1 > (end hc-list[1]) > (start hc-list[2]) > nvlist version: 0 > hc-name = memory-controller > hc-id = 0 > (end hc-list[2]) > (start hc-list[3]) > nvlist version: 0 > hc-name = dram-channel > hc-id = 0 > (end hc-list[3]) > (start hc-list[4]) > nvlist version: 0 > hc-name = dimm > hc-id = 1 > (end hc-list[4]) > (start hc-list[5]) > nvlist version: 0 > hc-name = rank > hc-id = 5 > (end hc-list[5]) > > hc-specific = (embedded nvlist) > nvlist version: 0 > offset = 0x7fc0c400 > (end hc-specific) > > (end resource[0]) > > mem_cor_ecc_counter = 0x69f 0x797 0x0 0x0 0x3 0x0 > mem_cor_ecc_counter_last = 0x743 0x4ee 0x0 0x0 0x3 0x0 > __ttl = 0x1 > __tod = 0x4b0bfd10 0x1870256d > > Nov 24 2009 09:34:40.410013341 ereport.cpu.intel.quickpath.mem_ce > nvlist version: 0 > class = ereport.cpu.intel.quickpath.mem_ce > ena = 0xb34793d17612001 > detector = (embedded nvlist) > nvlist version: 0 > version = 0x0 > scheme = hc > hc-list = (array of embedded nvlists) > (start hc-list[0]) > nvlist version: 0 > hc-name = motherboard > hc-id = 0 > (end hc-list[0]) > (start hc-list[1]) > nvlist version: 0 > hc-name = chip > hc-id = 1 > (end hc-list[1]) > (start hc-list[2]) > nvlist version: 0 > hc-name = memory-controller > hc-id = 0 > (end hc-list[2]) > > (end detector) > > compound_errorname = MC_CH_RD_ERR > IA32_MCG_STATUS = 0x0 > machine_check_in_progress = 0 > bank_number = 0x8 > bank_msr_offset = 0x420 > IA32_MCi_STATUS = 0xcc0001400001009f > overflow = 1 > error_uncorrected = 0 > error_enabled = 0 > processor_context_corrupt = 0 > error_code = 0x9f > model_specific_error_code = 0x1 > threshold_based_error_status = No tracking > IA32_MCi_ADDR = 0xc3e0d3300 > IA32_MCi_MISC = 0xa13dcb9300015840 > ECC-syndrome = 0xa13dcb93 > physaddr = 0xc3e0d3300 > resource = (array of embedded nvlists) > (start resource[0]) > nvlist version: 0 > version = 0x0 > scheme = hc > hc-list = (array of embedded nvlists) > (start hc-list[0]) > nvlist version: 0 > hc-name = motherboard > hc-id = 0 > (end hc-list[0]) > (start hc-list[1]) > nvlist version: 0 > hc-name = chip > hc-id = 1 > (end hc-list[1]) > (start hc-list[2]) > nvlist version: 0 > hc-name = memory-controller > hc-id = 0 > (end hc-list[2]) > (start hc-list[3]) > nvlist version: 0 > hc-name = dram-channel > hc-id = 0 > (end hc-list[3]) > (start hc-list[4]) > nvlist version: 0 > hc-name = dimm > hc-id = 1 > (end hc-list[4]) > (start hc-list[5]) > nvlist version: 0 > hc-name = rank > hc-id = 5 > (end hc-list[5]) > > hc-specific = (embedded nvlist) > nvlist version: 0 > offset = 0x7fd66b80 > (end hc-specific) > > (end resource[0]) > > mem_cor_ecc_counter = 0x6a0 0x7a6 0x0 0x0 0x3 0x0 > mem_cor_ecc_counter_last = 0x69f 0x797 0x0 0x0 0x3 0x0 > __ttl = 0x1 > __tod = 0x4b0bfd10 0x18704e9d > > Nov 24 2009 09:34:40.410025191 ereport.cpu.intel.quickpath.mem_ce > nvlist version: 0 > class = ereport.cpu.intel.quickpath.mem_ce > ena = 0xb34794013414401 > detector = (embedded nvlist) > nvlist version: 0 > version = 0x0 > scheme = hc > hc-list = (array of embedded nvlists) > (start hc-list[0]) > nvlist version: 0 > hc-name = motherboard > hc-id = 0 > (end hc-list[0]) > (start hc-list[1]) > nvlist version: 0 > hc-name = chip > hc-id = 1 > (end hc-list[1]) > (start hc-list[2]) > nvlist version: 0 > hc-name = memory-controller > hc-id = 0 > (end hc-list[2]) > > (end detector) > > compound_errorname = MC_CH_RD_ERR > IA32_MCG_STATUS = 0x0 > machine_check_in_progress = 0 > bank_number = 0x8 > bank_msr_offset = 0x420 > IA32_MCi_STATUS = 0xcc0000800001009f > overflow = 1 > error_uncorrected = 0 > error_enabled = 0 > processor_context_corrupt = 0 > error_code = 0x9f > model_specific_error_code = 0x1 > threshold_based_error_status = No tracking > IA32_MCi_ADDR = 0xc3cebadc0 > IA32_MCi_MISC = 0x7f59c35b00011280 > ECC-syndrome = 0x7f59c35b > physaddr = 0xc3cebadc0 > resource = (array of embedded nvlists) > (start resource[0]) > nvlist version: 0 > version = 0x0 > scheme = hc > hc-list = (array of embedded nvlists) > (start hc-list[0]) > nvlist version: 0 > hc-name = motherboard > hc-id = 0 > (end hc-list[0]) > (start hc-list[1]) > nvlist version: 0 > hc-name = chip > hc-id = 1 > (end hc-list[1]) > (start hc-list[2]) > nvlist version: 0 > hc-name = memory-controller > hc-id = 0 > (end hc-list[2]) > (start hc-list[3]) > nvlist version: 0 > hc-name = dram-channel > hc-id = 0 > (end hc-list[3]) > (start hc-list[4]) > nvlist version: 0 > hc-name = dimm > hc-id = 1 > (end hc-list[4]) > (start hc-list[5]) > nvlist version: 0 > hc-name = rank > hc-id = 5 > (end hc-list[5]) > > hc-specific = (embedded nvlist) > nvlist version: 0 > offset = 0x7fbe49c0 > (end hc-specific) > > (end resource[0]) > > mem_cor_ecc_counter = 0x6a0 0x7a9 0x0 0x0 0x3 0x0 > mem_cor_ecc_counter_last = 0x6a0 0x7a6 0x0 0x0 0x3 0x0 > __ttl = 0x1 > __tod = 0x4b0bfd10 0x18707ce7 > > Nov 24 2009 09:34:40.410060807 ereport.cpu.intel.quickpath.mem_ce > nvlist version: 0 > class = ereport.cpu.intel.quickpath.mem_ce > ena = 0xb3479489c416401 > detector = (embedded nvlist) > nvlist version: 0 > version = 0x0 > scheme = hc > hc-list = (array of embedded nvlists) > (start hc-list[0]) > nvlist version: 0 > hc-name = motherboard > hc-id = 0 > (end hc-list[0]) > (start hc-list[1]) > nvlist version: 0 > hc-name = chip > hc-id = 1 > (end hc-list[1]) > (start hc-list[2]) > nvlist version: 0 > hc-name = memory-controller > hc-id = 0 > (end hc-list[2]) > > (end detector) > > compound_errorname = MC_CH_RD_ERR > IA32_MCG_STATUS = 0x0 > machine_check_in_progress = 0 > bank_number = 0x8 > bank_msr_offset = 0x420 > IA32_MCi_STATUS = 0xcc0000800001009f > overflow = 1 > error_uncorrected = 0 > error_enabled = 0 > processor_context_corrupt = 0 > error_code = 0x9f > model_specific_error_code = 0x1 > threshold_based_error_status = No tracking > IA32_MCi_ADDR = 0xc3ad95680 > IA32_MCi_MISC = 0x5ac6cb0500011080 > ECC-syndrome = 0x5ac6cb05 > physaddr = 0xc3ad95680 > resource = (array of embedded nvlists) > (start resource[0]) > nvlist version: 0 > version = 0x0 > scheme = hc > hc-list = (array of embedded nvlists) > (start hc-list[0]) > nvlist version: 0 > hc-name = motherboard > hc-id = 0 > (end hc-list[0]) > (start hc-list[1]) > nvlist version: 0 > hc-name = chip > hc-id = 1 > (end hc-list[1]) > (start hc-list[2]) > nvlist version: 0 > hc-name = memory-controller > hc-id = 0 > (end hc-list[2]) > (start hc-list[3]) > nvlist version: 0 > hc-name = dram-channel > hc-id = 0 > (end hc-list[3]) > (start hc-list[4]) > nvlist version: 0 > hc-name = dimm > hc-id = 1 > (end hc-list[4]) > (start hc-list[5]) > nvlist version: 0 > hc-name = rank > hc-id = 5 > (end hc-list[5]) > > hc-specific = (embedded nvlist) > nvlist version: 0 > offset = 0x7f921200 > (end hc-specific) > > (end resource[0]) > > mem_cor_ecc_counter = 0x6a2 0x7c3 0x0 0x0 0x3 0x0 > mem_cor_ecc_counter_last = 0x6a0 0x7a9 0x0 0x0 0x3 0x0 > __ttl = 0x1 > __tod = 0x4b0bfd10 0x18710807 > > Nov 24 2009 09:34:40.410071957 ereport.cpu.intel.quickpath.mem_ce > nvlist version: 0 > class = ereport.cpu.intel.quickpath.mem_ce > ena = 0xb34794b6cb10001 > detector = (embedded nvlist) > nvlist version: 0 > version = 0x0 > scheme = hc > hc-list = (array of embedded nvlists) > (start hc-list[0]) > nvlist version: 0 > hc-name = motherboard > hc-id = 0 > (end hc-list[0]) > (start hc-list[1]) > nvlist version: 0 > hc-name = chip > hc-id = 1 > (end hc-list[1]) > (start hc-list[2]) > nvlist version: 0 > hc-name = memory-controller > hc-id = 0 > (end hc-list[2]) > > (end detector) > > compound_errorname = MC_CH_RD_ERR > IA32_MCG_STATUS = 0x0 > machine_check_in_progress = 0 > bank_number = 0x8 > bank_msr_offset = 0x420 > IA32_MCi_STATUS = 0xcc0000c00001009f > overflow = 1 > error_uncorrected = 0 > error_enabled = 0 > processor_context_corrupt = 0 > error_code = 0x9f > model_specific_error_code = 0x1 > threshold_based_error_status = No tracking > IA32_MCi_ADDR = 0xc2ff7dbc0 > IA32_MCi_MISC = 0x431bdc5f00011180 > ECC-syndrome = 0x431bdc5f > physaddr = 0xc2ff7dbc0 > resource = (array of embedded nvlists) > (start resource[0]) > nvlist version: 0 > version = 0x0 > scheme = hc > hc-list = (array of embedded nvlists) > (start hc-list[0]) > nvlist version: 0 > hc-name = motherboard > hc-id = 0 > (end hc-list[0]) > (start hc-list[1]) > nvlist version: 0 > hc-name = chip > hc-id = 1 > (end hc-list[1]) > (start hc-list[2]) > nvlist version: 0 > hc-name = memory-controller > hc-id = 0 > (end hc-list[2]) > (start hc-list[3]) > nvlist version: 0 > hc-name = dram-channel > hc-id = 0 > (end hc-list[3]) > (start hc-list[4]) > nvlist version: 0 > hc-name = dimm > hc-id = 1 > (end hc-list[4]) > (start hc-list[5]) > nvlist version: 0 > hc-name = rank > hc-id = 5 > (end hc-list[5]) > > hc-specific = (embedded nvlist) > nvlist version: 0 > offset = 0x7ea9f3c0 > (end hc-specific) > > (end resource[0]) > > mem_cor_ecc_counter = 0x6a5 0x7d1 0x0 0x0 0x3 0x0 > mem_cor_ecc_counter_last = 0x6a2 0x7c3 0x0 0x0 0x3 0x0 > __ttl = 0x1 > __tod = 0x4b0bfd10 0x18713395 > > Nov 24 2009 09:34:40.410134184 ereport.cpu.intel.quickpath.mem_ce > nvlist version: 0 > class = ereport.cpu.intel.quickpath.mem_ce > ena = 0xb34795aa2712401 > detector = (embedded nvlist) > nvlist version: 0 > version = 0x0 > scheme = hc > hc-list = (array of embedded nvlists) > (start hc-list[0]) > nvlist version: 0 > hc-name = motherboard > hc-id = 0 > (end hc-list[0]) > (start hc-list[1]) > nvlist version: 0 > hc-name = chip > hc-id = 1 > (end hc-list[1]) > (start hc-list[2]) > nvlist version: 0 > hc-name = memory-controller > hc-id = 0 > (end hc-list[2]) > > (end detector) > > compound_errorname = MC_CH_RD_ERR > IA32_MCG_STATUS = 0x0 > machine_check_in_progress = 0 > bank_number = 0x8 > bank_msr_offset = 0x420 > IA32_MCi_STATUS = 0xcc0003c00001009f > overflow = 1 > error_uncorrected = 0 > error_enabled = 0 > processor_context_corrupt = 0 > error_code = 0x9f > model_specific_error_code = 0x1 > threshold_based_error_status = No tracking > IA32_MCi_ADDR = 0xc2e261440 > IA32_MCi_MISC = 0x48f99d1500015e46 > ECC-syndrome = 0x48f99d15 > physaddr = 0xc2e261440 > resource = (array of embedded nvlists) > (start resource[0]) > nvlist version: 0 > version = 0x0 > scheme = hc > hc-list = (array of embedded nvlists) > (start hc-list[0]) > nvlist version: 0 > hc-name = motherboard > hc-id = 0 > (end hc-list[0]) > (start hc-list[1]) > nvlist version: 0 > hc-name = chip > hc-id = 1 > (end hc-list[1]) > (start hc-list[2]) > nvlist version: 0 > hc-name = memory-controller > hc-id = 0 > (end hc-list[2]) > (start hc-list[3]) > nvlist version: 0 > hc-name = dram-channel > hc-id = 0 > (end hc-list[3]) > (start hc-list[4]) > nvlist version: 0 > hc-name = dimm > hc-id = 1 > (end hc-list[4]) > (start hc-list[5]) > nvlist version: 0 > hc-name = rank > hc-id = 5 > (end hc-list[5]) > > hc-specific = (embedded nvlist) > nvlist version: 0 > offset = 0x7e832140 > (end hc-specific) > > (end resource[0]) > > mem_cor_ecc_counter = 0x6a7 0x7d8 0x0 0x0 0x3 0x0 > mem_cor_ecc_counter_last = 0x6a5 0x7d1 0x0 0x0 0x3 0x0 > __ttl = 0x1 > __tod = 0x4b0bfd10 0x187226a8 > > bash-3.00# > > > -- > paul > _______________________________________________ > fm-discuss mailing list > fm-discuss at opensolaris.org From pfisher at alertlogic.net Tue Nov 24 11:40:20 2009 From: pfisher at alertlogic.net (Paul Fisher) Date: Tue, 24 Nov 2009 13:40:20 -0600 Subject: [fm-discuss] further help diagnosing ereport.cpu.intel.quickpath.mem_ce In-Reply-To: <4B0C0225.3050009@alertlogic.net> References: <4B0C0225.3050009@alertlogic.net> Message-ID: <4B0C36A4.3060102@alertlogic.net> After a little more digging, now I am really confused about what is going on. It turns out the the exact memory modules are Hynix HMT151R7BFR4C-G7, which are on the verified list for this particular motherboard: http://www.supermicro.com/support/resources/memory/display.cfm?sz=4.0&mspd=1.066&mtyp=32&id=144D7111B7126A9A938F3113403B15D6 So now I am left wandering if there is some subtle incompatibility with the chipset (Intel? 5520 (Tylersburg)) and current Solaris/Opensolaris. Is there anything someone can suggest to determine this with certainty? Or are the ereports "absolute" in the sense that the memory is simply having these errors? Paul Fisher wrote: > I am still trying to diagnose the issues with these two systems (dual > Intel 5520/supermicro X8DT3). When I have two PC3-8500 ECC RDIMMs > populated per channel (6, 4G RDIMMs/socket), I see the following fmdump > -e reports reliably every 10 seconds: > > Nov 24 2009 09:34:40.410002797 ereport.cpu.intel.quickpath.mem_ce > Nov 24 2009 09:34:40.410013341 ereport.cpu.intel.quickpath.mem_ce > Nov 24 2009 09:34:40.410025191 ereport.cpu.intel.quickpath.mem_ce > Nov 24 2009 09:34:40.410060807 ereport.cpu.intel.quickpath.mem_ce > Nov 24 2009 09:34:40.410071957 ereport.cpu.intel.quickpath.mem_ce > Nov 24 2009 09:34:40.410134184 ereport.cpu.intel.quickpath.mem_ce > > (I've included the details of these reports at the bottom of this email.) > > While the reports seem to indicate that there is a problem on the second > row RDIMMs, swapping the memory sticks around yields the same problem. > In fact, even when I have only on RDIMM in each channel the machine will > experience occassional ereports on memory. > > I have tried forcing the BIOS to run the memory @ DDR-1066 and DDR-800 > and still have the same behavior, so it would seem to not be related to > memory quality. The memory modules appear to be reasonable quality stuff: > > Hynix > > I have tried opensolaris b127, b126, 2009.06, and S10U8, all with > variations of these same symptoms. I am really hoping that someone can > either give some guidance on further diagnosing what is going on, or can > point me to which mailing list on which I should follow-up. > > Here is the detailed fmdump ereport information: > > bash-3.00# cat /tmp/fmdump.eV.sample > Nov 24 2009 09:34:40.410002797 ereport.cpu.intel.quickpath.mem_ce > nvlist version: 0 > class = ereport.cpu.intel.quickpath.mem_ce > ena = 0xb34793a6b810401 > detector = (embedded nvlist) > nvlist version: 0 > version = 0x0 > scheme = hc > hc-list = (array of embedded nvlists) > (start hc-list[0]) > nvlist version: 0 > hc-name = motherboard > hc-id = 0 > (end hc-list[0]) > (start hc-list[1]) > nvlist version: 0 > hc-name = chip > hc-id = 1 > (end hc-list[1]) > (start hc-list[2]) > nvlist version: 0 > hc-name = memory-controller > hc-id = 0 > (end hc-list[2]) > > (end detector) > > compound_errorname = MC_CH_RD_ERR > IA32_MCG_STATUS = 0x0 > machine_check_in_progress = 0 > bank_number = 0x8 > bank_msr_offset = 0x420 > IA32_MCi_STATUS = 0xcc0d5c800001009f > overflow = 1 > error_uncorrected = 0 > error_enabled = 0 > processor_context_corrupt = 0 > error_code = 0x9f > model_specific_error_code = 0x1 > threshold_based_error_status = No tracking > IA32_MCi_ADDR = 0xc3d099c80 > IA32_MCi_MISC = 0x294e5def00015840 > ECC-syndrome = 0x294e5def > physaddr = 0xc3d099c80 > resource = (array of embedded nvlists) > (start resource[0]) > nvlist version: 0 > version = 0x0 > scheme = hc > hc-list = (array of embedded nvlists) > (start hc-list[0]) > nvlist version: 0 > hc-name = motherboard > hc-id = 0 > (end hc-list[0]) > (start hc-list[1]) > nvlist version: 0 > hc-name = chip > hc-id = 1 > (end hc-list[1]) > (start hc-list[2]) > nvlist version: 0 > hc-name = memory-controller > hc-id = 0 > (end hc-list[2]) > (start hc-list[3]) > nvlist version: 0 > hc-name = dram-channel > hc-id = 0 > (end hc-list[3]) > (start hc-list[4]) > nvlist version: 0 > hc-name = dimm > hc-id = 1 > (end hc-list[4]) > (start hc-list[5]) > nvlist version: 0 > hc-name = rank > hc-id = 5 > (end hc-list[5]) > > hc-specific = (embedded nvlist) > nvlist version: 0 > offset = 0x7fc0c400 > (end hc-specific) > > (end resource[0]) > > mem_cor_ecc_counter = 0x69f 0x797 0x0 0x0 0x3 0x0 > mem_cor_ecc_counter_last = 0x743 0x4ee 0x0 0x0 0x3 0x0 > __ttl = 0x1 > __tod = 0x4b0bfd10 0x1870256d > > Nov 24 2009 09:34:40.410013341 ereport.cpu.intel.quickpath.mem_ce > nvlist version: 0 > class = ereport.cpu.intel.quickpath.mem_ce > ena = 0xb34793d17612001 > detector = (embedded nvlist) > nvlist version: 0 > version = 0x0 > scheme = hc > hc-list = (array of embedded nvlists) > (start hc-list[0]) > nvlist version: 0 > hc-name = motherboard > hc-id = 0 > (end hc-list[0]) > (start hc-list[1]) > nvlist version: 0 > hc-name = chip > hc-id = 1 > (end hc-list[1]) > (start hc-list[2]) > nvlist version: 0 > hc-name = memory-controller > hc-id = 0 > (end hc-list[2]) > > (end detector) > > compound_errorname = MC_CH_RD_ERR > IA32_MCG_STATUS = 0x0 > machine_check_in_progress = 0 > bank_number = 0x8 > bank_msr_offset = 0x420 > IA32_MCi_STATUS = 0xcc0001400001009f > overflow = 1 > error_uncorrected = 0 > error_enabled = 0 > processor_context_corrupt = 0 > error_code = 0x9f > model_specific_error_code = 0x1 > threshold_based_error_status = No tracking > IA32_MCi_ADDR = 0xc3e0d3300 > IA32_MCi_MISC = 0xa13dcb9300015840 > ECC-syndrome = 0xa13dcb93 > physaddr = 0xc3e0d3300 > resource = (array of embedded nvlists) > (start resource[0]) > nvlist version: 0 > version = 0x0 > scheme = hc > hc-list = (array of embedded nvlists) > (start hc-list[0]) > nvlist version: 0 > hc-name = motherboard > hc-id = 0 > (end hc-list[0]) > (start hc-list[1]) > nvlist version: 0 > hc-name = chip > hc-id = 1 > (end hc-list[1]) > (start hc-list[2]) > nvlist version: 0 > hc-name = memory-controller > hc-id = 0 > (end hc-list[2]) > (start hc-list[3]) > nvlist version: 0 > hc-name = dram-channel > hc-id = 0 > (end hc-list[3]) > (start hc-list[4]) > nvlist version: 0 > hc-name = dimm > hc-id = 1 > (end hc-list[4]) > (start hc-list[5]) > nvlist version: 0 > hc-name = rank > hc-id = 5 > (end hc-list[5]) > > hc-specific = (embedded nvlist) > nvlist version: 0 > offset = 0x7fd66b80 > (end hc-specific) > > (end resource[0]) > > mem_cor_ecc_counter = 0x6a0 0x7a6 0x0 0x0 0x3 0x0 > mem_cor_ecc_counter_last = 0x69f 0x797 0x0 0x0 0x3 0x0 > __ttl = 0x1 > __tod = 0x4b0bfd10 0x18704e9d > > Nov 24 2009 09:34:40.410025191 ereport.cpu.intel.quickpath.mem_ce > nvlist version: 0 > class = ereport.cpu.intel.quickpath.mem_ce > ena = 0xb34794013414401 > detector = (embedded nvlist) > nvlist version: 0 > version = 0x0 > scheme = hc > hc-list = (array of embedded nvlists) > (start hc-list[0]) > nvlist version: 0 > hc-name = motherboard > hc-id = 0 > (end hc-list[0]) > (start hc-list[1]) > nvlist version: 0 > hc-name = chip > hc-id = 1 > (end hc-list[1]) > (start hc-list[2]) > nvlist version: 0 > hc-name = memory-controller > hc-id = 0 > (end hc-list[2]) > > (end detector) > > compound_errorname = MC_CH_RD_ERR > IA32_MCG_STATUS = 0x0 > machine_check_in_progress = 0 > bank_number = 0x8 > bank_msr_offset = 0x420 > IA32_MCi_STATUS = 0xcc0000800001009f > overflow = 1 > error_uncorrected = 0 > error_enabled = 0 > processor_context_corrupt = 0 > error_code = 0x9f > model_specific_error_code = 0x1 > threshold_based_error_status = No tracking > IA32_MCi_ADDR = 0xc3cebadc0 > IA32_MCi_MISC = 0x7f59c35b00011280 > ECC-syndrome = 0x7f59c35b > physaddr = 0xc3cebadc0 > resource = (array of embedded nvlists) > (start resource[0]) > nvlist version: 0 > version = 0x0 > scheme = hc > hc-list = (array of embedded nvlists) > (start hc-list[0]) > nvlist version: 0 > hc-name = motherboard > hc-id = 0 > (end hc-list[0]) > (start hc-list[1]) > nvlist version: 0 > hc-name = chip > hc-id = 1 > (end hc-list[1]) > (start hc-list[2]) > nvlist version: 0 > hc-name = memory-controller > hc-id = 0 > (end hc-list[2]) > (start hc-list[3]) > nvlist version: 0 > hc-name = dram-channel > hc-id = 0 > (end hc-list[3]) > (start hc-list[4]) > nvlist version: 0 > hc-name = dimm > hc-id = 1 > (end hc-list[4]) > (start hc-list[5]) > nvlist version: 0 > hc-name = rank > hc-id = 5 > (end hc-list[5]) > > hc-specific = (embedded nvlist) > nvlist version: 0 > offset = 0x7fbe49c0 > (end hc-specific) > > (end resource[0]) > > mem_cor_ecc_counter = 0x6a0 0x7a9 0x0 0x0 0x3 0x0 > mem_cor_ecc_counter_last = 0x6a0 0x7a6 0x0 0x0 0x3 0x0 > __ttl = 0x1 > __tod = 0x4b0bfd10 0x18707ce7 > > Nov 24 2009 09:34:40.410060807 ereport.cpu.intel.quickpath.mem_ce > nvlist version: 0 > class = ereport.cpu.intel.quickpath.mem_ce > ena = 0xb3479489c416401 > detector = (embedded nvlist) > nvlist version: 0 > version = 0x0 > scheme = hc > hc-list = (array of embedded nvlists) > (start hc-list[0]) > nvlist version: 0 > hc-name = motherboard > hc-id = 0 > (end hc-list[0]) > (start hc-list[1]) > nvlist version: 0 > hc-name = chip > hc-id = 1 > (end hc-list[1]) > (start hc-list[2]) > nvlist version: 0 > hc-name = memory-controller > hc-id = 0 > (end hc-list[2]) > > (end detector) > > compound_errorname = MC_CH_RD_ERR > IA32_MCG_STATUS = 0x0 > machine_check_in_progress = 0 > bank_number = 0x8 > bank_msr_offset = 0x420 > IA32_MCi_STATUS = 0xcc0000800001009f > overflow = 1 > error_uncorrected = 0 > error_enabled = 0 > processor_context_corrupt = 0 > error_code = 0x9f > model_specific_error_code = 0x1 > threshold_based_error_status = No tracking > IA32_MCi_ADDR = 0xc3ad95680 > IA32_MCi_MISC = 0x5ac6cb0500011080 > ECC-syndrome = 0x5ac6cb05 > physaddr = 0xc3ad95680 > resource = (array of embedded nvlists) > (start resource[0]) > nvlist version: 0 > version = 0x0 > scheme = hc > hc-list = (array of embedded nvlists) > (start hc-list[0]) > nvlist version: 0 > hc-name = motherboard > hc-id = 0 > (end hc-list[0]) > (start hc-list[1]) > nvlist version: 0 > hc-name = chip > hc-id = 1 > (end hc-list[1]) > (start hc-list[2]) > nvlist version: 0 > hc-name = memory-controller > hc-id = 0 > (end hc-list[2]) > (start hc-list[3]) > nvlist version: 0 > hc-name = dram-channel > hc-id = 0 > (end hc-list[3]) > (start hc-list[4]) > nvlist version: 0 > hc-name = dimm > hc-id = 1 > (end hc-list[4]) > (start hc-list[5]) > nvlist version: 0 > hc-name = rank > hc-id = 5 > (end hc-list[5]) > > hc-specific = (embedded nvlist) > nvlist version: 0 > offset = 0x7f921200 > (end hc-specific) > > (end resource[0]) > > mem_cor_ecc_counter = 0x6a2 0x7c3 0x0 0x0 0x3 0x0 > mem_cor_ecc_counter_last = 0x6a0 0x7a9 0x0 0x0 0x3 0x0 > __ttl = 0x1 > __tod = 0x4b0bfd10 0x18710807 > > Nov 24 2009 09:34:40.410071957 ereport.cpu.intel.quickpath.mem_ce > nvlist version: 0 > class = ereport.cpu.intel.quickpath.mem_ce > ena = 0xb34794b6cb10001 > detector = (embedded nvlist) > nvlist version: 0 > version = 0x0 > scheme = hc > hc-list = (array of embedded nvlists) > (start hc-list[0]) > nvlist version: 0 > hc-name = motherboard > hc-id = 0 > (end hc-list[0]) > (start hc-list[1]) > nvlist version: 0 > hc-name = chip > hc-id = 1 > (end hc-list[1]) > (start hc-list[2]) > nvlist version: 0 > hc-name = memory-controller > hc-id = 0 > (end hc-list[2]) > > (end detector) > > compound_errorname = MC_CH_RD_ERR > IA32_MCG_STATUS = 0x0 > machine_check_in_progress = 0 > bank_number = 0x8 > bank_msr_offset = 0x420 > IA32_MCi_STATUS = 0xcc0000c00001009f > overflow = 1 > error_uncorrected = 0 > error_enabled = 0 > processor_context_corrupt = 0 > error_code = 0x9f > model_specific_error_code = 0x1 > threshold_based_error_status = No tracking > IA32_MCi_ADDR = 0xc2ff7dbc0 > IA32_MCi_MISC = 0x431bdc5f00011180 > ECC-syndrome = 0x431bdc5f > physaddr = 0xc2ff7dbc0 > resource = (array of embedded nvlists) > (start resource[0]) > nvlist version: 0 > version = 0x0 > scheme = hc > hc-list = (array of embedded nvlists) > (start hc-list[0]) > nvlist version: 0 > hc-name = motherboard > hc-id = 0 > (end hc-list[0]) > (start hc-list[1]) > nvlist version: 0 > hc-name = chip > hc-id = 1 > (end hc-list[1]) > (start hc-list[2]) > nvlist version: 0 > hc-name = memory-controller > hc-id = 0 > (end hc-list[2]) > (start hc-list[3]) > nvlist version: 0 > hc-name = dram-channel > hc-id = 0 > (end hc-list[3]) > (start hc-list[4]) > nvlist version: 0 > hc-name = dimm > hc-id = 1 > (end hc-list[4]) > (start hc-list[5]) > nvlist version: 0 > hc-name = rank > hc-id = 5 > (end hc-list[5]) > > hc-specific = (embedded nvlist) > nvlist version: 0 > offset = 0x7ea9f3c0 > (end hc-specific) > > (end resource[0]) > > mem_cor_ecc_counter = 0x6a5 0x7d1 0x0 0x0 0x3 0x0 > mem_cor_ecc_counter_last = 0x6a2 0x7c3 0x0 0x0 0x3 0x0 > __ttl = 0x1 > __tod = 0x4b0bfd10 0x18713395 > > Nov 24 2009 09:34:40.410134184 ereport.cpu.intel.quickpath.mem_ce > nvlist version: 0 > class = ereport.cpu.intel.quickpath.mem_ce > ena = 0xb34795aa2712401 > detector = (embedded nvlist) > nvlist version: 0 > version = 0x0 > scheme = hc > hc-list = (array of embedded nvlists) > (start hc-list[0]) > nvlist version: 0 > hc-name = motherboard > hc-id = 0 > (end hc-list[0]) > (start hc-list[1]) > nvlist version: 0 > hc-name = chip > hc-id = 1 > (end hc-list[1]) > (start hc-list[2]) > nvlist version: 0 > hc-name = memory-controller > hc-id = 0 > (end hc-list[2]) > > (end detector) > > compound_errorname = MC_CH_RD_ERR > IA32_MCG_STATUS = 0x0 > machine_check_in_progress = 0 > bank_number = 0x8 > bank_msr_offset = 0x420 > IA32_MCi_STATUS = 0xcc0003c00001009f > overflow = 1 > error_uncorrected = 0 > error_enabled = 0 > processor_context_corrupt = 0 > error_code = 0x9f > model_specific_error_code = 0x1 > threshold_based_error_status = No tracking > IA32_MCi_ADDR = 0xc2e261440 > IA32_MCi_MISC = 0x48f99d1500015e46 > ECC-syndrome = 0x48f99d15 > physaddr = 0xc2e261440 > resource = (array of embedded nvlists) > (start resource[0]) > nvlist version: 0 > version = 0x0 > scheme = hc > hc-list = (array of embedded nvlists) > (start hc-list[0]) > nvlist version: 0 > hc-name = motherboard > hc-id = 0 > (end hc-list[0]) > (start hc-list[1]) > nvlist version: 0 > hc-name = chip > hc-id = 1 > (end hc-list[1]) > (start hc-list[2]) > nvlist version: 0 > hc-name = memory-controller > hc-id = 0 > (end hc-list[2]) > (start hc-list[3]) > nvlist version: 0 > hc-name = dram-channel > hc-id = 0 > (end hc-list[3]) > (start hc-list[4]) > nvlist version: 0 > hc-name = dimm > hc-id = 1 > (end hc-list[4]) > (start hc-list[5]) > nvlist version: 0 > hc-name = rank > hc-id = 5 > (end hc-list[5]) > > hc-specific = (embedded nvlist) > nvlist version: 0 > offset = 0x7e832140 > (end hc-specific) > > (end resource[0]) > > mem_cor_ecc_counter = 0x6a7 0x7d8 0x0 0x0 0x3 0x0 > mem_cor_ecc_counter_last = 0x6a5 0x7d1 0x0 0x0 0x3 0x0 > __ttl = 0x1 > __tod = 0x4b0bfd10 0x187226a8 > > bash-3.00# > > > -- > paul > From steve.hanson at sun.com Tue Nov 24 12:15:22 2009 From: steve.hanson at sun.com (Steve Hanson) Date: Tue, 24 Nov 2009 20:15:22 +0000 Subject: [fm-discuss] further help diagnosing ereport.cpu.intel.quickpath.mem_ce In-Reply-To: <4B0C36A4.3060102@alertlogic.net> References: <4B0C0225.3050009@alertlogic.net> <4B0C36A4.3060102@alertlogic.net> Message-ID: <4B0C3EDA.4080601@sun.com> Paul Fisher wrote: > After a little more digging, now I am really confused about what is > going on. It turns out the the exact memory modules are Hynix > HMT151R7BFR4C-G7, which are on the verified list for this particular > motherboard: > > http://www.supermicro.com/support/resources/memory/display.cfm?sz=4.0&mspd=1.066&mtyp=32&id=144D7111B7126A9A938F3113403B15D6 > > > So now I am left wandering if there is some subtle incompatibility > with the chipset (Intel? 5520 (Tylersburg)) and current > Solaris/Opensolaris. > > Is there anything someone can suggest to determine this with > certainty? Or are the ereports "absolute" in the sense that the > memory is simply having these errors? It does look like the memory is really having these errors. The syndrome, physaddr, ecc_counter etc are all constantly changing, so it doesn't seem like they are just "stuck" values or anything. From the ecc_counters you do seem to be getting a large number of errors, more that a few every ten seconds (we only poll the ce counter registers every few seconds so there could be far more than we are reporting). I have seen problems like this before where there is something like a solder problem on the socket causing a short. This gets a CE almost every otrher access, but the ECC mechanism prevents this from turning into a UE. Steve > > > Paul Fisher wrote: > >> I am still trying to diagnose the issues with these two systems (dual >> Intel 5520/supermicro X8DT3). When I have two PC3-8500 ECC RDIMMs >> populated per channel (6, 4G RDIMMs/socket), I see the following fmdump >> -e reports reliably every 10 seconds: >> >> Nov 24 2009 09:34:40.410002797 ereport.cpu.intel.quickpath.mem_ce >> Nov 24 2009 09:34:40.410013341 ereport.cpu.intel.quickpath.mem_ce >> Nov 24 2009 09:34:40.410025191 ereport.cpu.intel.quickpath.mem_ce >> Nov 24 2009 09:34:40.410060807 ereport.cpu.intel.quickpath.mem_ce >> Nov 24 2009 09:34:40.410071957 ereport.cpu.intel.quickpath.mem_ce >> Nov 24 2009 09:34:40.410134184 ereport.cpu.intel.quickpath.mem_ce >> >> (I've included the details of these reports at the bottom of this >> email.) >> >> While the reports seem to indicate that there is a problem on the second >> row RDIMMs, swapping the memory sticks around yields the same problem. >> In fact, even when I have only on RDIMM in each channel the machine will >> experience occassional ereports on memory. >> >> I have tried forcing the BIOS to run the memory @ DDR-1066 and DDR-800 >> and still have the same behavior, so it would seem to not be related to >> memory quality. The memory modules appear to be reasonable quality >> stuff: >> >> Hynix >> >> I have tried opensolaris b127, b126, 2009.06, and S10U8, all with >> variations of these same symptoms. I am really hoping that someone can >> either give some guidance on further diagnosing what is going on, or can >> point me to which mailing list on which I should follow-up. >> >> Here is the detailed fmdump ereport information: >> >> bash-3.00# cat /tmp/fmdump.eV.sample >> Nov 24 2009 09:34:40.410002797 ereport.cpu.intel.quickpath.mem_ce >> nvlist version: 0 >> class = ereport.cpu.intel.quickpath.mem_ce >> ena = 0xb34793a6b810401 >> detector = (embedded nvlist) >> nvlist version: 0 >> version = 0x0 >> scheme = hc >> hc-list = (array of embedded nvlists) >> (start hc-list[0]) >> nvlist version: 0 >> hc-name = motherboard >> hc-id = 0 >> (end hc-list[0]) >> (start hc-list[1]) >> nvlist version: 0 >> hc-name = chip >> hc-id = 1 >> (end hc-list[1]) >> (start hc-list[2]) >> nvlist version: 0 >> hc-name = memory-controller >> hc-id = 0 >> (end hc-list[2]) >> >> (end detector) >> >> compound_errorname = MC_CH_RD_ERR >> IA32_MCG_STATUS = 0x0 >> machine_check_in_progress = 0 >> bank_number = 0x8 >> bank_msr_offset = 0x420 >> IA32_MCi_STATUS = 0xcc0d5c800001009f >> overflow = 1 >> error_uncorrected = 0 >> error_enabled = 0 >> processor_context_corrupt = 0 >> error_code = 0x9f >> model_specific_error_code = 0x1 >> threshold_based_error_status = No tracking >> IA32_MCi_ADDR = 0xc3d099c80 >> IA32_MCi_MISC = 0x294e5def00015840 >> ECC-syndrome = 0x294e5def >> physaddr = 0xc3d099c80 >> resource = (array of embedded nvlists) >> (start resource[0]) >> nvlist version: 0 >> version = 0x0 >> scheme = hc >> hc-list = (array of embedded nvlists) >> (start hc-list[0]) >> nvlist version: 0 >> hc-name = motherboard >> hc-id = 0 >> (end hc-list[0]) >> (start hc-list[1]) >> nvlist version: 0 >> hc-name = chip >> hc-id = 1 >> (end hc-list[1]) >> (start hc-list[2]) >> nvlist version: 0 >> hc-name = memory-controller >> hc-id = 0 >> (end hc-list[2]) >> (start hc-list[3]) >> nvlist version: 0 >> hc-name = dram-channel >> hc-id = 0 >> (end hc-list[3]) >> (start hc-list[4]) >> nvlist version: 0 >> hc-name = dimm >> hc-id = 1 >> (end hc-list[4]) >> (start hc-list[5]) >> nvlist version: 0 >> hc-name = rank >> hc-id = 5 >> (end hc-list[5]) >> >> hc-specific = (embedded nvlist) >> nvlist version: 0 >> offset = 0x7fc0c400 >> (end hc-specific) >> >> (end resource[0]) >> >> mem_cor_ecc_counter = 0x69f 0x797 0x0 0x0 0x3 0x0 >> mem_cor_ecc_counter_last = 0x743 0x4ee 0x0 0x0 0x3 0x0 >> __ttl = 0x1 >> __tod = 0x4b0bfd10 0x1870256d >> >> Nov 24 2009 09:34:40.410013341 ereport.cpu.intel.quickpath.mem_ce >> nvlist version: 0 >> class = ereport.cpu.intel.quickpath.mem_ce >> ena = 0xb34793d17612001 >> detector = (embedded nvlist) >> nvlist version: 0 >> version = 0x0 >> scheme = hc >> hc-list = (array of embedded nvlists) >> (start hc-list[0]) >> nvlist version: 0 >> hc-name = motherboard >> hc-id = 0 >> (end hc-list[0]) >> (start hc-list[1]) >> nvlist version: 0 >> hc-name = chip >> hc-id = 1 >> (end hc-list[1]) >> (start hc-list[2]) >> nvlist version: 0 >> hc-name = memory-controller >> hc-id = 0 >> (end hc-list[2]) >> >> (end detector) >> >> compound_errorname = MC_CH_RD_ERR >> IA32_MCG_STATUS = 0x0 >> machine_check_in_progress = 0 >> bank_number = 0x8 >> bank_msr_offset = 0x420 >> IA32_MCi_STATUS = 0xcc0001400001009f >> overflow = 1 >> error_uncorrected = 0 >> error_enabled = 0 >> processor_context_corrupt = 0 >> error_code = 0x9f >> model_specific_error_code = 0x1 >> threshold_based_error_status = No tracking >> IA32_MCi_ADDR = 0xc3e0d3300 >> IA32_MCi_MISC = 0xa13dcb9300015840 >> ECC-syndrome = 0xa13dcb93 >> physaddr = 0xc3e0d3300 >> resource = (array of embedded nvlists) >> (start resource[0]) >> nvlist version: 0 >> version = 0x0 >> scheme = hc >> hc-list = (array of embedded nvlists) >> (start hc-list[0]) >> nvlist version: 0 >> hc-name = motherboard >> hc-id = 0 >> (end hc-list[0]) >> (start hc-list[1]) >> nvlist version: 0 >> hc-name = chip >> hc-id = 1 >> (end hc-list[1]) >> (start hc-list[2]) >> nvlist version: 0 >> hc-name = memory-controller >> hc-id = 0 >> (end hc-list[2]) >> (start hc-list[3]) >> nvlist version: 0 >> hc-name = dram-channel >> hc-id = 0 >> (end hc-list[3]) >> (start hc-list[4]) >> nvlist version: 0 >> hc-name = dimm >> hc-id = 1 >> (end hc-list[4]) >> (start hc-list[5]) >> nvlist version: 0 >> hc-name = rank >> hc-id = 5 >> (end hc-list[5]) >> >> hc-specific = (embedded nvlist) >> nvlist version: 0 >> offset = 0x7fd66b80 >> (end hc-specific) >> >> (end resource[0]) >> >> mem_cor_ecc_counter = 0x6a0 0x7a6 0x0 0x0 0x3 0x0 >> mem_cor_ecc_counter_last = 0x69f 0x797 0x0 0x0 0x3 0x0 >> __ttl = 0x1 >> __tod = 0x4b0bfd10 0x18704e9d >> >> Nov 24 2009 09:34:40.410025191 ereport.cpu.intel.quickpath.mem_ce >> nvlist version: 0 >> class = ereport.cpu.intel.quickpath.mem_ce >> ena = 0xb34794013414401 >> detector = (embedded nvlist) >> nvlist version: 0 >> version = 0x0 >> scheme = hc >> hc-list = (array of embedded nvlists) >> (start hc-list[0]) >> nvlist version: 0 >> hc-name = motherboard >> hc-id = 0 >> (end hc-list[0]) >> (start hc-list[1]) >> nvlist version: 0 >> hc-name = chip >> hc-id = 1 >> (end hc-list[1]) >> (start hc-list[2]) >> nvlist version: 0 >> hc-name = memory-controller >> hc-id = 0 >> (end hc-list[2]) >> >> (end detector) >> >> compound_errorname = MC_CH_RD_ERR >> IA32_MCG_STATUS = 0x0 >> machine_check_in_progress = 0 >> bank_number = 0x8 >> bank_msr_offset = 0x420 >> IA32_MCi_STATUS = 0xcc0000800001009f >> overflow = 1 >> error_uncorrected = 0 >> error_enabled = 0 >> processor_context_corrupt = 0 >> error_code = 0x9f >> model_specific_error_code = 0x1 >> threshold_based_error_status = No tracking >> IA32_MCi_ADDR = 0xc3cebadc0 >> IA32_MCi_MISC = 0x7f59c35b00011280 >> ECC-syndrome = 0x7f59c35b >> physaddr = 0xc3cebadc0 >> resource = (array of embedded nvlists) >> (start resource[0]) >> nvlist version: 0 >> version = 0x0 >> scheme = hc >> hc-list = (array of embedded nvlists) >> (start hc-list[0]) >> nvlist version: 0 >> hc-name = motherboard >> hc-id = 0 >> (end hc-list[0]) >> (start hc-list[1]) >> nvlist version: 0 >> hc-name = chip >> hc-id = 1 >> (end hc-list[1]) >> (start hc-list[2]) >> nvlist version: 0 >> hc-name = memory-controller >> hc-id = 0 >> (end hc-list[2]) >> (start hc-list[3]) >> nvlist version: 0 >> hc-name = dram-channel >> hc-id = 0 >> (end hc-list[3]) >> (start hc-list[4]) >> nvlist version: 0 >> hc-name = dimm >> hc-id = 1 >> (end hc-list[4]) >> (start hc-list[5]) >> nvlist version: 0 >> hc-name = rank >> hc-id = 5 >> (end hc-list[5]) >> >> hc-specific = (embedded nvlist) >> nvlist version: 0 >> offset = 0x7fbe49c0 >> (end hc-specific) >> >> (end resource[0]) >> >> mem_cor_ecc_counter = 0x6a0 0x7a9 0x0 0x0 0x3 0x0 >> mem_cor_ecc_counter_last = 0x6a0 0x7a6 0x0 0x0 0x3 0x0 >> __ttl = 0x1 >> __tod = 0x4b0bfd10 0x18707ce7 >> >> Nov 24 2009 09:34:40.410060807 ereport.cpu.intel.quickpath.mem_ce >> nvlist version: 0 >> class = ereport.cpu.intel.quickpath.mem_ce >> ena = 0xb3479489c416401 >> detector = (embedded nvlist) >> nvlist version: 0 >> version = 0x0 >> scheme = hc >> hc-list = (array of embedded nvlists) >> (start hc-list[0]) >> nvlist version: 0 >> hc-name = motherboard >> hc-id = 0 >> (end hc-list[0]) >> (start hc-list[1]) >> nvlist version: 0 >> hc-name = chip >> hc-id = 1 >> (end hc-list[1]) >> (start hc-list[2]) >> nvlist version: 0 >> hc-name = memory-controller >> hc-id = 0 >> (end hc-list[2]) >> >> (end detector) >> >> compound_errorname = MC_CH_RD_ERR >> IA32_MCG_STATUS = 0x0 >> machine_check_in_progress = 0 >> bank_number = 0x8 >> bank_msr_offset = 0x420 >> IA32_MCi_STATUS = 0xcc0000800001009f >> overflow = 1 >> error_uncorrected = 0 >> error_enabled = 0 >> processor_context_corrupt = 0 >> error_code = 0x9f >> model_specific_error_code = 0x1 >> threshold_based_error_status = No tracking >> IA32_MCi_ADDR = 0xc3ad95680 >> IA32_MCi_MISC = 0x5ac6cb0500011080 >> ECC-syndrome = 0x5ac6cb05 >> physaddr = 0xc3ad95680 >> resource = (array of embedded nvlists) >> (start resource[0]) >> nvlist version: 0 >> version = 0x0 >> scheme = hc >> hc-list = (array of embedded nvlists) >> (start hc-list[0]) >> nvlist version: 0 >> hc-name = motherboard >> hc-id = 0 >> (end hc-list[0]) >> (start hc-list[1]) >> nvlist version: 0 >> hc-name = chip >> hc-id = 1 >> (end hc-list[1]) >> (start hc-list[2]) >> nvlist version: 0 >> hc-name = memory-controller >> hc-id = 0 >> (end hc-list[2]) >> (start hc-list[3]) >> nvlist version: 0 >> hc-name = dram-channel >> hc-id = 0 >> (end hc-list[3]) >> (start hc-list[4]) >> nvlist version: 0 >> hc-name = dimm >> hc-id = 1 >> (end hc-list[4]) >> (start hc-list[5]) >> nvlist version: 0 >> hc-name = rank >> hc-id = 5 >> (end hc-list[5]) >> >> hc-specific = (embedded nvlist) >> nvlist version: 0 >> offset = 0x7f921200 >> (end hc-specific) >> >> (end resource[0]) >> >> mem_cor_ecc_counter = 0x6a2 0x7c3 0x0 0x0 0x3 0x0 >> mem_cor_ecc_counter_last = 0x6a0 0x7a9 0x0 0x0 0x3 0x0 >> __ttl = 0x1 >> __tod = 0x4b0bfd10 0x18710807 >> >> Nov 24 2009 09:34:40.410071957 ereport.cpu.intel.quickpath.mem_ce >> nvlist version: 0 >> class = ereport.cpu.intel.quickpath.mem_ce >> ena = 0xb34794b6cb10001 >> detector = (embedded nvlist) >> nvlist version: 0 >> version = 0x0 >> scheme = hc >> hc-list = (array of embedded nvlists) >> (start hc-list[0]) >> nvlist version: 0 >> hc-name = motherboard >> hc-id = 0 >> (end hc-list[0]) >> (start hc-list[1]) >> nvlist version: 0 >> hc-name = chip >> hc-id = 1 >> (end hc-list[1]) >> (start hc-list[2]) >> nvlist version: 0 >> hc-name = memory-controller >> hc-id = 0 >> (end hc-list[2]) >> >> (end detector) >> >> compound_errorname = MC_CH_RD_ERR >> IA32_MCG_STATUS = 0x0 >> machine_check_in_progress = 0 >> bank_number = 0x8 >> bank_msr_offset = 0x420 >> IA32_MCi_STATUS = 0xcc0000c00001009f >> overflow = 1 >> error_uncorrected = 0 >> error_enabled = 0 >> processor_context_corrupt = 0 >> error_code = 0x9f >> model_specific_error_code = 0x1 >> threshold_based_error_status = No tracking >> IA32_MCi_ADDR = 0xc2ff7dbc0 >> IA32_MCi_MISC = 0x431bdc5f00011180 >> ECC-syndrome = 0x431bdc5f >> physaddr = 0xc2ff7dbc0 >> resource = (array of embedded nvlists) >> (start resource[0]) >> nvlist version: 0 >> version = 0x0 >> scheme = hc >> hc-list = (array of embedded nvlists) >> (start hc-list[0]) >> nvlist version: 0 >> hc-name = motherboard >> hc-id = 0 >> (end hc-list[0]) >> (start hc-list[1]) >> nvlist version: 0 >> hc-name = chip >> hc-id = 1 >> (end hc-list[1]) >> (start hc-list[2]) >> nvlist version: 0 >> hc-name = memory-controller >> hc-id = 0 >> (end hc-list[2]) >> (start hc-list[3]) >> nvlist version: 0 >> hc-name = dram-channel >> hc-id = 0 >> (end hc-list[3]) >> (start hc-list[4]) >> nvlist version: 0 >> hc-name = dimm >> hc-id = 1 >> (end hc-list[4]) >> (start hc-list[5]) >> nvlist version: 0 >> hc-name = rank >> hc-id = 5 >> (end hc-list[5]) >> >> hc-specific = (embedded nvlist) >> nvlist version: 0 >> offset = 0x7ea9f3c0 >> (end hc-specific) >> >> (end resource[0]) >> >> mem_cor_ecc_counter = 0x6a5 0x7d1 0x0 0x0 0x3 0x0 >> mem_cor_ecc_counter_last = 0x6a2 0x7c3 0x0 0x0 0x3 0x0 >> __ttl = 0x1 >> __tod = 0x4b0bfd10 0x18713395 >> >> Nov 24 2009 09:34:40.410134184 ereport.cpu.intel.quickpath.mem_ce >> nvlist version: 0 >> class = ereport.cpu.intel.quickpath.mem_ce >> ena = 0xb34795aa2712401 >> detector = (embedded nvlist) >> nvlist version: 0 >> version = 0x0 >> scheme = hc >> hc-list = (array of embedded nvlists) >> (start hc-list[0]) >> nvlist version: 0 >> hc-name = motherboard >> hc-id = 0 >> (end hc-list[0]) >> (start hc-list[1]) >> nvlist version: 0 >> hc-name = chip >> hc-id = 1 >> (end hc-list[1]) >> (start hc-list[2]) >> nvlist version: 0 >> hc-name = memory-controller >> hc-id = 0 >> (end hc-list[2]) >> >> (end detector) >> >> compound_errorname = MC_CH_RD_ERR >> IA32_MCG_STATUS = 0x0 >> machine_check_in_progress = 0 >> bank_number = 0x8 >> bank_msr_offset = 0x420 >> IA32_MCi_STATUS = 0xcc0003c00001009f >> overflow = 1 >> error_uncorrected = 0 >> error_enabled = 0 >> processor_context_corrupt = 0 >> error_code = 0x9f >> model_specific_error_code = 0x1 >> threshold_based_error_status = No tracking >> IA32_MCi_ADDR = 0xc2e261440 >> IA32_MCi_MISC = 0x48f99d1500015e46 >> ECC-syndrome = 0x48f99d15 >> physaddr = 0xc2e261440 >> resource = (array of embedded nvlists) >> (start resource[0]) >> nvlist version: 0 >> version = 0x0 >> scheme = hc >> hc-list = (array of embedded nvlists) >> (start hc-list[0]) >> nvlist version: 0 >> hc-name = motherboard >> hc-id = 0 >> (end hc-list[0]) >> (start hc-list[1]) >> nvlist version: 0 >> hc-name = chip >> hc-id = 1 >> (end hc-list[1]) >> (start hc-list[2]) >> nvlist version: 0 >> hc-name = memory-controller >> hc-id = 0 >> (end hc-list[2]) >> (start hc-list[3]) >> nvlist version: 0 >> hc-name = dram-channel >> hc-id = 0 >> (end hc-list[3]) >> (start hc-list[4]) >> nvlist version: 0 >> hc-name = dimm >> hc-id = 1 >> (end hc-list[4]) >> (start hc-list[5]) >> nvlist version: 0 >> hc-name = rank >> hc-id = 5 >> (end hc-list[5]) >> >> hc-specific = (embedded nvlist) >> nvlist version: 0 >> offset = 0x7e832140 >> (end hc-specific) >> >> (end resource[0]) >> >> mem_cor_ecc_counter = 0x6a7 0x7d8 0x0 0x0 0x3 0x0 >> mem_cor_ecc_counter_last = 0x6a5 0x7d1 0x0 0x0 0x3 0x0 >> __ttl = 0x1 >> __tod = 0x4b0bfd10 0x187226a8 >> >> bash-3.00# >> >> >> -- >> paul >> > > _______________________________________________ > fm-discuss mailing list > fm-discuss at opensolaris.or > g