[fm-discuss] errlog growing out of control; PCIe errors on NHM/IOH mobo
Erwin Tsaur
Erwin.Tsaur at Sun.COM
Wed Jul 8 16:40:01 PDT 2009
Chris Worley wrote:
> On Wed, Jul 8, 2009 at 5:02 PM, Erwin Tsaur<Erwin.Tsaur at sun.com> wrote:
>
>> I didn't realize that the Root Port was seeing the same thing.. :(
>>
>> Add the same line to pcie_pci.conf
>>
>
> I did, and rebooted:
>
> # tail /kernel/drv/pcie_pci.conf
> # Force load driver to support hotplug activity
> #
> ddi-forceattach=1;
>
> #
> # force interrupt priorities to be one
> # otherwise this driver binds as bridge device with priority 12
> #
> interrupt-priorities=1;
> pcie_ce_mask=0x1040;
>
> ... still errors being logged (see attached).
>
geesh.. yet another type of CE. Change the mask from 0x1040 to 0x1041.
If you get tired of this, change the mask to -1. :)
With this you shouldn't see any more ereports from these devices, unless
there was a UE from that link.
>
>> Good news is that the leaf device is no longer spamming with CEs.
>>
>
> Yes, I've made a lot of progress so far on this, thanks!
>
>
>> You can also limit which RP's CE's get turned off. see "driver.conf" man
>> page. This will mask 0x1040 on all the RPs.
>>
>> I have to warn again, though they are technically CEs and no damage was
>> done, there are probably performance impacts. Masking the CE's won't
>> correct it but will save your harddrive and also significantly improve
>> performance since the OS won't be interrupted hundreds of times a second.
>> Unfortunately don't know of any fix, unless there is a vendor specific
>> method to fix the underlying HW issue.
>>
>
> Intel is usually pretty good about fixing issues like this (unless
> it's caused by Supermicro's layout).
>
Intel is pretty good about this. These are low level physical layer
errors, so it could very well be a layout issue.
> I'm not to worried about the NIC's performance, as long as it works...
> I do need to measure system performance in other respects, so
> decreasing the shower of interrupts (CPU overhead) is important.
>
> Chris
>
>> Chris Worley wrote:
>>
>>> On Wed, Jul 8, 2009 at 4:29 PM, Erwin Tsaur<Erwin.Tsaur at sun.com> wrote:
>>>
>>>
>>>> Chris Worley wrote:
>>>>
>>>>
>>>>> On Wed, Jul 8, 2009 at 3:42 PM, Erwin Tsaur<Erwin.Tsaur at sun.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>>> ok I severely underestimated how much 1000 lines are, but I got the
>>>>>> info
>>>>>> I
>>>>>> needed, mostly. Now I'm wondering if there is an errata in the Root
>>>>>> Port
>>>>>> causing this.
>>>>>>
>>>>>> /pci at 0,0/pci8086,3408 at 1/pci15d9,10c9, I know it's a nic device. Both
>>>>>> notes
>>>>>> are complaining of CE errors.
>>>>>>
>>>>>> The best way is to disable reporting those 2 CEs is to add the
>>>>>> following
>>>>>> line in the driver's .conf file.
>>>>>>
>>>>>>
>>>>>>
>>>>> This is the igb driver (SUNWigb package). It doesn't have a conf file.
>>>>>
>>>>> It's for the Intel 82576 Dual-Port GigE on-board NIC. This driver
>>>>> didn't work prior to the update.
>>>>>
>>>>> So, I made a .conf file thusly and rebooted:
>>>>>
>>>>> /usr/kernel/drv# cat >igb.conf
>>>>> pcie_ce_mask=0x1040;
>>>>>
>>>>> ... no difference. Still lots of errors reported (attached last 1000
>>>>> lines).
>>>>>
>>>>>
>>>>>
>>>> It's not picking up the conf property...
>>>> According to the pkgdef, I think the correct place is
>>>> /kernel/drv/igb.conf
>>>>
>>>> It should be in the same place as the igb driver.
>>>>
>>>>
>>> Okay, it was there, and I changed it and rebooted:
>>>
>>> root at opensolaris:~# tail /kernel/drv/igb.conf
>>> # For example, if you see,
>>> # "/pci at 0,0/pci10de,5d at d/pci8086,0 at 0" 0 "igb"
>>> # "/pci at 0,0/pci10de,5d at d/pci8086,0 at 0,1" 1 "igb"
>>> #
>>> # name = "pciex8086,10a7" parent = "/pci at 0,0/pci10de,5d at d" unit-address =
>>> "0"
>>> # flow_control = 1;
>>> # name = "pciex8086,10a7" parent = "/pci at 0,0/pci10de,5d at d" unit-address =
>>> "0,1"
>>> # flow_control = 3;
>>> pcie_ce_mask=0x1040;
>>>
>>> Still, no joy... the last 1K lines attached.
>>>
>>> Thanks,
>>>
>>> Chris
>>>
>>>
>>>>
>>>>> Chris
>>>>>
>>>>>
>>>>>
>>>>>> pcie_ce_mask=0x1040;
>>>>>>
>>>>>> It requires reboot.
>>>>>>
>>>>>> If you need to do it on a live system let me know, the instructions are
>>>>>> a
>>>>>> bit more complicated.
>>>>>>
>>>>>> Chris Worley wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>> On Wed, Jul 8, 2009 at 2:57 PM, Erwin Tsaur<Erwin.Tsaur at sun.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> Chris Worley wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> On Wed, Jul 8, 2009 at 2:42 PM, Erwin Tsaur<Erwin.Tsaur at sun.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Chris Worley wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> On Wed, Jul 8, 2009 at 1:10 PM, Erwin Tsaur<Erwin.Tsaur at sun.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> Chris Worley wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> (Sorry for the misleading "Subject" in the initial post. would
>>>>>>>>>>>>> like
>>>>>>>>>>>>> to know a more appropriate place to post, since fm is just the
>>>>>>>>>>>>> messenger here.)
>>>>>>>>>>>>>
>>>>>>>>>>>>> More to add: fmadm faulty may be saying something about a bad
>>>>>>>>>>>>> PCIe
>>>>>>>>>>>>> slot or device (is there an "lspci" in OpenSolaris?):
>>>>>>>>>>>>>
>>>>>>>>>>>>> # fmadm faulty
>>>>>>>>>>>>> --------------- ------------------------------------
>>>>>>>>>>>>> --------------
>>>>>>>>>>>>> ---------
>>>>>>>>>>>>> TIME EVENT-ID MSG-ID
>>>>>>>>>>>>> SEVERITY
>>>>>>>>>>>>> --------------- ------------------------------------
>>>>>>>>>>>>> --------------
>>>>>>>>>>>>> ---------
>>>>>>>>>>>>> Jul 07 07:55:42 016cf20c-d572-42c1-f217-9eb8d439b73c
>>>>>>>>>>>>> PCIEX-8000-KP
>>>>>>>>>>>>> Major
>>>>>>>>>>>>>
>>>>>>>>>>>>> Fault class : fault.io.pciex.device-interr-corr max 29%
>>>>>>>>>>>>> fault.io.pciex.bus-linkerr-corr max 14%
>>>>>>>>>>>>> Affects : dev:////pci@0,0/pci8086,3408@1/pci15d9,10c9@0
>>>>>>>>>>>>> dev:////pci@0,0/pci8086,3408@1/pci15d9,10c9@0,1
>>>>>>>>>>>>> dev:////pci@0,0/pci8086,3408@1
>>>>>>>>>>>>> faulted but still in service
>>>>>>>>>>>>> FRU : "MB"
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> (hc://:product-id=X8DTH-i-6-iF-6F:chassis-id=1234567890:server-id=opensolaris/motherboard=0)
>>>>>>>>>>>>> faulty
>>>>>>>>>>>>>
>>>>>>>>>>>>> Description : Too many recovered bus errors have been detected,
>>>>>>>>>>>>> which
>>>>>>>>>>>>> indicates
>>>>>>>>>>>>> a problem with the specified bus or with the specified
>>>>>>>>>>>>> transmitting device. This may degrade into an
>>>>>>>>>>>>> unrecoverable
>>>>>>>>>>>>> fault.
>>>>>>>>>>>>> Refer to http://sun.com/msg/PCIEX-8000-KP for more
>>>>>>>>>>>>> information.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Response : One or more device instances may be disabled
>>>>>>>>>>>>>
>>>>>>>>>>>>> Impact : Loss of services provided by the device instances
>>>>>>>>>>>>> associated
>>>>>>>>>>>>> with
>>>>>>>>>>>>> this fault
>>>>>>>>>>>>>
>>>>>>>>>>>>> Action : If a plug-in card is involved check for
>>>>>>>>>>>>> badly-seated
>>>>>>>>>>>>> cards
>>>>>>>>>>>>> or
>>>>>>>>>>>>> bent pins. Otherwise schedule a repair procedure to
>>>>>>>>>>>>> replace
>>>>>>>>>>>>> the
>>>>>>>>>>>>> affected device. Use fmadm faulty to identify the
>>>>>>>>>>>>> device
>>>>>>>>>>>>> or
>>>>>>>>>>>>> contact Sun for support.
>>>>>>>>>>>>>
>>>>>>>>>>>>> How bad is this error? I need to put some adapters in, but it
>>>>>>>>>>>>> sounds
>>>>>>>>>>>>> like the OS doesn't handle the NHM's IOH (or is it really
>>>>>>>>>>>>> detaining
>>>>>>>>>>>>> a
>>>>>>>>>>>>> HW issue?).
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>> OS does handle these issues and unfortunately it is a HW issue.
>>>>>>>>>>>> This
>>>>>>>>>>>> is
>>>>>>>>>>>> likely to eventually cause your system to panic or fill up your
>>>>>>>>>>>> hard
>>>>>>>>>>>> drive.
>>>>>>>>>>>> Assuming you are seeing a lot of btlp and rto errors.. If
>>>>>>>>>>>> anything
>>>>>>>>>>>> these
>>>>>>>>>>>> errors are performance killer. Not only is the RTO/BTLP error
>>>>>>>>>>>> telling
>>>>>>>>>>>> you
>>>>>>>>>>>> that many packets require retransmit, the OS also has to
>>>>>>>>>>>> constantly
>>>>>>>>>>>> go
>>>>>>>>>>>> out
>>>>>>>>>>>> and scan and clean up the fabric.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> This system is triple boot: RHEL5.3, W2008S, and OpenSolaris.
>>>>>>>>>>>
>>>>>>>>>>> The errors in OpenSolaris occur if no cards are installed in the
>>>>>>>>>>> bus.
>>>>>>>>>>>
>>>>>>>>>>> The other OSes don't report any errors w/ or w/o cards in the bus.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> This doesn't happen when there are no cards installed, since the
>>>>>>>>>> error
>>>>>>>>>> is
>>>>>>>>>> literally complaining about a packets received between 2 devices.
>>>>>>>>>> Are
>>>>>>>>>> you
>>>>>>>>>> sure it's you are correctly identifying the right slot?
>>>>>>>>>>
>>>>>>>>>> I believe only OpenSolaris even detects these errors, which is why
>>>>>>>>>> the
>>>>>>>>>> other
>>>>>>>>>> OSes don't report any errors. It doesn't mean that errors aren't
>>>>>>>>>> occurring
>>>>>>>>>> though.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>> It would also be nice to throttle the errlog so it doesn't fill
>>>>>>>>>>>>> the
>>>>>>>>>>>>> disk an hour after boot. Is this possible?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>> no throttling possible, but you could turn it off, though highly
>>>>>>>>>>>> not
>>>>>>>>>>>> recommended, it's better to fix the issue. It really could just
>>>>>>>>>>>> be
>>>>>>>>>>>> a
>>>>>>>>>>>> badly
>>>>>>>>>>>> seated card.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> How do I disable the errors?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> We need to figure out exactly what your error is first, please
>>>>>>>>>> provide
>>>>>>>>>> the
>>>>>>>>>> "fmdump -eV" log. If it is huge, just tail the last 500-1000 lines
>>>>>>>>>> should
>>>>>>>>>> be enough.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> Would that produce the same as the incantation shown earlier?:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>> I think all the CE's would produce the same message. I also need to
>>>>>>>> know
>>>>>>>> the exact device.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> Last 1000 lines (of ~20 million) attached.
>>>>>>>
>>>>>>> There are some boards in the bus at this time, but the same error
>>>>>>> occurs w/o them, and their drivers are not yet loaded. Everything
>>>>>>> else is built-in to the motherboard.
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Chris
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>>> <snip>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>
>>>>
>>
More information about the fm-discuss
mailing list