[fm-discuss] fmd gone insane
Mike Shapiro
mws at sun.com
Wed Dec 5 13:04:37 PST 2007
On Wed, Dec 05, 2007 at 12:27:54PM -0800, Erwin wrote:
> >From the ereport you are getting 2 CEs.
>
> aer_ce = 0x81
>
> 0x1 = Receiver Error. This is known problem with PLX switches and you should already have a workaround that actually automatically masks this.
>
> 0x80 = Bad DLLP Status. This is caused when the Link Layer detects a bad CRC check. Usually due to bad signal integrity that results in bit flips. PCI-Express's ack/nack protocol will retry the packet resulting in no information loss.
>
> You could either have a dirty link or the card below this slot is producing bad data. There is a risk that eventually this may cause an Uncorrectable error and your system may go down.
>
> You are probably not experiencing any performance loss because the fabric hasn't been saturated, it's a pretty big pipe.
>
> If you want to ignore these errors add the following in the /etc/system
> set pcie:pcie_aer_ce_mask = 0x81
>
> This will mask both "Receiver Errors" as Cindi suggested earlier as well as "Bad DLLP". Turning off FMD just means the ereports that are being sent will get lost. But your system is still getting a bunch of errors and SW is taking up cycles trying to clean it up.
>
> Erwin
> set pcie:pcie_aer_ce_mask = 1
The kernel needs to protect against error storm traps. We used to have
the same issue on SPARC for CE traps. If the design of the hardware is
such that it can emit the same CE trap again and again, and there isn't
some missing software bug where we're not clearing some state that
prevents re-trap, then the design of the kernel software for any subsystem
like this needs to be that it has some built-in throttle whereby you
can't generate more than a certain number of CEs per unit time, or
that you need to offer an interface to shut off CE traps and fmd needs
to have a SERD engine that causes a module to poke the kernel to
shut those off in a storm scenario.
-Mike
--
Mike Shapiro, Solaris Kernel Development. blogs.sun.com/mws/
More information about the fm-discuss
mailing list