Ampere - 10GbE nic's creating lots of errors - Intel X710 - Ubuntu 22.04

Hello, I found an issue on the default 22.04 Ubuntu build that comes with the Ampere ADlink workstations that show lots of dmesg errors, as seen below

[171852.910737] pcieport 000c:00:01.0: AER: Multiple Corrected error message received from 000c:01:00.0                                                                       
[171852.910765] i40e 000c:01:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)                                                                 
[171852.910773] i40e 000c:01:00.0:   device [8086:1572] error status/mask=00001000/00002000                                                                                   
[171852.910782] i40e 000c:01:00.0:    [12] Timeout                                                                                                                            
[171852.910792] i40e 000c:01:00.0: AER:   Error of this Agent is reported first                                                                                               
[171852.917868] i40e 000c:01:00.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)                                                                 
[171852.917871] i40e 000c:01:00.1:   device [8086:1572] error status/mask=00001000/00002000                                                                                   
[171852.917875] i40e 000c:01:00.1:    [12] Timeout                                                                                                                            
[171856.910532] pcieport 000c:00:01.0: AER: Corrected error message received from 000c:01:00.0  
000c:01:00.1 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02)                                                                          
        Subsystem: QNAP Systems, Inc. Ethernet Controller X710 for 10GbE SFP+                                                                                                 
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+                                                                 
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-                                                                  
        Latency: 0                                                                                                                                                            
        Interrupt: pin A routed to IRQ 96                                                                                                                                     
        NUMA node: 0                                                                                                                                                          
        IOMMU group: 20                                                                                                                                                       
        Region 0: Memory at 300000800000 (64-bit, prefetchable) [size=8M]                                                                                                     
        Region 3: Memory at 300001808000 (64-bit, prefetchable) [size=32K]                                                                                                    
        Expansion ROM at 40080000 [disabled] [size=512K]                                                                                                                      
        Capabilities: <access denied>                                                                                                                                         
        Kernel driver in use: i40e                                                                                                                                            
        Kernel modules: i40e  

Any ideas?

Wow, that is ugly. Yikes!

Try this:
‘sudo lspci -vvv -s 000c:01:00.0’ and post the full results. I’ll have a look at it.

I helped develop that driver while I was at Intel, so I have some familiarity with it.

Oh, you’ll have to sudo that command to get past the “<access denied>” for capabilities.

1 Like

Thanks for the prompt reply,
here’s the output, it’s a dual head SFP+ QNAP PCIe adapter, I got SFP+ module installed on the first one, about 10Km link, tested the module in switch, and was fine. The second SFP+ module has a 1G Fiber, but nothing connected.

darragh@armbuilder:~$ sudo lspci -vvv -s 000c:01:00.0
000c:01:00.0 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02)
        Subsystem: QNAP Systems, Inc. Ethernet Controller X710 for 10GbE SFP+
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin A routed to IRQ 96
        NUMA node: 0
        IOMMU group: 15
        Region 0: Memory at 300000000000 (64-bit, prefetchable) [size=8M]
        Region 3: Memory at 300001800000 (64-bit, prefetchable) [size=32K]
        Expansion ROM at 40000000 [disabled] [size=512K]
        Capabilities: [40] Power Management version 3
                Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME-
        Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
                Address: 0000000000000000  Data: 0000
                Masking: 00000000  Pending: 00000000
        Capabilities: [70] MSI-X: Enable+ Count=129 Masked-
                Vector table: BAR=3 offset=00000000
                PBA: BAR=3 offset=00001000
        Capabilities: [a0] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 2048 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
                DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop- FLReset-
                        MaxPayload 128 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
                LnkCap: Port #0, Speed 8GT/s, Width x8, ASPM L1, Exit Latency L1 <16us
                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 8GT/s (ok), Width x8 (ok)
                        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR-
                         10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix-
                         EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
                         FRS- TPHComp- ExtTPHComp-
                         AtomicOpsCap: 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- OBFF Disabled,
                         AtomicOpsCtl: ReqEn-
                LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS-
                LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+
                         EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
                         Retimer- 2Retimers- CrosslinkRes: unsupported
        Capabilities: [100 v2] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt+ UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
                AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
                        MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
                HeaderLog: 00000000 00000000 00000000 00000000
        Capabilities: [140 v1] Device Serial Number 9c-97-8a-ff-ff-be-1b-aa
        Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI)
                ARICap: MFVC- ACS-, Next Function: 1
                ARICtl: MFVC- ACS-, Function Group: 0
        Capabilities: [160 v1] Single Root I/O Virtualization (SR-IOV)
                IOVCap: Migration-, Interrupt Message Number: 000
                IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy+
                IOVSta: Migration-
                Initial VFs: 64, Total VFs: 64, Number of VFs: 0, Function Dependency Link: 00
                VF offset: 16, stride: 1, Device ID: 154c
                Supported Page Size: 00000553, System Page Size: 00000001
                Region 0: Memory at 0000300001000000 (64-bit, prefetchable)
                Region 3: Memory at 0000300001810000 (64-bit, prefetchable)
                VF Migration: offset: 00000000, BIR: 0
        Capabilities: [1a0 v1] Transaction Processing Hints
                Device specific mode supported
                No steering table available
        Capabilities: [1b0 v1] Access Control Services
                ACSCap: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
        Capabilities: [1d0 v1] Secondary PCI Express
                LnkCtl3: LnkEquIntrruptEn- PerformEqu-
                LaneErrStat: 0
        Kernel driver in use: i40e
        Kernel modules: i40e

darragh@armbuilder:~$ 

Second port:

darragh@armbuilder:~$ sudo lspci -vvv -s 000c:01:00.1
000c:01:00.1 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02)
        Subsystem: QNAP Systems, Inc. Ethernet Controller X710 for 10GbE SFP+
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin A routed to IRQ 96
        NUMA node: 0
        IOMMU group: 20
        Region 0: Memory at 300000800000 (64-bit, prefetchable) [size=8M]
        Region 3: Memory at 300001808000 (64-bit, prefetchable) [size=32K]
        Expansion ROM at 40080000 [disabled] [size=512K]
        Capabilities: [40] Power Management version 3
                Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME-
        Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
                Address: 0000000000000000  Data: 0000
                Masking: 00000000  Pending: 00000000
        Capabilities: [70] MSI-X: Enable+ Count=129 Masked-
                Vector table: BAR=3 offset=00000000
                PBA: BAR=3 offset=00001000
        Capabilities: [a0] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 2048 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
                DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop- FLReset-
                        MaxPayload 128 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
                LnkCap: Port #0, Speed 8GT/s, Width x8, ASPM L1, Exit Latency L1 <16us
                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 8GT/s (ok), Width x8 (ok)
                        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR-
                         10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix-
                         EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
                         FRS- TPHComp- ExtTPHComp-
                         AtomicOpsCap: 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- OBFF Disabled,
                         AtomicOpsCtl: ReqEn-
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete- EqualizationPhase1-
                         EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest-
                         Retimer- 2Retimers- CrosslinkRes: unsupported
        Capabilities: [100 v2] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt+ UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout+ AdvNonFatalErr-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
                AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
                        MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
                HeaderLog: 00000000 00000000 00000000 00000000
        Capabilities: [140 v1] Device Serial Number 9c-97-8a-ff-ff-be-1b-aa
        Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI)
                ARICap: MFVC- ACS-, Next Function: 0
                ARICtl: MFVC- ACS-, Function Group: 0
        Capabilities: [160 v1] Single Root I/O Virtualization (SR-IOV)
                IOVCap: Migration-, Interrupt Message Number: 000
                IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy-
                IOVSta: Migration-
                Initial VFs: 64, Total VFs: 64, Number of VFs: 0, Function Dependency Link: 01
                VF offset: 79, stride: 1, Device ID: 154c
                Supported Page Size: 00000553, System Page Size: 00000001
                Region 0: Memory at 0000300001400000 (64-bit, prefetchable)
                Region 3: Memory at 0000300001910000 (64-bit, prefetchable)
                VF Migration: offset: 00000000, BIR: 0
        Capabilities: [1a0 v1] Transaction Processing Hints
                Device specific mode supported
                No steering table available
        Capabilities: [1b0 v1] Access Control Services
                ACSCap: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
        Kernel driver in use: i40e
        Kernel modules: i40e
darragh@armbuilder:~$ 

Thats ethtool output

darragh@armbuilder:~$ sudo ethtool enP12p1s0f0 
Settings for enP12p1s0f0:
        Supported ports: [ FIBRE ]
        Supported link modes:   10000baseLR/Full
        Supported pause frame use: Symmetric Receive-only
        Supports auto-negotiation: Yes
        Supported FEC modes: Not reported
        Advertised link modes:  10000baseLR/Full
        Advertised pause frame use: No
        Advertised auto-negotiation: Yes
        Advertised FEC modes: Not reported
        Speed: 10000Mb/s
        Duplex: Full
        Auto-negotiation: off
        Port: FIBRE
        PHYAD: 0
        Transceiver: internal
        Supports Wake-on: g
        Wake-on: g
        Current message level: 0x00000007 (7)
                               drv probe link
        Link detected: yes
darragh@armbuilder:~$ 
``

Let me know if you need any more info

Hi, this looks odd:

Your capabilities show 8GT/s and link status shows 8GT/s but target link speed is 2.5GT/s

Do you have another slot you could use? Target link speed should match with capabilities and link status in most cases.

I have an X520 installed on this machine, so while it’s only PCIe gen 2 at 5GT/s all three are matching.

LnkCap: Port #0, Speed 5GT/s, Width x8, ASPM L0s, Exit Latency L0s <1us
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp-
LnkSta: Speed 5GT/s (ok), Width x8 (ok)
TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
	LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
		 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
		 Compliance De-emphasis: -6dB

Weird formatting - sorry about that.

Thanks Greg, Interesting, it could be that where I have the PCIe card placed is a PCIe x16 (full width) as the card is a little longer then the PCIe x14 slot
https://www.digitec.ch/en/s1/product/qnap-qxg-10g2sf-x710-sfp-pci-express-30-x8-network-cards-37863007

This is the motherboard i’m using, it’s placed in the middle PCIe x16 slot
aadp-docs – I-Pi SMARC

Do you think that has an impact?

I’m in the office tomorrow again so can swap the slots and see if it makes any impact

There have been some PCIe reliability issues on the AADP daughter board. If space permits, use the QNAP card on the PCIe slot closest to the COM-HPC/SoC. The PCIe link training n boot was supposed to be better on that slot.

1 Like

Hi Darragh,

It appears that vikingforties has a good suggestion and some knowledge of the issue. I’d follow his advice since I’m not familiar with that board.

Good luck!

2 Likes

A usual possible solution is disable aspm, some cards might have aspm issue on some circumstance

ex:
GRUB_CMDLINE_LINUX_DEFAULT="console=ttyAMA0 pcie_aspm=off"

ref.

It’s normal. I got errors like that from NVMe drives too like that and I just disabled it. Here is how:

My systemd file look like this. You can do the same with any other NVMe devices.

root@voyager:~# cat /usr/lib/systemd/system/baonq-disable-nvme-aer.service
[Unit]
Description=Fix for AER's excessive logging for NVMe storage (Samsung Samsung SSD 990 PRO M2 NVMe)
After=systemd-modules-load.service

[Service]
Type=oneshot
# Change your device and vendor (or bus/slot/function accordingly)
ExecStart=/usr/bin/setpci -v -d 1def:e117 CAP_EXP+0x8.w=0x283e
RemainAfterExit=yes

[Install]
WantedBy=network.target
3 Likes