Skip to content

Comments

fix pci devices duplicates#82

Open
GregWhiteyBialas wants to merge 2 commits intojenningsloy318:masterfrom
GregWhiteyBialas:master
Open

fix pci devices duplicates#82
GregWhiteyBialas wants to merge 2 commits intojenningsloy318:masterfrom
GregWhiteyBialas:master

Conversation

@GregWhiteyBialas
Copy link

Hi,
on some systems scraping fails and crashes exporter because of duplicated PCI devices in redfish output.

This patch aims to workaround this.

Co-authored-by: Will Szumski <will@stackhpc.com>
for _, pcieDevice := range pcieDevices {
_, exists := processed[pcieDevice.ODataID]
if exists {
systemLogContext.WithField("operation", "system.PCIeDevices()").Info(fmt.Sprintf("Ignoring duplicate pci device: %s", pcieDevice.ODataID))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you see the duplicate devices in the Redfish response?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Here is sample of what I got, when I made curl call:
"PCIeDevices": [ { "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeDevices/129-0" }, { "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeDevices/129-0" }, { "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeDevices/128-7" }, { "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeDevices/128-4" }, { "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeDevices/128-8" }, { "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeDevices/128-7" }, { "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeDevices/128-2" }, { "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeDevices/128-8" }, { "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeDevices/128-1" }, { "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeDevices/128-3" }, { "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeDevices/128-0" }, { "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeDevices/160-8" }, { "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeDevices/160-7" }, { "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeDevices/160-7" }, { "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeDevices/160-3" }, { "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeDevices/160-8" },

@stmcginnis
Copy link

Formatted to make it a little easier to read:

PCIeDevices:
  - "@odata.id": /redfish/v1/Systems/System.Embedded.1/PCIeDevices/129-0
  - "@odata.id": /redfish/v1/Systems/System.Embedded.1/PCIeDevices/129-0
  - "@odata.id": /redfish/v1/Systems/System.Embedded.1/PCIeDevices/128-7
  - "@odata.id": /redfish/v1/Systems/System.Embedded.1/PCIeDevices/128-4
  - "@odata.id": /redfish/v1/Systems/System.Embedded.1/PCIeDevices/128-8
  - "@odata.id": /redfish/v1/Systems/System.Embedded.1/PCIeDevices/128-7
  - "@odata.id": /redfish/v1/Systems/System.Embedded.1/PCIeDevices/128-2
  - "@odata.id": /redfish/v1/Systems/System.Embedded.1/PCIeDevices/128-8
  - "@odata.id": /redfish/v1/Systems/System.Embedded.1/PCIeDevices/128-1
  - "@odata.id": /redfish/v1/Systems/System.Embedded.1/PCIeDevices/128-3
  - "@odata.id": /redfish/v1/Systems/System.Embedded.1/PCIeDevices/128-0
  - "@odata.id": /redfish/v1/Systems/System.Embedded.1/PCIeDevices/160-8
  - "@odata.id": /redfish/v1/Systems/System.Embedded.1/PCIeDevices/160-7
  - "@odata.id": /redfish/v1/Systems/System.Embedded.1/PCIeDevices/160-7
  - "@odata.id": /redfish/v1/Systems/System.Embedded.1/PCIeDevices/160-3
  - "@odata.id": /redfish/v1/Systems/System.Embedded.1/PCIeDevices/160-8

@GregWhiteyBialas
Copy link
Author

@stmcginnis Thanks!

I shouldn't be so laconic in description of this PR. So I am correcting myself. I will truncate all outputs for redability:

Curl of curl -s -k -u $REDFISH_USER:$REDFISH_PASS https://$REDFISH_HOST/redfish/v1/Systems/System.Embedded.1/
gives a lot of output in which we can find:

"PCIeDevices": [
    {
      "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeDevices/129-0"
    },
    {
      "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeDevices/129-0"

and:

"PCIeFunctions": [
    {
      "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeDevices/129-0/PCIeFunctions/129-0-0"
    },
    {
      "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeDevices/129-0/PCIeFunctions/129-0-1"
    },

As you can see PCIeDevices/129-0 is listed twice without any data that allow us to differentiate in PCIeDevices but it is easy distinguishable in PCIeFunctions

Curl-ing this endpoint gives:

$ curl -s -k -u $REDFISH_USER:$REDFISH_PASS https://$REDFISH_HOST/redfish/v1/Systems/System.Embedded.1/PCIeDevices/129-0| jq .
{
  "@odata.context": "/redfish/v1/$metadata#PCIeDevice.PCIeDevice",
  "@odata.etag": "1705495544",
  "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeDevices/129-0",
  "@odata.type": "#PCIeDevice.v1_4_0.PCIeDevice",
  "Assembly": {
    "@odata.id": "/redfish/v1/Chassis/System.Embedded.1/Assembly"
  },
  "AssetTag": null,
  "Description": "BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet Controller",
  "DeviceType": "MultiFunction",
  "FirmwareVersion": "21.65.33.33",
  "Id": "129-0",
  "Links": {
    "Chassis": [
      {
        "@odata.id": "/redfish/v1/Chassis/System.Embedded.1"
      }
    ],
    "Chassis@odata.count": 1,
    "PCIeFunctions": [
      {
        "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeDevices/129-0/PCIeFunctions/129-0-0"
      },
      {
        "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeDevices/129-0/PCIeFunctions/129-0-1"
      }
    ],
    "PCIeFunctions@odata.count": 2
  },
  "Manufacturer": "Broadcom Inc. and subsidiaries",
  "Model": null,
  "Name": "BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet Controller",
  "PartNumber": "03T135449",
  "SKU": null,
  "SerialNumber": "WSEDERP01TX",
  "Status": {
    "Health": "OK",
    "HealthRollup": "OK",
    "State": "Enabled"
  }
}

So we see that this device has two functions. Which can be accessed by curl:

$ curl -s -k -u $REDFISH_USER:$REDFISH_PASS https://$REDFISH_HOST/redfish/v1/Systems/System.Embedded.1/PCIeDevices/129-0/PCIeFunctions/129-0-1 | jq .
{
  "@odata.context": "/redfish/v1/$metadata#PCIeFunction.PCIeFunction",
  "@odata.etag": "1705676",
  "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeDevices/129-0/PCIeFunctions/129-0-1",
  "@odata.type": "#PCIeFunction.v1_2_3.PCIeFunction",
  "ClassCode": "0x000002",
  "Description": "BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet Controller",
  "DeviceClass": "NetworkController",
  "DeviceId": "0x16d8",
  "FunctionId": 1,
  "FunctionType": "Physical",
  "Id": "129-0-1",
  "Links": {
    "Drives": [],
    "Drives@odata.count": 0,
    "EthernetInterfaces": [
      {
        "@odata.id": "/redfish/v1/Systems/System.Embedded.1/EthernetInterfaces/NIC.Slot.5-2-1"
      }
    ],
    "EthernetInterfaces@odata.count": 1,
    "PCIeDevice": {
      "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeDevices/129-0"
    },
    "StorageControllers": [],
    "StorageControllers@odata.count": 0
  },
  "Name": "BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet Controller",
  "Oem": {
    "Dell": {
      "@odata.type": "#DellOem.v1_2_0.DellOemResources",
      "DellPCIeFunction": {
        "@odata.context": "/redfish/v1/$metadata#DellPCIeFunction.DellPCIeFunction",
        "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeDevices/129-0/PCIeFunctions/129-0-1/Oem/Dell/DellPCIeFunctions/NIC.Slot.5-2-1",
        "@odata.type": "#DellPCIeFunction.v1_3_0.DellPCIeFunction",
        "DataBusWidth": "16XOrX16",
        "Description": "An instance of DellPCIeFunction will have PCI device specific data.",
        "Id": "NIC.Slot.5-2-1",
        "LastSystemInventoryTime": "2024-01-17T12:45:34+00:00",
        "LastUpdateTime": "2023-01-24T10:02:15+00:00",
        "Name": "DellPCIeFunction",
        "SlotLength": "LongLength",
        "SlotType": "PCIExpressGen4"
      }
    }
  },
  "RevisionId": "0x01",
  "Status": {
    "Health": "OK",
    "HealthRollup": "OK",
    "State": "Enabled"
  },
  "SubsystemId": "0x4161",
  "SubsystemVendorId": "0x14e4",
  "VendorId": "0x14e4"
}
$ curl -s -k -u $REDFISH_USER:$REDFISH_PASS https://$REDFISH_HOST/redfish/v1/Systems/System.Embedded.1/PCIeDevices/129-0/PCIeFunctions/129-0-0 | jq .
{
  "@odata.context": "/redfish/v1/$metadata#PCIeFunction.PCIeFunction",
  "@odata.etag": "17046744",
  "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeDevices/129-0/PCIeFunctions/129-0-0",
  "@odata.type": "#PCIeFunction.v1_2_3.PCIeFunction",
  "ClassCode": "0x000002",
  "Description": "BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet Controller",
  "DeviceClass": "NetworkController",
  "DeviceId": "0x16d8",
  "FunctionId": 0,
  "FunctionType": "Physical",
  "Id": "129-0-0",
  "Links": {
    "Drives": [],
    "Drives@odata.count": 0,
    "EthernetInterfaces": [
      {
        "@odata.id": "/redfish/v1/Systems/System.Embedded.1/EthernetInterfaces/NIC.Slot.5-1-1"
      }
    ],
    "EthernetInterfaces@odata.count": 1,
    "PCIeDevice": {
      "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeDevices/129-0"
    },
    "StorageControllers": [],
    "StorageControllers@odata.count": 0
  },
  "Name": "BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet Controller",
  "Oem": {
    "Dell": {
      "@odata.type": "#DellOem.v1_2_0.DellOemResources",
      "DellPCIeFunction": {
        "@odata.context": "/redfish/v1/$metadata#DellPCIeFunction.DellPCIeFunction",
        "@odata.id": "/redfish/v1/Systems/System.Embedded.1/PCIeDevices/129-0/PCIeFunctions/129-0-0/Oem/Dell/DellPCIeFunctions/NIC.Slot.5-1-1",
        "@odata.type": "#DellPCIeFunction.v1_3_0.DellPCIeFunction",
        "DataBusWidth": "16XOrX16",
        "Description": "An instance of DellPCIeFunction will have PCI device specific data.",
        "Id": "NIC.Slot.5-1-1",
        "LastSystemInventoryTime": "2024-01-17T12:45:34+00:00",
        "LastUpdateTime": "2023-01-24T10:02:15+00:00",
        "Name": "DellPCIeFunction",
        "SlotLength": "LongLength",
        "SlotType": "PCIExpressGen4"
      }
    }
  },
  "RevisionId": "0x01",
  "Status": {
    "Health": "OK",
    "HealthRollup": "OK",
    "State": "Enabled"
  },
  "SubsystemId": "0x4161",
  "SubsystemVendorId": "0x14e4",
  "VendorId": "0x14e4"
}

I see in code, that list of PCI devices isn't used to get PCI devices functions, which is pulled from other API endpoint, so this fix shouldn't have side effect in PCIFunction metrics not scraped.

I am not sure if this is bug in this version of firmware or specific hardware configuration which results in duplicated items.

@stmcginnis
Copy link

I think I've seen this bug in a vendors firmware implementation. It would be good to raise it with your vendor so they can fix it. Redfish definitely should not be reporting the same device twice.

Still good if there is a way to make handling of that condition more robust though.

Copy link

@dougszumski dougszumski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, a single PCI device with multiple functions, can, at least for one set of hardware, appear multiple times in the list of PCI devices (with the same identifier). This causes a failure in the current code. This fix works around the issue by skipping any 'duplicate' devices in a backwards compatible way. I have one minor suggestion, but otherwise I'm in favour of merging.

wg5.Add(len(pcieDevices))
for _, pcieDevice := range pcieDevices {
_, exists := processed[pcieDevice.ODataID]
if exists {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: In the absence of any unit tests it would be helpful to add the scrape data where we see this issue to the sampleOut folder.

A comment could also be added here, referencing the sample data.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. I have added sample data and comment about it.

@GregWhiteyBialas
Copy link
Author

@jenningsloy318 is it possible to move this forward somehow?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants