UCS B200 memory fault – DIMM E2 on server 2/2 operability: inoperable

We saw this fault on a UCS B200 blade server: DIMM E2 on server 2/2 operability: inoperable

We installed an ESXi 5 host on this blade and it had worked fine for some time. However, it crashed (pink screen on the ESX console) a couple of times and we started wondering what was going on.

We recovered it by rebooting the blade, but that was not a final solution.

We were told that the cause could be hardware related, so we decided to take a look at the UCS manager.

Under Faults and Alarms we saw one entry for this particular blade server, namely a problem with the DIMM operability:

UCS B200 DIMM on server operability inoperable fault

DIMM operability: inoperable fault – UCSM

 

Searching on cisco.com, it said that if the DIMM becomes inoperable, you need to replace the DIMM. You can see what the UCS Fault Code: F0185 means for the UCS C-series.

We opened a ticket with Cisco TAC to have this confirmed and, indeed, they said that we had to replace the DIMM, as it had passed the threshold of ECC errors.

We created an RMA, but while we were waiting for it to arrive, we decided to switch this DIMM to another bank within the same B200 blade server.

The DIMM appeared to work normally in its new location and it has continued like that so far (this happened 3 months before this is being post). The fault was cleared:

UCS B200 memory degraded DIMM fault recovered

DIMM inoperable fault recovered – UCSM

 

However, the TAC warned us that the error may re-appear on the same DIMM in the future and suggested that we monitored the server for some time.

You can also check the memory status from the UCSM CLI with this command:

SYSFAB-A# scope server 2/2
SYSFAB-A /chassis/server # show memory
DIMM Location Presence Overall Status Type Capacity (MB) Clock

—- ———- —————- ———————— ———— ————- —–
1 A0 Equipped Operable DDR3 8192 1600
2 A1 Equipped Operable DDR3 8192 1600
3 A2 Equipped Operable DDR3 16384 1600
4 B0 Equipped Operable DDR3 8192 1600
5 B1 Equipped Operable DDR3 8192 1600
6 B2 Equipped Operable DDR3 16384 1600
7 C0 Equipped Operable DDR3 8192 1600
8 C1 Equipped Operable DDR3 8192 1600
9 C2 Equipped Operable DDR3 16384 1600
10 D0 Equipped Operable DDR3 8192 1600
11 D1 Equipped Operable DDR3 8192 1600
12 D2 Equipped Operable DDR3 16384 1600
13 E0 Equipped Operable DDR3 8192 1600
14 E1 Equipped Operable DDR3 8192 1600
15 E2 Equipped Operable DDR3 16384 1600
16 F0 Equipped Operable DDR3 8192 1600
17 F1 Equipped Operable DDR3 8192 1600
18 F2 Equipped Operable DDR3 16384 1600
19 G0 Equipped Operable DDR3 8192 1600
20 G1 Equipped Operable DDR3 8192 1600
21 G2 Equipped Operable DDR3 16384 1600
22 H0 Equipped Operable DDR3 8192 1600
23 H1 Equipped Operable DDR3 8192 1600
24 H2 Equipped Operable DDR3 16384 1600

We replaced the DIMM with the new one and returned the damaged part back to Cisco. Case closed.

Comments

  1. Scott says

    We seen that too. It was a first time that a memory dimm caused a purple screen. We had to replace the system board twice. We also have be seeing a lot of memory issues that keep coming up in Cisco UCS. I don’t know what it is with the memory and the boards but we never seen this many memory dimm issues with HP before.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>