Rundall, Jacob D
2017-05-17 19:12:49 UTC
Iâm curious if anybody can help me figure out how to use xCAT to view âActive Eventsâ for Lenovo System x servers, as shown in the web interface of the IMM. Using pasu gets me somewhere, as follows:
pasu mynode immapp showimmlog | grep âSeverity:5â
There are a few shortcomings, though, as compared to the web interface of the IMM:
1. pasu shows me past events that are no longer active (and the recovery events are lower severity so they donât make it through the grep, so itâs not obvious that the events have been recovered from, at least not with this command).
2. pasu only returns items with some kind of sequence number rather than a date and time.
3. The web interface also sometimes has âAdditional Information for Eventâ as well, which I cannot figure out how to view using pasu.
Here is an example of what I can see in the IMM web interface:
Error System 25 June 2016, 03:14:40.788 AM An Uncorrectable Error has occurred on PCIs.
Error System 25 June 2016, 03:15:13.638 AM Fault in slot 3 on system System x3650 M5. <more>
Clicking âmoreâ on the latter provides the following additional information:
[S.68005] An error has been detected by the the IIO core logic on CPU 1. The Global Fatal Error Status register contains 0x0. The Global Non-Fatal Error Status register contains 0x40. Please check error logs for the presence of additional downstream device error data.
And hereâs the output that I get using my pasu command shown above (with grep):
monitor01: 19 | Severity:5 | Message:Redundancy Lost for Power Unit has asserted.
monitor01: 22 | Severity:5 | Message:Redundancy Lost for Power Unit has asserted.
monitor01: 27 | Severity:5 | Message:Redundancy Lost for Power Unit has asserted.
monitor01: 49 | Severity:5 | Message:Redundancy Lost for Power Unit has asserted.
monitor01: 56 | Severity:5 | Message:Redundancy Lost for Power Unit has asserted.
monitor01: 125 | Severity:5 | Message:A Fatal Bus Error has occurred on bus CPU 2 PECI.
monitor01: 126 | Severity:5 | Message:An Uncorrectable Error has occurred on PCIs.
monitor01: 128 | Severity:5 | Message:Fault in slot 3 on system System x3650 M5.
monitor01: 138 | Severity:5 | Message:A Fatal Bus Error has occurred on bus CPU 2 PECI.
monitor01: 164 | Severity:5 | Message:A Fatal Bus Error has occurred on bus CPU 2 PECI.
Events 126 and 128 clearly correspond to what is shown as âActive Eventsâ in the web interface. But itâs not obvious that the others are not active unless I dig deeper in the IMM log (e.g., without filtering through grep). When I do that I can eventually find subsequent recovery events for the other sev 5 events which shows why they are not considered âactiveâ.
On a related note, does anyone know of a way with xCAT (pasu or otherwise) to view status/info about the following via the command-line from an xCAT management node:
1. IMM web interface: System Status -> System Information -> Check Log LED [I suspect the status here corresponds to the status of the âCheck log LEDâ on the front of the server].
2. Front of the server: âSystem-error LEDâ
3. IMM web interface: System Status -> Hardware Health: status of each component type (i.e., âCooling Devicesâ, âPower Modulesâ, âLocal Storageâ, âProcessorsâ, âMemoryâ, âSystemâ)
Thanks very much,
Jake Rundall
pasu mynode immapp showimmlog | grep âSeverity:5â
There are a few shortcomings, though, as compared to the web interface of the IMM:
1. pasu shows me past events that are no longer active (and the recovery events are lower severity so they donât make it through the grep, so itâs not obvious that the events have been recovered from, at least not with this command).
2. pasu only returns items with some kind of sequence number rather than a date and time.
3. The web interface also sometimes has âAdditional Information for Eventâ as well, which I cannot figure out how to view using pasu.
Here is an example of what I can see in the IMM web interface:
Error System 25 June 2016, 03:14:40.788 AM An Uncorrectable Error has occurred on PCIs.
Error System 25 June 2016, 03:15:13.638 AM Fault in slot 3 on system System x3650 M5. <more>
Clicking âmoreâ on the latter provides the following additional information:
[S.68005] An error has been detected by the the IIO core logic on CPU 1. The Global Fatal Error Status register contains 0x0. The Global Non-Fatal Error Status register contains 0x40. Please check error logs for the presence of additional downstream device error data.
And hereâs the output that I get using my pasu command shown above (with grep):
monitor01: 19 | Severity:5 | Message:Redundancy Lost for Power Unit has asserted.
monitor01: 22 | Severity:5 | Message:Redundancy Lost for Power Unit has asserted.
monitor01: 27 | Severity:5 | Message:Redundancy Lost for Power Unit has asserted.
monitor01: 49 | Severity:5 | Message:Redundancy Lost for Power Unit has asserted.
monitor01: 56 | Severity:5 | Message:Redundancy Lost for Power Unit has asserted.
monitor01: 125 | Severity:5 | Message:A Fatal Bus Error has occurred on bus CPU 2 PECI.
monitor01: 126 | Severity:5 | Message:An Uncorrectable Error has occurred on PCIs.
monitor01: 128 | Severity:5 | Message:Fault in slot 3 on system System x3650 M5.
monitor01: 138 | Severity:5 | Message:A Fatal Bus Error has occurred on bus CPU 2 PECI.
monitor01: 164 | Severity:5 | Message:A Fatal Bus Error has occurred on bus CPU 2 PECI.
Events 126 and 128 clearly correspond to what is shown as âActive Eventsâ in the web interface. But itâs not obvious that the others are not active unless I dig deeper in the IMM log (e.g., without filtering through grep). When I do that I can eventually find subsequent recovery events for the other sev 5 events which shows why they are not considered âactiveâ.
On a related note, does anyone know of a way with xCAT (pasu or otherwise) to view status/info about the following via the command-line from an xCAT management node:
1. IMM web interface: System Status -> System Information -> Check Log LED [I suspect the status here corresponds to the status of the âCheck log LEDâ on the front of the server].
2. Front of the server: âSystem-error LEDâ
3. IMM web interface: System Status -> Hardware Health: status of each component type (i.e., âCooling Devicesâ, âPower Modulesâ, âLocal Storageâ, âProcessorsâ, âMemoryâ, âSystemâ)
Thanks very much,
Jake Rundall