Discussion:
[xcat-user] using xCAT to view "Active Events" for Lenovo System x servers
Rundall, Jacob D
2017-05-17 19:12:49 UTC
Permalink
I’m curious if anybody can help me figure out how to use xCAT to view “Active Events” for Lenovo System x servers, as shown in the web interface of the IMM. Using pasu gets me somewhere, as follows:
pasu mynode immapp showimmlog | grep “Severity:5”
There are a few shortcomings, though, as compared to the web interface of the IMM:

1. pasu shows me past events that are no longer active (and the recovery events are lower severity so they don’t make it through the grep, so it’s not obvious that the events have been recovered from, at least not with this command).
2. pasu only returns items with some kind of sequence number rather than a date and time.
3. The web interface also sometimes has “Additional Information for Event” as well, which I cannot figure out how to view using pasu.

Here is an example of what I can see in the IMM web interface:
Error System 25 June 2016, 03:14:40.788 AM An Uncorrectable Error has occurred on PCIs.
Error System 25 June 2016, 03:15:13.638 AM Fault in slot 3 on system System x3650 M5. <more>

Clicking “more” on the latter provides the following additional information:
[S.68005] An error has been detected by the the IIO core logic on CPU 1. The Global Fatal Error Status register contains 0x0. The Global Non-Fatal Error Status register contains 0x40. Please check error logs for the presence of additional downstream device error data.

And here’s the output that I get using my pasu command shown above (with grep):
monitor01: 19 | Severity:5 | Message:Redundancy Lost for Power Unit has asserted.
monitor01: 22 | Severity:5 | Message:Redundancy Lost for Power Unit has asserted.
monitor01: 27 | Severity:5 | Message:Redundancy Lost for Power Unit has asserted.
monitor01: 49 | Severity:5 | Message:Redundancy Lost for Power Unit has asserted.
monitor01: 56 | Severity:5 | Message:Redundancy Lost for Power Unit has asserted.
monitor01: 125 | Severity:5 | Message:A Fatal Bus Error has occurred on bus CPU 2 PECI.
monitor01: 126 | Severity:5 | Message:An Uncorrectable Error has occurred on PCIs.
monitor01: 128 | Severity:5 | Message:Fault in slot 3 on system System x3650 M5.
monitor01: 138 | Severity:5 | Message:A Fatal Bus Error has occurred on bus CPU 2 PECI.
monitor01: 164 | Severity:5 | Message:A Fatal Bus Error has occurred on bus CPU 2 PECI.

Events 126 and 128 clearly correspond to what is shown as “Active Events” in the web interface. But it’s not obvious that the others are not active unless I dig deeper in the IMM log (e.g., without filtering through grep). When I do that I can eventually find subsequent recovery events for the other sev 5 events which shows why they are not considered “active”.


On a related note, does anyone know of a way with xCAT (pasu or otherwise) to view status/info about the following via the command-line from an xCAT management node:

1. IMM web interface: System Status -> System Information -> Check Log LED [I suspect the status here corresponds to the status of the “Check log LED” on the front of the server].
2. Front of the server: “System-error LED”
3. IMM web interface: System Status -> Hardware Health: status of each component type (i.e., “Cooling Devices”, “Power Modules”, “Local Storage”, “Processors”, “Memory”, “System”)


Thanks very much,

Jake Rundall
Jarrod Johnson
2017-05-17 19:18:38 UTC
Permalink
In confluent, a new command was added:

# nodehealth n1
n1: critical (Mezz Exp 2 Fault:Critical)
[***@odin ~]# nodehealth r1
r1: ok

In xcat:
# rvitals <noderange> led

Can do a serviceable job of showing the error lights:
# rvitals n1 leds
n1: LED 0x0000 (Fault) active to indicate system error condition.
n1: LED 0271 (Mezz Exp 2) active to indicate Sensor 0x62 (Mezz Exp 2 Fault) error.


From: Rundall, Jacob D [mailto:***@illinois.edu]
Sent: Wednesday, May 17, 2017 3:13 PM
To: xcat-***@lists.sourceforge.net
Subject: [xcat-user] using xCAT to view "Active Events" for Lenovo System x servers

I’m curious if anybody can help me figure out how to use xCAT to view “Active Events” for Lenovo System x servers, as shown in the web interface of the IMM. Using pasu gets me somewhere, as follows:
pasu mynode immapp showimmlog | grep “Severity:5”
There are a few shortcomings, though, as compared to the web interface of the IMM:

1. pasu shows me past events that are no longer active (and the recovery events are lower severity so they don’t make it through the grep, so it’s not obvious that the events have been recovered from, at least not with this command).
2. pasu only returns items with some kind of sequence number rather than a date and time.
3. The web interface also sometimes has “Additional Information for Event” as well, which I cannot figure out how to view using pasu.

Here is an example of what I can see in the IMM web interface:
Error System 25 June 2016, 03:14:40.788 AM An Uncorrectable Error has occurred on PCIs.
Error System 25 June 2016, 03:15:13.638 AM Fault in slot 3 on system System x3650 M5. <more>

Clicking “more” on the latter provides the following additional information:
[S.68005] An error has been detected by the the IIO core logic on CPU 1. The Global Fatal Error Status register contains 0x0. The Global Non-Fatal Error Status register contains 0x40. Please check error logs for the presence of additional downstream device error data.

And here’s the output that I get using my pasu command shown above (with grep):
monitor01: 19 | Severity:5 | Message:Redundancy Lost for Power Unit has asserted.
monitor01: 22 | Severity:5 | Message:Redundancy Lost for Power Unit has asserted.
monitor01: 27 | Severity:5 | Message:Redundancy Lost for Power Unit has asserted.
monitor01: 49 | Severity:5 | Message:Redundancy Lost for Power Unit has asserted.
monitor01: 56 | Severity:5 | Message:Redundancy Lost for Power Unit has asserted.
monitor01: 125 | Severity:5 | Message:A Fatal Bus Error has occurred on bus CPU 2 PECI.
monitor01: 126 | Severity:5 | Message:An Uncorrectable Error has occurred on PCIs.
monitor01: 128 | Severity:5 | Message:Fault in slot 3 on system System x3650 M5.
monitor01: 138 | Severity:5 | Message:A Fatal Bus Error has occurred on bus CPU 2 PECI.
monitor01: 164 | Severity:5 | Message:A Fatal Bus Error has occurred on bus CPU 2 PECI.

Events 126 and 128 clearly correspond to what is shown as “Active Events” in the web interface. But it’s not obvious that the others are not active unless I dig deeper in the IMM log (e.g., without filtering through grep). When I do that I can eventually find subsequent recovery events for the other sev 5 events which shows why they are not considered “active”.


On a related note, does anyone know of a way with xCAT (pasu or otherwise) to view status/info about the following via the command-line from an xCAT management node:

1. IMM web interface: System Status -> System Information -> Check Log LED [I suspect the status here corresponds to the status of the “Check log LED” on the front of the server].
2. Front of the server: “System-error LED”
3. IMM web interface: System Status -> Hardware Health: status of each component type (i.e., “Cooling Devices”, “Power Modules”, “Local Storage”, “Processors”, “Memory”, “System”)


Thanks very much,

Jake Rundall
Christian Caruthers
2017-05-17 19:25:04 UTC
Permalink
Have you looked at 'rvitals mynode leds' ?

Regards,
Christian Caruthers
Lenovo Professional Services
Mobile: 757-289-9872

From: Rundall, Jacob D [mailto:***@illinois.edu]
Sent: Wednesday, May 17, 2017 3:13 PM
To: xcat-***@lists.sourceforge.net
Subject: [xcat-user] using xCAT to view "Active Events" for Lenovo System x servers

I’m curious if anybody can help me figure out how to use xCAT to view “Active Events” for Lenovo System x servers, as shown in the web interface of the IMM. Using pasu gets me somewhere, as follows:
pasu mynode immapp showimmlog | grep “Severity:5”
There are a few shortcomings, though, as compared to the web interface of the IMM:

1. pasu shows me past events that are no longer active (and the recovery events are lower severity so they don’t make it through the grep, so it’s not obvious that the events have been recovered from, at least not with this command).
2. pasu only returns items with some kind of sequence number rather than a date and time.
3. The web interface also sometimes has “Additional Information for Event” as well, which I cannot figure out how to view using pasu.

Here is an example of what I can see in the IMM web interface:
Error System 25 June 2016, 03:14:40.788 AM An Uncorrectable Error has occurred on PCIs.
Error System 25 June 2016, 03:15:13.638 AM Fault in slot 3 on system System x3650 M5. <more>

Clicking “more” on the latter provides the following additional information:
[S.68005] An error has been detected by the the IIO core logic on CPU 1. The Global Fatal Error Status register contains 0x0. The Global Non-Fatal Error Status register contains 0x40. Please check error logs for the presence of additional downstream device error data.

And here’s the output that I get using my pasu command shown above (with grep):
monitor01: 19 | Severity:5 | Message:Redundancy Lost for Power Unit has asserted.
monitor01: 22 | Severity:5 | Message:Redundancy Lost for Power Unit has asserted.
monitor01: 27 | Severity:5 | Message:Redundancy Lost for Power Unit has asserted.
monitor01: 49 | Severity:5 | Message:Redundancy Lost for Power Unit has asserted.
monitor01: 56 | Severity:5 | Message:Redundancy Lost for Power Unit has asserted.
monitor01: 125 | Severity:5 | Message:A Fatal Bus Error has occurred on bus CPU 2 PECI.
monitor01: 126 | Severity:5 | Message:An Uncorrectable Error has occurred on PCIs.
monitor01: 128 | Severity:5 | Message:Fault in slot 3 on system System x3650 M5.
monitor01: 138 | Severity:5 | Message:A Fatal Bus Error has occurred on bus CPU 2 PECI.
monitor01: 164 | Severity:5 | Message:A Fatal Bus Error has occurred on bus CPU 2 PECI.

Events 126 and 128 clearly correspond to what is shown as “Active Events” in the web interface. But it’s not obvious that the others are not active unless I dig deeper in the IMM log (e.g., without filtering through grep). When I do that I can eventually find subsequent recovery events for the other sev 5 events which shows why they are not considered “active”.


On a related note, does anyone know of a way with xCAT (pasu or otherwise) to view status/info about the following via the command-line from an xCAT management node:

1. IMM web interface: System Status -> System Information -> Check Log LED [I suspect the status here corresponds to the status of the “Check log LED” on the front of the server].
2. Front of the server: “System-error LED”
3. IMM web interface: System Status -> Hardware Health: status of each component type (i.e., “Cooling Devices”, “Power Modules”, “Local Storage”, “Processors”, “Memory”, “System”)


Thanks very much,

Jake Rundall
Rundall, Jacob D
2017-05-17 20:34:27 UTC
Permalink
Thanks, Jarrod and Christian. I was not aware of rvitals but that seems to do exactly what I need. And it also shows me that I need to spend some more time reading the xCAT docs, including the Hardware Management section. Doh!

Jake

On 5/17/17, 2:26 PM, "xcat-user-***@lists.sourceforge.net" <xcat-user-***@lists.sourceforge.net> wrote:

Send xCAT-user mailing list submissions to
xcat-***@lists.sourceforge.net

To subscribe or unsubscribe via the World Wide Web, visit
https://lists.sourceforge.net/lists/listinfo/xcat-user
or, via email, send a message with subject or body 'help' to
xcat-user-***@lists.sourceforge.net

You can reach the person managing the list at
xcat-user-***@lists.sourceforge.net

When replying, please edit your Subject line so it is more specific
than "Re: Contents of xCAT-user digest..."


Today's Topics:

1. Re: using xCAT to view "Active Events" for Lenovo System x
servers (Jarrod Johnson)
2. Re: using xCAT to view "Active Events" for Lenovo System x
servers (Christian Caruthers)


----------------------------------------------------------------------

Message: 1
Date: Wed, 17 May 2017 19:18:38 +0000
From: Jarrod Johnson <***@lenovo.com>
Subject: Re: [xcat-user] using xCAT to view "Active Events" for Lenovo
System x servers
To: xCAT Users Mailing list <xcat-***@lists.sourceforge.net>
Message-ID: <***@USMAILMBX01>
Content-Type: text/plain; charset="utf-8"

In confluent, a new command was added:

# nodehealth n1
n1: critical (Mezz Exp 2 Fault:Critical)
[***@odin ~]# nodehealth r1
r1: ok

In xcat:
# rvitals <noderange> led

Can do a serviceable job of showing the error lights:
# rvitals n1 leds
n1: LED 0x0000 (Fault) active to indicate system error condition.
n1: LED 0271 (Mezz Exp 2) active to indicate Sensor 0x62 (Mezz Exp 2 Fault) error.


From: Rundall, Jacob D [mailto:***@illinois.edu]
Sent: Wednesday, May 17, 2017 3:13 PM
To: xcat-***@lists.sourceforge.net
Subject: [xcat-user] using xCAT to view "Active Events" for Lenovo System x servers

I?m curious if anybody can help me figure out how to use xCAT to view ?Active Events? for Lenovo System x servers, as shown in the web interface of the IMM. Using pasu gets me somewhere, as follows:
pasu mynode immapp showimmlog | grep ?Severity:5?
There are a few shortcomings, though, as compared to the web interface of the IMM:

1. pasu shows me past events that are no longer active (and the recovery events are lower severity so they don?t make it through the grep, so it?s not obvious that the events have been recovered from, at least not with this command).
2. pasu only returns items with some kind of sequence number rather than a date and time.
3. The web interface also sometimes has ?Additional Information for Event? as well, which I cannot figure out how to view using pasu.

Here is an example of what I can see in the IMM web interface:
Error System 25 June 2016, 03:14:40.788 AM An Uncorrectable Error has occurred on PCIs.
Error System 25 June 2016, 03:15:13.638 AM Fault in slot 3 on system System x3650 M5. <more>

Clicking ?more? on the latter provides the following additional information:
[S.68005] An error has been detected by the the IIO core logic on CPU 1. The Global Fatal Error Status register contains 0x0. The Global Non-Fatal Error Status register contains 0x40. Please check error logs for the presence of additional downstream device error data.

And here?s the output that I get using my pasu command shown above (with grep):
monitor01: 19 | Severity:5 | Message:Redundancy Lost for Power Unit has asserted.
monitor01: 22 | Severity:5 | Message:Redundancy Lost for Power Unit has asserted.
monitor01: 27 | Severity:5 | Message:Redundancy Lost for Power Unit has asserted.
monitor01: 49 | Severity:5 | Message:Redundancy Lost for Power Unit has asserted.
monitor01: 56 | Severity:5 | Message:Redundancy Lost for Power Unit has asserted.
monitor01: 125 | Severity:5 | Message:A Fatal Bus Error has occurred on bus CPU 2 PECI.
monitor01: 126 | Severity:5 | Message:An Uncorrectable Error has occurred on PCIs.
monitor01: 128 | Severity:5 | Message:Fault in slot 3 on system System x3650 M5.
monitor01: 138 | Severity:5 | Message:A Fatal Bus Error has occurred on bus CPU 2 PECI.
monitor01: 164 | Severity:5 | Message:A Fatal Bus Error has occurred on bus CPU 2 PECI.

Events 126 and 128 clearly correspond to what is shown as ?Active Events? in the web interface. But it?s not obvious that the others are not active unless I dig deeper in the IMM log (e.g., without filtering through grep). When I do that I can eventually find subsequent recovery events for the other sev 5 events which shows why they are not considered ?active?.


On a related note, does anyone know of a way with xCAT (pasu or otherwise) to view status/info about the following via the command-line from an xCAT management node:

1. IMM web interface: System Status -> System Information -> Check Log LED [I suspect the status here corresponds to the status of the ?Check log LED? on the front of the server].
2. Front of the server: ?System-error LED?
3. IMM web interface: System Status -> Hardware Health: status of each component type (i.e., ?Cooling Devices?, ?Power Modules?, ?Local Storage?, ?Processors?, ?Memory?, ?System?)


Thanks very much,

Jake Rundall
-------------- next part --------------
An HTML attachment was scrubbed...

------------------------------

Message: 2
Date: Wed, 17 May 2017 19:25:04 +0000
From: Christian Caruthers <***@lenovo.com>
Subject: Re: [xcat-user] using xCAT to view "Active Events" for Lenovo
System x servers
To: xCAT Users Mailing list <xcat-***@lists.sourceforge.net>
Message-ID: <***@USMAILMBX01>
Content-Type: text/plain; charset="utf-8"

Have you looked at 'rvitals mynode leds' ?

Regards,
Christian Caruthers
Lenovo Professional Services
Mobile: 757-289-9872

From: Rundall, Jacob D [mailto:***@illinois.edu]
Sent: Wednesday, May 17, 2017 3:13 PM
To: xcat-***@lists.sourceforge.net
Subject: [xcat-user] using xCAT to view "Active Events" for Lenovo System x servers

I?m curious if anybody can help me figure out how to use xCAT to view ?Active Events? for Lenovo System x servers, as shown in the web interface of the IMM. Using pasu gets me somewhere, as follows:
pasu mynode immapp showimmlog | grep ?Severity:5?
There are a few shortcomings, though, as compared to the web interface of the IMM:

1. pasu shows me past events that are no longer active (and the recovery events are lower severity so they don?t make it through the grep, so it?s not obvious that the events have been recovered from, at least not with this command).
2. pasu only returns items with some kind of sequence number rather than a date and time.
3. The web interface also sometimes has ?Additional Information for Event? as well, which I cannot figure out how to view using pasu.

Here is an example of what I can see in the IMM web interface:
Error System 25 June 2016, 03:14:40.788 AM An Uncorrectable Error has occurred on PCIs.
Error System 25 June 2016, 03:15:13.638 AM Fault in slot 3 on system System x3650 M5. <more>

Clicking ?more? on the latter provides the following additional information:
[S.68005] An error has been detected by the the IIO core logic on CPU 1. The Global Fatal Error Status register contains 0x0. The Global Non-Fatal Error Status register contains 0x40. Please check error logs for the presence of additional downstream device error data.

And here?s the output that I get using my pasu command shown above (with grep):
monitor01: 19 | Severity:5 | Message:Redundancy Lost for Power Unit has asserted.
monitor01: 22 | Severity:5 | Message:Redundancy Lost for Power Unit has asserted.
monitor01: 27 | Severity:5 | Message:Redundancy Lost for Power Unit has asserted.
monitor01: 49 | Severity:5 | Message:Redundancy Lost for Power Unit has asserted.
monitor01: 56 | Severity:5 | Message:Redundancy Lost for Power Unit has asserted.
monitor01: 125 | Severity:5 | Message:A Fatal Bus Error has occurred on bus CPU 2 PECI.
monitor01: 126 | Severity:5 | Message:An Uncorrectable Error has occurred on PCIs.
monitor01: 128 | Severity:5 | Message:Fault in slot 3 on system System x3650 M5.
monitor01: 138 | Severity:5 | Message:A Fatal Bus Error has occurred on bus CPU 2 PECI.
monitor01: 164 | Severity:5 | Message:A Fatal Bus Error has occurred on bus CPU 2 PECI.

Events 126 and 128 clearly correspond to what is shown as ?Active Events? in the web interface. But it?s not obvious that the others are not active unless I dig deeper in the IMM log (e.g., without filtering through grep). When I do that I can eventually find subsequent recovery events for the other sev 5 events which shows why they are not considered ?active?.


On a related note, does anyone know of a way with xCAT (pasu or otherwise) to view status/info about the following via the command-line from an xCAT management node:

1. IMM web interface: System Status -> System Information -> Check Log LED [I suspect the status here corresponds to the status of the ?Check log LED? on the front of the server].
2. Front of the server: ?System-error LED?
3. IMM web interface: System Status -> Hardware Health: status of each component type (i.e., ?Cooling Devices?, ?Power Modules?, ?Local Storage?, ?Processors?, ?Memory?, ?System?)


Thanks very much,

Jake Rundall
-------------- next part --------------
An HTML attachment was scrubbed...

------------------------------

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot

------------------------------

_______________________________________________
xCAT-user mailing list
xCAT-***@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xcat-user


End of xCAT-user Digest, Vol 93, Issue 35
*****************************************
Gilad Berman
2017-05-23 14:33:16 UTC
Permalink
Picking up on this a bit late... but this might also be useful.

First, Confluent does a great job with nodehealth and indeed you do not need more.
However, If you want another method of getting the active events, SNMP query is pretty simple and Lenovo creates a table with all active events -

(This is a quote from another email, sorry in advance for bad context 😊 )

IMM status:
For IMM, there is a simple SNMP query that will provide the status and the current active errors –
# snmpwalk -c public -v 1 -m IMM-MIB 10.10.102.122 .1.3.6.1.4.1.2.3.51.3.1.4
IMM-MIB::systemHealthStat.0 = INTEGER: critical(2)
IMM-MIB::systemHealthSummaryIndex.1 = INTEGER: 1
IMM-MIB::systemHealthSummarySeverity.1 = STRING: "Error"
IMM-MIB::systemHealthSummaryDescription.1 = STRING: "The Drive 4 has been disabled due to a detected fault."

This query actually contain the status and the table of errors. If you want to get only the status –
# snmpwalk -c public -v 1 -m IMM-MIB 10.10.102.122 .1.3.6.1.4.1.2.3.51.3.1.4.1
IMM-MIB::systemHealthStat.0 = INTEGER: critical(2)
Possible values - INTEGER {nonRecoverable(0),critical(2), nonCritical(4),normal(255)}

And if you want to get only the errors (I guess it makes sense to check status first and only if there is an issue check the errors) –
# snmpwalk -c public -v 1 -m IMM-MIB 10.10.102.122 .1.3.6.1.4.1.2.3.51.3.1.4.2
IMM-MIB::systemHealthSummaryIndex.1 = INTEGER: 1
IMM-MIB::systemHealthSummarySeverity.1 = STRING: "Error"
IMM-MIB::systemHealthSummaryDescription.1 = STRING: "The Drive 4 has been disabled due to a detected fault."
Note that this is a table so you will get all the errors in one output. If you want to have it one by one you need to add the index number at the end of the OID.
Note that this entry is empty if status is normal.

The IMM-MIB is available with the IMM FW.

Hope this helps, we use it quite often (where there is no Confluent and due to the speed of SNMP). Let me know if you need more details.

Gilad Berman
HPC Architect
Lenovo EMEA
+972-52-2554262
***@lenovo.com
 

Lenovo.com
Twitter | Facebook | Instagram | Blogs | Forums






-----Original Message-----
From: Rundall, Jacob D [mailto:***@illinois.edu]
Sent: Wednesday, May 17, 2017 11:34 PM
To: xcat-***@lists.sourceforge.net
Subject: Re: [xcat-user] using xCAT to view "Active Events" for Lenovo System x servers

Thanks, Jarrod and Christian. I was not aware of rvitals but that seems to do exactly what I need. And it also shows me that I need to spend some more time reading the xCAT docs, including the Hardware Management section. Doh!

Jake

On 5/17/17, 2:26 PM, "xcat-user-***@lists.sourceforge.net" <xcat-user-***@lists.sourceforge.net> wrote:

Send xCAT-user mailing list submissions to
xcat-***@lists.sourceforge.net

To subscribe or unsubscribe via the World Wide Web, visit
https://lists.sourceforge.net/lists/listinfo/xcat-user
or, via email, send a message with subject or body 'help' to
xcat-user-***@lists.sourceforge.net

You can reach the person managing the list at
xcat-user-***@lists.sourceforge.net

When replying, please edit your Subject line so it is more specific
than "Re: Contents of xCAT-user digest..."


Today's Topics:

1. Re: using xCAT to view "Active Events" for Lenovo System x
servers (Jarrod Johnson)
2. Re: using xCAT to view "Active Events" for Lenovo System x
servers (Christian Caruthers)


----------------------------------------------------------------------

Message: 1
Date: Wed, 17 May 2017 19:18:38 +0000
From: Jarrod Johnson <***@lenovo.com>
Subject: Re: [xcat-user] using xCAT to view "Active Events" for Lenovo
System x servers
To: xCAT Users Mailing list <xcat-***@lists.sourceforge.net>
Message-ID: <***@USMAILMBX01>
Content-Type: text/plain; charset="utf-8"

In confluent, a new command was added:

# nodehealth n1
n1: critical (Mezz Exp 2 Fault:Critical)
[***@odin ~]# nodehealth r1
r1: ok

In xcat:
# rvitals <noderange> led

Can do a serviceable job of showing the error lights:
# rvitals n1 leds
n1: LED 0x0000 (Fault) active to indicate system error condition.
n1: LED 0271 (Mezz Exp 2) active to indicate Sensor 0x62 (Mezz Exp 2 Fault) error.


From: Rundall, Jacob D [mailto:***@illinois.edu]
Sent: Wednesday, May 17, 2017 3:13 PM
To: xcat-***@lists.sourceforge.net
Subject: [xcat-user] using xCAT to view "Active Events" for Lenovo System x servers

I?m curious if anybody can help me figure out how to use xCAT to view ?Active Events? for Lenovo System x servers, as shown in the web interface of the IMM. Using pasu gets me somewhere, as follows:
pasu mynode immapp showimmlog | grep ?Severity:5?
There are a few shortcomings, though, as compared to the web interface of the IMM:

1. pasu shows me past events that are no longer active (and the recovery events are lower severity so they don?t make it through the grep, so it?s not obvious that the events have been recovered from, at least not with this command).
2. pasu only returns items with some kind of sequence number rather than a date and time.
3. The web interface also sometimes has ?Additional Information for Event? as well, which I cannot figure out how to view using pasu.

Here is an example of what I can see in the IMM web interface:
Error System 25 June 2016, 03:14:40.788 AM An Uncorrectable Error has occurred on PCIs.
Error System 25 June 2016, 03:15:13.638 AM Fault in slot 3 on system System x3650 M5. <more>

Clicking ?more? on the latter provides the following additional information:
[S.68005] An error has been detected by the the IIO core logic on CPU 1. The Global Fatal Error Status register contains 0x0. The Global Non-Fatal Error Status register contains 0x40. Please check error logs for the presence of additional downstream device error data.

And here?s the output that I get using my pasu command shown above (with grep):
monitor01: 19 | Severity:5 | Message:Redundancy Lost for Power Unit has asserted.
monitor01: 22 | Severity:5 | Message:Redundancy Lost for Power Unit has asserted.
monitor01: 27 | Severity:5 | Message:Redundancy Lost for Power Unit has asserted.
monitor01: 49 | Severity:5 | Message:Redundancy Lost for Power Unit has asserted.
monitor01: 56 | Severity:5 | Message:Redundancy Lost for Power Unit has asserted.
monitor01: 125 | Severity:5 | Message:A Fatal Bus Error has occurred on bus CPU 2 PECI.
monitor01: 126 | Severity:5 | Message:An Uncorrectable Error has occurred on PCIs.
monitor01: 128 | Severity:5 | Message:Fault in slot 3 on system System x3650 M5.
monitor01: 138 | Severity:5 | Message:A Fatal Bus Error has occurred on bus CPU 2 PECI.
monitor01: 164 | Severity:5 | Message:A Fatal Bus Error has occurred on bus CPU 2 PECI.

Events 126 and 128 clearly correspond to what is shown as ?Active Events? in the web interface. But it?s not obvious that the others are not active unless I dig deeper in the IMM log (e.g., without filtering through grep). When I do that I can eventually find subsequent recovery events for the other sev 5 events which shows why they are not considered ?active?.


On a related note, does anyone know of a way with xCAT (pasu or otherwise) to view status/info about the following via the command-line from an xCAT management node:

1. IMM web interface: System Status -> System Information -> Check Log LED [I suspect the status here corresponds to the status of the ?Check log LED? on the front of the server].
2. Front of the server: ?System-error LED?
3. IMM web interface: System Status -> Hardware Health: status of each component type (i.e., ?Cooling Devices?, ?Power Modules?, ?Local Storage?, ?Processors?, ?Memory?, ?System?)


Thanks very much,

Jake Rundall
-------------- next part --------------
An HTML attachment was scrubbed...

------------------------------

Message: 2
Date: Wed, 17 May 2017 19:25:04 +0000
From: Christian Caruthers <***@lenovo.com>
Subject: Re: [xcat-user] using xCAT to view "Active Events" for Lenovo
System x servers
To: xCAT Users Mailing list <xcat-***@lists.sourceforge.net>
Message-ID: <***@USMAILMBX01>
Content-Type: text/plain; charset="utf-8"

Have you looked at 'rvitals mynode leds' ?

Regards,
Christian Caruthers
Lenovo Professional Services
Mobile: 757-289-9872

From: Rundall, Jacob D [mailto:***@illinois.edu]
Sent: Wednesday, May 17, 2017 3:13 PM
To: xcat-***@lists.sourceforge.net
Subject: [xcat-user] using xCAT to view "Active Events" for Lenovo System x servers

I?m curious if anybody can help me figure out how to use xCAT to view ?Active Events? for Lenovo System x servers, as shown in the web interface of the IMM. Using pasu gets me somewhere, as follows:
pasu mynode immapp showimmlog | grep ?Severity:5?
There are a few shortcomings, though, as compared to the web interface of the IMM:

1. pasu shows me past events that are no longer active (and the recovery events are lower severity so they don?t make it through the grep, so it?s not obvious that the events have been recovered from, at least not with this command).
2. pasu only returns items with some kind of sequence number rather than a date and time.
3. The web interface also sometimes has ?Additional Information for Event? as well, which I cannot figure out how to view using pasu.

Here is an example of what I can see in the IMM web interface:
Error System 25 June 2016, 03:14:40.788 AM An Uncorrectable Error has occurred on PCIs.
Error System 25 June 2016, 03:15:13.638 AM Fault in slot 3 on system System x3650 M5. <more>

Clicking ?more? on the latter provides the following additional information:
[S.68005] An error has been detected by the the IIO core logic on CPU 1. The Global Fatal Error Status register contains 0x0. The Global Non-Fatal Error Status register contains 0x40. Please check error logs for the presence of additional downstream device error data.

And here?s the output that I get using my pasu command shown above (with grep):
monitor01: 19 | Severity:5 | Message:Redundancy Lost for Power Unit has asserted.
monitor01: 22 | Severity:5 | Message:Redundancy Lost for Power Unit has asserted.
monitor01: 27 | Severity:5 | Message:Redundancy Lost for Power Unit has asserted.
monitor01: 49 | Severity:5 | Message:Redundancy Lost for Power Unit has asserted.
monitor01: 56 | Severity:5 | Message:Redundancy Lost for Power Unit has asserted.
monitor01: 125 | Severity:5 | Message:A Fatal Bus Error has occurred on bus CPU 2 PECI.
monitor01: 126 | Severity:5 | Message:An Uncorrectable Error has occurred on PCIs.
monitor01: 128 | Severity:5 | Message:Fault in slot 3 on system System x3650 M5.
monitor01: 138 | Severity:5 | Message:A Fatal Bus Error has occurred on bus CPU 2 PECI.
monitor01: 164 | Severity:5 | Message:A Fatal Bus Error has occurred on bus CPU 2 PECI.

Events 126 and 128 clearly correspond to what is shown as ?Active Events? in the web interface. But it?s not obvious that the others are not active unless I dig deeper in the IMM log (e.g., without filtering through grep). When I do that I can eventually find subsequent recovery events for the other sev 5 events which shows why they are not considered ?active?.


On a related note, does anyone know of a way with xCAT (pasu or otherwise) to view status/info about the following via the command-line from an xCAT management node:

1. IMM web interface: System Status -> System Information -> Check Log LED [I suspect the status here corresponds to the status of the ?Check log LED? on the front of the server].
2. Front of the server: ?System-error LED?
3. IMM web interface: System Status -> Hardware Health: status of each component type (i.e., ?Cooling Devices?, ?Power Modules?, ?Local Storage?, ?Processors?, ?Memory?, ?System?)


Thanks very much,

Jake Rundall
-------------- next part --------------
An HTML attachment was scrubbed...

------------------------------

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot

------------------------------

_______________________________________________
xCAT-user mailing list
xCAT-***@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xcat-user


End of xCAT-user Digest, Vol 93, Issue 35
*****************************************




------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________
xCAT-user mailing list
xCAT-***@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xcat-user

Loading...