Discussion:
[xcat-user] Confluent as console server. Consoles hangs ~after 24h.
banuchka
2017-04-13 07:22:28 UTC
Permalink
Hi,

Im trying to completely migrate from conserver to confluent, but catch
strange behaviour.
Some of my consoles hangs ~after 24, so no any new messages in their logs
or in rcons.
I send messages with timestamp from OS >/dev/console every 30-60min and
take a look on them for monitoring purposes(consoles availability
monitoring).
I can open rcons and hit enter, after few secs console is waking
up(strange). I didnt see it happen with conserver or maybe im wrong...
Some details:
- as i can see the bigest part of consoles with hangs behaviour are Dell
idrac. Doesnt matter which type of RacSerial or IPMISerial is in use.
- racreset hard/ipmitool bmc reset didnt do the things
- hit enter to console wake it up(for example with expect i can send
\r\n\f, but it looks bad)
- i didnt try to clean confluent's conf and restart it. Not sure it may
help.
- HP consoles works well, same ipmi
- few consoles with custom pluging works good as well

So maybe my question is not about confluent, but if some of you have some
knowledge about same problems please share it! ;)
--
banuchka
banuchka
2017-04-13 10:30:13 UTC
Permalink
Looks like that problem was before
 The fix was to use ipmitool with keepalive(one from xcat repos).
Here pyghmi is used maybe that the reason?

On 13 April 2017 at 08:22:28, banuchka (***@gmail.com) wrote:

Hi,

Im trying to completely migrate from conserver to confluent, but catch strange behaviour.
Some of my consoles hangs ~after 24, so no any new messages in their logs or in rcons.
I send messages with timestamp from OS >/dev/console every 30-60min and take a look on them for monitoring purposes(consoles availability monitoring).
I can open rcons and hit enter, after few secs console is waking up(strange). I didnt see it happen with conserver or maybe im wrong...
Some details:
- as i can see the bigest part of consoles with hangs behaviour are Dell idrac. Doesnt matter which type of RacSerial or IPMISerial is in use.
- racreset hard/ipmitool bmc reset didnt do the things
- hit enter to console wake it up(for example with expect i can send \r\n\f, but it looks bad)
- i didnt try to clean confluent's conf and restart it. Not sure it may help.
- HP consoles works well, same ipmi
- few consoles with custom pluging works good as well

So maybe my question is not about confluent, but if some of you have some knowledge about same problems please share it! ;)

-- 
banuchka
-- 
banuchka
banuchka
2017-04-13 16:03:19 UTC
Permalink
It is Dell’s related problem, not 100% but

Confluent from current master is doing things well :) 
Thanks for pretty nice tool “confluentdbutil".

On 13 April 2017 at 11:30:14, banuchka (***@gmail.com) wrote:

Looks like that problem was before
 The fix was to use ipmitool with keepalive(one from xcat repos).
Here pyghmi is used maybe that the reason?

On 13 April 2017 at 08:22:28, banuchka (***@gmail.com) wrote:

Hi,

Im trying to completely migrate from conserver to confluent, but catch strange behaviour.
Some of my consoles hangs ~after 24, so no any new messages in their logs or in rcons.
I send messages with timestamp from OS >/dev/console every 30-60min and take a look on them for monitoring purposes(consoles availability monitoring).
I can open rcons and hit enter, after few secs console is waking up(strange). I didnt see it happen with conserver or maybe im wrong...
Some details:
- as i can see the bigest part of consoles with hangs behaviour are Dell idrac. Doesnt matter which type of RacSerial or IPMISerial is in use.
- racreset hard/ipmitool bmc reset didnt do the things
- hit enter to console wake it up(for example with expect i can send \r\n\f, but it looks bad)
- i didnt try to clean confluent's conf and restart it. Not sure it may help.
- HP consoles works well, same ipmi
- few consoles with custom pluging works good as well

So maybe my question is not about confluent, but if some of you have some knowledge about same problems please share it! ;)

-- 
banuchka
-- 
banuchka
-- 
banuchka
banuchka
2017-04-14 09:55:01 UTC
Permalink
My last reply was incorrect. Problems still here. Im trying to find
something usefull inbetween confluent/pyghmi...
Confluent restart solves hangs/reopen all connections.
I think it isnt the best option to restart confluent 1 or 2 times in 24h.
--
banuchka
Post by banuchka
It is Dell’s related problem, not 100% but

Confluent from current master is doing things well :)
Thanks for pretty nice tool “confluentdbutil".
Looks like that problem was before
 The fix was to use ipmitool with
keepalive(one from xcat repos).
Here pyghmi is used maybe that the reason?
Hi,
Im trying to completely migrate from conserver to confluent, but catch strange behaviour.
Some of my consoles hangs ~after 24, so no any new messages in their logs or in rcons.
I send messages with timestamp from OS >/dev/console every 30-60min and
take a look on them for monitoring purposes(consoles availability
monitoring).
I can open rcons and hit enter, after few secs console is waking
up(strange). I didnt see it happen with conserver or maybe im wrong...
- as i can see the bigest part of consoles with hangs behaviour are Dell
idrac. Doesnt matter which type of RacSerial or IPMISerial is in use.
- racreset hard/ipmitool bmc reset didnt do the things
- hit enter to console wake it up(for example with expect i can send
\r\n\f, but it looks bad)
- i didnt try to clean confluent's conf and restart it. Not sure it may help.
- HP consoles works well, same ipmi
- few consoles with custom pluging works good as well
So maybe my question is not about confluent, but if some of you have some
knowledge about same problems please share it! ;)
--
banuchka
--
banuchka
--
banuchka
Jarrod Johnson
2017-04-14 11:37:57 UTC
Permalink
If you ctrl-e, c, o, does it restore the console after the time?

Can you tell that it goes after exactly 24hours on the dot?

When console hung, does ‘ipmitool sol activate’ say ‘session already active’?

Does /var/log/confluent/consoles/<nodename> have any interesting events crop up?

Pyghmi will do keepalive as well, and if that’s the problem, it should be much shorter than 24 hours. In fact, it should be checking if the SOL payload is active and owned by confluent specifically every couple of minutes.

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 5:55 AM
To: xcat-***@lists.sourceforge.net
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

My last reply was incorrect. Problems still here. Im trying to find something usefull inbetween confluent/pyghmi...
Confluent restart solves hangs/reopen all connections.
I think it isnt the best option to restart confluent 1 or 2 times in 24h.
--
banuchka

On 13 April 2017 at 17:03:19, banuchka (***@gmail.com<mailto:***@gmail.com>) wrote:
It is Dell’s related problem, not 100% but

Confluent from current master is doing things well :)
Thanks for pretty nice tool “confluentdbutil".


On 13 April 2017 at 11:30:14, banuchka (***@gmail.com<mailto:***@gmail.com>) wrote:
Looks like that problem was before
 The fix was to use ipmitool with keepalive(one from xcat repos).
Here pyghmi is used maybe that the reason?


On 13 April 2017 at 08:22:28, banuchka (***@gmail.com<mailto:***@gmail.com>) wrote:
Hi,

Im trying to completely migrate from conserver to confluent, but catch strange behaviour.
Some of my consoles hangs ~after 24, so no any new messages in their logs or in rcons.
I send messages with timestamp from OS >/dev/console every 30-60min and take a look on them for monitoring purposes(consoles availability monitoring).
I can open rcons and hit enter, after few secs console is waking up(strange). I didnt see it happen with conserver or maybe im wrong...
Some details:
- as i can see the bigest part of consoles with hangs behaviour are Dell idrac. Doesnt matter which type of RacSerial or IPMISerial is in use.
- racreset hard/ipmitool bmc reset didnt do the things
- hit enter to console wake it up(for example with expect i can send \r\n\f, but it looks bad)
- i didnt try to clean confluent's conf and restart it. Not sure it may help.
- HP consoles works well, same ipmi
- few consoles with custom pluging works good as well

So maybe my question is not about confluent, but if some of you have some knowledge about same problems please share it! ;)
--
banuchka
--
banuchka
--
banuchka
banuchka
2017-04-14 11:52:48 UTC
Permalink
I can see keepalive messages to consoles
 and I've seen “bad udp cksum” hope that is a problem.

I’ve turned off TX/RX offloading(and generic offloading as well) on my eth card. Now tcpdump on 623 looks much better. 

On 14 April 2017 at 12:42:03, Jarrod Johnson (***@lenovo.com) wrote:

If you ctrl-e, c, o, does it restore the console after the time?

 

Can you tell that it goes after exactly 24hours on the dot?

 

When console hung, does ‘ipmitool sol activate’ say ‘session already active’?

 

Does /var/log/confluent/consoles/<nodename> have any interesting events crop up?

 

Pyghmi will do keepalive as well, and if that’s the problem, it should be much shorter than 24 hours.  In fact, it should be checking if the SOL payload is active and owned by confluent specifically every couple of minutes.

 

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 5:55 AM
To: xcat-***@lists.sourceforge.net
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

My last reply was incorrect. Problems still here. Im trying to find something usefull inbetween confluent/pyghmi...

Confluent restart solves hangs/reopen all connections.

I think it isnt the best option to restart confluent 1 or 2 times in 24h.

-- 
banuchka

On 13 April 2017 at 17:03:19, banuchka (***@gmail.com) wrote:

It is Dell’s related problem, not 100% but


Confluent from current master is doing things well :) 

Thanks for pretty nice tool “confluentdbutil".

 

On 13 April 2017 at 11:30:14, banuchka (***@gmail.com) wrote:

Looks like that problem was before
 The fix was to use ipmitool with keepalive(one from xcat repos).

Here pyghmi is used maybe that the reason?

 

On 13 April 2017 at 08:22:28, banuchka (***@gmail.com) wrote:

Hi,

 

Im trying to completely migrate from conserver to confluent, but catch strange behaviour.

Some of my consoles hangs ~after 24, so no any new messages in their logs or in rcons.

I send messages with timestamp from OS >/dev/console every 30-60min and take a look on them for monitoring purposes(consoles availability monitoring).

I can open rcons and hit enter, after few secs console is waking up(strange). I didnt see it happen with conserver or maybe im wrong...

Some details:

- as i can see the bigest part of consoles with hangs behaviour are Dell idrac. Doesnt matter which type of RacSerial or IPMISerial is in use.

- racreset hard/ipmitool bmc reset didnt do the things

- hit enter to console wake it up(for example with expect i can send \r\n\f, but it looks bad)

- i didnt try to clean confluent's conf and restart it. Not sure it may help.

- HP consoles works well, same ipmi

- few consoles with custom pluging works good as well

 

So maybe my question is not about confluent, but if some of you have some knowledge about same problems please share it! ;)

 

-- 
banuchka

-- 
banuchka

-- 
banuchka

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot_______________________________________________
xCAT-user mailing list
xCAT-***@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xcat-user
-- 
banuchka
Jarrod Johnson
2017-04-14 12:05:59 UTC
Permalink
Bad udp checksum is a side effect of wireshark when offloading is enabled. What is happening is that wireshark captures the data before the TX offloaded checksumming occurs. So the TX looks incorrect, but it’s intentional because the hardware will do it instead. RX should look fine, and every TX should *look* like bad checksum if offload enabled, but it’s actually going to be fine on the wire.


From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 7:53 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

I can see keepalive messages to consoles
 and I've seen “bad udp cksum” hope that is a problem.

I’ve turned off TX/RX offloading(and generic offloading as well) on my eth card. Now tcpdump on 623 looks much better.


On 14 April 2017 at 12:42:03, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
If you ctrl-e, c, o, does it restore the console after the time?

Can you tell that it goes after exactly 24hours on the dot?

When console hung, does ‘ipmitool sol activate’ say ‘session already active’?

Does /var/log/confluent/consoles/<nodename> have any interesting events crop up?

Pyghmi will do keepalive as well, and if that’s the problem, it should be much shorter than 24 hours. In fact, it should be checking if the SOL payload is active and owned by confluent specifically every couple of minutes.

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 5:55 AM
To: xcat-***@lists.sourceforge.net<mailto:xcat-***@lists.sourceforge.net>
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

My last reply was incorrect. Problems still here. Im trying to find something usefull inbetween confluent/pyghmi...
Confluent restart solves hangs/reopen all connections.
I think it isnt the best option to restart confluent 1 or 2 times in 24h.
--
banuchka

On 13 April 2017 at 17:03:19, banuchka (***@gmail.com<mailto:***@gmail.com>) wrote:
It is Dell’s related problem, not 100% but

Confluent from current master is doing things well :)
Thanks for pretty nice tool “confluentdbutil".


On 13 April 2017 at 11:30:14, banuchka (***@gmail.com<mailto:***@gmail.com>) wrote:
Looks like that problem was before
 The fix was to use ipmitool with keepalive(one from xcat repos).
Here pyghmi is used maybe that the reason?


On 13 April 2017 at 08:22:28, banuchka (***@gmail.com<mailto:***@gmail.com>) wrote:
Hi,

Im trying to completely migrate from conserver to confluent, but catch strange behaviour.
Some of my consoles hangs ~after 24, so no any new messages in their logs or in rcons.
I send messages with timestamp from OS >/dev/console every 30-60min and take a look on them for monitoring purposes(consoles availability monitoring).
I can open rcons and hit enter, after few secs console is waking up(strange). I didnt see it happen with conserver or maybe im wrong...
Some details:
- as i can see the bigest part of consoles with hangs behaviour are Dell idrac. Doesnt matter which type of RacSerial or IPMISerial is in use.
- racreset hard/ipmitool bmc reset didnt do the things
- hit enter to console wake it up(for example with expect i can send \r\n\f, but it looks bad)
- i didnt try to clean confluent's conf and restart it. Not sure it may help.
- HP consoles works well, same ipmi
- few consoles with custom pluging works good as well

So maybe my question is not about confluent, but if some of you have some knowledge about same problems please share it! ;)
--
banuchka
--
banuchka
--
banuchka
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot_______________________________________________
xCAT-user mailing list
xCAT-***@lists.sourceforge.net<mailto:xCAT-***@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/xcat-user
--
banuchka
banuchka
2017-04-14 12:13:58 UTC
Permalink
I need few hours to check how things are going.
Maybe im wrong with disable offloading... but it loooks stable now, no hangs
--
banuchka
Post by Jarrod Johnson
Bad udp checksum is a side effect of wireshark when offloading is
enabled. What is happening is that wireshark captures the data before the
TX offloaded checksumming occurs. So the TX looks incorrect, but it’s
intentional because the hardware will do it instead. RX should look fine,
and every TX should **look** like bad checksum if offload enabled, but
it’s actually going to be fine on the wire.
*Sent:* Friday, April 14, 2017 7:53 AM
*To:* xCAT Users Mailing list; Jarrod Johnson
*Subject:* Re: [xcat-user] Confluent as console server. Consoles hangs
~after 24h.
I can see keepalive messages to consoles
 and I've seen “bad udp cksum”
hope that is a problem.
I’ve turned off TX/RX offloading(and generic offloading as well) on my eth
card. Now tcpdump on 623 looks much better.
If you ctrl-e, c, o, does it restore the console after the time?
Can you tell that it goes after exactly 24hours on the dot?
When console hung, does ‘ipmitool sol activate’ say ‘session already
active’?
Does /var/log/confluent/consoles/<nodename> have any interesting events crop up?
Pyghmi will do keepalive as well, and if that’s the problem, it should be
much shorter than 24 hours. In fact, it should be checking if the SOL
payload is active and owned by confluent specifically every couple of
minutes.
*Sent:* Friday, April 14, 2017 5:55 AM
*Subject:* Re: [xcat-user] Confluent as console server. Consoles hangs
~after 24h.
My last reply was incorrect. Problems still here. Im trying to find
something usefull inbetween confluent/pyghmi...
Confluent restart solves hangs/reopen all connections.
I think it isnt the best option to restart confluent 1 or 2 times in 24h.
--
banuchka
It is Dell’s related problem, not 100% but

Confluent from current master is doing things well :)
Thanks for pretty nice tool “confluentdbutil".
Looks like that problem was before
 The fix was to use ipmitool with
keepalive(one from xcat repos).
Here pyghmi is used maybe that the reason?
Hi,
Im trying to completely migrate from conserver to confluent, but catch strange behaviour.
Some of my consoles hangs ~after 24, so no any new messages in their logs or in rcons.
I send messages with timestamp from OS >/dev/console every 30-60min and
take a look on them for monitoring purposes(consoles availability
monitoring).
I can open rcons and hit enter, after few secs console is waking
up(strange). I didnt see it happen with conserver or maybe im wrong...
- as i can see the bigest part of consoles with hangs behaviour are Dell
idrac. Doesnt matter which type of RacSerial or IPMISerial is in use.
- racreset hard/ipmitool bmc reset didnt do the things
- hit enter to console wake it up(for example with expect i can send
\r\n\f, but it looks bad)
- i didnt try to clean confluent's conf and restart it. Not sure it may help.
- HP consoles works well, same ipmi
- few consoles with custom pluging works good as well
So maybe my question is not about confluent, but if some of you have some
knowledge about same problems please share it! ;)
--
banuchka
--
banuchka
--
banuchka
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org!
http://sdm.link/slashdot_______________________________________________
xCAT-user mailing list
https://lists.sourceforge.net/lists/listinfo/xcat-user
--
banuchka
banuchka
2017-04-14 13:29:12 UTC
Permalink
I’m out of ideas, let me show you all i see.

Inside rcons i see:

MONITORING_TEST dbb54 1492160401 <= last message i’ve sent from OS (more complex log below)

tcpdump(keepalive?):

13:23:42.342886 IP (tos 0x0, ttl 64, id 16448, offset 0, flags [DF], proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:23:42.345504 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80




13:24:09.422491 IP (tos 0x0, ttl 64, id 17060, offset 0, flags [DF], proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:24:09.425045 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

Hit <enter> in rcons:
---
MONITORING_TEST dbb54 1492160401

ᅵᅵ
  Porᅵ
—

tcpdump:
13:24:35.727671 IP (tos 0x0, ttl 64, id 19582, offset 0, flags [DF], proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:24:35.731533 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:24:47.390367 IP (tos 0x0, ttl 64, id 20347, offset 0, flags [DF], proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:24:47.392799 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 92)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 64
13:24:47.408312 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:24:47.409797 IP (tos 0x0, ttl 64, id 20349, offset 0, flags [DF], proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:03.127774 IP (tos 0x0, ttl 64, id 21818, offset 0, flags [DF], proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:03.131561 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:25:27.269696 IP (tos 0x0, ttl 64, id 26284, offset 0, flags [DF], proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:27.272204 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:25:47.410313 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 92)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 64
13:25:47.413754 IP (tos 0x0, ttl 64, id 28210, offset 0, flags [DF], proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:48.709947 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 204)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 176
13:25:48.712033 IP (tos 0x0, ttl 64, id 28355, offset 0, flags [DF], proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:52.564080 IP (tos 0x0, ttl 64, id 29103, offset 0, flags [DF], proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:52.566810 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

and Magic, rcons:
---
  Porᅵlo]0;console: dbb54 [13:25]


dbb54 login:
---

On 14 April 2017 at 12:42:03, Jarrod Johnson (***@lenovo.com) wrote:

If you ctrl-e, c, o, does it restore the console after the time?

 

Can you tell that it goes after exactly 24hours on the dot?

 

When console hung, does ‘ipmitool sol activate’ say ‘session already active’?

Yes, 

# ipmitool -I lanplus -H 10.10.106.155 -U root -a sol activate

Password:

Info: SOL payload already active on another session


 

Does /var/log/confluent/consoles/<nodename> have any interesting events crop up?

[04/13 15:17:21 console connected]


 many our own messages

^MMONITORING_TEST dbb54 1492160401 | <== This is the last message from OS/ # date -***@1492160401 (Fri Apr 14 09:00:01 UTC 2017)

^M

[04/14 09:05:13 console connected]

[04/14 09:11:59 console connected]

[04/14 09:13:38 console disconnected]

[04/14 09:14:54 console connected]

[04/14 10:15:13 connection by xcat_console]

[04/14 10:15:14 disconnection by xcat_console]

[04/14 13:14:30 connection by xcat_console]


 

Pyghmi will do keepalive as well, and if that’s the problem, it should be much shorter than 24 hours.  In fact, it should be checking if the SOL payload is active and owned by confluent specifically every couple of minutes.

yes, thats correct


 

From: banuchka [mailto:***@gmail.com] 
Sent: Friday, April 14, 2017 5:55 AM
To: xcat-***@lists.sourceforge.net
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

My last reply was incorrect. Problems still here. Im trying to find something usefull inbetween confluent/pyghmi...

Confluent restart solves hangs/reopen all connections.

I think it isnt the best option to restart confluent 1 or 2 times in 24h.

-- 
banuchka

On 13 April 2017 at 17:03:19, banuchka (***@gmail.com) wrote:

It is Dell’s related problem, not 100% but


Confluent from current master is doing things well :) 

Thanks for pretty nice tool “confluentdbutil".

 

On 13 April 2017 at 11:30:14, banuchka (***@gmail.com) wrote:

Looks like that problem was before
 The fix was to use ipmitool with keepalive(one from xcat repos).

Here pyghmi is used maybe that the reason?

 

On 13 April 2017 at 08:22:28, banuchka (***@gmail.com) wrote:

Hi, 

 

Im trying to completely migrate from conserver to confluent, but catch strange behaviour.

Some of my consoles hangs ~after 24, so no any new messages in their logs or in rcons.

I send messages with timestamp from OS >/dev/console every 30-60min and take a look on them for monitoring purposes(consoles availability monitoring).

I can open rcons and hit enter, after few secs console is waking up(strange). I didnt see it happen with conserver or maybe im wrong...

Some details:

- as i can see the bigest part of consoles with hangs behaviour are Dell idrac. Doesnt matter which type of RacSerial or IPMISerial is in use.

- racreset hard/ipmitool bmc reset didnt do the things

- hit enter to console wake it up(for example with expect i can send \r\n\f, but it looks bad)

- i didnt try to clean confluent's conf and restart it. Not sure it may help.

- HP consoles works well, same ipmi

- few consoles with custom pluging works good as well

 

So maybe my question is not about confluent, but if some of you have some knowledge about same problems please share it! ;)

 

-- 
banuchka

-- 
banuchka

-- 
banuchka

------------------------------------------------------------------------------ 
Check out the vibrant tech community on one of the world's most 
engaging tech sites, Slashdot.org! http://sdm.link/slashdot_______________________________________________ 
xCAT-user mailing list 
xCAT-***@lists.sourceforge.net 
https://lists.sourceforge.net/lists/listinfo/xcat-user 



-- 
banuchka
Jarrod Johnson
2017-04-14 14:46:08 UTC
Permalink
So the console starts showing garbage? Restarting the console causes the garbage to go away?

You said that ipmitool with a certain configuration did not trigger this?

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 9:29 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

I’m out of ideas, let me show you all i see.

Inside rcons i see:

MONITORING_TEST dbb54 1492160401 <= last message i’ve sent from OS (more complex log below)

tcpdump(keepalive?):

13:23:42.342886 IP (tos 0x0, ttl 64, id 16448, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:23:42.345504 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80




13:24:09.422491 IP (tos 0x0, ttl 64, id 17060, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:24:09.425045 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

Hit <enter> in rcons:
---
MONITORING_TEST dbb54 1492160401

ᅵᅵ
Porᅵ
—

tcpdump:
13:24:35.727671 IP (tos 0x0, ttl 64, id 19582, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:24:35.731533 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:24:47.390367 IP (tos 0x0, ttl 64, id 20347, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:24:47.392799 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 92)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 64
13:24:47.408312 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:24:47.409797 IP (tos 0x0, ttl 64, id 20349, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:03.127774 IP (tos 0x0, ttl 64, id 21818, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:03.131561 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:25:27.269696 IP (tos 0x0, ttl 64, id 26284, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:27.272204 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:25:47.410313 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 92)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 64
13:25:47.413754 IP (tos 0x0, ttl 64, id 28210, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:48.709947 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 204)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 176
13:25:48.712033 IP (tos 0x0, ttl 64, id 28355, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:52.564080 IP (tos 0x0, ttl 64, id 29103, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:52.566810 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

and Magic, rcons:
---
Porï¿œlo]0;console: dbb54 [13:25]


dbb54 login:
---


On 14 April 2017 at 12:42:03, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
If you ctrl-e, c, o, does it restore the console after the time?

Can you tell that it goes after exactly 24hours on the dot?

When console hung, does ‘ipmitool sol activate’ say ‘session already active’?

Yes,

# ipmitool -I lanplus -H 10.10.106.155 -U root -a sol activate

Password:

Info: SOL payload already active on another session


Does /var/log/confluent/consoles/<nodename> have any interesting events crop up?

[04/13 15:17:21 console connected]


 many our own messages

^MMONITORING_TEST dbb54 1492160401 | <== This is the last message from OS/ # date -***@1492160401 (Fri Apr 14 09:00:01 UTC 2017)

^M

[04/14 09:05:13 console connected]

[04/14 09:11:59 console connected]

[04/14 09:13:38 console disconnected]

[04/14 09:14:54 console connected]

[04/14 10:15:13 connection by xcat_console]

[04/14 10:15:14 disconnection by xcat_console]

[04/14 13:14:30 connection by xcat_console]


Pyghmi will do keepalive as well, and if that’s the problem, it should be much shorter than 24 hours. In fact, it should be checking if the SOL payload is active and owned by confluent specifically every couple of minutes.

yes, thats correct


From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 5:55 AM
To: xcat-***@lists.sourceforge.net<mailto:xcat-***@lists.sourceforge.net>
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

My last reply was incorrect. Problems still here. Im trying to find something usefull inbetween confluent/pyghmi...
Confluent restart solves hangs/reopen all connections.
I think it isnt the best option to restart confluent 1 or 2 times in 24h.
--
banuchka

On 13 April 2017 at 17:03:19, banuchka (***@gmail.com<mailto:***@gmail.com>) wrote:
It is Dell’s related problem, not 100% but

Confluent from current master is doing things well :)
Thanks for pretty nice tool “confluentdbutil".


On 13 April 2017 at 11:30:14, banuchka (***@gmail.com<mailto:***@gmail.com>) wrote:
Looks like that problem was before
 The fix was to use ipmitool with keepalive(one from xcat repos).
Here pyghmi is used maybe that the reason?


On 13 April 2017 at 08:22:28, banuchka (***@gmail.com<mailto:***@gmail.com>) wrote:
Hi,

Im trying to completely migrate from conserver to confluent, but catch strange behaviour.
Some of my consoles hangs ~after 24, so no any new messages in their logs or in rcons.
I send messages with timestamp from OS >/dev/console every 30-60min and take a look on them for monitoring purposes(consoles availability monitoring).
I can open rcons and hit enter, after few secs console is waking up(strange). I didnt see it happen with conserver or maybe im wrong...
Some details:
- as i can see the bigest part of consoles with hangs behaviour are Dell idrac. Doesnt matter which type of RacSerial or IPMISerial is in use.
- racreset hard/ipmitool bmc reset didnt do the things
- hit enter to console wake it up(for example with expect i can send \r\n\f, but it looks bad)
- i didnt try to clean confluent's conf and restart it. Not sure it may help.
- HP consoles works well, same ipmi
- few consoles with custom pluging works good as well

So maybe my question is not about confluent, but if some of you have some knowledge about same problems please share it! ;)
--
banuchka
--
banuchka
--
banuchka
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot_______________________________________________
xCAT-user mailing list
xCAT-***@lists.sourceforge.net<mailto:xCAT-***@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/xcat-user
--
banuchka
banuchka
2017-04-14 14:52:27 UTC
Permalink
Console starts showing garbage after <enter> inside rcons.
What do you mean when said “restarting console”?
Console continue its work after:
- <enter> inside rcons/confetty
- bmc reset (console disconnected/console connected)

You’re absolutely right with ipmitool and conserver with the same servers we were out of such troubles.
On 14 April 2017 at 15:47:14, Jarrod Johnson (***@lenovo.com) wrote:

So the console starts showing garbage?  Restarting the console causes the garbage to go away?

 

You said that ipmitool with a certain configuration did not trigger this?

 

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 9:29 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

I’m out of ideas, let me show you all i see.

 

Inside rcons i see:

 

MONITORING_TEST dbb54 1492160401 <= last message i’ve sent from OS (more complex log below)

 

tcpdump(keepalive?):

 

13:23:42.342886 IP (tos 0x0, ttl 64, id 16448, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:23:42.345504 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

 




 

13:24:09.422491 IP (tos 0x0, ttl 64, id 17060, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:24:09.425045 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

 

Hit <enter> in rcons:

---

MONITORING_TEST dbb54 1492160401

 

ᅵᅵ

  Porᅵ

—

 

tcpdump:

13:24:35.727671 IP (tos 0x0, ttl 64, id 19582, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:24:35.731533 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

13:24:47.390367 IP (tos 0x0, ttl 64, id 20347, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:24:47.392799 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 64

13:24:47.408312 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

13:24:47.409797 IP (tos 0x0, ttl 64, id 20349, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:03.127774 IP (tos 0x0, ttl 64, id 21818, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:03.131561 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

13:25:27.269696 IP (tos 0x0, ttl 64, id 26284, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:27.272204 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

13:25:47.410313 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 64

13:25:47.413754 IP (tos 0x0, ttl 64, id 28210, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:48.709947 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 204)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 176

13:25:48.712033 IP (tos 0x0, ttl 64, id 28355, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:52.564080 IP (tos 0x0, ttl 64, id 29103, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:52.566810 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

 

and Magic, rcons:

---

  Porᅵlo]0;console: dbb54 [13:25]

 

 

dbb54 login:

---

 

On 14 April 2017 at 12:42:03, Jarrod Johnson (***@lenovo.com) wrote:

If you ctrl-e, c, o, does it restore the console after the time?

 

Can you tell that it goes after exactly 24hours on the dot?

 

When console hung, does ‘ipmitool sol activate’ say ‘session already active’?

Yes, 

# ipmitool -I lanplus -H 10.10.106.155 -U root -a sol activate

Password:

Info: SOL payload already active on another session


 

Does /var/log/confluent/consoles/<nodename> have any interesting events crop up?

[04/13 15:17:21 console connected]


 many our own messages

^MMONITORING_TEST dbb54 1492160401 | <== This is the last message from OS/ # date -***@1492160401 (Fri Apr 14 09:00:01 UTC 2017)

^M

[04/14 09:05:13 console connected]

[04/14 09:11:59 console connected]

[04/14 09:13:38 console disconnected]

[04/14 09:14:54 console connected]

[04/14 10:15:13 connection by xcat_console]

[04/14 10:15:14 disconnection by xcat_console]

[04/14 13:14:30 connection by xcat_console]


 

Pyghmi will do keepalive as well, and if that’s the problem, it should be much shorter than 24 hours.  In fact, it should be checking if the SOL payload is active and owned by confluent specifically every couple of minutes.

yes, thats correct


 

From: banuchka [mailto:***@gmail.com] 
Sent: Friday, April 14, 2017 5:55 AM
To: xcat-***@lists.sourceforge.net
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

My last reply was incorrect. Problems still here. Im trying to find something usefull inbetween confluent/pyghmi...

Confluent restart solves hangs/reopen all connections.

I think it isnt the best option to restart confluent 1 or 2 times in 24h.

-- 
banuchka

On 13 April 2017 at 17:03:19, banuchka (***@gmail.com) wrote:

It is Dell’s related problem, not 100% but


Confluent from current master is doing things well :) 

Thanks for pretty nice tool “confluentdbutil".

 

On 13 April 2017 at 11:30:14, banuchka (***@gmail.com) wrote:

Looks like that problem was before
 The fix was to use ipmitool with keepalive(one from xcat repos).

Here pyghmi is used maybe that the reason?

 

On 13 April 2017 at 08:22:28, banuchka (***@gmail.com) wrote:

Hi, 

 

Im trying to completely migrate from conserver to confluent, but catch strange behaviour.

Some of my consoles hangs ~after 24, so no any new messages in their logs or in rcons.

I send messages with timestamp from OS >/dev/console every 30-60min and take a look on them for monitoring purposes(consoles availability monitoring).

I can open rcons and hit enter, after few secs console is waking up(strange). I didnt see it happen with conserver or maybe im wrong...

Some details:

- as i can see the bigest part of consoles with hangs behaviour are Dell idrac. Doesnt matter which type of RacSerial or IPMISerial is in use.

- racreset hard/ipmitool bmc reset didnt do the things

- hit enter to console wake it up(for example with expect i can send \r\n\f, but it looks bad)

- i didnt try to clean confluent's conf and restart it. Not sure it may help.

- HP consoles works well, same ipmi

- few consoles with custom pluging works good as well

 

So maybe my question is not about confluent, but if some of you have some knowledge about same problems please share it! ;)

 

-- 
banuchka

-- 
banuchka

-- 
banuchka

------------------------------------------------------------------------------ 
Check out the vibrant tech community on one of the world's most 
engaging tech sites, Slashdot.org! http://sdm.link/slashdot_______________________________________________ 
xCAT-user mailing list 
xCAT-***@lists.sourceforge.net 
https://lists.sourceforge.net/lists/listinfo/xcat-user 

 

 

 

-- 
banuchka

-- 
banuchka
Jarrod Johnson
2017-04-14 14:53:32 UTC
Permalink
‘ctrl-e, then c, then o’ to reconnect.

Was conserver ondemand or full logging?

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 10:52 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

Console starts showing garbage after <enter> inside rcons.
What do you mean when said “restarting console”?
Console continue its work after:
- <enter> inside rcons/confetty
- bmc reset (console disconnected/console connected)

You’re absolutely right with ipmitool and conserver with the same servers we were out of such troubles.

On 14 April 2017 at 15:47:14, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
So the console starts showing garbage? Restarting the console causes the garbage to go away?

You said that ipmitool with a certain configuration did not trigger this?

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 9:29 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

I’m out of ideas, let me show you all i see.

Inside rcons i see:

MONITORING_TEST dbb54 1492160401 <= last message i’ve sent from OS (more complex log below)

tcpdump(keepalive?):

13:23:42.342886 IP (tos 0x0, ttl 64, id 16448, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:23:42.345504 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80




13:24:09.422491 IP (tos 0x0, ttl 64, id 17060, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:24:09.425045 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

Hit <enter> in rcons:
---
MONITORING_TEST dbb54 1492160401

ᅵᅵ
Porᅵ
—

tcpdump:
13:24:35.727671 IP (tos 0x0, ttl 64, id 19582, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:24:35.731533 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:24:47.390367 IP (tos 0x0, ttl 64, id 20347, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:24:47.392799 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 92)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 64
13:24:47.408312 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:24:47.409797 IP (tos 0x0, ttl 64, id 20349, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:03.127774 IP (tos 0x0, ttl 64, id 21818, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:03.131561 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:25:27.269696 IP (tos 0x0, ttl 64, id 26284, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:27.272204 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:25:47.410313 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 92)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 64
13:25:47.413754 IP (tos 0x0, ttl 64, id 28210, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:48.709947 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 204)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 176
13:25:48.712033 IP (tos 0x0, ttl 64, id 28355, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:52.564080 IP (tos 0x0, ttl 64, id 29103, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:52.566810 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

and Magic, rcons:
---
Porï¿œlo]0;console: dbb54 [13:25]


dbb54 login:
---


On 14 April 2017 at 12:42:03, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
If you ctrl-e, c, o, does it restore the console after the time?

Can you tell that it goes after exactly 24hours on the dot?

When console hung, does ‘ipmitool sol activate’ say ‘session already active’?

Yes,

# ipmitool -I lanplus -H 10.10.106.155 -U root -a sol activate

Password:

Info: SOL payload already active on another session


Does /var/log/confluent/consoles/<nodename> have any interesting events crop up?

[04/13 15:17:21 console connected]


 many our own messages

^MMONITORING_TEST dbb54 1492160401 | <== This is the last message from OS/ # date -***@1492160401 (Fri Apr 14 09:00:01 UTC 2017)

^M

[04/14 09:05:13 console connected]

[04/14 09:11:59 console connected]

[04/14 09:13:38 console disconnected]

[04/14 09:14:54 console connected]

[04/14 10:15:13 connection by xcat_console]

[04/14 10:15:14 disconnection by xcat_console]

[04/14 13:14:30 connection by xcat_console]


Pyghmi will do keepalive as well, and if that’s the problem, it should be much shorter than 24 hours. In fact, it should be checking if the SOL payload is active and owned by confluent specifically every couple of minutes.

yes, thats correct


From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 5:55 AM
To: xcat-***@lists.sourceforge.net<mailto:xcat-***@lists.sourceforge.net>
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

My last reply was incorrect. Problems still here. Im trying to find something usefull inbetween confluent/pyghmi...
Confluent restart solves hangs/reopen all connections.
I think it isnt the best option to restart confluent 1 or 2 times in 24h.
--
banuchka

On 13 April 2017 at 17:03:19, banuchka (***@gmail.com<mailto:***@gmail.com>) wrote:
It is Dell’s related problem, not 100% but

Confluent from current master is doing things well :)
Thanks for pretty nice tool “confluentdbutil".


On 13 April 2017 at 11:30:14, banuchka (***@gmail.com<mailto:***@gmail.com>) wrote:
Looks like that problem was before
 The fix was to use ipmitool with keepalive(one from xcat repos).
Here pyghmi is used maybe that the reason?


On 13 April 2017 at 08:22:28, banuchka (***@gmail.com<mailto:***@gmail.com>) wrote:
Hi,

Im trying to completely migrate from conserver to confluent, but catch strange behaviour.
Some of my consoles hangs ~after 24, so no any new messages in their logs or in rcons.
I send messages with timestamp from OS >/dev/console every 30-60min and take a look on them for monitoring purposes(consoles availability monitoring).
I can open rcons and hit enter, after few secs console is waking up(strange). I didnt see it happen with conserver or maybe im wrong...
Some details:
- as i can see the bigest part of consoles with hangs behaviour are Dell idrac. Doesnt matter which type of RacSerial or IPMISerial is in use.
- racreset hard/ipmitool bmc reset didnt do the things
- hit enter to console wake it up(for example with expect i can send \r\n\f, but it looks bad)
- i didnt try to clean confluent's conf and restart it. Not sure it may help.
- HP consoles works well, same ipmi
- few consoles with custom pluging works good as well

So maybe my question is not about confluent, but if some of you have some knowledge about same problems please share it! ;)
--
banuchka
--
banuchka
--
banuchka
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot_______________________________________________
xCAT-user mailing list
xCAT-***@lists.sourceforge.net<mailto:xCAT-***@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/xcat-user
--
banuchka
--
banuchka
banuchka
2017-04-14 14:54:52 UTC
Permalink
Full logging

On 14 April 2017 at 15:54:03, Jarrod Johnson (***@lenovo.com) wrote:

‘ctrl-e, then c, then o’ to reconnect.

 

Was conserver ondemand or full logging?

 

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 10:52 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

Console starts showing garbage after <enter> inside rcons.

What do you mean when said “restarting console”?

Console continue its work after:

- <enter> inside rcons/confetty

- bmc reset (console disconnected/console connected)

 

You’re absolutely right with ipmitool and conserver with the same servers we were out of such troubles.

On 14 April 2017 at 15:47:14, Jarrod Johnson (***@lenovo.com) wrote:

So the console starts showing garbage?  Restarting the console causes the garbage to go away?

 

You said that ipmitool with a certain configuration did not trigger this?

 

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 9:29 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

I’m out of ideas, let me show you all i see.

 

Inside rcons i see:

 

MONITORING_TEST dbb54 1492160401 <= last message i’ve sent from OS (more complex log below)

 

tcpdump(keepalive?):

 

13:23:42.342886 IP (tos 0x0, ttl 64, id 16448, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:23:42.345504 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

 




 

13:24:09.422491 IP (tos 0x0, ttl 64, id 17060, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:24:09.425045 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

 

Hit <enter> in rcons:

---

MONITORING_TEST dbb54 1492160401

 

ᅵᅵ

  Porᅵ

—

 

tcpdump:

13:24:35.727671 IP (tos 0x0, ttl 64, id 19582, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:24:35.731533 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

13:24:47.390367 IP (tos 0x0, ttl 64, id 20347, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:24:47.392799 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 64

13:24:47.408312 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

13:24:47.409797 IP (tos 0x0, ttl 64, id 20349, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:03.127774 IP (tos 0x0, ttl 64, id 21818, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:03.131561 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

13:25:27.269696 IP (tos 0x0, ttl 64, id 26284, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:27.272204 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

13:25:47.410313 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 64

13:25:47.413754 IP (tos 0x0, ttl 64, id 28210, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:48.709947 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 204)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 176

13:25:48.712033 IP (tos 0x0, ttl 64, id 28355, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:52.564080 IP (tos 0x0, ttl 64, id 29103, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:52.566810 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

 

and Magic, rcons:

---

  Porᅵlo]0;console: dbb54 [13:25]

 

 

dbb54 login:

---

 

On 14 April 2017 at 12:42:03, Jarrod Johnson (***@lenovo.com) wrote:

If you ctrl-e, c, o, does it restore the console after the time?

 

Can you tell that it goes after exactly 24hours on the dot?

 

When console hung, does ‘ipmitool sol activate’ say ‘session already active’?

Yes, 

# ipmitool -I lanplus -H 10.10.106.155 -U root -a sol activate

Password:

Info: SOL payload already active on another session


 

Does /var/log/confluent/consoles/<nodename> have any interesting events crop up?

[04/13 15:17:21 console connected]


 many our own messages

^MMONITORING_TEST dbb54 1492160401 | <== This is the last message from OS/ # date -***@1492160401 (Fri Apr 14 09:00:01 UTC 2017)

^M

[04/14 09:05:13 console connected]

[04/14 09:11:59 console connected]

[04/14 09:13:38 console disconnected]

[04/14 09:14:54 console connected]

[04/14 10:15:13 connection by xcat_console]

[04/14 10:15:14 disconnection by xcat_console]

[04/14 13:14:30 connection by xcat_console]


 

Pyghmi will do keepalive as well, and if that’s the problem, it should be much shorter than 24 hours.  In fact, it should be checking if the SOL payload is active and owned by confluent specifically every couple of minutes.

yes, thats correct


 

From: banuchka [mailto:***@gmail.com] 
Sent: Friday, April 14, 2017 5:55 AM
To: xcat-***@lists.sourceforge.net
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

My last reply was incorrect. Problems still here. Im trying to find something usefull inbetween confluent/pyghmi...

Confluent restart solves hangs/reopen all connections.

I think it isnt the best option to restart confluent 1 or 2 times in 24h.

-- 
banuchka

On 13 April 2017 at 17:03:19, banuchka (***@gmail.com) wrote:

It is Dell’s related problem, not 100% but


Confluent from current master is doing things well :) 

Thanks for pretty nice tool “confluentdbutil".

 

On 13 April 2017 at 11:30:14, banuchka (***@gmail.com) wrote:

Looks like that problem was before
 The fix was to use ipmitool with keepalive(one from xcat repos).

Here pyghmi is used maybe that the reason?

 

On 13 April 2017 at 08:22:28, banuchka (***@gmail.com) wrote:

Hi, 

 

Im trying to completely migrate from conserver to confluent, but catch strange behaviour.

Some of my consoles hangs ~after 24, so no any new messages in their logs or in rcons.

I send messages with timestamp from OS >/dev/console every 30-60min and take a look on them for monitoring purposes(consoles availability monitoring).

I can open rcons and hit enter, after few secs console is waking up(strange). I didnt see it happen with conserver or maybe im wrong...

Some details:

- as i can see the bigest part of consoles with hangs behaviour are Dell idrac. Doesnt matter which type of RacSerial or IPMISerial is in use.

- racreset hard/ipmitool bmc reset didnt do the things

- hit enter to console wake it up(for example with expect i can send \r\n\f, but it looks bad)

- i didnt try to clean confluent's conf and restart it. Not sure it may help.

- HP consoles works well, same ipmi

- few consoles with custom pluging works good as well

 

So maybe my question is not about confluent, but if some of you have some knowledge about same problems please share it! ;)

 

-- 
banuchka

-- 
banuchka

-- 
banuchka

------------------------------------------------------------------------------ 
Check out the vibrant tech community on one of the world's most 
engaging tech sites, Slashdot.org! http://sdm.link/slashdot_______________________________________________ 
xCAT-user mailing list 
xCAT-***@lists.sourceforge.net 
https://lists.sourceforge.net/lists/listinfo/xcat-user 

 

 

 

-- 
banuchka

-- 
banuchka

-- 
banuchka
banuchka
2017-04-14 15:00:22 UTC
Permalink
Reopen console did the trick as well...

On 14 April 2017 at 15:54:03, Jarrod Johnson (***@lenovo.com) wrote:

‘ctrl-e, then c, then o’ to reconnect.

 

Was conserver ondemand or full logging?

 

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 10:52 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

Console starts showing garbage after <enter> inside rcons.

What do you mean when said “restarting console”?

Console continue its work after:

- <enter> inside rcons/confetty

- bmc reset (console disconnected/console connected)

 

You’re absolutely right with ipmitool and conserver with the same servers we were out of such troubles.

On 14 April 2017 at 15:47:14, Jarrod Johnson (***@lenovo.com) wrote:

So the console starts showing garbage?  Restarting the console causes the garbage to go away?

 

You said that ipmitool with a certain configuration did not trigger this?

 

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 9:29 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

I’m out of ideas, let me show you all i see.

 

Inside rcons i see:

 

MONITORING_TEST dbb54 1492160401 <= last message i’ve sent from OS (more complex log below)

 

tcpdump(keepalive?):

 

13:23:42.342886 IP (tos 0x0, ttl 64, id 16448, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:23:42.345504 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

 




 

13:24:09.422491 IP (tos 0x0, ttl 64, id 17060, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:24:09.425045 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

 

Hit <enter> in rcons:

---

MONITORING_TEST dbb54 1492160401

 

ᅵᅵ

  Porᅵ

—

 

tcpdump:

13:24:35.727671 IP (tos 0x0, ttl 64, id 19582, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:24:35.731533 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

13:24:47.390367 IP (tos 0x0, ttl 64, id 20347, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:24:47.392799 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 64

13:24:47.408312 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

13:24:47.409797 IP (tos 0x0, ttl 64, id 20349, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:03.127774 IP (tos 0x0, ttl 64, id 21818, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:03.131561 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

13:25:27.269696 IP (tos 0x0, ttl 64, id 26284, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:27.272204 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

13:25:47.410313 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 64

13:25:47.413754 IP (tos 0x0, ttl 64, id 28210, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:48.709947 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 204)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 176

13:25:48.712033 IP (tos 0x0, ttl 64, id 28355, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:52.564080 IP (tos 0x0, ttl 64, id 29103, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:52.566810 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

 

and Magic, rcons:

---

  Porᅵlo]0;console: dbb54 [13:25]

 

 

dbb54 login:

---

 

On 14 April 2017 at 12:42:03, Jarrod Johnson (***@lenovo.com) wrote:

If you ctrl-e, c, o, does it restore the console after the time?

 

Can you tell that it goes after exactly 24hours on the dot?

 

When console hung, does ‘ipmitool sol activate’ say ‘session already active’?

Yes, 

# ipmitool -I lanplus -H 10.10.106.155 -U root -a sol activate

Password:

Info: SOL payload already active on another session


 

Does /var/log/confluent/consoles/<nodename> have any interesting events crop up?

[04/13 15:17:21 console connected]


 many our own messages

^MMONITORING_TEST dbb54 1492160401 | <== This is the last message from OS/ # date -***@1492160401 (Fri Apr 14 09:00:01 UTC 2017)

^M

[04/14 09:05:13 console connected]

[04/14 09:11:59 console connected]

[04/14 09:13:38 console disconnected]

[04/14 09:14:54 console connected]

[04/14 10:15:13 connection by xcat_console]

[04/14 10:15:14 disconnection by xcat_console]

[04/14 13:14:30 connection by xcat_console]


 

Pyghmi will do keepalive as well, and if that’s the problem, it should be much shorter than 24 hours.  In fact, it should be checking if the SOL payload is active and owned by confluent specifically every couple of minutes.

yes, thats correct


 

From: banuchka [mailto:***@gmail.com] 
Sent: Friday, April 14, 2017 5:55 AM
To: xcat-***@lists.sourceforge.net
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

My last reply was incorrect. Problems still here. Im trying to find something usefull inbetween confluent/pyghmi...

Confluent restart solves hangs/reopen all connections.

I think it isnt the best option to restart confluent 1 or 2 times in 24h.

-- 
banuchka

On 13 April 2017 at 17:03:19, banuchka (***@gmail.com) wrote:

It is Dell’s related problem, not 100% but


Confluent from current master is doing things well :) 

Thanks for pretty nice tool “confluentdbutil".

 

On 13 April 2017 at 11:30:14, banuchka (***@gmail.com) wrote:

Looks like that problem was before
 The fix was to use ipmitool with keepalive(one from xcat repos).

Here pyghmi is used maybe that the reason?

 

On 13 April 2017 at 08:22:28, banuchka (***@gmail.com) wrote:

Hi, 

 

Im trying to completely migrate from conserver to confluent, but catch strange behaviour.

Some of my consoles hangs ~after 24, so no any new messages in their logs or in rcons.

I send messages with timestamp from OS >/dev/console every 30-60min and take a look on them for monitoring purposes(consoles availability monitoring).

I can open rcons and hit enter, after few secs console is waking up(strange). I didnt see it happen with conserver or maybe im wrong...

Some details:

- as i can see the bigest part of consoles with hangs behaviour are Dell idrac. Doesnt matter which type of RacSerial or IPMISerial is in use.

- racreset hard/ipmitool bmc reset didnt do the things

- hit enter to console wake it up(for example with expect i can send \r\n\f, but it looks bad)

- i didnt try to clean confluent's conf and restart it. Not sure it may help.

- HP consoles works well, same ipmi

- few consoles with custom pluging works good as well

 

So maybe my question is not about confluent, but if some of you have some knowledge about same problems please share it! ;)

 

-- 
banuchka

-- 
banuchka

-- 
banuchka

------------------------------------------------------------------------------ 
Check out the vibrant tech community on one of the world's most 
engaging tech sites, Slashdot.org! http://sdm.link/slashdot_______________________________________________ 
xCAT-user mailing list 
xCAT-***@lists.sourceforge.net 
https://lists.sourceforge.net/lists/listinfo/xcat-user 

 

 

 

-- 
banuchka

-- 
banuchka

-- 
banuchka
Jarrod Johnson
2017-04-14 15:01:46 UTC
Permalink
So reopen causes it to work again, and before, it’s not *hung*, but erratic with garbage characters and occasional blips of sanity?

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 11:00 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

Reopen console did the trick as well...


On 14 April 2017 at 15:54:03, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
‘ctrl-e, then c, then o’ to reconnect.

Was conserver ondemand or full logging?

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 10:52 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

Console starts showing garbage after <enter> inside rcons.
What do you mean when said “restarting console”?
Console continue its work after:
- <enter> inside rcons/confetty
- bmc reset (console disconnected/console connected)

You’re absolutely right with ipmitool and conserver with the same servers we were out of such troubles.

On 14 April 2017 at 15:47:14, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
So the console starts showing garbage? Restarting the console causes the garbage to go away?

You said that ipmitool with a certain configuration did not trigger this?

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 9:29 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

I’m out of ideas, let me show you all i see.

Inside rcons i see:

MONITORING_TEST dbb54 1492160401 <= last message i’ve sent from OS (more complex log below)

tcpdump(keepalive?):

13:23:42.342886 IP (tos 0x0, ttl 64, id 16448, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:23:42.345504 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80




13:24:09.422491 IP (tos 0x0, ttl 64, id 17060, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:24:09.425045 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

Hit <enter> in rcons:
---
MONITORING_TEST dbb54 1492160401

ᅵᅵ
Porᅵ
—

tcpdump:
13:24:35.727671 IP (tos 0x0, ttl 64, id 19582, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:24:35.731533 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:24:47.390367 IP (tos 0x0, ttl 64, id 20347, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:24:47.392799 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 92)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 64
13:24:47.408312 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:24:47.409797 IP (tos 0x0, ttl 64, id 20349, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:03.127774 IP (tos 0x0, ttl 64, id 21818, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:03.131561 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:25:27.269696 IP (tos 0x0, ttl 64, id 26284, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:27.272204 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:25:47.410313 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 92)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 64
13:25:47.413754 IP (tos 0x0, ttl 64, id 28210, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:48.709947 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 204)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 176
13:25:48.712033 IP (tos 0x0, ttl 64, id 28355, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:52.564080 IP (tos 0x0, ttl 64, id 29103, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:52.566810 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

and Magic, rcons:
---
Porï¿œlo]0;console: dbb54 [13:25]


dbb54 login:
---


On 14 April 2017 at 12:42:03, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
If you ctrl-e, c, o, does it restore the console after the time?

Can you tell that it goes after exactly 24hours on the dot?

When console hung, does ‘ipmitool sol activate’ say ‘session already active’?

Yes,

# ipmitool -I lanplus -H 10.10.106.155 -U root -a sol activate

Password:

Info: SOL payload already active on another session


Does /var/log/confluent/consoles/<nodename> have any interesting events crop up?

[04/13 15:17:21 console connected]


 many our own messages

^MMONITORING_TEST dbb54 1492160401 | <== This is the last message from OS/ # date -***@1492160401 (Fri Apr 14 09:00:01 UTC 2017)

^M

[04/14 09:05:13 console connected]

[04/14 09:11:59 console connected]

[04/14 09:13:38 console disconnected]

[04/14 09:14:54 console connected]

[04/14 10:15:13 connection by xcat_console]

[04/14 10:15:14 disconnection by xcat_console]

[04/14 13:14:30 connection by xcat_console]


Pyghmi will do keepalive as well, and if that’s the problem, it should be much shorter than 24 hours. In fact, it should be checking if the SOL payload is active and owned by confluent specifically every couple of minutes.

yes, thats correct


From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 5:55 AM
To: xcat-***@lists.sourceforge.net<mailto:xcat-***@lists.sourceforge.net>
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

My last reply was incorrect. Problems still here. Im trying to find something usefull inbetween confluent/pyghmi...
Confluent restart solves hangs/reopen all connections.
I think it isnt the best option to restart confluent 1 or 2 times in 24h.
--
banuchka

On 13 April 2017 at 17:03:19, banuchka (***@gmail.com<mailto:***@gmail.com>) wrote:
It is Dell’s related problem, not 100% but

Confluent from current master is doing things well :)
Thanks for pretty nice tool “confluentdbutil".


On 13 April 2017 at 11:30:14, banuchka (***@gmail.com<mailto:***@gmail.com>) wrote:
Looks like that problem was before
 The fix was to use ipmitool with keepalive(one from xcat repos).
Here pyghmi is used maybe that the reason?


On 13 April 2017 at 08:22:28, banuchka (***@gmail.com<mailto:***@gmail.com>) wrote:
Hi,

Im trying to completely migrate from conserver to confluent, but catch strange behaviour.
Some of my consoles hangs ~after 24, so no any new messages in their logs or in rcons.
I send messages with timestamp from OS >/dev/console every 30-60min and take a look on them for monitoring purposes(consoles availability monitoring).
I can open rcons and hit enter, after few secs console is waking up(strange). I didnt see it happen with conserver or maybe im wrong...
Some details:
- as i can see the bigest part of consoles with hangs behaviour are Dell idrac. Doesnt matter which type of RacSerial or IPMISerial is in use.
- racreset hard/ipmitool bmc reset didnt do the things
- hit enter to console wake it up(for example with expect i can send \r\n\f, but it looks bad)
- i didnt try to clean confluent's conf and restart it. Not sure it may help.
- HP consoles works well, same ipmi
- few consoles with custom pluging works good as well

So maybe my question is not about confluent, but if some of you have some knowledge about same problems please share it! ;)
--
banuchka
--
banuchka
--
banuchka
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot_______________________________________________
xCAT-user mailing list
xCAT-***@lists.sourceforge.net<mailto:xCAT-***@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/xcat-user
--
banuchka
--
banuchka
--
banuchka
banuchka
2017-04-14 15:08:42 UTC
Permalink
Yes, reopen causes it to work again,  without any garbage
 so looks like normal console :)
Hit <enter> causes at first garbage output(ᅵᅵ Porᅵlo) and *normal console* before...

On 14 April 2017 at 16:02:09, Jarrod Johnson (***@lenovo.com) wrote:

So reopen causes it to work again, and before, it’s not *hung*, but erratic with garbage characters and occasional blips of sanity?

 

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 11:00 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

Reopen console did the trick as well...

 

On 14 April 2017 at 15:54:03, Jarrod Johnson (***@lenovo.com) wrote:

‘ctrl-e, then c, then o’ to reconnect.

 

Was conserver ondemand or full logging?

 

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 10:52 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

Console starts showing garbage after <enter> inside rcons.

What do you mean when said “restarting console”?

Console continue its work after:

- <enter> inside rcons/confetty

- bmc reset (console disconnected/console connected)

 

You’re absolutely right with ipmitool and conserver with the same servers we were out of such troubles.

On 14 April 2017 at 15:47:14, Jarrod Johnson (***@lenovo.com) wrote:

So the console starts showing garbage?  Restarting the console causes the garbage to go away?

 

You said that ipmitool with a certain configuration did not trigger this?

 

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 9:29 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

I’m out of ideas, let me show you all i see.

 

Inside rcons i see:

 

MONITORING_TEST dbb54 1492160401 <= last message i’ve sent from OS (more complex log below)

 

tcpdump(keepalive?):

 

13:23:42.342886 IP (tos 0x0, ttl 64, id 16448, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:23:42.345504 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

 




 

13:24:09.422491 IP (tos 0x0, ttl 64, id 17060, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:24:09.425045 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

 

Hit <enter> in rcons:

---

MONITORING_TEST dbb54 1492160401

 

ᅵᅵ

  Porᅵ

—

 

tcpdump:

13:24:35.727671 IP (tos 0x0, ttl 64, id 19582, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:24:35.731533 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

13:24:47.390367 IP (tos 0x0, ttl 64, id 20347, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:24:47.392799 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 64

13:24:47.408312 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

13:24:47.409797 IP (tos 0x0, ttl 64, id 20349, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:03.127774 IP (tos 0x0, ttl 64, id 21818, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:03.131561 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

13:25:27.269696 IP (tos 0x0, ttl 64, id 26284, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:27.272204 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

13:25:47.410313 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 64

13:25:47.413754 IP (tos 0x0, ttl 64, id 28210, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:48.709947 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 204)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 176

13:25:48.712033 IP (tos 0x0, ttl 64, id 28355, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:52.564080 IP (tos 0x0, ttl 64, id 29103, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:52.566810 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

 

and Magic, rcons:

---

  Porᅵlo]0;console: dbb54 [13:25]

 

 

dbb54 login:

---

 

On 14 April 2017 at 12:42:03, Jarrod Johnson (***@lenovo.com) wrote:

If you ctrl-e, c, o, does it restore the console after the time?

 

Can you tell that it goes after exactly 24hours on the dot?

 

When console hung, does ‘ipmitool sol activate’ say ‘session already active’?

Yes, 

# ipmitool -I lanplus -H 10.10.106.155 -U root -a sol activate

Password:

Info: SOL payload already active on another session


 

Does /var/log/confluent/consoles/<nodename> have any interesting events crop up?

[04/13 15:17:21 console connected]


 many our own messages

^MMONITORING_TEST dbb54 1492160401 | <== This is the last message from OS/ # date -***@1492160401 (Fri Apr 14 09:00:01 UTC 2017)

^M

[04/14 09:05:13 console connected]

[04/14 09:11:59 console connected]

[04/14 09:13:38 console disconnected]

[04/14 09:14:54 console connected]

[04/14 10:15:13 connection by xcat_console]

[04/14 10:15:14 disconnection by xcat_console]

[04/14 13:14:30 connection by xcat_console]


 

Pyghmi will do keepalive as well, and if that’s the problem, it should be much shorter than 24 hours.  In fact, it should be checking if the SOL payload is active and owned by confluent specifically every couple of minutes.

yes, thats correct


 

From: banuchka [mailto:***@gmail.com] 
Sent: Friday, April 14, 2017 5:55 AM
To: xcat-***@lists.sourceforge.net
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

My last reply was incorrect. Problems still here. Im trying to find something usefull inbetween confluent/pyghmi...

Confluent restart solves hangs/reopen all connections.

I think it isnt the best option to restart confluent 1 or 2 times in 24h.

-- 
banuchka

On 13 April 2017 at 17:03:19, banuchka (***@gmail.com) wrote:

It is Dell’s related problem, not 100% but


Confluent from current master is doing things well :) 

Thanks for pretty nice tool “confluentdbutil".

 

On 13 April 2017 at 11:30:14, banuchka (***@gmail.com) wrote:

Looks like that problem was before
 The fix was to use ipmitool with keepalive(one from xcat repos).

Here pyghmi is used maybe that the reason?

 

On 13 April 2017 at 08:22:28, banuchka (***@gmail.com) wrote:

Hi, 

 

Im trying to completely migrate from conserver to confluent, but catch strange behaviour.

Some of my consoles hangs ~after 24, so no any new messages in their logs or in rcons.

I send messages with timestamp from OS >/dev/console every 30-60min and take a look on them for monitoring purposes(consoles availability monitoring).

I can open rcons and hit enter, after few secs console is waking up(strange). I didnt see it happen with conserver or maybe im wrong...

Some details:

- as i can see the bigest part of consoles with hangs behaviour are Dell idrac. Doesnt matter which type of RacSerial or IPMISerial is in use.

- racreset hard/ipmitool bmc reset didnt do the things

- hit enter to console wake it up(for example with expect i can send \r\n\f, but it looks bad)

- i didnt try to clean confluent's conf and restart it. Not sure it may help.

- HP consoles works well, same ipmi

- few consoles with custom pluging works good as well

 

So maybe my question is not about confluent, but if some of you have some knowledge about same problems please share it! ;)

 

-- 
banuchka

-- 
banuchka

-- 
banuchka

------------------------------------------------------------------------------ 
Check out the vibrant tech community on one of the world's most 
engaging tech sites, Slashdot.org! http://sdm.link/slashdot_______________________________________________ 
xCAT-user mailing list 
xCAT-***@lists.sourceforge.net 
https://lists.sourceforge.net/lists/listinfo/xcat-user 

 

 

 

-- 
banuchka

-- 
banuchka

-- 
banuchka

-- 
banuchka
Jarrod Johnson
2017-04-14 15:14:17 UTC
Permalink
And to be clear, the corruption only starts after a long period of time of being continuously connected?

I might be interested in seeing ipmitool sol info 1 output against a system while it is working versus showing corrupted info.

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 11:09 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

Yes, reopen causes it to work again, without any garbage
 so looks like normal console :)
Hit <enter> causes at first garbage output(ᅵᅵ Porᅵlo) and *normal console* before...


On 14 April 2017 at 16:02:09, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
So reopen causes it to work again, and before, it’s not *hung*, but erratic with garbage characters and occasional blips of sanity?

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 11:00 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

Reopen console did the trick as well...


On 14 April 2017 at 15:54:03, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
‘ctrl-e, then c, then o’ to reconnect.

Was conserver ondemand or full logging?

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 10:52 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

Console starts showing garbage after <enter> inside rcons.
What do you mean when said “restarting console”?
Console continue its work after:
- <enter> inside rcons/confetty
- bmc reset (console disconnected/console connected)

You’re absolutely right with ipmitool and conserver with the same servers we were out of such troubles.

On 14 April 2017 at 15:47:14, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
So the console starts showing garbage? Restarting the console causes the garbage to go away?

You said that ipmitool with a certain configuration did not trigger this?

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 9:29 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

I’m out of ideas, let me show you all i see.

Inside rcons i see:

MONITORING_TEST dbb54 1492160401 <= last message i’ve sent from OS (more complex log below)

tcpdump(keepalive?):

13:23:42.342886 IP (tos 0x0, ttl 64, id 16448, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:23:42.345504 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80




13:24:09.422491 IP (tos 0x0, ttl 64, id 17060, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:24:09.425045 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

Hit <enter> in rcons:
---
MONITORING_TEST dbb54 1492160401

ᅵᅵ
Porᅵ
—

tcpdump:
13:24:35.727671 IP (tos 0x0, ttl 64, id 19582, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:24:35.731533 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:24:47.390367 IP (tos 0x0, ttl 64, id 20347, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:24:47.392799 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 92)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 64
13:24:47.408312 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:24:47.409797 IP (tos 0x0, ttl 64, id 20349, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:03.127774 IP (tos 0x0, ttl 64, id 21818, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:03.131561 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:25:27.269696 IP (tos 0x0, ttl 64, id 26284, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:27.272204 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:25:47.410313 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 92)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 64
13:25:47.413754 IP (tos 0x0, ttl 64, id 28210, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:48.709947 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 204)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 176
13:25:48.712033 IP (tos 0x0, ttl 64, id 28355, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:52.564080 IP (tos 0x0, ttl 64, id 29103, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:52.566810 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

and Magic, rcons:
---
Porï¿œlo]0;console: dbb54 [13:25]


dbb54 login:
---


On 14 April 2017 at 12:42:03, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
If you ctrl-e, c, o, does it restore the console after the time?

Can you tell that it goes after exactly 24hours on the dot?

When console hung, does ‘ipmitool sol activate’ say ‘session already active’?

Yes,

# ipmitool -I lanplus -H 10.10.106.155 -U root -a sol activate

Password:

Info: SOL payload already active on another session


Does /var/log/confluent/consoles/<nodename> have any interesting events crop up?

[04/13 15:17:21 console connected]


 many our own messages

^MMONITORING_TEST dbb54 1492160401 | <== This is the last message from OS/ # date -***@1492160401 (Fri Apr 14 09:00:01 UTC 2017)

^M

[04/14 09:05:13 console connected]

[04/14 09:11:59 console connected]

[04/14 09:13:38 console disconnected]

[04/14 09:14:54 console connected]

[04/14 10:15:13 connection by xcat_console]

[04/14 10:15:14 disconnection by xcat_console]

[04/14 13:14:30 connection by xcat_console]


Pyghmi will do keepalive as well, and if that’s the problem, it should be much shorter than 24 hours. In fact, it should be checking if the SOL payload is active and owned by confluent specifically every couple of minutes.

yes, thats correct


From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 5:55 AM
To: xcat-***@lists.sourceforge.net<mailto:xcat-***@lists.sourceforge.net>
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

My last reply was incorrect. Problems still here. Im trying to find something usefull inbetween confluent/pyghmi...
Confluent restart solves hangs/reopen all connections.
I think it isnt the best option to restart confluent 1 or 2 times in 24h.
--
banuchka

On 13 April 2017 at 17:03:19, banuchka (***@gmail.com<mailto:***@gmail.com>) wrote:
It is Dell’s related problem, not 100% but

Confluent from current master is doing things well :)
Thanks for pretty nice tool “confluentdbutil".


On 13 April 2017 at 11:30:14, banuchka (***@gmail.com<mailto:***@gmail.com>) wrote:
Looks like that problem was before
 The fix was to use ipmitool with keepalive(one from xcat repos).
Here pyghmi is used maybe that the reason?


On 13 April 2017 at 08:22:28, banuchka (***@gmail.com<mailto:***@gmail.com>) wrote:
Hi,

Im trying to completely migrate from conserver to confluent, but catch strange behaviour.
Some of my consoles hangs ~after 24, so no any new messages in their logs or in rcons.
I send messages with timestamp from OS >/dev/console every 30-60min and take a look on them for monitoring purposes(consoles availability monitoring).
I can open rcons and hit enter, after few secs console is waking up(strange). I didnt see it happen with conserver or maybe im wrong...
Some details:
- as i can see the bigest part of consoles with hangs behaviour are Dell idrac. Doesnt matter which type of RacSerial or IPMISerial is in use.
- racreset hard/ipmitool bmc reset didnt do the things
- hit enter to console wake it up(for example with expect i can send \r\n\f, but it looks bad)
- i didnt try to clean confluent's conf and restart it. Not sure it may help.
- HP consoles works well, same ipmi
- few consoles with custom pluging works good as well

So maybe my question is not about confluent, but if some of you have some knowledge about same problems please share it! ;)
--
banuchka
--
banuchka
--
banuchka
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot_______________________________________________
xCAT-user mailing list
xCAT-***@lists.sourceforge.net<mailto:xCAT-***@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/xcat-user
--
banuchka
--
banuchka
--
banuchka
--
banuchka
banuchka
2017-04-14 15:27:58 UTC
Permalink
On 14 April 2017 at 16:15:16, Jarrod Johnson (***@lenovo.com) wrote:

And to be clear, the corruption only starts after a long period of time of being continuously connected?

Yes, that is correct


 

I might be interested in seeing ipmitool sol info 1 output against a system while it is working versus showing corrupted info.

corrupted:

# ipmitool -I lanplus -H cloud2manage -U root -a sol info 1

Password:

Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01

Set in progress                 : set-complete

Enabled                         : true

Force Encryption                : true

Force Authentication            : false

Privilege Level                 : ADMINISTRATOR

Character Accumulate Level (ms) : 50

Character Send Threshold        : 255

Retry Count                     : 7

Retry Interval (ms)             : 480

Volatile Bit Rate (kbps)        : 38.4

Non-Volatile Bit Rate (kbps)    : 115.2

Payload Channel                 : 1 (0x01)

Payload Port                    : 623



Working:

# ipmitool -I lanplus -H cloud2manage -U root -a sol info 1

Password:

Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01

Set in progress                 : set-complete

Enabled                         : true

Force Encryption                : true

Force Authentication            : false

Privilege Level                 : ADMINISTRATOR

Character Accumulate Level (ms) : 50

Character Send Threshold        : 255

Retry Count                     : 7

Retry Interval (ms)             : 480

Volatile Bit Rate (kbps)        : 38.4

Non-Volatile Bit Rate (kbps)    : 115.2

Payload Channel                 : 1 (0x01)

Payload Port                    : 623


 

From: banuchka [mailto:***@gmail.com] 
Sent: Friday, April 14, 2017 11:09 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

Yes, reopen causes it to work again,  without any garbage
 so looks like normal console :)

Hit <enter> causes at first garbage output(ᅵᅵ Porᅵlo) and *normal console* before...

 

On 14 April 2017 at 16:02:09, Jarrod Johnson (***@lenovo.com) wrote:

So reopen causes it to work again, and before, it’s not *hung*, but erratic with garbage characters and occasional blips of sanity?

 

From: banuchka [mailto:***@gmail.com] 
Sent: Friday, April 14, 2017 11:00 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

Reopen console did the trick as well...

 

On 14 April 2017 at 15:54:03, Jarrod Johnson (***@lenovo.com) wrote:

‘ctrl-e, then c, then o’ to reconnect.

 

Was conserver ondemand or full logging?

 

From: banuchka [mailto:***@gmail.com] 
Sent: Friday, April 14, 2017 10:52 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

Console starts showing garbage after <enter> inside rcons.

What do you mean when said “restarting console”?

Console continue its work after:

- <enter> inside rcons/confetty

- bmc reset (console disconnected/console connected)

 

You’re absolutely right with ipmitool and conserver with the same servers we were out of such troubles.

On 14 April 2017 at 15:47:14, Jarrod Johnson (***@lenovo.com) wrote:

So the console starts showing garbage?  Restarting the console causes the garbage to go away?

 

You said that ipmitool with a certain configuration did not trigger this?

 

From: banuchka [mailto:***@gmail.com] 
Sent: Friday, April 14, 2017 9:29 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

I’m out of ideas, let me show you all i see.

 

Inside rcons i see:

 

MONITORING_TEST dbb54 1492160401 <= last message i’ve sent from OS (more complex log below)

 

tcpdump(keepalive?):

 

13:23:42.342886 IP (tos 0x0, ttl 64, id 16448, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:23:42.345504 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

 




 

13:24:09.422491 IP (tos 0x0, ttl 64, id 17060, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:24:09.425045 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

 

Hit <enter> in rcons:

---

MONITORING_TEST dbb54 1492160401

 

ᅵᅵ

  Porᅵ

—

 

tcpdump:

13:24:35.727671 IP (tos 0x0, ttl 64, id 19582, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:24:35.731533 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

13:24:47.390367 IP (tos 0x0, ttl 64, id 20347, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:24:47.392799 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 64

13:24:47.408312 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

13:24:47.409797 IP (tos 0x0, ttl 64, id 20349, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:03.127774 IP (tos 0x0, ttl 64, id 21818, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:03.131561 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

13:25:27.269696 IP (tos 0x0, ttl 64, id 26284, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:27.272204 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

13:25:47.410313 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 64

13:25:47.413754 IP (tos 0x0, ttl 64, id 28210, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:48.709947 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 204)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 176

13:25:48.712033 IP (tos 0x0, ttl 64, id 28355, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:52.564080 IP (tos 0x0, ttl 64, id 29103, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:52.566810 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

 

and Magic, rcons:

---

  Porᅵlo]0;console: dbb54 [13:25]

 

 

dbb54 login:

---

 

On 14 April 2017 at 12:42:03, Jarrod Johnson (***@lenovo.com) wrote:

If you ctrl-e, c, o, does it restore the console after the time?

 

Can you tell that it goes after exactly 24hours on the dot?

 

When console hung, does ‘ipmitool sol activate’ say ‘session already active’?

Yes, 

# ipmitool -I lanplus -H 10.10.106.155 -U root -a sol activate

Password:

Info: SOL payload already active on another session


 

Does /var/log/confluent/consoles/<nodename> have any interesting events crop up?

[04/13 15:17:21 console connected]


 many our own messages

^MMONITORING_TEST dbb54 1492160401 | <== This is the last message from OS/ # date -***@1492160401 (Fri Apr 14 09:00:01 UTC 2017)

^M

[04/14 09:05:13 console connected]

[04/14 09:11:59 console connected]

[04/14 09:13:38 console disconnected]

[04/14 09:14:54 console connected]

[04/14 10:15:13 connection by xcat_console]

[04/14 10:15:14 disconnection by xcat_console]

[04/14 13:14:30 connection by xcat_console]


 

Pyghmi will do keepalive as well, and if that’s the problem, it should be much shorter than 24 hours.  In fact, it should be checking if the SOL payload is active and owned by confluent specifically every couple of minutes.

yes, thats correct


 

From: banuchka [mailto:***@gmail.com] 
Sent: Friday, April 14, 2017 5:55 AM
To: xcat-***@lists.sourceforge.net
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

My last reply was incorrect. Problems still here. Im trying to find something usefull inbetween confluent/pyghmi...

Confluent restart solves hangs/reopen all connections.

I think it isnt the best option to restart confluent 1 or 2 times in 24h.

-- 
banuchka

On 13 April 2017 at 17:03:19, banuchka (***@gmail.com) wrote:

It is Dell’s related problem, not 100% but


Confluent from current master is doing things well :) 

Thanks for pretty nice tool “confluentdbutil".

 

On 13 April 2017 at 11:30:14, banuchka (***@gmail.com) wrote:

Looks like that problem was before
 The fix was to use ipmitool with keepalive(one from xcat repos).

Here pyghmi is used maybe that the reason?

 

On 13 April 2017 at 08:22:28, banuchka (***@gmail.com) wrote:

Hi, 

 

Im trying to completely migrate from conserver to confluent, but catch strange behaviour.

Some of my consoles hangs ~after 24, so no any new messages in their logs or in rcons.

I send messages with timestamp from OS >/dev/console every 30-60min and take a look on them for monitoring purposes(consoles availability monitoring).

I can open rcons and hit enter, after few secs console is waking up(strange). I didnt see it happen with conserver or maybe im wrong...

Some details:

- as i can see the bigest part of consoles with hangs behaviour are Dell idrac. Doesnt matter which type of RacSerial or IPMISerial is in use.

- racreset hard/ipmitool bmc reset didnt do the things

- hit enter to console wake it up(for example with expect i can send \r\n\f, but it looks bad)

- i didnt try to clean confluent's conf and restart it. Not sure it may help.

- HP consoles works well, same ipmi

- few consoles with custom pluging works good as well

 

So maybe my question is not about confluent, but if some of you have some knowledge about same problems please share it! ;)

 

-- 
banuchka

-- 
banuchka

-- 
banuchka

------------------------------------------------------------------------------ 
Check out the vibrant tech community on one of the world's most 
engaging tech sites, Slashdot.org! http://sdm.link/slashdot_______________________________________________ 
xCAT-user mailing list 
xCAT-***@lists.sourceforge.net 
https://lists.sourceforge.net/lists/listinfo/xcat-user 

 

 

 

-- 
banuchka

-- 
banuchka

-- 
banuchka

-- 
banuchka



-- 
banuchka
Jarrod Johnson
2017-04-14 15:30:21 UTC
Permalink
Hmm, what’s the baud rate the console is actually running at? Odd to see the volatile and non volatile bit rates not be the same.

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 11:28 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.




On 14 April 2017 at 16:15:16, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
And to be clear, the corruption only starts after a long period of time of being continuously connected?

Yes, that is correct


I might be interested in seeing ipmitool sol info 1 output against a system while it is working versus showing corrupted info.

corrupted:

# ipmitool -I lanplus -H cloud2manage -U root -a sol info 1

Password:

Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01

Set in progress : set-complete

Enabled : true

Force Encryption : true

Force Authentication : false

Privilege Level : ADMINISTRATOR

Character Accumulate Level (ms) : 50

Character Send Threshold : 255

Retry Count : 7

Retry Interval (ms) : 480

Volatile Bit Rate (kbps) : 38.4

Non-Volatile Bit Rate (kbps) : 115.2

Payload Channel : 1 (0x01)

Payload Port : 623



Working:

# ipmitool -I lanplus -H cloud2manage -U root -a sol info 1

Password:

Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01

Set in progress : set-complete

Enabled : true

Force Encryption : true

Force Authentication : false

Privilege Level : ADMINISTRATOR

Character Accumulate Level (ms) : 50

Character Send Threshold : 255

Retry Count : 7

Retry Interval (ms) : 480

Volatile Bit Rate (kbps) : 38.4

Non-Volatile Bit Rate (kbps) : 115.2

Payload Channel : 1 (0x01)

Payload Port : 623


From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 11:09 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

Yes, reopen causes it to work again, without any garbage
 so looks like normal console :)
Hit <enter> causes at first garbage output(ᅵᅵ Porᅵlo) and *normal console* before...


On 14 April 2017 at 16:02:09, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
So reopen causes it to work again, and before, it’s not *hung*, but erratic with garbage characters and occasional blips of sanity?

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 11:00 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

Reopen console did the trick as well...


On 14 April 2017 at 15:54:03, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
‘ctrl-e, then c, then o’ to reconnect.

Was conserver ondemand or full logging?

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 10:52 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

Console starts showing garbage after <enter> inside rcons.
What do you mean when said “restarting console”?
Console continue its work after:
- <enter> inside rcons/confetty
- bmc reset (console disconnected/console connected)

You’re absolutely right with ipmitool and conserver with the same servers we were out of such troubles.

On 14 April 2017 at 15:47:14, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
So the console starts showing garbage? Restarting the console causes the garbage to go away?

You said that ipmitool with a certain configuration did not trigger this?

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 9:29 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

I’m out of ideas, let me show you all i see.

Inside rcons i see:

MONITORING_TEST dbb54 1492160401 <= last message i’ve sent from OS (more complex log below)

tcpdump(keepalive?):

13:23:42.342886 IP (tos 0x0, ttl 64, id 16448, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:23:42.345504 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80




13:24:09.422491 IP (tos 0x0, ttl 64, id 17060, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:24:09.425045 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

Hit <enter> in rcons:
---
MONITORING_TEST dbb54 1492160401

ᅵᅵ
Porᅵ
—

tcpdump:
13:24:35.727671 IP (tos 0x0, ttl 64, id 19582, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:24:35.731533 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:24:47.390367 IP (tos 0x0, ttl 64, id 20347, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:24:47.392799 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 92)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 64
13:24:47.408312 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:24:47.409797 IP (tos 0x0, ttl 64, id 20349, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:03.127774 IP (tos 0x0, ttl 64, id 21818, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:03.131561 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:25:27.269696 IP (tos 0x0, ttl 64, id 26284, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:27.272204 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:25:47.410313 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 92)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 64
13:25:47.413754 IP (tos 0x0, ttl 64, id 28210, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:48.709947 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 204)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 176
13:25:48.712033 IP (tos 0x0, ttl 64, id 28355, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:52.564080 IP (tos 0x0, ttl 64, id 29103, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:52.566810 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

and Magic, rcons:
---
Porï¿œlo]0;console: dbb54 [13:25]


dbb54 login:
---


On 14 April 2017 at 12:42:03, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
If you ctrl-e, c, o, does it restore the console after the time?

Can you tell that it goes after exactly 24hours on the dot?

When console hung, does ‘ipmitool sol activate’ say ‘session already active’?

Yes,

# ipmitool -I lanplus -H 10.10.106.155 -U root -a sol activate

Password:

Info: SOL payload already active on another session


Does /var/log/confluent/consoles/<nodename> have any interesting events crop up?

[04/13 15:17:21 console connected]


 many our own messages

^MMONITORING_TEST dbb54 1492160401 | <== This is the last message from OS/ # date -***@1492160401 (Fri Apr 14 09:00:01 UTC 2017)

^M

[04/14 09:05:13 console connected]

[04/14 09:11:59 console connected]

[04/14 09:13:38 console disconnected]

[04/14 09:14:54 console connected]

[04/14 10:15:13 connection by xcat_console]

[04/14 10:15:14 disconnection by xcat_console]

[04/14 13:14:30 connection by xcat_console]


Pyghmi will do keepalive as well, and if that’s the problem, it should be much shorter than 24 hours. In fact, it should be checking if the SOL payload is active and owned by confluent specifically every couple of minutes.

yes, thats correct


From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 5:55 AM
To: xcat-***@lists.sourceforge.net<mailto:xcat-***@lists.sourceforge.net>
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

My last reply was incorrect. Problems still here. Im trying to find something usefull inbetween confluent/pyghmi...
Confluent restart solves hangs/reopen all connections.
I think it isnt the best option to restart confluent 1 or 2 times in 24h.
--
banuchka

On 13 April 2017 at 17:03:19, banuchka (***@gmail.com<mailto:***@gmail.com>) wrote:
It is Dell’s related problem, not 100% but

Confluent from current master is doing things well :)
Thanks for pretty nice tool “confluentdbutil".


On 13 April 2017 at 11:30:14, banuchka (***@gmail.com<mailto:***@gmail.com>) wrote:
Looks like that problem was before
 The fix was to use ipmitool with keepalive(one from xcat repos).
Here pyghmi is used maybe that the reason?


On 13 April 2017 at 08:22:28, banuchka (***@gmail.com<mailto:***@gmail.com>) wrote:
Hi,

Im trying to completely migrate from conserver to confluent, but catch strange behaviour.
Some of my consoles hangs ~after 24, so no any new messages in their logs or in rcons.
I send messages with timestamp from OS >/dev/console every 30-60min and take a look on them for monitoring purposes(consoles availability monitoring).
I can open rcons and hit enter, after few secs console is waking up(strange). I didnt see it happen with conserver or maybe im wrong...
Some details:
- as i can see the bigest part of consoles with hangs behaviour are Dell idrac. Doesnt matter which type of RacSerial or IPMISerial is in use.
- racreset hard/ipmitool bmc reset didnt do the things
- hit enter to console wake it up(for example with expect i can send \r\n\f, but it looks bad)
- i didnt try to clean confluent's conf and restart it. Not sure it may help.
- HP consoles works well, same ipmi
- few consoles with custom pluging works good as well

So maybe my question is not about confluent, but if some of you have some knowledge about same problems please share it! ;)
--
banuchka
--
banuchka
--
banuchka
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot_______________________________________________
xCAT-user mailing list
xCAT-***@lists.sourceforge.net<mailto:xCAT-***@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/xcat-user
--
banuchka
--
banuchka
--
banuchka
--
banuchka
--
banuchka
banuchka
2017-04-14 15:36:22 UTC
Permalink
115200

idracadm7 get iDRAC.IPMISerial
[Key=iDRAC.Embedded.1#IPMISerial.1]
BaudRate=115200
ChanPrivLimit=4
ConnectionMode=Terminal
DeleteControl=Disabled
EchoControl=Enabled
FlowControl=RTS/CTS
HandshakeControl=Enabled
InputNewLineSeq=1
LineEdit=Enabled
NewLineSeq=CR-LF

that is strange, right

On 14 April 2017 at 16:31:27, Jarrod Johnson (***@lenovo.com) wrote:

Hmm, what’s the baud rate the console is actually running at?  Odd to see the volatile and non volatile bit rates not be the same.

 

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 11:28 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

 

 

On 14 April 2017 at 16:15:16, Jarrod Johnson (***@lenovo.com) wrote:

And to be clear, the corruption only starts after a long period of time of being continuously connected?

Yes, that is correct


 

I might be interested in seeing ipmitool sol info 1 output against a system while it is working versus showing corrupted info.

corrupted:

# ipmitool -I lanplus -H cloud2manage -U root -a sol info 1

Password:

Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01

Set in progress                 : set-complete

Enabled                         : true

Force Encryption                : true

Force Authentication            : false

Privilege Level                 : ADMINISTRATOR

Character Accumulate Level (ms) : 50

Character Send Threshold        : 255

Retry Count                     : 7

Retry Interval (ms)             : 480

Volatile Bit Rate (kbps)        : 38.4

Non-Volatile Bit Rate (kbps)    : 115.2

Payload Channel                 : 1 (0x01)

Payload Port                    : 623

 

Working:

# ipmitool -I lanplus -H cloud2manage -U root -a sol info 1

Password:

Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01

Set in progress                 : set-complete

Enabled                         : true

Force Encryption                : true

Force Authentication            : false

Privilege Level                 : ADMINISTRATOR

Character Accumulate Level (ms) : 50

Character Send Threshold        : 255

Retry Count                     : 7

Retry Interval (ms)             : 480

Volatile Bit Rate (kbps)        : 38.4

Non-Volatile Bit Rate (kbps)    : 115.2

Payload Channel                 : 1 (0x01)

Payload Port                    : 623


 

From: banuchka [mailto:***@gmail.com] 
Sent: Friday, April 14, 2017 11:09 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

Yes, reopen causes it to work again,  without any garbage
 so looks like normal console :)

Hit <enter> causes at first garbage output(ᅵᅵ Porᅵlo) and *normal console* before...

 

On 14 April 2017 at 16:02:09, Jarrod Johnson (***@lenovo.com) wrote:

So reopen causes it to work again, and before, it’s not *hung*, but erratic with garbage characters and occasional blips of sanity?

 

From: banuchka [mailto:***@gmail.com] 
Sent: Friday, April 14, 2017 11:00 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

Reopen console did the trick as well...

 

On 14 April 2017 at 15:54:03, Jarrod Johnson (***@lenovo.com) wrote:

‘ctrl-e, then c, then o’ to reconnect.

 

Was conserver ondemand or full logging?

 

From: banuchka [mailto:***@gmail.com] 
Sent: Friday, April 14, 2017 10:52 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

Console starts showing garbage after <enter> inside rcons.

What do you mean when said “restarting console”?

Console continue its work after:

- <enter> inside rcons/confetty

- bmc reset (console disconnected/console connected)

 

You’re absolutely right with ipmitool and conserver with the same servers we were out of such troubles.

On 14 April 2017 at 15:47:14, Jarrod Johnson (***@lenovo.com) wrote:

So the console starts showing garbage?  Restarting the console causes the garbage to go away?

 

You said that ipmitool with a certain configuration did not trigger this?

 

From: banuchka [mailto:***@gmail.com] 
Sent: Friday, April 14, 2017 9:29 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

I’m out of ideas, let me show you all i see.

 

Inside rcons i see:

 

MONITORING_TEST dbb54 1492160401 <= last message i’ve sent from OS (more complex log below)

 

tcpdump(keepalive?):

 

13:23:42.342886 IP (tos 0x0, ttl 64, id 16448, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:23:42.345504 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

 




 

13:24:09.422491 IP (tos 0x0, ttl 64, id 17060, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:24:09.425045 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

 

Hit <enter> in rcons:

---

MONITORING_TEST dbb54 1492160401

 

ᅵᅵ

  Porᅵ

—

 

tcpdump:

13:24:35.727671 IP (tos 0x0, ttl 64, id 19582, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:24:35.731533 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

13:24:47.390367 IP (tos 0x0, ttl 64, id 20347, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:24:47.392799 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 64

13:24:47.408312 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

13:24:47.409797 IP (tos 0x0, ttl 64, id 20349, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:03.127774 IP (tos 0x0, ttl 64, id 21818, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:03.131561 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

13:25:27.269696 IP (tos 0x0, ttl 64, id 26284, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:27.272204 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

13:25:47.410313 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 64

13:25:47.413754 IP (tos 0x0, ttl 64, id 28210, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:48.709947 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 204)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 176

13:25:48.712033 IP (tos 0x0, ttl 64, id 28355, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:52.564080 IP (tos 0x0, ttl 64, id 29103, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:52.566810 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

 

and Magic, rcons:

---

  Porᅵlo]0;console: dbb54 [13:25]

 

 

dbb54 login:

---

 

On 14 April 2017 at 12:42:03, Jarrod Johnson (***@lenovo.com) wrote:

If you ctrl-e, c, o, does it restore the console after the time?

 

Can you tell that it goes after exactly 24hours on the dot?

 

When console hung, does ‘ipmitool sol activate’ say ‘session already active’?

Yes, 

# ipmitool -I lanplus -H 10.10.106.155 -U root -a sol activate

Password:

Info: SOL payload already active on another session


 

Does /var/log/confluent/consoles/<nodename> have any interesting events crop up?

[04/13 15:17:21 console connected]


 many our own messages

^MMONITORING_TEST dbb54 1492160401 | <== This is the last message from OS/ # date -***@1492160401 (Fri Apr 14 09:00:01 UTC 2017)

^M

[04/14 09:05:13 console connected]

[04/14 09:11:59 console connected]

[04/14 09:13:38 console disconnected]

[04/14 09:14:54 console connected]

[04/14 10:15:13 connection by xcat_console]

[04/14 10:15:14 disconnection by xcat_console]

[04/14 13:14:30 connection by xcat_console]


 

Pyghmi will do keepalive as well, and if that’s the problem, it should be much shorter than 24 hours.  In fact, it should be checking if the SOL payload is active and owned by confluent specifically every couple of minutes.

yes, thats correct


 

From: banuchka [mailto:***@gmail.com] 
Sent: Friday, April 14, 2017 5:55 AM
To: xcat-***@lists.sourceforge.net
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

My last reply was incorrect. Problems still here. Im trying to find something usefull inbetween confluent/pyghmi...

Confluent restart solves hangs/reopen all connections.

I think it isnt the best option to restart confluent 1 or 2 times in 24h.

-- 
banuchka

On 13 April 2017 at 17:03:19, banuchka (***@gmail.com) wrote:

It is Dell’s related problem, not 100% but


Confluent from current master is doing things well :) 

Thanks for pretty nice tool “confluentdbutil".

 

On 13 April 2017 at 11:30:14, banuchka (***@gmail.com) wrote:

Looks like that problem was before
 The fix was to use ipmitool with keepalive(one from xcat repos).

Here pyghmi is used maybe that the reason?

 

On 13 April 2017 at 08:22:28, banuchka (***@gmail.com) wrote:

Hi, 

 

Im trying to completely migrate from conserver to confluent, but catch strange behaviour.

Some of my consoles hangs ~after 24, so no any new messages in their logs or in rcons.

I send messages with timestamp from OS >/dev/console every 30-60min and take a look on them for monitoring purposes(consoles availability monitoring).

I can open rcons and hit enter, after few secs console is waking up(strange). I didnt see it happen with conserver or maybe im wrong...

Some details:

- as i can see the bigest part of consoles with hangs behaviour are Dell idrac. Doesnt matter which type of RacSerial or IPMISerial is in use.

- racreset hard/ipmitool bmc reset didnt do the things

- hit enter to console wake it up(for example with expect i can send \r\n\f, but it looks bad)

- i didnt try to clean confluent's conf and restart it. Not sure it may help.

- HP consoles works well, same ipmi

- few consoles with custom pluging works good as well

 

So maybe my question is not about confluent, but if some of you have some knowledge about same problems please share it! ;)

 

-- 
banuchka

-- 
banuchka

-- 
banuchka

------------------------------------------------------------------------------ 
Check out the vibrant tech community on one of the world's most 
engaging tech sites, Slashdot.org! http://sdm.link/slashdot_______________________________________________ 
xCAT-user mailing list 
xCAT-***@lists.sourceforge.net 
https://lists.sourceforge.net/lists/listinfo/xcat-user 

 

 

 

-- 
banuchka

-- 
banuchka

-- 
banuchka

-- 
banuchka

 

 

-- 
banuchka

-- 
banuchka
Jarrod Johnson
2017-04-14 15:38:42 UTC
Permalink
If you do have any in corrupted state, would be interested to see what happens if you do:
ipmitool sol set volatile-bit-rate 115.2 1


To change the volatile bit rate to match the non-volatile bit rate and see if the corruption goes away.

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 11:36 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

115200

idracadm7 get iDRAC.IPMISerial
[Key=iDRAC.Embedded.1#IPMISerial.1]
BaudRate=115200
ChanPrivLimit=4
ConnectionMode=Terminal
DeleteControl=Disabled
EchoControl=Enabled
FlowControl=RTS/CTS
HandshakeControl=Enabled
InputNewLineSeq=1
LineEdit=Enabled
NewLineSeq=CR-LF

that is strange, right


On 14 April 2017 at 16:31:27, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
Hmm, what’s the baud rate the console is actually running at? Odd to see the volatile and non volatile bit rates not be the same.

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 11:28 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.




On 14 April 2017 at 16:15:16, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
And to be clear, the corruption only starts after a long period of time of being continuously connected?

Yes, that is correct


I might be interested in seeing ipmitool sol info 1 output against a system while it is working versus showing corrupted info.

corrupted:

# ipmitool -I lanplus -H cloud2manage -U root -a sol info 1

Password:

Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01

Set in progress : set-complete

Enabled : true

Force Encryption : true

Force Authentication : false

Privilege Level : ADMINISTRATOR

Character Accumulate Level (ms) : 50

Character Send Threshold : 255

Retry Count : 7

Retry Interval (ms) : 480

Volatile Bit Rate (kbps) : 38.4

Non-Volatile Bit Rate (kbps) : 115.2

Payload Channel : 1 (0x01)

Payload Port : 623



Working:

# ipmitool -I lanplus -H cloud2manage -U root -a sol info 1

Password:

Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01

Set in progress : set-complete

Enabled : true

Force Encryption : true

Force Authentication : false

Privilege Level : ADMINISTRATOR

Character Accumulate Level (ms) : 50

Character Send Threshold : 255

Retry Count : 7

Retry Interval (ms) : 480

Volatile Bit Rate (kbps) : 38.4

Non-Volatile Bit Rate (kbps) : 115.2

Payload Channel : 1 (0x01)

Payload Port : 623


From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 11:09 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

Yes, reopen causes it to work again, without any garbage
 so looks like normal console :)
Hit <enter> causes at first garbage output(ᅵᅵ Porᅵlo) and *normal console* before...


On 14 April 2017 at 16:02:09, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
So reopen causes it to work again, and before, it’s not *hung*, but erratic with garbage characters and occasional blips of sanity?

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 11:00 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

Reopen console did the trick as well...


On 14 April 2017 at 15:54:03, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
‘ctrl-e, then c, then o’ to reconnect.

Was conserver ondemand or full logging?

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 10:52 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

Console starts showing garbage after <enter> inside rcons.
What do you mean when said “restarting console”?
Console continue its work after:
- <enter> inside rcons/confetty
- bmc reset (console disconnected/console connected)

You’re absolutely right with ipmitool and conserver with the same servers we were out of such troubles.

On 14 April 2017 at 15:47:14, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
So the console starts showing garbage? Restarting the console causes the garbage to go away?

You said that ipmitool with a certain configuration did not trigger this?

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 9:29 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

I’m out of ideas, let me show you all i see.

Inside rcons i see:

MONITORING_TEST dbb54 1492160401 <= last message i’ve sent from OS (more complex log below)

tcpdump(keepalive?):

13:23:42.342886 IP (tos 0x0, ttl 64, id 16448, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:23:42.345504 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80




13:24:09.422491 IP (tos 0x0, ttl 64, id 17060, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:24:09.425045 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

Hit <enter> in rcons:
---
MONITORING_TEST dbb54 1492160401

ᅵᅵ
Porᅵ
—

tcpdump:
13:24:35.727671 IP (tos 0x0, ttl 64, id 19582, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:24:35.731533 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:24:47.390367 IP (tos 0x0, ttl 64, id 20347, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:24:47.392799 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 92)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 64
13:24:47.408312 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:24:47.409797 IP (tos 0x0, ttl 64, id 20349, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:03.127774 IP (tos 0x0, ttl 64, id 21818, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:03.131561 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:25:27.269696 IP (tos 0x0, ttl 64, id 26284, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:27.272204 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:25:47.410313 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 92)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 64
13:25:47.413754 IP (tos 0x0, ttl 64, id 28210, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:48.709947 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 204)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 176
13:25:48.712033 IP (tos 0x0, ttl 64, id 28355, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:52.564080 IP (tos 0x0, ttl 64, id 29103, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:52.566810 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

and Magic, rcons:
---
Porï¿œlo]0;console: dbb54 [13:25]


dbb54 login:
---


On 14 April 2017 at 12:42:03, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
If you ctrl-e, c, o, does it restore the console after the time?

Can you tell that it goes after exactly 24hours on the dot?

When console hung, does ‘ipmitool sol activate’ say ‘session already active’?

Yes,

# ipmitool -I lanplus -H 10.10.106.155 -U root -a sol activate

Password:

Info: SOL payload already active on another session


Does /var/log/confluent/consoles/<nodename> have any interesting events crop up?

[04/13 15:17:21 console connected]


 many our own messages

^MMONITORING_TEST dbb54 1492160401 | <== This is the last message from OS/ # date -***@1492160401 (Fri Apr 14 09:00:01 UTC 2017)

^M

[04/14 09:05:13 console connected]

[04/14 09:11:59 console connected]

[04/14 09:13:38 console disconnected]

[04/14 09:14:54 console connected]

[04/14 10:15:13 connection by xcat_console]

[04/14 10:15:14 disconnection by xcat_console]

[04/14 13:14:30 connection by xcat_console]


Pyghmi will do keepalive as well, and if that’s the problem, it should be much shorter than 24 hours. In fact, it should be checking if the SOL payload is active and owned by confluent specifically every couple of minutes.

yes, thats correct


From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 5:55 AM
To: xcat-***@lists.sourceforge.net<mailto:xcat-***@lists.sourceforge.net>
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

My last reply was incorrect. Problems still here. Im trying to find something usefull inbetween confluent/pyghmi...
Confluent restart solves hangs/reopen all connections.
I think it isnt the best option to restart confluent 1 or 2 times in 24h.
--
banuchka

On 13 April 2017 at 17:03:19, banuchka (***@gmail.com<mailto:***@gmail.com>) wrote:
It is Dell’s related problem, not 100% but

Confluent from current master is doing things well :)
Thanks for pretty nice tool “confluentdbutil".


On 13 April 2017 at 11:30:14, banuchka (***@gmail.com<mailto:***@gmail.com>) wrote:
Looks like that problem was before
 The fix was to use ipmitool with keepalive(one from xcat repos).
Here pyghmi is used maybe that the reason?


On 13 April 2017 at 08:22:28, banuchka (***@gmail.com<mailto:***@gmail.com>) wrote:
Hi,

Im trying to completely migrate from conserver to confluent, but catch strange behaviour.
Some of my consoles hangs ~after 24, so no any new messages in their logs or in rcons.
I send messages with timestamp from OS >/dev/console every 30-60min and take a look on them for monitoring purposes(consoles availability monitoring).
I can open rcons and hit enter, after few secs console is waking up(strange). I didnt see it happen with conserver or maybe im wrong...
Some details:
- as i can see the bigest part of consoles with hangs behaviour are Dell idrac. Doesnt matter which type of RacSerial or IPMISerial is in use.
- racreset hard/ipmitool bmc reset didnt do the things
- hit enter to console wake it up(for example with expect i can send \r\n\f, but it looks bad)
- i didnt try to clean confluent's conf and restart it. Not sure it may help.
- HP consoles works well, same ipmi
- few consoles with custom pluging works good as well

So maybe my question is not about confluent, but if some of you have some knowledge about same problems please share it! ;)
--
banuchka
--
banuchka
--
banuchka
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot_______________________________________________
xCAT-user mailing list
xCAT-***@lists.sourceforge.net<mailto:xCAT-***@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/xcat-user
--
banuchka
--
banuchka
--
banuchka
--
banuchka
--
banuchka
--
banuchka
banuchka
2017-04-14 15:45:19 UTC
Permalink
cloud53.ulan:/home/banuchka # ipmitool sol info 1
Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01
Set in progress                 : set-complete
Enabled                         : true
Force Encryption                : true
Force Authentication            : false
Privilege Level                 : ADMINISTRATOR
Character Accumulate Level (ms) : 50
Character Send Threshold        : 255
Retry Count                     : 7
Retry Interval (ms)             : 480
Volatile Bit Rate (kbps)        : 38.4
Non-Volatile Bit Rate (kbps)    : 115.2
Payload Channel                 : 1 (0x01)
Payload Port                    : 623
cloud53.ulan:/home/banuchka # ipmitool sol set volatile-bit-rate 115.2 1
cloud53.ulan:/home/banuchka # ipmitool sol info 1
Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01
Set in progress                 : set-complete
Enabled                         : true
Force Encryption                : true
Force Authentication            : false
Privilege Level                 : ADMINISTRATOR
Character Accumulate Level (ms) : 50
Character Send Threshold        : 255
Retry Count                     : 7
Retry Interval (ms)             : 480
Volatile Bit Rate (kbps)        : 115.2
Non-Volatile Bit Rate (kbps)    : 115.2
Payload Channel                 : 1 (0x01)
Payload Port                    : 623
cloud53.ulan:/home/banuchka # echo 123 > /dev/console

and nothing happened

in the console’s log
—
[04/14 12:49:12 console disconnected][04/14 12:49:29 console connected][04/14 13:01:02 console disconnected][04/14 13:01:02 console connected][04/14 13:03:54 console disconnected][04/14 13:04:15 console connected][04/14 13:38:37 console connected][04/14 15:31:47 console disconnected][04/14 15:36:24 console connected][04/14 15:42:08 connection by xcat_console]
---

On 14 April 2017 at 16:39:35, Jarrod Johnson (***@lenovo.com) wrote:

If you do have any in corrupted state, would be interested to see what happens if you do:

ipmitool sol set volatile-bit-rate 115.2 1

 

 

To change the volatile bit rate to match the non-volatile bit rate and see if the corruption goes away.

 

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 11:36 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

115200

 

idracadm7 get iDRAC.IPMISerial

[Key=iDRAC.Embedded.1#IPMISerial.1]

BaudRate=115200

ChanPrivLimit=4

ConnectionMode=Terminal

DeleteControl=Disabled

EchoControl=Enabled

FlowControl=RTS/CTS

HandshakeControl=Enabled

InputNewLineSeq=1

LineEdit=Enabled

NewLineSeq=CR-LF

 

that is strange, right

 

On 14 April 2017 at 16:31:27, Jarrod Johnson (***@lenovo.com) wrote:

Hmm, what’s the baud rate the console is actually running at?  Odd to see the volatile and non volatile bit rates not be the same.

 

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 11:28 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

 

 

On 14 April 2017 at 16:15:16, Jarrod Johnson (***@lenovo.com) wrote:

And to be clear, the corruption only starts after a long period of time of being continuously connected?

Yes, that is correct


 

I might be interested in seeing ipmitool sol info 1 output against a system while it is working versus showing corrupted info.

corrupted:

# ipmitool -I lanplus -H cloud2manage -U root -a sol info 1

Password:

Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01

Set in progress                 : set-complete

Enabled                         : true

Force Encryption                : true

Force Authentication            : false

Privilege Level                 : ADMINISTRATOR

Character Accumulate Level (ms) : 50

Character Send Threshold        : 255

Retry Count                     : 7

Retry Interval (ms)             : 480

Volatile Bit Rate (kbps)        : 38.4

Non-Volatile Bit Rate (kbps)    : 115.2

Payload Channel                 : 1 (0x01)

Payload Port                    : 623

 

Working:

# ipmitool -I lanplus -H cloud2manage -U root -a sol info 1

Password:

Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01

Set in progress                 : set-complete

Enabled                         : true

Force Encryption                : true

Force Authentication            : false

Privilege Level                 : ADMINISTRATOR

Character Accumulate Level (ms) : 50

Character Send Threshold        : 255

Retry Count                     : 7

Retry Interval (ms)             : 480

Volatile Bit Rate (kbps)        : 38.4

Non-Volatile Bit Rate (kbps)    : 115.2

Payload Channel                 : 1 (0x01)

Payload Port                    : 623


 

From: banuchka [mailto:***@gmail.com] 
Sent: Friday, April 14, 2017 11:09 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

Yes, reopen causes it to work again,  without any garbage
 so looks like normal console :)

Hit <enter> causes at first garbage output(ᅵᅵ Porᅵlo) and *normal console* before...

 

On 14 April 2017 at 16:02:09, Jarrod Johnson (***@lenovo.com) wrote:

So reopen causes it to work again, and before, it’s not *hung*, but erratic with garbage characters and occasional blips of sanity?

 

From: banuchka [mailto:***@gmail.com] 
Sent: Friday, April 14, 2017 11:00 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

Reopen console did the trick as well...

 

On 14 April 2017 at 15:54:03, Jarrod Johnson (***@lenovo.com) wrote:

‘ctrl-e, then c, then o’ to reconnect.

 

Was conserver ondemand or full logging?

 

From: banuchka [mailto:***@gmail.com] 
Sent: Friday, April 14, 2017 10:52 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

Console starts showing garbage after <enter> inside rcons.

What do you mean when said “restarting console”?

Console continue its work after:

- <enter> inside rcons/confetty

- bmc reset (console disconnected/console connected)

 

You’re absolutely right with ipmitool and conserver with the same servers we were out of such troubles.

On 14 April 2017 at 15:47:14, Jarrod Johnson (***@lenovo.com) wrote:

So the console starts showing garbage?  Restarting the console causes the garbage to go away?

 

You said that ipmitool with a certain configuration did not trigger this?

 

From: banuchka [mailto:***@gmail.com] 
Sent: Friday, April 14, 2017 9:29 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

I’m out of ideas, let me show you all i see.

 

Inside rcons i see:

 

MONITORING_TEST dbb54 1492160401 <= last message i’ve sent from OS (more complex log below)

 

tcpdump(keepalive?):

 

13:23:42.342886 IP (tos 0x0, ttl 64, id 16448, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:23:42.345504 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

 




 

13:24:09.422491 IP (tos 0x0, ttl 64, id 17060, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:24:09.425045 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

 

Hit <enter> in rcons:

---

MONITORING_TEST dbb54 1492160401

 

ᅵᅵ

  Porᅵ

—

 

tcpdump:

13:24:35.727671 IP (tos 0x0, ttl 64, id 19582, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:24:35.731533 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

13:24:47.390367 IP (tos 0x0, ttl 64, id 20347, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:24:47.392799 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 64

13:24:47.408312 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

13:24:47.409797 IP (tos 0x0, ttl 64, id 20349, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:03.127774 IP (tos 0x0, ttl 64, id 21818, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:03.131561 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

13:25:27.269696 IP (tos 0x0, ttl 64, id 26284, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:27.272204 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

13:25:47.410313 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 64

13:25:47.413754 IP (tos 0x0, ttl 64, id 28210, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:48.709947 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 204)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 176

13:25:48.712033 IP (tos 0x0, ttl 64, id 28355, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:52.564080 IP (tos 0x0, ttl 64, id 29103, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:52.566810 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

 

and Magic, rcons:

---

  Porᅵlo]0;console: dbb54 [13:25]

 

 

dbb54 login:

---

 

On 14 April 2017 at 12:42:03, Jarrod Johnson (***@lenovo.com) wrote:

If you ctrl-e, c, o, does it restore the console after the time?

 

Can you tell that it goes after exactly 24hours on the dot?

 

When console hung, does ‘ipmitool sol activate’ say ‘session already active’?

Yes, 

# ipmitool -I lanplus -H 10.10.106.155 -U root -a sol activate

Password:

Info: SOL payload already active on another session


 

Does /var/log/confluent/consoles/<nodename> have any interesting events crop up?

[04/13 15:17:21 console connected]


 many our own messages

^MMONITORING_TEST dbb54 1492160401 | <== This is the last message from OS/ # date -***@1492160401 (Fri Apr 14 09:00:01 UTC 2017)

^M

[04/14 09:05:13 console connected]

[04/14 09:11:59 console connected]

[04/14 09:13:38 console disconnected]

[04/14 09:14:54 console connected]

[04/14 10:15:13 connection by xcat_console]

[04/14 10:15:14 disconnection by xcat_console]

[04/14 13:14:30 connection by xcat_console]


 

Pyghmi will do keepalive as well, and if that’s the problem, it should be much shorter than 24 hours.  In fact, it should be checking if the SOL payload is active and owned by confluent specifically every couple of minutes.

yes, thats correct


 

From: banuchka [mailto:***@gmail.com] 
Sent: Friday, April 14, 2017 5:55 AM
To: xcat-***@lists.sourceforge.net
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

My last reply was incorrect. Problems still here. Im trying to find something usefull inbetween confluent/pyghmi...

Confluent restart solves hangs/reopen all connections.

I think it isnt the best option to restart confluent 1 or 2 times in 24h.

-- 
banuchka

On 13 April 2017 at 17:03:19, banuchka (***@gmail.com) wrote:

It is Dell’s related problem, not 100% but


Confluent from current master is doing things well :) 

Thanks for pretty nice tool “confluentdbutil".

 

On 13 April 2017 at 11:30:14, banuchka (***@gmail.com) wrote:

Looks like that problem was before
 The fix was to use ipmitool with keepalive(one from xcat repos).

Here pyghmi is used maybe that the reason?

 

On 13 April 2017 at 08:22:28, banuchka (***@gmail.com) wrote:

Hi, 

 

Im trying to completely migrate from conserver to confluent, but catch strange behaviour.

Some of my consoles hangs ~after 24, so no any new messages in their logs or in rcons.

I send messages with timestamp from OS >/dev/console every 30-60min and take a look on them for monitoring purposes(consoles availability monitoring).

I can open rcons and hit enter, after few secs console is waking up(strange). I didnt see it happen with conserver or maybe im wrong...

Some details:

- as i can see the bigest part of consoles with hangs behaviour are Dell idrac. Doesnt matter which type of RacSerial or IPMISerial is in use.

- racreset hard/ipmitool bmc reset didnt do the things

- hit enter to console wake it up(for example with expect i can send \r\n\f, but it looks bad)

- i didnt try to clean confluent's conf and restart it. Not sure it may help.

- HP consoles works well, same ipmi

- few consoles with custom pluging works good as well

 

So maybe my question is not about confluent, but if some of you have some knowledge about same problems please share it! ;)

 

-- 
banuchka

-- 
banuchka

-- 
banuchka

------------------------------------------------------------------------------ 
Check out the vibrant tech community on one of the world's most 
engaging tech sites, Slashdot.org! http://sdm.link/slashdot_______________________________________________ 
xCAT-user mailing list 
xCAT-***@lists.sourceforge.net 
https://lists.sourceforge.net/lists/listinfo/xcat-user 

 

 

 

-- 
banuchka

-- 
banuchka

-- 
banuchka

-- 
banuchka

 

 

-- 
banuchka

-- 
banuchka

-- 
banuchka
Jarrod Johnson
2017-04-14 15:56:03 UTC
Permalink
Hmm, this is going to be very difficult to root cause (I only have Lenovo equipment as one might expect).

I’m loathe to do a workaround, but in console.py (find /usr –name console.py) , might be interesting to see how a change like the following:
diff --git a/pyghmi/ipmi/console.py b/pyghmi/ipmi/console.py
index 95e8551..a5f6062 100644
--- a/pyghmi/ipmi/console.py
+++ b/pyghmi/ipmi/console.py
@@ -42,6 +42,7 @@ class Console(object):
def __init__(self, bmc, userid, password,
iohandler, port=623,
force=False, kg=None):
+ self.keepalivecount = 0
self.keepaliveid = None
self.connected = False
self.broken = False
@@ -70,6 +71,7 @@ class Console(object):
if 'error' in response:
self._print_error(response['error'])
return
+ self.keepalivecount = 0
#Send activate sol payload directive
#netfn= 6 (application)
#command = 0x48 (activate payload)
@@ -150,11 +152,12 @@ class Console(object):
return
currowner = struct.unpack(
"<I", struct.pack('4B', *response['data'][:4]))
- if currowner[0] != self.ipmi_session.sessionid:
+ if currowner[0] != self.ipmi_session.sessionid or self.keepalivecount > 180:
# the session is deactivated or active for something else
self.activated = False
self._print_error('SOL deactivated')
return
+ self.keepalivecount += 1
# ok, still here, that means session is alive, but another
# common issue is firmware messing with mux on reboot
# this would be a nice thing to check, but the serial channel

If it would pan out, should cause the console session to disconnect itself roughly every 90 minutes and trigger reconnect (is 90 minutes short enough in your case?) Would require a service confluent restart to see if it had the desired effect.

Sorry I haven’t tested and can’t think of root cause, but going to take some time off for the weekend.

I would be curious if the same ipmitool is running a day later than a check (e.g. if ipmitool is exiting and getting restarted). I don’t have the time at the moment to see if they do some other interesting thing to avoid the behavior.

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 11:45 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

cloud53.ulan:/home/banuchka # ipmitool sol info 1
Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01
Set in progress : set-complete
Enabled : true
Force Encryption : true
Force Authentication : false
Privilege Level : ADMINISTRATOR
Character Accumulate Level (ms) : 50
Character Send Threshold : 255
Retry Count : 7
Retry Interval (ms) : 480
Volatile Bit Rate (kbps) : 38.4
Non-Volatile Bit Rate (kbps) : 115.2
Payload Channel : 1 (0x01)
Payload Port : 623
cloud53.ulan:/home/banuchka # ipmitool sol set volatile-bit-rate 115.2 1
cloud53.ulan:/home/banuchka # ipmitool sol info 1
Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01
Set in progress : set-complete
Enabled : true
Force Encryption : true
Force Authentication : false
Privilege Level : ADMINISTRATOR
Character Accumulate Level (ms) : 50
Character Send Threshold : 255
Retry Count : 7
Retry Interval (ms) : 480
Volatile Bit Rate (kbps) : 115.2
Non-Volatile Bit Rate (kbps) : 115.2
Payload Channel : 1 (0x01)
Payload Port : 623
cloud53.ulan:/home/banuchka # echo 123 > /dev/console

and nothing happened

in the console’s log
—
[04/14 12:49:12 console disconnected][04/14 12:49:29 console connected][04/14 13:01:02 console disconnected][04/14 13:01:02 console connected][04/14 13:03:54 console disconnected][04/14 13:04:15 console connected][04/14 13:38:37 console connected][04/14 15:31:47 console disconnected][04/14 15:36:24 console connected][04/14 15:42:08 connection by xcat_console]
---


On 14 April 2017 at 16:39:35, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
If you do have any in corrupted state, would be interested to see what happens if you do:
ipmitool sol set volatile-bit-rate 115.2 1


To change the volatile bit rate to match the non-volatile bit rate and see if the corruption goes away.

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 11:36 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

115200

idracadm7 get iDRAC.IPMISerial
[Key=iDRAC.Embedded.1#IPMISerial.1]
BaudRate=115200
ChanPrivLimit=4
ConnectionMode=Terminal
DeleteControl=Disabled
EchoControl=Enabled
FlowControl=RTS/CTS
HandshakeControl=Enabled
InputNewLineSeq=1
LineEdit=Enabled
NewLineSeq=CR-LF

that is strange, right


On 14 April 2017 at 16:31:27, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
Hmm, what’s the baud rate the console is actually running at? Odd to see the volatile and non volatile bit rates not be the same.

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 11:28 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.




On 14 April 2017 at 16:15:16, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
And to be clear, the corruption only starts after a long period of time of being continuously connected?

Yes, that is correct


I might be interested in seeing ipmitool sol info 1 output against a system while it is working versus showing corrupted info.

corrupted:

# ipmitool -I lanplus -H cloud2manage -U root -a sol info 1

Password:

Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01

Set in progress : set-complete

Enabled : true

Force Encryption : true

Force Authentication : false

Privilege Level : ADMINISTRATOR

Character Accumulate Level (ms) : 50

Character Send Threshold : 255

Retry Count : 7

Retry Interval (ms) : 480

Volatile Bit Rate (kbps) : 38.4

Non-Volatile Bit Rate (kbps) : 115.2

Payload Channel : 1 (0x01)

Payload Port : 623



Working:

# ipmitool -I lanplus -H cloud2manage -U root -a sol info 1

Password:

Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01

Set in progress : set-complete

Enabled : true

Force Encryption : true

Force Authentication : false

Privilege Level : ADMINISTRATOR

Character Accumulate Level (ms) : 50

Character Send Threshold : 255

Retry Count : 7

Retry Interval (ms) : 480

Volatile Bit Rate (kbps) : 38.4

Non-Volatile Bit Rate (kbps) : 115.2

Payload Channel : 1 (0x01)

Payload Port : 623


From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 11:09 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

Yes, reopen causes it to work again, without any garbage
 so looks like normal console :)
Hit <enter> causes at first garbage output(ᅵᅵ Porᅵlo) and *normal console* before...


On 14 April 2017 at 16:02:09, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
So reopen causes it to work again, and before, it’s not *hung*, but erratic with garbage characters and occasional blips of sanity?

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 11:00 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

Reopen console did the trick as well...


On 14 April 2017 at 15:54:03, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
‘ctrl-e, then c, then o’ to reconnect.

Was conserver ondemand or full logging?

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 10:52 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

Console starts showing garbage after <enter> inside rcons.
What do you mean when said “restarting console”?
Console continue its work after:
- <enter> inside rcons/confetty
- bmc reset (console disconnected/console connected)

You’re absolutely right with ipmitool and conserver with the same servers we were out of such troubles.

On 14 April 2017 at 15:47:14, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
So the console starts showing garbage? Restarting the console causes the garbage to go away?

You said that ipmitool with a certain configuration did not trigger this?

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 9:29 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

I’m out of ideas, let me show you all i see.

Inside rcons i see:

MONITORING_TEST dbb54 1492160401 <= last message i’ve sent from OS (more complex log below)

tcpdump(keepalive?):

13:23:42.342886 IP (tos 0x0, ttl 64, id 16448, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:23:42.345504 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80




13:24:09.422491 IP (tos 0x0, ttl 64, id 17060, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:24:09.425045 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

Hit <enter> in rcons:
---
MONITORING_TEST dbb54 1492160401

ᅵᅵ
Porᅵ
—

tcpdump:
13:24:35.727671 IP (tos 0x0, ttl 64, id 19582, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:24:35.731533 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:24:47.390367 IP (tos 0x0, ttl 64, id 20347, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:24:47.392799 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 92)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 64
13:24:47.408312 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:24:47.409797 IP (tos 0x0, ttl 64, id 20349, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:03.127774 IP (tos 0x0, ttl 64, id 21818, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:03.131561 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:25:27.269696 IP (tos 0x0, ttl 64, id 26284, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:27.272204 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:25:47.410313 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 92)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 64
13:25:47.413754 IP (tos 0x0, ttl 64, id 28210, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:48.709947 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 204)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 176
13:25:48.712033 IP (tos 0x0, ttl 64, id 28355, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:52.564080 IP (tos 0x0, ttl 64, id 29103, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:52.566810 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

and Magic, rcons:
---
Porï¿œlo]0;console: dbb54 [13:25]


dbb54 login:
---


On 14 April 2017 at 12:42:03, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
If you ctrl-e, c, o, does it restore the console after the time?

Can you tell that it goes after exactly 24hours on the dot?

When console hung, does ‘ipmitool sol activate’ say ‘session already active’?

Yes,

# ipmitool -I lanplus -H 10.10.106.155 -U root -a sol activate

Password:

Info: SOL payload already active on another session


Does /var/log/confluent/consoles/<nodename> have any interesting events crop up?

[04/13 15:17:21 console connected]


 many our own messages

^MMONITORING_TEST dbb54 1492160401 | <== This is the last message from OS/ # date -***@1492160401 (Fri Apr 14 09:00:01 UTC 2017)

^M

[04/14 09:05:13 console connected]

[04/14 09:11:59 console connected]

[04/14 09:13:38 console disconnected]

[04/14 09:14:54 console connected]

[04/14 10:15:13 connection by xcat_console]

[04/14 10:15:14 disconnection by xcat_console]

[04/14 13:14:30 connection by xcat_console]


Pyghmi will do keepalive as well, and if that’s the problem, it should be much shorter than 24 hours. In fact, it should be checking if the SOL payload is active and owned by confluent specifically every couple of minutes.

yes, thats correct


From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 5:55 AM
To: xcat-***@lists.sourceforge.net<mailto:xcat-***@lists.sourceforge.net>
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

My last reply was incorrect. Problems still here. Im trying to find something usefull inbetween confluent/pyghmi...
Confluent restart solves hangs/reopen all connections.
I think it isnt the best option to restart confluent 1 or 2 times in 24h.
--
banuchka

On 13 April 2017 at 17:03:19, banuchka (***@gmail.com<mailto:***@gmail.com>) wrote:
It is Dell’s related problem, not 100% but

Confluent from current master is doing things well :)
Thanks for pretty nice tool “confluentdbutil".


On 13 April 2017 at 11:30:14, banuchka (***@gmail.com<mailto:***@gmail.com>) wrote:
Looks like that problem was before
 The fix was to use ipmitool with keepalive(one from xcat repos).
Here pyghmi is used maybe that the reason?


On 13 April 2017 at 08:22:28, banuchka (***@gmail.com<mailto:***@gmail.com>) wrote:
Hi,

Im trying to completely migrate from conserver to confluent, but catch strange behaviour.
Some of my consoles hangs ~after 24, so no any new messages in their logs or in rcons.
I send messages with timestamp from OS >/dev/console every 30-60min and take a look on them for monitoring purposes(consoles availability monitoring).
I can open rcons and hit enter, after few secs console is waking up(strange). I didnt see it happen with conserver or maybe im wrong...
Some details:
- as i can see the bigest part of consoles with hangs behaviour are Dell idrac. Doesnt matter which type of RacSerial or IPMISerial is in use.
- racreset hard/ipmitool bmc reset didnt do the things
- hit enter to console wake it up(for example with expect i can send \r\n\f, but it looks bad)
- i didnt try to clean confluent's conf and restart it. Not sure it may help.
- HP consoles works well, same ipmi
- few consoles with custom pluging works good as well

So maybe my question is not about confluent, but if some of you have some knowledge about same problems please share it! ;)
--
banuchka
--
banuchka
--
banuchka
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot_______________________________________________
xCAT-user mailing list
xCAT-***@lists.sourceforge.net<mailto:xCAT-***@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/xcat-user
--
banuchka
--
banuchka
--
banuchka
--
banuchka
--
banuchka
--
banuchka
--
banuchka
banuchka
2017-04-14 16:01:29 UTC
Permalink
Thank you Jarrod, i’ll try to add patch and let you know after. Hope 90 minutes is enough, yes.

On 14 April 2017 at 16:57:24, Jarrod Johnson (***@lenovo.com) wrote:

Hmm, this is going to be very difficult to root cause (I only have Lenovo equipment as one might expect).

 

I’m loathe to do a workaround, but in console.py (find /usr –name console.py) , might be interesting to see how a change like the following:

diff --git a/pyghmi/ipmi/console.py b/pyghmi/ipmi/console.py

index 95e8551..a5f6062 100644

--- a/pyghmi/ipmi/console.py

+++ b/pyghmi/ipmi/console.py

@@ -42,6 +42,7 @@ class Console(object):

     def __init__(self, bmc, userid, password,

                  iohandler, port=623,

                  force=False, kg=None):

+        self.keepalivecount = 0

         self.keepaliveid = None

         self.connected = False

         self.broken = False

@@ -70,6 +71,7 @@ class Console(object):

         if 'error' in response:

             self._print_error(response['error'])

             return

+        self.keepalivecount = 0

         #Send activate sol payload directive

         #netfn= 6 (application)

         #command = 0x48 (activate payload)

@@ -150,11 +152,12 @@ class Console(object):

             return

         currowner = struct.unpack(

             "<I", struct.pack('4B', *response['data'][:4]))

-        if currowner[0] != self.ipmi_session.sessionid:

+        if currowner[0] != self.ipmi_session.sessionid or  self.keepalivecount > 180:

             # the session is deactivated or active for something else

             self.activated = False

             self._print_error('SOL deactivated')

             return

+        self.keepalivecount += 1

         # ok, still here, that means session is alive, but another

         # common issue is firmware messing with mux on reboot

         # this would be a nice thing to check, but the serial channel

 

If it would pan out, should cause the console session to disconnect itself roughly every 90 minutes and trigger reconnect (is 90 minutes short enough in your case?)  Would require a service confluent restart to see if it had the desired effect.

 

Sorry I haven’t tested and can’t think of root cause, but going to take some time off for the weekend.

 

I would be curious if the same ipmitool is running a day later than a check (e.g. if ipmitool is exiting and getting restarted).  I don’t have the time at the moment to see if they do some other interesting thing to avoid the behavior.

 

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 11:45 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

cloud53.ulan:/home/banuchka # ipmitool sol info 1

Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01

Set in progress                 : set-complete

Enabled                         : true

Force Encryption                : true

Force Authentication            : false

Privilege Level                 : ADMINISTRATOR

Character Accumulate Level (ms) : 50

Character Send Threshold        : 255

Retry Count                     : 7

Retry Interval (ms)             : 480

Volatile Bit Rate (kbps)        : 38.4

Non-Volatile Bit Rate (kbps)    : 115.2

Payload Channel                 : 1 (0x01)

Payload Port                    : 623

cloud53.ulan:/home/banuchka # ipmitool sol set volatile-bit-rate 115.2 1

cloud53.ulan:/home/banuchka # ipmitool sol info 1

Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01

Set in progress                 : set-complete

Enabled                         : true

Force Encryption                : true

Force Authentication            : false

Privilege Level                 : ADMINISTRATOR

Character Accumulate Level (ms) : 50

Character Send Threshold        : 255

Retry Count                     : 7

Retry Interval (ms)             : 480

Volatile Bit Rate (kbps)        : 115.2

Non-Volatile Bit Rate (kbps)    : 115.2

Payload Channel                 : 1 (0x01)

Payload Port                    : 623

cloud53.ulan:/home/banuchka # echo 123 > /dev/console

 

and nothing happened

 

in the console’s log

—

[04/14 12:49:12 console disconnected][04/14 12:49:29 console connected][04/14 13:01:02 console disconnected][04/14 13:01:02 console connected][04/14 13:03:54 console disconnected][04/14 13:04:15 console connected][04/14 13:38:37 console connected][04/14 15:31:47 console disconnected][04/14 15:36:24 console connected][04/14 15:42:08 connection by xcat_console]

---

 

On 14 April 2017 at 16:39:35, Jarrod Johnson (***@lenovo.com) wrote:

If you do have any in corrupted state, would be interested to see what happens if you do:

ipmitool sol set volatile-bit-rate 115.2 1

 

 

To change the volatile bit rate to match the non-volatile bit rate and see if the corruption goes away.

 

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 11:36 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

115200

 

idracadm7 get iDRAC.IPMISerial

[Key=iDRAC.Embedded.1#IPMISerial.1]

BaudRate=115200

ChanPrivLimit=4

ConnectionMode=Terminal

DeleteControl=Disabled

EchoControl=Enabled

FlowControl=RTS/CTS

HandshakeControl=Enabled

InputNewLineSeq=1

LineEdit=Enabled

NewLineSeq=CR-LF

 

that is strange, right

 

On 14 April 2017 at 16:31:27, Jarrod Johnson (***@lenovo.com) wrote:

Hmm, what’s the baud rate the console is actually running at?  Odd to see the volatile and non volatile bit rates not be the same.

 

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 11:28 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

 

 

On 14 April 2017 at 16:15:16, Jarrod Johnson (***@lenovo.com) wrote:

And to be clear, the corruption only starts after a long period of time of being continuously connected?

Yes, that is correct


 

I might be interested in seeing ipmitool sol info 1 output against a system while it is working versus showing corrupted info.

corrupted:

# ipmitool -I lanplus -H cloud2manage -U root -a sol info 1

Password:

Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01

Set in progress                 : set-complete

Enabled                         : true

Force Encryption                : true

Force Authentication            : false

Privilege Level                 : ADMINISTRATOR

Character Accumulate Level (ms) : 50

Character Send Threshold        : 255

Retry Count                     : 7

Retry Interval (ms)             : 480

Volatile Bit Rate (kbps)        : 38.4

Non-Volatile Bit Rate (kbps)    : 115.2

Payload Channel                 : 1 (0x01)

Payload Port                    : 623

 

Working:

# ipmitool -I lanplus -H cloud2manage -U root -a sol info 1

Password:

Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01

Set in progress                 : set-complete

Enabled                         : true

Force Encryption                : true

Force Authentication            : false

Privilege Level                 : ADMINISTRATOR

Character Accumulate Level (ms) : 50

Character Send Threshold        : 255

Retry Count                     : 7

Retry Interval (ms)             : 480

Volatile Bit Rate (kbps)        : 38.4

Non-Volatile Bit Rate (kbps)    : 115.2

Payload Channel                 : 1 (0x01)

Payload Port                    : 623


 

From: banuchka [mailto:***@gmail.com] 
Sent: Friday, April 14, 2017 11:09 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

Yes, reopen causes it to work again,  without any garbage
 so looks like normal console :)

Hit <enter> causes at first garbage output(ᅵᅵ Porᅵlo) and *normal console* before...

 

On 14 April 2017 at 16:02:09, Jarrod Johnson (***@lenovo.com) wrote:

So reopen causes it to work again, and before, it’s not *hung*, but erratic with garbage characters and occasional blips of sanity?

 

From: banuchka [mailto:***@gmail.com] 
Sent: Friday, April 14, 2017 11:00 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

Reopen console did the trick as well...

 

On 14 April 2017 at 15:54:03, Jarrod Johnson (***@lenovo.com) wrote:

‘ctrl-e, then c, then o’ to reconnect.

 

Was conserver ondemand or full logging?

 

From: banuchka [mailto:***@gmail.com] 
Sent: Friday, April 14, 2017 10:52 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

Console starts showing garbage after <enter> inside rcons.

What do you mean when said “restarting console”?

Console continue its work after:

- <enter> inside rcons/confetty

- bmc reset (console disconnected/console connected)

 

You’re absolutely right with ipmitool and conserver with the same servers we were out of such troubles.

On 14 April 2017 at 15:47:14, Jarrod Johnson (***@lenovo.com) wrote:

So the console starts showing garbage?  Restarting the console causes the garbage to go away?

 

You said that ipmitool with a certain configuration did not trigger this?

 

From: banuchka [mailto:***@gmail.com] 
Sent: Friday, April 14, 2017 9:29 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

I’m out of ideas, let me show you all i see.

 

Inside rcons i see:

 

MONITORING_TEST dbb54 1492160401 <= last message i’ve sent from OS (more complex log below)

 

tcpdump(keepalive?):

 

13:23:42.342886 IP (tos 0x0, ttl 64, id 16448, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:23:42.345504 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

 




 

13:24:09.422491 IP (tos 0x0, ttl 64, id 17060, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:24:09.425045 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

 

Hit <enter> in rcons:

---

MONITORING_TEST dbb54 1492160401

 

ᅵᅵ

  Porᅵ

—

 

tcpdump:

13:24:35.727671 IP (tos 0x0, ttl 64, id 19582, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:24:35.731533 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

13:24:47.390367 IP (tos 0x0, ttl 64, id 20347, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:24:47.392799 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 64

13:24:47.408312 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

13:24:47.409797 IP (tos 0x0, ttl 64, id 20349, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:03.127774 IP (tos 0x0, ttl 64, id 21818, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:03.131561 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

13:25:27.269696 IP (tos 0x0, ttl 64, id 26284, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:27.272204 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

13:25:47.410313 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 64

13:25:47.413754 IP (tos 0x0, ttl 64, id 28210, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:48.709947 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 204)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 176

13:25:48.712033 IP (tos 0x0, ttl 64, id 28355, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:52.564080 IP (tos 0x0, ttl 64, id 29103, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:52.566810 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

 

and Magic, rcons:

---

  Porᅵlo]0;console: dbb54 [13:25]

 

 

dbb54 login:

---

 

On 14 April 2017 at 12:42:03, Jarrod Johnson (***@lenovo.com) wrote:

If you ctrl-e, c, o, does it restore the console after the time?

 

Can you tell that it goes after exactly 24hours on the dot?

 

When console hung, does ‘ipmitool sol activate’ say ‘session already active’?

Yes, 

# ipmitool -I lanplus -H 10.10.106.155 -U root -a sol activate

Password:

Info: SOL payload already active on another session


 

Does /var/log/confluent/consoles/<nodename> have any interesting events crop up?

[04/13 15:17:21 console connected]


 many our own messages

^MMONITORING_TEST dbb54 1492160401 | <== This is the last message from OS/ # date -***@1492160401 (Fri Apr 14 09:00:01 UTC 2017)

^M

[04/14 09:05:13 console connected]

[04/14 09:11:59 console connected]

[04/14 09:13:38 console disconnected]

[04/14 09:14:54 console connected]

[04/14 10:15:13 connection by xcat_console]

[04/14 10:15:14 disconnection by xcat_console]

[04/14 13:14:30 connection by xcat_console]


 

Pyghmi will do keepalive as well, and if that’s the problem, it should be much shorter than 24 hours.  In fact, it should be checking if the SOL payload is active and owned by confluent specifically every couple of minutes.

yes, thats correct


 

From: banuchka [mailto:***@gmail.com] 
Sent: Friday, April 14, 2017 5:55 AM
To: xcat-***@lists.sourceforge.net
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

My last reply was incorrect. Problems still here. Im trying to find something usefull inbetween confluent/pyghmi...

Confluent restart solves hangs/reopen all connections.

I think it isnt the best option to restart confluent 1 or 2 times in 24h.

-- 
banuchka

On 13 April 2017 at 17:03:19, banuchka (***@gmail.com) wrote:

It is Dell’s related problem, not 100% but


Confluent from current master is doing things well :) 

Thanks for pretty nice tool “confluentdbutil".

 

On 13 April 2017 at 11:30:14, banuchka (***@gmail.com) wrote:

Looks like that problem was before
 The fix was to use ipmitool with keepalive(one from xcat repos).

Here pyghmi is used maybe that the reason?

 

On 13 April 2017 at 08:22:28, banuchka (***@gmail.com) wrote:

Hi, 

 

Im trying to completely migrate from conserver to confluent, but catch strange behaviour.

Some of my consoles hangs ~after 24, so no any new messages in their logs or in rcons.

I send messages with timestamp from OS >/dev/console every 30-60min and take a look on them for monitoring purposes(consoles availability monitoring).

I can open rcons and hit enter, after few secs console is waking up(strange). I didnt see it happen with conserver or maybe im wrong...

Some details:

- as i can see the bigest part of consoles with hangs behaviour are Dell idrac. Doesnt matter which type of RacSerial or IPMISerial is in use.

- racreset hard/ipmitool bmc reset didnt do the things

- hit enter to console wake it up(for example with expect i can send \r\n\f, but it looks bad)

- i didnt try to clean confluent's conf and restart it. Not sure it may help.

- HP consoles works well, same ipmi

- few consoles with custom pluging works good as well

 

So maybe my question is not about confluent, but if some of you have some knowledge about same problems please share it! ;)

 

-- 
banuchka

-- 
banuchka

-- 
banuchka

------------------------------------------------------------------------------ 
Check out the vibrant tech community on one of the world's most 
engaging tech sites, Slashdot.org! http://sdm.link/slashdot_______________________________________________ 
xCAT-user mailing list 
xCAT-***@lists.sourceforge.net 
https://lists.sourceforge.net/lists/listinfo/xcat-user 

 

 

 

-- 
banuchka

-- 
banuchka

-- 
banuchka

-- 
banuchka

 

 

-- 
banuchka

-- 
banuchka

-- 
banuchka

-- 
banuchka
Jarrod Johnson
2017-04-14 18:26:47 UTC
Permalink
Very interested in the outcome. And thank you for working through it. Also interested what you have liked, would like, and have disliked about confluent.

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 12:01 PM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

Thank you Jarrod, i’ll try to add patch and let you know after. Hope 90 minutes is enough, yes.


On 14 April 2017 at 16:57:24, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
Hmm, this is going to be very difficult to root cause (I only have Lenovo equipment as one might expect).

I’m loathe to do a workaround, but in console.py (find /usr –name console.py) , might be interesting to see how a change like the following:
diff --git a/pyghmi/ipmi/console.py b/pyghmi/ipmi/console.py
index 95e8551..a5f6062 100644
--- a/pyghmi/ipmi/console.py
+++ b/pyghmi/ipmi/console.py
@@ -42,6 +42,7 @@ class Console(object):
def __init__(self, bmc, userid, password,
iohandler, port=623,
force=False, kg=None):
+ self.keepalivecount = 0
self.keepaliveid = None
self.connected = False
self.broken = False
@@ -70,6 +71,7 @@ class Console(object):
if 'error' in response:
self._print_error(response['error'])
return
+ self.keepalivecount = 0
#Send activate sol payload directive
#netfn= 6 (application)
#command = 0x48 (activate payload)
@@ -150,11 +152,12 @@ class Console(object):
return
currowner = struct.unpack(
"<I", struct.pack('4B', *response['data'][:4]))
- if currowner[0] != self.ipmi_session.sessionid:
+ if currowner[0] != self.ipmi_session.sessionid or self.keepalivecount > 180:
# the session is deactivated or active for something else
self.activated = False
self._print_error('SOL deactivated')
return
+ self.keepalivecount += 1
# ok, still here, that means session is alive, but another
# common issue is firmware messing with mux on reboot
# this would be a nice thing to check, but the serial channel

If it would pan out, should cause the console session to disconnect itself roughly every 90 minutes and trigger reconnect (is 90 minutes short enough in your case?) Would require a service confluent restart to see if it had the desired effect.

Sorry I haven’t tested and can’t think of root cause, but going to take some time off for the weekend.

I would be curious if the same ipmitool is running a day later than a check (e.g. if ipmitool is exiting and getting restarted). I don’t have the time at the moment to see if they do some other interesting thing to avoid the behavior.

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 11:45 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

cloud53.ulan:/home/banuchka # ipmitool sol info 1
Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01
Set in progress : set-complete
Enabled : true
Force Encryption : true
Force Authentication : false
Privilege Level : ADMINISTRATOR
Character Accumulate Level (ms) : 50
Character Send Threshold : 255
Retry Count : 7
Retry Interval (ms) : 480
Volatile Bit Rate (kbps) : 38.4
Non-Volatile Bit Rate (kbps) : 115.2
Payload Channel : 1 (0x01)
Payload Port : 623
cloud53.ulan:/home/banuchka # ipmitool sol set volatile-bit-rate 115.2 1
cloud53.ulan:/home/banuchka # ipmitool sol info 1
Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01
Set in progress : set-complete
Enabled : true
Force Encryption : true
Force Authentication : false
Privilege Level : ADMINISTRATOR
Character Accumulate Level (ms) : 50
Character Send Threshold : 255
Retry Count : 7
Retry Interval (ms) : 480
Volatile Bit Rate (kbps) : 115.2
Non-Volatile Bit Rate (kbps) : 115.2
Payload Channel : 1 (0x01)
Payload Port : 623
cloud53.ulan:/home/banuchka # echo 123 > /dev/console

and nothing happened

in the console’s log
—
[04/14 12:49:12 console disconnected][04/14 12:49:29 console connected][04/14 13:01:02 console disconnected][04/14 13:01:02 console connected][04/14 13:03:54 console disconnected][04/14 13:04:15 console connected][04/14 13:38:37 console connected][04/14 15:31:47 console disconnected][04/14 15:36:24 console connected][04/14 15:42:08 connection by xcat_console]
---


On 14 April 2017 at 16:39:35, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
If you do have any in corrupted state, would be interested to see what happens if you do:
ipmitool sol set volatile-bit-rate 115.2 1


To change the volatile bit rate to match the non-volatile bit rate and see if the corruption goes away.

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 11:36 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

115200

idracadm7 get iDRAC.IPMISerial
[Key=iDRAC.Embedded.1#IPMISerial.1]
BaudRate=115200
ChanPrivLimit=4
ConnectionMode=Terminal
DeleteControl=Disabled
EchoControl=Enabled
FlowControl=RTS/CTS
HandshakeControl=Enabled
InputNewLineSeq=1
LineEdit=Enabled
NewLineSeq=CR-LF

that is strange, right


On 14 April 2017 at 16:31:27, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
Hmm, what’s the baud rate the console is actually running at? Odd to see the volatile and non volatile bit rates not be the same.

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 11:28 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.




On 14 April 2017 at 16:15:16, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
And to be clear, the corruption only starts after a long period of time of being continuously connected?

Yes, that is correct


I might be interested in seeing ipmitool sol info 1 output against a system while it is working versus showing corrupted info.

corrupted:

# ipmitool -I lanplus -H cloud2manage -U root -a sol info 1

Password:

Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01

Set in progress : set-complete

Enabled : true

Force Encryption : true

Force Authentication : false

Privilege Level : ADMINISTRATOR

Character Accumulate Level (ms) : 50

Character Send Threshold : 255

Retry Count : 7

Retry Interval (ms) : 480

Volatile Bit Rate (kbps) : 38.4

Non-Volatile Bit Rate (kbps) : 115.2

Payload Channel : 1 (0x01)

Payload Port : 623



Working:

# ipmitool -I lanplus -H cloud2manage -U root -a sol info 1

Password:

Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01

Set in progress : set-complete

Enabled : true

Force Encryption : true

Force Authentication : false

Privilege Level : ADMINISTRATOR

Character Accumulate Level (ms) : 50

Character Send Threshold : 255

Retry Count : 7

Retry Interval (ms) : 480

Volatile Bit Rate (kbps) : 38.4

Non-Volatile Bit Rate (kbps) : 115.2

Payload Channel : 1 (0x01)

Payload Port : 623


From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 11:09 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

Yes, reopen causes it to work again, without any garbage
 so looks like normal console :)
Hit <enter> causes at first garbage output(ᅵᅵ Porᅵlo) and *normal console* before...


On 14 April 2017 at 16:02:09, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
So reopen causes it to work again, and before, it’s not *hung*, but erratic with garbage characters and occasional blips of sanity?

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 11:00 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

Reopen console did the trick as well...


On 14 April 2017 at 15:54:03, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
‘ctrl-e, then c, then o’ to reconnect.

Was conserver ondemand or full logging?

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 10:52 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

Console starts showing garbage after <enter> inside rcons.
What do you mean when said “restarting console”?
Console continue its work after:
- <enter> inside rcons/confetty
- bmc reset (console disconnected/console connected)

You’re absolutely right with ipmitool and conserver with the same servers we were out of such troubles.

On 14 April 2017 at 15:47:14, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
So the console starts showing garbage? Restarting the console causes the garbage to go away?

You said that ipmitool with a certain configuration did not trigger this?

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 9:29 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

I’m out of ideas, let me show you all i see.

Inside rcons i see:

MONITORING_TEST dbb54 1492160401 <= last message i’ve sent from OS (more complex log below)

tcpdump(keepalive?):

13:23:42.342886 IP (tos 0x0, ttl 64, id 16448, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:23:42.345504 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80




13:24:09.422491 IP (tos 0x0, ttl 64, id 17060, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:24:09.425045 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

Hit <enter> in rcons:
---
MONITORING_TEST dbb54 1492160401

ᅵᅵ
Porᅵ
—

tcpdump:
13:24:35.727671 IP (tos 0x0, ttl 64, id 19582, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:24:35.731533 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:24:47.390367 IP (tos 0x0, ttl 64, id 20347, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:24:47.392799 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 92)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 64
13:24:47.408312 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:24:47.409797 IP (tos 0x0, ttl 64, id 20349, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:03.127774 IP (tos 0x0, ttl 64, id 21818, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:03.131561 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:25:27.269696 IP (tos 0x0, ttl 64, id 26284, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:27.272204 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:25:47.410313 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 92)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 64
13:25:47.413754 IP (tos 0x0, ttl 64, id 28210, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:48.709947 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 204)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 176
13:25:48.712033 IP (tos 0x0, ttl 64, id 28355, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:52.564080 IP (tos 0x0, ttl 64, id 29103, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:52.566810 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

and Magic, rcons:
---
Porï¿œlo]0;console: dbb54 [13:25]


dbb54 login:
---


On 14 April 2017 at 12:42:03, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
If you ctrl-e, c, o, does it restore the console after the time?

Can you tell that it goes after exactly 24hours on the dot?

When console hung, does ‘ipmitool sol activate’ say ‘session already active’?

Yes,

# ipmitool -I lanplus -H 10.10.106.155 -U root -a sol activate

Password:

Info: SOL payload already active on another session


Does /var/log/confluent/consoles/<nodename> have any interesting events crop up?

[04/13 15:17:21 console connected]


 many our own messages

^MMONITORING_TEST dbb54 1492160401 | <== This is the last message from OS/ # date -***@1492160401 (Fri Apr 14 09:00:01 UTC 2017)

^M

[04/14 09:05:13 console connected]

[04/14 09:11:59 console connected]

[04/14 09:13:38 console disconnected]

[04/14 09:14:54 console connected]

[04/14 10:15:13 connection by xcat_console]

[04/14 10:15:14 disconnection by xcat_console]

[04/14 13:14:30 connection by xcat_console]


Pyghmi will do keepalive as well, and if that’s the problem, it should be much shorter than 24 hours. In fact, it should be checking if the SOL payload is active and owned by confluent specifically every couple of minutes.

yes, thats correct


From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 5:55 AM
To: xcat-***@lists.sourceforge.net<mailto:xcat-***@lists.sourceforge.net>
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

My last reply was incorrect. Problems still here. Im trying to find something usefull inbetween confluent/pyghmi...
Confluent restart solves hangs/reopen all connections.
I think it isnt the best option to restart confluent 1 or 2 times in 24h.
--
banuchka

On 13 April 2017 at 17:03:19, banuchka (***@gmail.com<mailto:***@gmail.com>) wrote:
It is Dell’s related problem, not 100% but

Confluent from current master is doing things well :)
Thanks for pretty nice tool “confluentdbutil".


On 13 April 2017 at 11:30:14, banuchka (***@gmail.com<mailto:***@gmail.com>) wrote:
Looks like that problem was before
 The fix was to use ipmitool with keepalive(one from xcat repos).
Here pyghmi is used maybe that the reason?


On 13 April 2017 at 08:22:28, banuchka (***@gmail.com<mailto:***@gmail.com>) wrote:
Hi,

Im trying to completely migrate from conserver to confluent, but catch strange behaviour.
Some of my consoles hangs ~after 24, so no any new messages in their logs or in rcons.
I send messages with timestamp from OS >/dev/console every 30-60min and take a look on them for monitoring purposes(consoles availability monitoring).
I can open rcons and hit enter, after few secs console is waking up(strange). I didnt see it happen with conserver or maybe im wrong...
Some details:
- as i can see the bigest part of consoles with hangs behaviour are Dell idrac. Doesnt matter which type of RacSerial or IPMISerial is in use.
- racreset hard/ipmitool bmc reset didnt do the things
- hit enter to console wake it up(for example with expect i can send \r\n\f, but it looks bad)
- i didnt try to clean confluent's conf and restart it. Not sure it may help.
- HP consoles works well, same ipmi
- few consoles with custom pluging works good as well

So maybe my question is not about confluent, but if some of you have some knowledge about same problems please share it! ;)
--
banuchka
--
banuchka
--
banuchka
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot_______________________________________________
xCAT-user mailing list
xCAT-***@lists.sourceforge.net<mailto:xCAT-***@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/xcat-user
--
banuchka
--
banuchka
--
banuchka
--
banuchka
--
banuchka
--
banuchka
--
banuchka
--
banuchka
banuchka
2017-04-14 18:53:48 UTC
Permalink
Last idea doesn’t work for me. So by the way idea as is is working great – confluent does disconnect/connect after time in constant. But for now it is 100% correct to say – it is a problem with IDRAC fw.
from release notes for last fw:
===
- Fix for occasional iDRAC unresponsiveness caused by upgrades via Firmware RACADM or
have an active SOL or SSH sessions while firmware upgrade is in progress.
===
I’m not sure, but maybe its something like i have here. So did the upgrade on few hosts and give them plenty of time to show me results.
Thanks for your answers, help and time
 it is very interesting quest :)

Bit more about Confluent:
- Interesting ambitions 
- Python VS Perl, thats good
- I think log files(not just trace, stderr, stdout) and documentation(source on Github is the best doc o know, but
) are things that i would like to be in Confluent

On 14 April 2017 at 19:27:20, Jarrod Johnson (***@lenovo.com) wrote:

Very interested in the outcome.  And thank you for working through it.  Also interested what you have liked, would like, and have disliked about confluent.

 

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 12:01 PM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

Thank you Jarrod, i’ll try to add patch and let you know after. Hope 90 minutes is enough, yes.

 

On 14 April 2017 at 16:57:24, Jarrod Johnson (***@lenovo.com) wrote:

Hmm, this is going to be very difficult to root cause (I only have Lenovo equipment as one might expect).

 

I’m loathe to do a workaround, but in console.py (find /usr –name console.py) , might be interesting to see how a change like the following:

diff --git a/pyghmi/ipmi/console.py b/pyghmi/ipmi/console.py

index 95e8551..a5f6062 100644

--- a/pyghmi/ipmi/console.py

+++ b/pyghmi/ipmi/console.py

@@ -42,6 +42,7 @@ class Console(object):

     def __init__(self, bmc, userid, password,

                  iohandler, port=623,

                  force=False, kg=None):

+        self.keepalivecount = 0

         self.keepaliveid = None

         self.connected = False

         self.broken = False

@@ -70,6 +71,7 @@ class Console(object):

         if 'error' in response:

             self._print_error(response['error'])

             return

+        self.keepalivecount = 0

         #Send activate sol payload directive

         #netfn= 6 (application)

         #command = 0x48 (activate payload)

@@ -150,11 +152,12 @@ class Console(object):

             return

         currowner = struct.unpack(

             "<I", struct.pack('4B', *response['data'][:4]))

-        if currowner[0] != self.ipmi_session.sessionid:

+        if currowner[0] != self.ipmi_session.sessionid or  self.keepalivecount > 180:

             # the session is deactivated or active for something else

             self.activated = False

             self._print_error('SOL deactivated')

             return

+        self.keepalivecount += 1

         # ok, still here, that means session is alive, but another

         # common issue is firmware messing with mux on reboot

         # this would be a nice thing to check, but the serial channel

 

If it would pan out, should cause the console session to disconnect itself roughly every 90 minutes and trigger reconnect (is 90 minutes short enough in your case?)  Would require a service confluent restart to see if it had the desired effect.

 

Sorry I haven’t tested and can’t think of root cause, but going to take some time off for the weekend.

 

I would be curious if the same ipmitool is running a day later than a check (e.g. if ipmitool is exiting and getting restarted).  I don’t have the time at the moment to see if they do some other interesting thing to avoid the behavior.

 

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 11:45 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

cloud53.ulan:/home/banuchka # ipmitool sol info 1

Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01

Set in progress                 : set-complete

Enabled                         : true

Force Encryption                : true

Force Authentication            : false

Privilege Level                 : ADMINISTRATOR

Character Accumulate Level (ms) : 50

Character Send Threshold        : 255

Retry Count                     : 7

Retry Interval (ms)             : 480

Volatile Bit Rate (kbps)        : 38.4

Non-Volatile Bit Rate (kbps)    : 115.2

Payload Channel                 : 1 (0x01)

Payload Port                    : 623

cloud53.ulan:/home/banuchka # ipmitool sol set volatile-bit-rate 115.2 1

cloud53.ulan:/home/banuchka # ipmitool sol info 1

Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01

Set in progress                 : set-complete

Enabled                         : true

Force Encryption                : true

Force Authentication            : false

Privilege Level                 : ADMINISTRATOR

Character Accumulate Level (ms) : 50

Character Send Threshold        : 255

Retry Count                     : 7

Retry Interval (ms)             : 480

Volatile Bit Rate (kbps)        : 115.2

Non-Volatile Bit Rate (kbps)    : 115.2

Payload Channel                 : 1 (0x01)

Payload Port                    : 623

cloud53.ulan:/home/banuchka # echo 123 > /dev/console

 

and nothing happened

 

in the console’s log

—

[04/14 12:49:12 console disconnected][04/14 12:49:29 console connected][04/14 13:01:02 console disconnected][04/14 13:01:02 console connected][04/14 13:03:54 console disconnected][04/14 13:04:15 console connected][04/14 13:38:37 console connected][04/14 15:31:47 console disconnected][04/14 15:36:24 console connected][04/14 15:42:08 connection by xcat_console]

---

 

On 14 April 2017 at 16:39:35, Jarrod Johnson (***@lenovo.com) wrote:

If you do have any in corrupted state, would be interested to see what happens if you do:

ipmitool sol set volatile-bit-rate 115.2 1

 

 

To change the volatile bit rate to match the non-volatile bit rate and see if the corruption goes away.

 

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 11:36 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

115200

 

idracadm7 get iDRAC.IPMISerial

[Key=iDRAC.Embedded.1#IPMISerial.1]

BaudRate=115200

ChanPrivLimit=4

ConnectionMode=Terminal

DeleteControl=Disabled

EchoControl=Enabled

FlowControl=RTS/CTS

HandshakeControl=Enabled

InputNewLineSeq=1

LineEdit=Enabled

NewLineSeq=CR-LF

 

that is strange, right

 

On 14 April 2017 at 16:31:27, Jarrod Johnson (***@lenovo.com) wrote:

Hmm, what’s the baud rate the console is actually running at?  Odd to see the volatile and non volatile bit rates not be the same.

 

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 11:28 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

 

 

On 14 April 2017 at 16:15:16, Jarrod Johnson (***@lenovo.com) wrote:

And to be clear, the corruption only starts after a long period of time of being continuously connected?

Yes, that is correct


 

I might be interested in seeing ipmitool sol info 1 output against a system while it is working versus showing corrupted info.

corrupted:

# ipmitool -I lanplus -H cloud2manage -U root -a sol info 1

Password:

Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01

Set in progress                 : set-complete

Enabled                         : true

Force Encryption                : true

Force Authentication            : false

Privilege Level                 : ADMINISTRATOR

Character Accumulate Level (ms) : 50

Character Send Threshold        : 255

Retry Count                     : 7

Retry Interval (ms)             : 480

Volatile Bit Rate (kbps)        : 38.4

Non-Volatile Bit Rate (kbps)    : 115.2

Payload Channel                 : 1 (0x01)

Payload Port                    : 623

 

Working:

# ipmitool -I lanplus -H cloud2manage -U root -a sol info 1

Password:

Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01

Set in progress                 : set-complete

Enabled                         : true

Force Encryption                : true

Force Authentication            : false

Privilege Level                 : ADMINISTRATOR

Character Accumulate Level (ms) : 50

Character Send Threshold        : 255

Retry Count                     : 7

Retry Interval (ms)             : 480

Volatile Bit Rate (kbps)        : 38.4

Non-Volatile Bit Rate (kbps)    : 115.2

Payload Channel                 : 1 (0x01)

Payload Port                    : 623


 

From: banuchka [mailto:***@gmail.com] 
Sent: Friday, April 14, 2017 11:09 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

Yes, reopen causes it to work again,  without any garbage
 so looks like normal console :)

Hit <enter> causes at first garbage output(ᅵᅵ Porᅵlo) and *normal console* before...

 

On 14 April 2017 at 16:02:09, Jarrod Johnson (***@lenovo.com) wrote:

So reopen causes it to work again, and before, it’s not *hung*, but erratic with garbage characters and occasional blips of sanity?

 

From: banuchka [mailto:***@gmail.com] 
Sent: Friday, April 14, 2017 11:00 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

Reopen console did the trick as well...

 

On 14 April 2017 at 15:54:03, Jarrod Johnson (***@lenovo.com) wrote:

‘ctrl-e, then c, then o’ to reconnect.

 

Was conserver ondemand or full logging?

 

From: banuchka [mailto:***@gmail.com] 
Sent: Friday, April 14, 2017 10:52 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

Console starts showing garbage after <enter> inside rcons.

What do you mean when said “restarting console”?

Console continue its work after:

- <enter> inside rcons/confetty

- bmc reset (console disconnected/console connected)

 

You’re absolutely right with ipmitool and conserver with the same servers we were out of such troubles.

On 14 April 2017 at 15:47:14, Jarrod Johnson (***@lenovo.com) wrote:

So the console starts showing garbage?  Restarting the console causes the garbage to go away?

 

You said that ipmitool with a certain configuration did not trigger this?

 

From: banuchka [mailto:***@gmail.com] 
Sent: Friday, April 14, 2017 9:29 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

I’m out of ideas, let me show you all i see.

 

Inside rcons i see:

 

MONITORING_TEST dbb54 1492160401 <= last message i’ve sent from OS (more complex log below)

 

tcpdump(keepalive?):

 

13:23:42.342886 IP (tos 0x0, ttl 64, id 16448, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:23:42.345504 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

 




 

13:24:09.422491 IP (tos 0x0, ttl 64, id 17060, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:24:09.425045 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

 

Hit <enter> in rcons:

---

MONITORING_TEST dbb54 1492160401

 

ᅵᅵ

  Porᅵ

—

 

tcpdump:

13:24:35.727671 IP (tos 0x0, ttl 64, id 19582, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:24:35.731533 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

13:24:47.390367 IP (tos 0x0, ttl 64, id 20347, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:24:47.392799 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 64

13:24:47.408312 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

13:24:47.409797 IP (tos 0x0, ttl 64, id 20349, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:03.127774 IP (tos 0x0, ttl 64, id 21818, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:03.131561 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

13:25:27.269696 IP (tos 0x0, ttl 64, id 26284, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:27.272204 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

13:25:47.410313 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 64

13:25:47.413754 IP (tos 0x0, ttl 64, id 28210, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:48.709947 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 204)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 176

13:25:48.712033 IP (tos 0x0, ttl 64, id 28355, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:52.564080 IP (tos 0x0, ttl 64, id 29103, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:52.566810 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

 

and Magic, rcons:

---

  Porᅵlo]0;console: dbb54 [13:25]

 

 

dbb54 login:

---

 

On 14 April 2017 at 12:42:03, Jarrod Johnson (***@lenovo.com) wrote:

If you ctrl-e, c, o, does it restore the console after the time?

 

Can you tell that it goes after exactly 24hours on the dot?

 

When console hung, does ‘ipmitool sol activate’ say ‘session already active’?

Yes, 

# ipmitool -I lanplus -H 10.10.106.155 -U root -a sol activate

Password:

Info: SOL payload already active on another session


 

Does /var/log/confluent/consoles/<nodename> have any interesting events crop up?

[04/13 15:17:21 console connected]


 many our own messages

^MMONITORING_TEST dbb54 1492160401 | <== This is the last message from OS/ # date -***@1492160401 (Fri Apr 14 09:00:01 UTC 2017)

^M

[04/14 09:05:13 console connected]

[04/14 09:11:59 console connected]

[04/14 09:13:38 console disconnected]

[04/14 09:14:54 console connected]

[04/14 10:15:13 connection by xcat_console]

[04/14 10:15:14 disconnection by xcat_console]

[04/14 13:14:30 connection by xcat_console]


 

Pyghmi will do keepalive as well, and if that’s the problem, it should be much shorter than 24 hours.  In fact, it should be checking if the SOL payload is active and owned by confluent specifically every couple of minutes.

yes, thats correct


 

From: banuchka [mailto:***@gmail.com] 
Sent: Friday, April 14, 2017 5:55 AM
To: xcat-***@lists.sourceforge.net
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

My last reply was incorrect. Problems still here. Im trying to find something usefull inbetween confluent/pyghmi...

Confluent restart solves hangs/reopen all connections.

I think it isnt the best option to restart confluent 1 or 2 times in 24h.

-- 
banuchka

On 13 April 2017 at 17:03:19, banuchka (***@gmail.com) wrote:

It is Dell’s related problem, not 100% but


Confluent from current master is doing things well :) 

Thanks for pretty nice tool “confluentdbutil".

 

On 13 April 2017 at 11:30:14, banuchka (***@gmail.com) wrote:

Looks like that problem was before
 The fix was to use ipmitool with keepalive(one from xcat repos).

Here pyghmi is used maybe that the reason?

 

On 13 April 2017 at 08:22:28, banuchka (***@gmail.com) wrote:

Hi, 

 

Im trying to completely migrate from conserver to confluent, but catch strange behaviour.

Some of my consoles hangs ~after 24, so no any new messages in their logs or in rcons.

I send messages with timestamp from OS >/dev/console every 30-60min and take a look on them for monitoring purposes(consoles availability monitoring).

I can open rcons and hit enter, after few secs console is waking up(strange). I didnt see it happen with conserver or maybe im wrong...

Some details:

- as i can see the bigest part of consoles with hangs behaviour are Dell idrac. Doesnt matter which type of RacSerial or IPMISerial is in use.

- racreset hard/ipmitool bmc reset didnt do the things

- hit enter to console wake it up(for example with expect i can send \r\n\f, but it looks bad)

- i didnt try to clean confluent's conf and restart it. Not sure it may help.

- HP consoles works well, same ipmi

- few consoles with custom pluging works good as well

 

So maybe my question is not about confluent, but if some of you have some knowledge about same problems please share it! ;)

 

-- 
banuchka

-- 
banuchka

-- 
banuchka

------------------------------------------------------------------------------ 
Check out the vibrant tech community on one of the world's most 
engaging tech sites, Slashdot.org! http://sdm.link/slashdot_______________________________________________ 
xCAT-user mailing list 
xCAT-***@lists.sourceforge.net 
https://lists.sourceforge.net/lists/listinfo/xcat-user 

 

 

 

-- 
banuchka

-- 
banuchka

-- 
banuchka

-- 
banuchka

 

 

-- 
banuchka

-- 
banuchka

-- 
banuchka

-- 
banuchka

-- 
banuchka
Jarrod Johnson
2017-04-14 19:58:48 UTC
Permalink
Yeah, there will be a bit push in the coming weeks it will have at least an ‘events’ log along with a lot more function.

Then some more fleshed out documentation (beyond the preliminary stuff on hpc.lenovo.com).

Let me know if the firmware exploration works out. That particular change line suggests firmware upgrades, but it is possible they could have some high BMC cpu usage that could manifest in such a way. The ‘works with ipmitool’ though has me scratching my head.

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 2:54 PM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

Last idea doesn’t work for me. So by the way idea as is is working great – confluent does disconnect/connect after time in constant. But for now it is 100% correct to say – it is a problem with IDRAC fw.
from release notes for last fw:
===
- Fix for occasional iDRAC unresponsiveness caused by upgrades via Firmware RACADM or
have an active SOL or SSH sessions while firmware upgrade is in progress.
===
I’m not sure, but maybe its something like i have here. So did the upgrade on few hosts and give them plenty of time to show me results.
Thanks for your answers, help and time
 it is very interesting quest :)

Bit more about Confluent:
- Interesting ambitions
- Python VS Perl, thats good
- I think log files(not just trace, stderr, stdout) and documentation(source on Github is the best doc o know, but
) are things that i would like to be in Confluent


On 14 April 2017 at 19:27:20, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
Very interested in the outcome. And thank you for working through it. Also interested what you have liked, would like, and have disliked about confluent.

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 12:01 PM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

Thank you Jarrod, i’ll try to add patch and let you know after. Hope 90 minutes is enough, yes.


On 14 April 2017 at 16:57:24, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
Hmm, this is going to be very difficult to root cause (I only have Lenovo equipment as one might expect).

I’m loathe to do a workaround, but in console.py (find /usr –name console.py) , might be interesting to see how a change like the following:
diff --git a/pyghmi/ipmi/console.py b/pyghmi/ipmi/console.py
index 95e8551..a5f6062 100644
--- a/pyghmi/ipmi/console.py
+++ b/pyghmi/ipmi/console.py
@@ -42,6 +42,7 @@ class Console(object):
def __init__(self, bmc, userid, password,
iohandler, port=623,
force=False, kg=None):
+ self.keepalivecount = 0
self.keepaliveid = None
self.connected = False
self.broken = False
@@ -70,6 +71,7 @@ class Console(object):
if 'error' in response:
self._print_error(response['error'])
return
+ self.keepalivecount = 0
#Send activate sol payload directive
#netfn= 6 (application)
#command = 0x48 (activate payload)
@@ -150,11 +152,12 @@ class Console(object):
return
currowner = struct.unpack(
"<I", struct.pack('4B', *response['data'][:4]))
- if currowner[0] != self.ipmi_session.sessionid:
+ if currowner[0] != self.ipmi_session.sessionid or self.keepalivecount > 180:
# the session is deactivated or active for something else
self.activated = False
self._print_error('SOL deactivated')
return
+ self.keepalivecount += 1
# ok, still here, that means session is alive, but another
# common issue is firmware messing with mux on reboot
# this would be a nice thing to check, but the serial channel

If it would pan out, should cause the console session to disconnect itself roughly every 90 minutes and trigger reconnect (is 90 minutes short enough in your case?) Would require a service confluent restart to see if it had the desired effect.

Sorry I haven’t tested and can’t think of root cause, but going to take some time off for the weekend.

I would be curious if the same ipmitool is running a day later than a check (e.g. if ipmitool is exiting and getting restarted). I don’t have the time at the moment to see if they do some other interesting thing to avoid the behavior.

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 11:45 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

cloud53.ulan:/home/banuchka # ipmitool sol info 1
Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01
Set in progress : set-complete
Enabled : true
Force Encryption : true
Force Authentication : false
Privilege Level : ADMINISTRATOR
Character Accumulate Level (ms) : 50
Character Send Threshold : 255
Retry Count : 7
Retry Interval (ms) : 480
Volatile Bit Rate (kbps) : 38.4
Non-Volatile Bit Rate (kbps) : 115.2
Payload Channel : 1 (0x01)
Payload Port : 623
cloud53.ulan:/home/banuchka # ipmitool sol set volatile-bit-rate 115.2 1
cloud53.ulan:/home/banuchka # ipmitool sol info 1
Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01
Set in progress : set-complete
Enabled : true
Force Encryption : true
Force Authentication : false
Privilege Level : ADMINISTRATOR
Character Accumulate Level (ms) : 50
Character Send Threshold : 255
Retry Count : 7
Retry Interval (ms) : 480
Volatile Bit Rate (kbps) : 115.2
Non-Volatile Bit Rate (kbps) : 115.2
Payload Channel : 1 (0x01)
Payload Port : 623
cloud53.ulan:/home/banuchka # echo 123 > /dev/console

and nothing happened

in the console’s log
—
[04/14 12:49:12 console disconnected][04/14 12:49:29 console connected][04/14 13:01:02 console disconnected][04/14 13:01:02 console connected][04/14 13:03:54 console disconnected][04/14 13:04:15 console connected][04/14 13:38:37 console connected][04/14 15:31:47 console disconnected][04/14 15:36:24 console connected][04/14 15:42:08 connection by xcat_console]
---


On 14 April 2017 at 16:39:35, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
If you do have any in corrupted state, would be interested to see what happens if you do:
ipmitool sol set volatile-bit-rate 115.2 1


To change the volatile bit rate to match the non-volatile bit rate and see if the corruption goes away.

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 11:36 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

115200

idracadm7 get iDRAC.IPMISerial
[Key=iDRAC.Embedded.1#IPMISerial.1]
BaudRate=115200
ChanPrivLimit=4
ConnectionMode=Terminal
DeleteControl=Disabled
EchoControl=Enabled
FlowControl=RTS/CTS
HandshakeControl=Enabled
InputNewLineSeq=1
LineEdit=Enabled
NewLineSeq=CR-LF

that is strange, right


On 14 April 2017 at 16:31:27, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
Hmm, what’s the baud rate the console is actually running at? Odd to see the volatile and non volatile bit rates not be the same.

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 11:28 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.




On 14 April 2017 at 16:15:16, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
And to be clear, the corruption only starts after a long period of time of being continuously connected?

Yes, that is correct


I might be interested in seeing ipmitool sol info 1 output against a system while it is working versus showing corrupted info.

corrupted:

# ipmitool -I lanplus -H cloud2manage -U root -a sol info 1

Password:

Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01

Set in progress : set-complete

Enabled : true

Force Encryption : true

Force Authentication : false

Privilege Level : ADMINISTRATOR

Character Accumulate Level (ms) : 50

Character Send Threshold : 255

Retry Count : 7

Retry Interval (ms) : 480

Volatile Bit Rate (kbps) : 38.4

Non-Volatile Bit Rate (kbps) : 115.2

Payload Channel : 1 (0x01)

Payload Port : 623



Working:

# ipmitool -I lanplus -H cloud2manage -U root -a sol info 1

Password:

Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01

Set in progress : set-complete

Enabled : true

Force Encryption : true

Force Authentication : false

Privilege Level : ADMINISTRATOR

Character Accumulate Level (ms) : 50

Character Send Threshold : 255

Retry Count : 7

Retry Interval (ms) : 480

Volatile Bit Rate (kbps) : 38.4

Non-Volatile Bit Rate (kbps) : 115.2

Payload Channel : 1 (0x01)

Payload Port : 623


From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 11:09 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

Yes, reopen causes it to work again, without any garbage
 so looks like normal console :)
Hit <enter> causes at first garbage output(ᅵᅵ Porᅵlo) and *normal console* before...


On 14 April 2017 at 16:02:09, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
So reopen causes it to work again, and before, it’s not *hung*, but erratic with garbage characters and occasional blips of sanity?

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 11:00 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

Reopen console did the trick as well...


On 14 April 2017 at 15:54:03, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
‘ctrl-e, then c, then o’ to reconnect.

Was conserver ondemand or full logging?

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 10:52 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

Console starts showing garbage after <enter> inside rcons.
What do you mean when said “restarting console”?
Console continue its work after:
- <enter> inside rcons/confetty
- bmc reset (console disconnected/console connected)

You’re absolutely right with ipmitool and conserver with the same servers we were out of such troubles.

On 14 April 2017 at 15:47:14, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
So the console starts showing garbage? Restarting the console causes the garbage to go away?

You said that ipmitool with a certain configuration did not trigger this?

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 9:29 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

I’m out of ideas, let me show you all i see.

Inside rcons i see:

MONITORING_TEST dbb54 1492160401 <= last message i’ve sent from OS (more complex log below)

tcpdump(keepalive?):

13:23:42.342886 IP (tos 0x0, ttl 64, id 16448, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:23:42.345504 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80




13:24:09.422491 IP (tos 0x0, ttl 64, id 17060, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:24:09.425045 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

Hit <enter> in rcons:
---
MONITORING_TEST dbb54 1492160401

ᅵᅵ
Porᅵ
—

tcpdump:
13:24:35.727671 IP (tos 0x0, ttl 64, id 19582, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:24:35.731533 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:24:47.390367 IP (tos 0x0, ttl 64, id 20347, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:24:47.392799 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 92)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 64
13:24:47.408312 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:24:47.409797 IP (tos 0x0, ttl 64, id 20349, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:03.127774 IP (tos 0x0, ttl 64, id 21818, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:03.131561 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:25:27.269696 IP (tos 0x0, ttl 64, id 26284, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:27.272204 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:25:47.410313 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 92)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 64
13:25:47.413754 IP (tos 0x0, ttl 64, id 28210, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:48.709947 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 204)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 176
13:25:48.712033 IP (tos 0x0, ttl 64, id 28355, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:52.564080 IP (tos 0x0, ttl 64, id 29103, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:52.566810 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

and Magic, rcons:
---
Porï¿œlo]0;console: dbb54 [13:25]


dbb54 login:
---


On 14 April 2017 at 12:42:03, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
If you ctrl-e, c, o, does it restore the console after the time?

Can you tell that it goes after exactly 24hours on the dot?

When console hung, does ‘ipmitool sol activate’ say ‘session already active’?

Yes,

# ipmitool -I lanplus -H 10.10.106.155 -U root -a sol activate

Password:

Info: SOL payload already active on another session


Does /var/log/confluent/consoles/<nodename> have any interesting events crop up?

[04/13 15:17:21 console connected]


 many our own messages

^MMONITORING_TEST dbb54 1492160401 | <== This is the last message from OS/ # date -***@1492160401 (Fri Apr 14 09:00:01 UTC 2017)

^M

[04/14 09:05:13 console connected]

[04/14 09:11:59 console connected]

[04/14 09:13:38 console disconnected]

[04/14 09:14:54 console connected]

[04/14 10:15:13 connection by xcat_console]

[04/14 10:15:14 disconnection by xcat_console]

[04/14 13:14:30 connection by xcat_console]


Pyghmi will do keepalive as well, and if that’s the problem, it should be much shorter than 24 hours. In fact, it should be checking if the SOL payload is active and owned by confluent specifically every couple of minutes.

yes, thats correct


From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 5:55 AM
To: xcat-***@lists.sourceforge.net<mailto:xcat-***@lists.sourceforge.net>
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

My last reply was incorrect. Problems still here. Im trying to find something usefull inbetween confluent/pyghmi...
Confluent restart solves hangs/reopen all connections.
I think it isnt the best option to restart confluent 1 or 2 times in 24h.
--
banuchka

On 13 April 2017 at 17:03:19, banuchka (***@gmail.com<mailto:***@gmail.com>) wrote:
It is Dell’s related problem, not 100% but

Confluent from current master is doing things well :)
Thanks for pretty nice tool “confluentdbutil".


On 13 April 2017 at 11:30:14, banuchka (***@gmail.com<mailto:***@gmail.com>) wrote:
Looks like that problem was before
 The fix was to use ipmitool with keepalive(one from xcat repos).
Here pyghmi is used maybe that the reason?


On 13 April 2017 at 08:22:28, banuchka (***@gmail.com<mailto:***@gmail.com>) wrote:
Hi,

Im trying to completely migrate from conserver to confluent, but catch strange behaviour.
Some of my consoles hangs ~after 24, so no any new messages in their logs or in rcons.
I send messages with timestamp from OS >/dev/console every 30-60min and take a look on them for monitoring purposes(consoles availability monitoring).
I can open rcons and hit enter, after few secs console is waking up(strange). I didnt see it happen with conserver or maybe im wrong...
Some details:
- as i can see the bigest part of consoles with hangs behaviour are Dell idrac. Doesnt matter which type of RacSerial or IPMISerial is in use.
- racreset hard/ipmitool bmc reset didnt do the things
- hit enter to console wake it up(for example with expect i can send \r\n\f, but it looks bad)
- i didnt try to clean confluent's conf and restart it. Not sure it may help.
- HP consoles works well, same ipmi
- few consoles with custom pluging works good as well

So maybe my question is not about confluent, but if some of you have some knowledge about same problems please share it! ;)
--
banuchka
--
banuchka
--
banuchka
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot_______________________________________________
xCAT-user mailing list
xCAT-***@lists.sourceforge.net<mailto:xCAT-***@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/xcat-user
--
banuchka
--
banuchka
--
banuchka
--
banuchka
--
banuchka
--
banuchka
--
banuchka
--
banuchka
--
banuchka
banuchka
2017-04-19 10:32:58 UTC
Permalink
Hi,

I’m trying to use plugin for confluent with simple "ipmitool sol activate” (placed here /opt/confluent/lib/python/confluent/plugins/console/). It is last attempt to understand whats going on here.
FW upgrade didn’t help me globally.
With current setup with pyghmi i see lots of “log on/log off” messages in BMC’s logs that doesn’t happen when im using ipmitool.
I’m out of ideas right now...

On 14 April 2017 at 20:59:04, Jarrod Johnson (***@lenovo.com) wrote:

Yeah, there will be a bit push in the coming weeks it will have at least an ‘events’ log along with a lot more function.

 

Then some more fleshed out documentation (beyond the preliminary stuff on hpc.lenovo.com).

 

Let me know if the firmware exploration works out.  That particular change line suggests firmware upgrades, but it is possible they could have some high BMC cpu usage that could manifest in such a way.  The ‘works with ipmitool’ though has me scratching my head.

 

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 2:54 PM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

Last idea doesn’t work for me. So by the way idea as is is working great – confluent does disconnect/connect after time in constant. But for now it is 100% correct to say – it is a problem with IDRAC fw.

from release notes for last fw:

===

- Fix for occasional iDRAC unresponsiveness caused by upgrades via Firmware RACADM or
have an active SOL or SSH sessions while firmware upgrade is in progress.

===

I’m not sure, but maybe its something like i have here. So did the upgrade on few hosts and give them plenty of time to show me results.

Thanks for your answers, help and time
 it is very interesting quest :)

 

Bit more about Confluent:

- Interesting ambitions 

- Python VS Perl, thats good

- I think log files(not just trace, stderr, stdout) and documentation(source on Github is the best doc o know, but
) are things that i would like to be in Confluent

 

On 14 April 2017 at 19:27:20, Jarrod Johnson (***@lenovo.com) wrote:

Very interested in the outcome.  And thank you for working through it.  Also interested what you have liked, would like, and have disliked about confluent.

 

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 12:01 PM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

Thank you Jarrod, i’ll try to add patch and let you know after. Hope 90 minutes is enough, yes.

 

On 14 April 2017 at 16:57:24, Jarrod Johnson (***@lenovo.com) wrote:

Hmm, this is going to be very difficult to root cause (I only have Lenovo equipment as one might expect).

 

I’m loathe to do a workaround, but in console.py (find /usr –name console.py) , might be interesting to see how a change like the following:

diff --git a/pyghmi/ipmi/console.py b/pyghmi/ipmi/console.py

index 95e8551..a5f6062 100644

--- a/pyghmi/ipmi/console.py

+++ b/pyghmi/ipmi/console.py

@@ -42,6 +42,7 @@ class Console(object):

     def __init__(self, bmc, userid, password,

                  iohandler, port=623,

                  force=False, kg=None):

+        self.keepalivecount = 0

         self.keepaliveid = None

         self.connected = False

         self.broken = False

@@ -70,6 +71,7 @@ class Console(object):

         if 'error' in response:

             self._print_error(response['error'])

             return

+        self.keepalivecount = 0

         #Send activate sol payload directive

         #netfn= 6 (application)

         #command = 0x48 (activate payload)

@@ -150,11 +152,12 @@ class Console(object):

             return

         currowner = struct.unpack(

             "<I", struct.pack('4B', *response['data'][:4]))

-        if currowner[0] != self.ipmi_session.sessionid:

+        if currowner[0] != self.ipmi_session.sessionid or  self.keepalivecount > 180:

             # the session is deactivated or active for something else

             self.activated = False

             self._print_error('SOL deactivated')

             return

+        self.keepalivecount += 1

         # ok, still here, that means session is alive, but another

         # common issue is firmware messing with mux on reboot

         # this would be a nice thing to check, but the serial channel

 

If it would pan out, should cause the console session to disconnect itself roughly every 90 minutes and trigger reconnect (is 90 minutes short enough in your case?)  Would require a service confluent restart to see if it had the desired effect.

 

Sorry I haven’t tested and can’t think of root cause, but going to take some time off for the weekend.

 

I would be curious if the same ipmitool is running a day later than a check (e.g. if ipmitool is exiting and getting restarted).  I don’t have the time at the moment to see if they do some other interesting thing to avoid the behavior.

 

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 11:45 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

cloud53.ulan:/home/banuchka # ipmitool sol info 1

Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01

Set in progress                 : set-complete

Enabled                         : true

Force Encryption                : true

Force Authentication            : false

Privilege Level                 : ADMINISTRATOR

Character Accumulate Level (ms) : 50

Character Send Threshold        : 255

Retry Count                     : 7

Retry Interval (ms)             : 480

Volatile Bit Rate (kbps)        : 38.4

Non-Volatile Bit Rate (kbps)    : 115.2

Payload Channel                 : 1 (0x01)

Payload Port                    : 623

cloud53.ulan:/home/banuchka # ipmitool sol set volatile-bit-rate 115.2 1

cloud53.ulan:/home/banuchka # ipmitool sol info 1

Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01

Set in progress                 : set-complete

Enabled                         : true

Force Encryption                : true

Force Authentication            : false

Privilege Level                 : ADMINISTRATOR

Character Accumulate Level (ms) : 50

Character Send Threshold        : 255

Retry Count                     : 7

Retry Interval (ms)             : 480

Volatile Bit Rate (kbps)        : 115.2

Non-Volatile Bit Rate (kbps)    : 115.2

Payload Channel                 : 1 (0x01)

Payload Port                    : 623

cloud53.ulan:/home/banuchka # echo 123 > /dev/console

 

and nothing happened

 

in the console’s log

—

[04/14 12:49:12 console disconnected][04/14 12:49:29 console connected][04/14 13:01:02 console disconnected][04/14 13:01:02 console connected][04/14 13:03:54 console disconnected][04/14 13:04:15 console connected][04/14 13:38:37 console connected][04/14 15:31:47 console disconnected][04/14 15:36:24 console connected][04/14 15:42:08 connection by xcat_console]

---

 

On 14 April 2017 at 16:39:35, Jarrod Johnson (***@lenovo.com) wrote:

If you do have any in corrupted state, would be interested to see what happens if you do:

ipmitool sol set volatile-bit-rate 115.2 1

 

 

To change the volatile bit rate to match the non-volatile bit rate and see if the corruption goes away.

 

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 11:36 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

115200

 

idracadm7 get iDRAC.IPMISerial

[Key=iDRAC.Embedded.1#IPMISerial.1]

BaudRate=115200

ChanPrivLimit=4

ConnectionMode=Terminal

DeleteControl=Disabled

EchoControl=Enabled

FlowControl=RTS/CTS

HandshakeControl=Enabled

InputNewLineSeq=1

LineEdit=Enabled

NewLineSeq=CR-LF

 

that is strange, right

 

On 14 April 2017 at 16:31:27, Jarrod Johnson (***@lenovo.com) wrote:

Hmm, what’s the baud rate the console is actually running at?  Odd to see the volatile and non volatile bit rates not be the same.

 

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 11:28 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

 

 

On 14 April 2017 at 16:15:16, Jarrod Johnson (***@lenovo.com) wrote:

And to be clear, the corruption only starts after a long period of time of being continuously connected?

Yes, that is correct


 

I might be interested in seeing ipmitool sol info 1 output against a system while it is working versus showing corrupted info.

corrupted:

# ipmitool -I lanplus -H cloud2manage -U root -a sol info 1

Password:

Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01

Set in progress                 : set-complete

Enabled                         : true

Force Encryption                : true

Force Authentication            : false

Privilege Level                 : ADMINISTRATOR

Character Accumulate Level (ms) : 50

Character Send Threshold        : 255

Retry Count                     : 7

Retry Interval (ms)             : 480

Volatile Bit Rate (kbps)        : 38.4

Non-Volatile Bit Rate (kbps)    : 115.2

Payload Channel                 : 1 (0x01)

Payload Port                    : 623

 

Working:

# ipmitool -I lanplus -H cloud2manage -U root -a sol info 1

Password:

Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01

Set in progress                 : set-complete

Enabled                         : true

Force Encryption                : true

Force Authentication            : false

Privilege Level                 : ADMINISTRATOR

Character Accumulate Level (ms) : 50

Character Send Threshold        : 255

Retry Count                     : 7

Retry Interval (ms)             : 480

Volatile Bit Rate (kbps)        : 38.4

Non-Volatile Bit Rate (kbps)    : 115.2

Payload Channel                 : 1 (0x01)

Payload Port                    : 623


 

From: banuchka [mailto:***@gmail.com] 
Sent: Friday, April 14, 2017 11:09 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

Yes, reopen causes it to work again,  without any garbage
 so looks like normal console :)

Hit <enter> causes at first garbage output(ᅵᅵ Porᅵlo) and *normal console* before...

 

On 14 April 2017 at 16:02:09, Jarrod Johnson (***@lenovo.com) wrote:

So reopen causes it to work again, and before, it’s not *hung*, but erratic with garbage characters and occasional blips of sanity?

 

From: banuchka [mailto:***@gmail.com] 
Sent: Friday, April 14, 2017 11:00 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

Reopen console did the trick as well...

 

On 14 April 2017 at 15:54:03, Jarrod Johnson (***@lenovo.com) wrote:

‘ctrl-e, then c, then o’ to reconnect.

 

Was conserver ondemand or full logging?

 

From: banuchka [mailto:***@gmail.com] 
Sent: Friday, April 14, 2017 10:52 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

Console starts showing garbage after <enter> inside rcons.

What do you mean when said “restarting console”?

Console continue its work after:

- <enter> inside rcons/confetty

- bmc reset (console disconnected/console connected)

 

You’re absolutely right with ipmitool and conserver with the same servers we were out of such troubles.

On 14 April 2017 at 15:47:14, Jarrod Johnson (***@lenovo.com) wrote:

So the console starts showing garbage?  Restarting the console causes the garbage to go away?

 

You said that ipmitool with a certain configuration did not trigger this?

 

From: banuchka [mailto:***@gmail.com] 
Sent: Friday, April 14, 2017 9:29 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

I’m out of ideas, let me show you all i see.

 

Inside rcons i see:

 

MONITORING_TEST dbb54 1492160401 <= last message i’ve sent from OS (more complex log below)

 

tcpdump(keepalive?):

 

13:23:42.342886 IP (tos 0x0, ttl 64, id 16448, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:23:42.345504 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

 




 

13:24:09.422491 IP (tos 0x0, ttl 64, id 17060, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:24:09.425045 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

 

Hit <enter> in rcons:

---

MONITORING_TEST dbb54 1492160401

 

ᅵᅵ

  Porᅵ

—

 

tcpdump:

13:24:35.727671 IP (tos 0x0, ttl 64, id 19582, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:24:35.731533 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

13:24:47.390367 IP (tos 0x0, ttl 64, id 20347, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:24:47.392799 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 64

13:24:47.408312 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

13:24:47.409797 IP (tos 0x0, ttl 64, id 20349, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:03.127774 IP (tos 0x0, ttl 64, id 21818, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:03.131561 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

13:25:27.269696 IP (tos 0x0, ttl 64, id 26284, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:27.272204 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

13:25:47.410313 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 64

13:25:47.413754 IP (tos 0x0, ttl 64, id 28210, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:48.709947 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 204)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 176

13:25:48.712033 IP (tos 0x0, ttl 64, id 28355, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:52.564080 IP (tos 0x0, ttl 64, id 29103, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:52.566810 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

 

and Magic, rcons:

---

  Porᅵlo]0;console: dbb54 [13:25]

 

 

dbb54 login:

---

 

On 14 April 2017 at 12:42:03, Jarrod Johnson (***@lenovo.com) wrote:

If you ctrl-e, c, o, does it restore the console after the time?

 

Can you tell that it goes after exactly 24hours on the dot?

 

When console hung, does ‘ipmitool sol activate’ say ‘session already active’?

Yes, 

# ipmitool -I lanplus -H 10.10.106.155 -U root -a sol activate

Password:

Info: SOL payload already active on another session


 

Does /var/log/confluent/consoles/<nodename> have any interesting events crop up?

[04/13 15:17:21 console connected]


 many our own messages

^MMONITORING_TEST dbb54 1492160401 | <== This is the last message from OS/ # date -***@1492160401 (Fri Apr 14 09:00:01 UTC 2017)

^M

[04/14 09:05:13 console connected]

[04/14 09:11:59 console connected]

[04/14 09:13:38 console disconnected]

[04/14 09:14:54 console connected]

[04/14 10:15:13 connection by xcat_console]

[04/14 10:15:14 disconnection by xcat_console]

[04/14 13:14:30 connection by xcat_console]


 

Pyghmi will do keepalive as well, and if that’s the problem, it should be much shorter than 24 hours.  In fact, it should be checking if the SOL payload is active and owned by confluent specifically every couple of minutes.

yes, thats correct


 

From: banuchka [mailto:***@gmail.com] 
Sent: Friday, April 14, 2017 5:55 AM
To: xcat-***@lists.sourceforge.net
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

My last reply was incorrect. Problems still here. Im trying to find something usefull inbetween confluent/pyghmi...

Confluent restart solves hangs/reopen all connections.

I think it isnt the best option to restart confluent 1 or 2 times in 24h.

-- 
banuchka

On 13 April 2017 at 17:03:19, banuchka (***@gmail.com) wrote:

It is Dell’s related problem, not 100% but


Confluent from current master is doing things well :) 

Thanks for pretty nice tool “confluentdbutil".

 

On 13 April 2017 at 11:30:14, banuchka (***@gmail.com) wrote:

Looks like that problem was before
 The fix was to use ipmitool with keepalive(one from xcat repos).

Here pyghmi is used maybe that the reason?

 

On 13 April 2017 at 08:22:28, banuchka (***@gmail.com) wrote:

Hi,

 

Im trying to completely migrate from conserver to confluent, but catch strange behaviour.

Some of my consoles hangs ~after 24, so no any new messages in their logs or in rcons.

I send messages with timestamp from OS >/dev/console every 30-60min and take a look on them for monitoring purposes(consoles availability monitoring).

I can open rcons and hit enter, after few secs console is waking up(strange). I didnt see it happen with conserver or maybe im wrong...

Some details:

- as i can see the bigest part of consoles with hangs behaviour are Dell idrac. Doesnt matter which type of RacSerial or IPMISerial is in use.

- racreset hard/ipmitool bmc reset didnt do the things

- hit enter to console wake it up(for example with expect i can send \r\n\f, but it looks bad)

- i didnt try to clean confluent's conf and restart it. Not sure it may help.

- HP consoles works well, same ipmi

- few consoles with custom pluging works good as well

 

So maybe my question is not about confluent, but if some of you have some knowledge about same problems please share it! ;)

 

--
banuchka

-- 
banuchka

-- 
banuchka

------------------------------------------------------------------------------ 
Check out the vibrant tech community on one of the world's most 
engaging tech sites, Slashdot.org! http://sdm.link/slashdot_______________________________________________ 
xCAT-user mailing list 
xCAT-***@lists.sourceforge.net 
https://lists.sourceforge.net/lists/listinfo/xcat-user 

 

 

 

-- 
banuchka

-- 
banuchka

-- 
banuchka

-- 
banuchka

 

 

-- 
banuchka

-- 
banuchka

-- 
banuchka

-- 
banuchka

-- 
banuchka

-- 
banuchka
Jarrod Johnson
2017-04-19 12:58:24 UTC
Permalink
I appreciate all the patience and help, let me know if you had a
request about making a shell plugin. The interface is not exactly
fleshed out ('CONFLUENT_NODE' is the only variable that makes it). If
the approach helps, I can accelerate a syntax for a shell module to
request more variables from the configuration (e.g.
CONFLUENT_HARDWAREMANAGEMENT_MANAGER SECRET_HARDWARMANAGEMENTUSER,
etc).

In case you have a question, here's one example:
# cat
/opt/confluent.backup/lib/python/confluent/plugins/console/xcatkvm.sh 
#!/bin/bash
exec /opt/xcat/share/xcat/cons/kvm $CONFLUENT_NODE


As an aside, would you be able to do one more experiment? Start
confluent up, verify console is working, then run nodehealth a few
times against the node and see if it triggers the bad state?
Especially if you have some cron job that involves some node* commands,
imitate that. I was trying to think about things that would be
different between ipmitool and pyghmi, and the one thing that occurs to
me is that in pyghmi we try to multiplex commands and serial over the
same session to limit session consumption. In ipmitool, it's just SOL
(apart from an occasional 'get device id' for keepalive), so I'm
wondering if some timing or large volume of ipmi commands on a session
with active sol session could mess up their BMC SOL session.

Unfortunately, I don't have the resources to help chase this since I
can't reproduce it on our equipment, so all I can do is guessing based
on comparative analysis.

-----Original Message-----
From: banuchka <***@gmail.com>
To: xCAT Users Mailing list <xcat-***@lists.sourceforge.net>, Jarrod
 J
ohnson <***@lenovo.com>
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs
~after 24h.
Date: Wed, 19 Apr 2017 11:32:58 +0100

Hi,

I’m trying to use plugin for confluent with simple "ipmitool sol
activate” (placed here
/opt/confluent/lib/python/confluent/plugins/console/). It is last
attempt to understand whats going on here.
FW upgrade didn’t help me globally.
With current setup with pyghmi i see lots of “log on/log off” messages
in BMC’s logs that doesn’t happen when im using ipmitool.
I’m out of ideas right now...
Post by Jarrod Johnson
Yeah, there will be a bit push in the coming weeks it will have at
least an ‘events’ log along with a lot more function.
 
Then some more fleshed out documentation (beyond the preliminary stuff on hpc.lenovo.com).
 
Let me know if the firmware exploration works out.  That particular
change line suggests firmware upgrades, but it is possible they could
have some high BMC cpu usage that could manifest in such a way.  The
‘works with ipmitool’ though has me scratching my head.
 
Sent: Friday, April 14, 2017 2:54 PM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
 
Last idea doesn’t work for me. So by the way idea as is is working
great – confluent does disconnect/connect after time in constant. But
for now it is 100% correct to say – it is a problem with IDRAC fw.
===
- Fix for occasional iDRAC unresponsiveness caused by upgrades via Firmware RACADM or
have an active SOL or SSH sessions while firmware upgrade is in progress.
===
I’m not sure, but maybe its something like i have here. So did the
upgrade on few hosts and give them plenty of time to show me results.
Thanks for your answers, help and time… it is very interesting quest
:)
 
- Interesting ambitions 
- Python VS Perl, thats good
- I think log files(not just trace, stderr, stdout) and
documentation(source on Github is the best doc o know, but…) are
things that i would like to be in Confluent
 
Very interested in the outcome.  And thank you for working through
it.  Also interested what you have liked, would like, and have
disliked about confluent.
 
Sent: Friday, April 14, 2017 12:01 PM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
 
Thank you Jarrod, i’ll try to add patch and let you know after. Hope
90 minutes is enough, yes.
 
Hmm, this is going to be very difficult to root cause (I only have
Lenovo equipment as one might expect).
 
I’m loathe to do a workaround, but in console.py (find /usr –name
console.py) , might be interesting to see how a change like the
diff --git a/pyghmi/ipmi/console.py b/pyghmi/ipmi/console.py
index 95e8551..a5f6062 100644
--- a/pyghmi/ipmi/console.py
+++ b/pyghmi/ipmi/console.py
     def __init__(self, bmc, userid, password,
                  iohandler, port=623,
+        self.keepalivecount = 0
         self.keepaliveid = None
         self.connected = False
         self.broken = False
             self._print_error(response['error'])
             return
+        self.keepalivecount = 0
         #Send activate sol payload directive
         #netfn= 6 (application)
         #command = 0x48 (activate payload)
             return
         currowner = struct.unpack(
             "<I", struct.pack('4B', *response['data'][:4]))
+        if currowner[0] != self.ipmi_session.sessionid or 
             # the session is deactivated or active for something
else
             self.activated = False
             self._print_error('SOL deactivated')
             return
+        self.keepalivecount += 1
         # ok, still here, that means session is alive, but another
         # common issue is firmware messing with mux on reboot
         # this would be a nice thing to check, but the serial
channel
 
If it would pan out, should cause the console session to disconnect
itself roughly every 90 minutes and trigger reconnect (is 90 minutes
short enough in your case?)  Would require a service confluent
restart to see if it had the desired effect.
 
Sorry I haven’t tested and can’t think of root cause, but going to
take some time off for the weekend.
 
I would be curious if the same ipmitool is running a day later than a
check (e.g. if ipmitool is exiting and getting restarted).  I don’t
have the time at the moment to see if they do some other interesting
thing to avoid the behavior.
 
Sent: Friday, April 14, 2017 11:45 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
 
cloud53.ulan:/home/banuchka # ipmitool sol info 1
Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01
Set in progress                 : set-complete
Enabled                         : true
Force Encryption                : true
Force Authentication            : false
Privilege Level                 : ADMINISTRATOR
Character Accumulate Level (ms) : 50
Character Send Threshold        : 255
Retry Count                     : 7
Retry Interval (ms)             : 480
Volatile Bit Rate (kbps)        : 38.4
Non-Volatile Bit Rate (kbps)    : 115.2
Payload Channel                 : 1 (0x01)
Payload Port                    : 623
cloud53.ulan:/home/banuchka # ipmitool sol set volatile-bit-rate 115.2 1
cloud53.ulan:/home/banuchka # ipmitool sol info 1
Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01
Set in progress                 : set-complete
Enabled                         : true
Force Encryption                : true
Force Authentication            : false
Privilege Level                 : ADMINISTRATOR
Character Accumulate Level (ms) : 50
Character Send Threshold        : 255
Retry Count                     : 7
Retry Interval (ms)             : 480
Volatile Bit Rate (kbps)        : 115.2
Non-Volatile Bit Rate (kbps)    : 115.2
Payload Channel                 : 1 (0x01)
Payload Port                    : 623
cloud53.ulan:/home/banuchka # echo 123 > /dev/console
 
and nothing happened
 
in the console’s log

[04/14 12:49:12 console disconnected][04/14 12:49:29 console
connected][04/14 13:01:02 console disconnected][04/14 13:01:02
console connected][04/14 13:03:54 console disconnected][04/14
13:04:15 console connected][04/14 13:38:37 console connected][04/14
15:31:47 console disconnected][04/14 15:36:24 console
connected][04/14 15:42:08 connection by xcat_console]
---
 
ipmitool sol set volatile-bit-rate 115.2 1
 
 
To change the volatile bit rate to match the non-volatile bit rate
and see if the corruption goes away.
 
Sent: Friday, April 14, 2017 11:36 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
 
115200
 
idracadm7 get iDRAC.IPMISerial
[Key=iDRAC.Embedded.1#IPMISerial.1]
BaudRate=115200
ChanPrivLimit=4
ConnectionMode=Terminal
DeleteControl=Disabled
EchoControl=Enabled
FlowControl=RTS/CTS
HandshakeControl=Enabled
InputNewLineSeq=1
LineEdit=Enabled
NewLineSeq=CR-LF
 
that is strange, right
 
Hmm, what’s the baud rate the console is actually running at?  Odd to
see the volatile and non volatile bit rates not be the same.
 
Sent: Friday, April 14, 2017 11:28 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
 
 
 
And to be clear, the corruption only starts after a long period of
time of being continuously connected?
Yes, that is correct
 
I might be interested in seeing ipmitool sol info 1 output against a
system while it is working versus showing corrupted info.
# ipmitool -I lanplus -H cloud2manage -U root -a sol info 1
Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01
Set in progress                 : set-complete
Enabled                         : true
Force Encryption                : true
Force Authentication            : false
Privilege Level                 : ADMINISTRATOR
Character Accumulate Level (ms) : 50
Character Send Threshold        : 255
Retry Count                     : 7
Retry Interval (ms)             : 480
Volatile Bit Rate (kbps)        : 38.4
Non-Volatile Bit Rate (kbps)    : 115.2
Payload Channel                 : 1 (0x01)
Payload Port                    : 623
 
# ipmitool -I lanplus -H cloud2manage -U root -a sol info 1
Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01
Set in progress                 : set-complete
Enabled                         : true
Force Encryption                : true
Force Authentication            : false
Privilege Level                 : ADMINISTRATOR
Character Accumulate Level (ms) : 50
Character Send Threshold        : 255
Retry Count                     : 7
Retry Interval (ms)             : 480
Volatile Bit Rate (kbps)        : 38.4
Non-Volatile Bit Rate (kbps)    : 115.2
Payload Channel                 : 1 (0x01)
Payload Port                    : 623
 
Sent: Friday, April 14, 2017 11:09 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs
~after 24h.
 
Yes, reopen causes it to work again,  without any garbage… so looks
like normal console :)
Hit <enter> causes at first garbage output(�� Por�lo) and *normal
console* before...
 
So reopen causes it to work again, and before, it’s not *hung*, but
erratic with garbage characters and occasional blips of sanity?
 
Sent: Friday, April 14, 2017 11:00 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs
~after 24h.
 
Reopen console did the trick as well...
 
‘ctrl-e, then c, then o’ to reconnect.
 
Was conserver ondemand or full logging?
 
Sent: Friday, April 14, 2017 10:52 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs
~after 24h.
 
Console starts showing garbage after <enter> inside rcons.
What do you mean when said “restarting console”?
- <enter> inside rcons/confetty
- bmc reset (console disconnected/console connected)
 
You’re absolutely right with ipmitool and conserver with the same
servers we were out of such troubles.
So the console starts showing garbage?  Restarting the console causes
the garbage to go away?
 
You said that ipmitool with a certain configuration did not trigger this?
 
Sent: Friday, April 14, 2017 9:29 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs
~after 24h.
 
I’m out of ideas, let me show you all i see.
 
 
MONITORING_TEST dbb54 1492160401 <= last message i’ve sent from OS
(more complex log below)
 
 
13:23:42.342886 IP (tos 0x0, ttl 64, id 16448, offset 0, flags [DF],
proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length
64
13:23:42.345504 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 108)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length
80
 

 
13:24:09.422491 IP (tos 0x0, ttl 64, id 17060, offset 0, flags [DF],
proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length
64
13:24:09.425045 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 108)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length
80
 
---
MONITORING_TEST dbb54 1492160401
 
��
  Por�

 
13:24:35.727671 IP (tos 0x0, ttl 64, id 19582, offset 0, flags [DF],
proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length
64
13:24:35.731533 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 108)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length
80
13:24:47.390367 IP (tos 0x0, ttl 64, id 20347, offset 0, flags [DF],
proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length
64
13:24:47.392799 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 92)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length
64
13:24:47.408312 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 108)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length
80
13:24:47.409797 IP (tos 0x0, ttl 64, id 20349, offset 0, flags [DF],
proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length
64
13:25:03.127774 IP (tos 0x0, ttl 64, id 21818, offset 0, flags [DF],
proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length
64
13:25:03.131561 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 108)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length
80
13:25:27.269696 IP (tos 0x0, ttl 64, id 26284, offset 0, flags [DF],
proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length
64
13:25:27.272204 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 108)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length
80
13:25:47.410313 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 92)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length
64
13:25:47.413754 IP (tos 0x0, ttl 64, id 28210, offset 0, flags [DF],
proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length
64
13:25:48.709947 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 204)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length
176
13:25:48.712033 IP (tos 0x0, ttl 64, id 28355, offset 0, flags [DF],
proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length
64
13:25:52.564080 IP (tos 0x0, ttl 64, id 29103, offset 0, flags [DF],
proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length
64
13:25:52.566810 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 108)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length
80
 
---
  Por�lo]0;console: dbb54 [13:25]
 
 
---
 
If you ctrl-e, c, o, does it restore the console after the time?
 
Can you tell that it goes after exactly 24hours on the dot?
 
When console hung, does ‘ipmitool sol activate’ say ‘session already
active’?
Yes, 
# ipmitool -I lanplus -H 10.10.106.155 -U root -a sol activate
Info: SOL payload already active on another session
 
Does /var/log/confluent/consoles/<nodename> have any interesting events crop up?
[04/13 15:17:21 console connected]
… many our own messages
^MMONITORING_TEST dbb54 1492160401 | <== This is the last message
^M
[04/14 09:05:13 console connected]
[04/14 09:11:59 console connected]
[04/14 09:13:38 console disconnected]
[04/14 09:14:54 console connected]
[04/14 10:15:13 connection by xcat_console]
[04/14 10:15:14 disconnection by xcat_console]
[04/14 13:14:30 connection by xcat_console]
 
Pyghmi will do keepalive as well, and if that’s the problem, it
should be much shorter than 24 hours.  In fact, it should be checking
if the SOL payload is active and owned by confluent specifically
every couple of minutes.
yes, thats correct
 
Sent: Friday, April 14, 2017 5:55 AM
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs
~after 24h.
 
My last reply was incorrect. Problems still here. Im trying to find
something usefull inbetween confluent/pyghmi...
Confluent restart solves hangs/reopen all connections.
I think it isnt the best option to restart confluent 1 or 2 times in 24h.
-- 
banuchka
It is Dell’s related problem, not 100% but…
Confluent from current master is doing things well :) 
Thanks for pretty nice tool “confluentdbutil".
 
Looks like that problem was before… The fix was to use ipmitool with
keepalive(one from xcat repos).
Here pyghmi is used maybe that the reason?
 
Hi, 
 
Im trying to completely migrate from conserver to confluent, but catch strange behaviour.
Some of my consoles hangs ~after 24, so no any new messages in their logs or in rcons.
I send messages with timestamp from OS >/dev/console every 30-60min
and take a look on them for monitoring purposes(consoles availability
monitoring).
I can open rcons and hit enter, after few secs console is waking
up(strange). I didnt see it happen with conserver or maybe im
wrong...
- as i can see the bigest part of consoles with hangs behaviour are
Dell idrac. Doesnt matter which type of RacSerial or IPMISerial is in
use.
- racreset hard/ipmitool bmc reset didnt do the things
- hit enter to console wake it up(for example with expect i can send
\r\n\f, but it looks bad)
- i didnt try to clean confluent's conf and restart it. Not sure it may help.
- HP consoles works well, same ipmi
- few consoles with custom pluging works good as well
 
So maybe my question is not about confluent, but if some of you have
some knowledge about same problems please share it! ;)
 
-- 
banuchka
-- 
banuchka
-- 
banuchka
-------------------------------------------------------------------
----------- 
Check out the vibrant tech community on one of the world's most 
engaging tech sites, Slashdot.org! http://sdm.link/slashdot__________
_____________________________________ 
xCAT-user mailing list 
https://lists.sourceforge.net/lists/listinfo/xcat-user 
 
 
 
-- 
banuchka
-- 
banuchka
-- 
banuchka
-- 
banuchka
 
 
-- 
banuchka
-- 
banuchka
-- 
banuchka
-- 
banuchka
-- 
banuchka
-- 
banuchka
banuchka
2017-04-19 13:07:57 UTC
Permalink
Thanks Jarrod, I already have few “plugins” for old Sun servers without SOL so it isn’t a big problem to create another one.
I really appreciate your help.
As one more thing I’m trying to fix all BaudRates on servers, because as i can see on DRAC there are minimum 3 places with that setting(Im not sure this is a problem, but it’s not a good practice to read and write on different speed).
I’ll try your advice as well and let you know.

On 19 April 2017 at 13:59:59, Jarrod Johnson (***@lenovo.com) wrote:

I appreciate all the patience and help, let me know if you had a
request about making a shell plugin. The interface is not exactly
fleshed out ('CONFLUENT_NODE' is the only variable that makes it). If
the approach helps, I can accelerate a syntax for a shell module to
request more variables from the configuration (e.g.
CONFLUENT_HARDWAREMANAGEMENT_MANAGER SECRET_HARDWARMANAGEMENTUSER,
etc).

In case you have a question, here's one example:
# cat
/opt/confluent.backup/lib/python/confluent/plugins/console/xcatkvm.sh 
#!/bin/bash
exec /opt/xcat/share/xcat/cons/kvm $CONFLUENT_NODE


As an aside, would you be able to do one more experiment? Start
confluent up, verify console is working, then run nodehealth a few
times against the node and see if it triggers the bad state?
Especially if you have some cron job that involves some node* commands,
imitate that. I was trying to think about things that would be
different between ipmitool and pyghmi, and the one thing that occurs to
me is that in pyghmi we try to multiplex commands and serial over the
same session to limit session consumption. In ipmitool, it's just SOL
(apart from an occasional 'get device id' for keepalive), so I'm
wondering if some timing or large volume of ipmi commands on a session
with active sol session could mess up their BMC SOL session.

Unfortunately, I don't have the resources to help chase this since I
can't reproduce it on our equipment, so all I can do is guessing based
on comparative analysis.

-----Original Message-----
From: banuchka <***@gmail.com>
To: xCAT Users Mailing list <xcat-***@lists.sourceforge.net>, Jarrod
 J
ohnson <***@lenovo.com>
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs
~after 24h.
Date: Wed, 19 Apr 2017 11:32:58 +0100

Hi,

I’m trying to use plugin for confluent with simple "ipmitool sol
activate” (placed here
/opt/confluent/lib/python/confluent/plugins/console/). It is last
attempt to understand whats going on here.
FW upgrade didn’t help me globally.
With current setup with pyghmi i see lots of “log on/log off” messages
in BMC’s logs that doesn’t happen when im using ipmitool.
I’m out of ideas right now...
Post by Jarrod Johnson
Yeah, there will be a bit push in the coming weeks it will have at
least an ‘events’ log along with a lot more function.
 
Then some more fleshed out documentation (beyond the preliminary stuff on hpc.lenovo.com).
 
Let me know if the firmware exploration works out.  That particular
change line suggests firmware upgrades, but it is possible they could
have some high BMC cpu usage that could manifest in such a way.  The
‘works with ipmitool’ though has me scratching my head.
 
Sent: Friday, April 14, 2017 2:54 PM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
 
Last idea doesn’t work for me. So by the way idea as is is working
great – confluent does disconnect/connect after time in constant. But
for now it is 100% correct to say – it is a problem with IDRAC fw.
===
- Fix for occasional iDRAC unresponsiveness caused by upgrades via Firmware RACADM or
have an active SOL or SSH sessions while firmware upgrade is in progress.
===
I’m not sure, but maybe its something like i have here. So did the
upgrade on few hosts and give them plenty of time to show me results.
Thanks for your answers, help and time
 it is very interesting quest :)
 
- Interesting ambitions 
- Python VS Perl, thats good
- I think log files(not just trace, stderr, stdout) and
documentation(source on Github is the best doc o know, but
) are
things that i would like to be in Confluent
 
Very interested in the outcome.  And thank you for working through
it.  Also interested what you have liked, would like, and have
disliked about confluent.
 
Sent: Friday, April 14, 2017 12:01 PM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
 
Thank you Jarrod, i’ll try to add patch and let you know after. Hope
90 minutes is enough, yes.
 
Hmm, this is going to be very difficult to root cause (I only have
Lenovo equipment as one might expect).
 
I’m loathe to do a workaround, but in console.py (find /usr –name
console.py) , might be interesting to see how a change like the
diff --git a/pyghmi/ipmi/console.py b/pyghmi/ipmi/console.py
index 95e8551..a5f6062 100644
--- a/pyghmi/ipmi/console.py
+++ b/pyghmi/ipmi/console.py
     def __init__(self, bmc, userid, password,
                  iohandler, port=623,
+        self.keepalivecount = 0
         self.keepaliveid = None
         self.connected = False
         self.broken = False
             self._print_error(response['error'])
             return
+        self.keepalivecount = 0
         #Send activate sol payload directive
         #netfn= 6 (application)
         #command = 0x48 (activate payload)
             return
         currowner = struct.unpack(
             "<I", struct.pack('4B', *response['data'][:4]))
+        if currowner[0] != self.ipmi_session.sessionid or 
             # the session is deactivated or active for something
else
             self.activated = False
             self._print_error('SOL deactivated')
             return
+        self.keepalivecount += 1
         # ok, still here, that means session is alive, but another
         # common issue is firmware messing with mux on reboot
         # this would be a nice thing to check, but the serial
channel
 
If it would pan out, should cause the console session to disconnect
itself roughly every 90 minutes and trigger reconnect (is 90 minutes
short enough in your case?)  Would require a service confluent
restart to see if it had the desired effect.
 
Sorry I haven’t tested and can’t think of root cause, but going to
take some time off for the weekend.
 
I would be curious if the same ipmitool is running a day later than a
check (e.g. if ipmitool is exiting and getting restarted).  I don’t
have the time at the moment to see if they do some other interesting
thing to avoid the behavior.
 
Sent: Friday, April 14, 2017 11:45 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
 
cloud53.ulan:/home/banuchka # ipmitool sol info 1
Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01
Set in progress                 : set-complete
Enabled                         : true
Force Encryption                : true
Force Authentication            : false
Privilege Level                 : ADMINISTRATOR
Character Accumulate Level (ms) : 50
Character Send Threshold        : 255
Retry Count                     : 7
Retry Interval (ms)             : 480
Volatile Bit Rate (kbps)        : 38.4
Non-Volatile Bit Rate (kbps)    : 115.2
Payload Channel                 : 1 (0x01)
Payload Port                    : 623
cloud53.ulan:/home/banuchka # ipmitool sol set volatile-bit-rate 115.2 1
cloud53.ulan:/home/banuchka # ipmitool sol info 1
Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01
Set in progress                 : set-complete
Enabled                         : true
Force Encryption                : true
Force Authentication            : false
Privilege Level                 : ADMINISTRATOR
Character Accumulate Level (ms) : 50
Character Send Threshold        : 255
Retry Count                     : 7
Retry Interval (ms)             : 480
Volatile Bit Rate (kbps)        : 115.2
Non-Volatile Bit Rate (kbps)    : 115.2
Payload Channel                 : 1 (0x01)
Payload Port                    : 623
cloud53.ulan:/home/banuchka # echo 123 > /dev/console
 
and nothing happened
 
in the console’s log
—
[04/14 12:49:12 console disconnected][04/14 12:49:29 console
connected][04/14 13:01:02 console disconnected][04/14 13:01:02
console connected][04/14 13:03:54 console disconnected][04/14
13:04:15 console connected][04/14 13:38:37 console connected][04/14
15:31:47 console disconnected][04/14 15:36:24 console
connected][04/14 15:42:08 connection by xcat_console]
---
 
ipmitool sol set volatile-bit-rate 115.2 1
 
 
To change the volatile bit rate to match the non-volatile bit rate
and see if the corruption goes away.
 
Sent: Friday, April 14, 2017 11:36 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
 
115200
 
idracadm7 get iDRAC.IPMISerial
[Key=iDRAC.Embedded.1#IPMISerial.1]
BaudRate=115200
ChanPrivLimit=4
ConnectionMode=Terminal
DeleteControl=Disabled
EchoControl=Enabled
FlowControl=RTS/CTS
HandshakeControl=Enabled
InputNewLineSeq=1
LineEdit=Enabled
NewLineSeq=CR-LF
 
that is strange, right
 
Hmm, what’s the baud rate the console is actually running at?  Odd to
see the volatile and non volatile bit rates not be the same.
 
Sent: Friday, April 14, 2017 11:28 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
 
 
 
And to be clear, the corruption only starts after a long period of
time of being continuously connected?
Yes, that is correct
 
I might be interested in seeing ipmitool sol info 1 output against a
system while it is working versus showing corrupted info.
# ipmitool -I lanplus -H cloud2manage -U root -a sol info 1
Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01
Set in progress                 : set-complete
Enabled                         : true
Force Encryption                : true
Force Authentication            : false
Privilege Level                 : ADMINISTRATOR
Character Accumulate Level (ms) : 50
Character Send Threshold        : 255
Retry Count                     : 7
Retry Interval (ms)             : 480
Volatile Bit Rate (kbps)        : 38.4
Non-Volatile Bit Rate (kbps)    : 115.2
Payload Channel                 : 1 (0x01)
Payload Port                    : 623
 
# ipmitool -I lanplus -H cloud2manage -U root -a sol info 1
Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01
Set in progress                 : set-complete
Enabled                         : true
Force Encryption                : true
Force Authentication            : false
Privilege Level                 : ADMINISTRATOR
Character Accumulate Level (ms) : 50
Character Send Threshold        : 255
Retry Count                     : 7
Retry Interval (ms)             : 480
Volatile Bit Rate (kbps)        : 38.4
Non-Volatile Bit Rate (kbps)    : 115.2
Payload Channel                 : 1 (0x01)
Payload Port                    : 623
 
Sent: Friday, April 14, 2017 11:09 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
 
Yes, reopen causes it to work again,  without any garbage
 so looks
like normal console :)
Hit <enter> causes at first garbage output(ᅵᅵ Porᅵlo) and *normal
console* before...
 
So reopen causes it to work again, and before, it’s not *hung*, but
erratic with garbage characters and occasional blips of sanity?
 
Sent: Friday, April 14, 2017 11:00 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
 
Reopen console did the trick as well...
 
‘ctrl-e, then c, then o’ to reconnect.
 
Was conserver ondemand or full logging?
 
Sent: Friday, April 14, 2017 10:52 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
 
Console starts showing garbage after <enter> inside rcons.
What do you mean when said “restarting console”?
- <enter> inside rcons/confetty
- bmc reset (console disconnected/console connected)
 
You’re absolutely right with ipmitool and conserver with the same
servers we were out of such troubles.
So the console starts showing garbage?  Restarting the console causes
the garbage to go away?
 
You said that ipmitool with a certain configuration did not trigger this?
 
Sent: Friday, April 14, 2017 9:29 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
 
I’m out of ideas, let me show you all i see.
 
 
MONITORING_TEST dbb54 1492160401 <= last message i’ve sent from OS
(more complex log below)
 
 
13:23:42.342886 IP (tos 0x0, ttl 64, id 16448, offset 0, flags [DF],
proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:23:42.345504 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 108)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
 


 
13:24:09.422491 IP (tos 0x0, ttl 64, id 17060, offset 0, flags [DF],
proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:24:09.425045 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 108)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
 
---
MONITORING_TEST dbb54 1492160401
 
ᅵᅵ
  Porᅵ
—
 
13:24:35.727671 IP (tos 0x0, ttl 64, id 19582, offset 0, flags [DF],
proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:24:35.731533 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 108)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:24:47.390367 IP (tos 0x0, ttl 64, id 20347, offset 0, flags [DF],
proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:24:47.392799 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 92)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 64
13:24:47.408312 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 108)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:24:47.409797 IP (tos 0x0, ttl 64, id 20349, offset 0, flags [DF],
proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:03.127774 IP (tos 0x0, ttl 64, id 21818, offset 0, flags [DF],
proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:03.131561 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 108)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:25:27.269696 IP (tos 0x0, ttl 64, id 26284, offset 0, flags [DF],
proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:27.272204 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 108)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:25:47.410313 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 92)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 64
13:25:47.413754 IP (tos 0x0, ttl 64, id 28210, offset 0, flags [DF],
proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:48.709947 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 204)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 176
13:25:48.712033 IP (tos 0x0, ttl 64, id 28355, offset 0, flags [DF],
proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:52.564080 IP (tos 0x0, ttl 64, id 29103, offset 0, flags [DF],
proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:52.566810 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 108)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
 
---
  Porᅵlo]0;console: dbb54 [13:25]
 
 
---
 
If you ctrl-e, c, o, does it restore the console after the time?
 
Can you tell that it goes after exactly 24hours on the dot?
 
When console hung, does ‘ipmitool sol activate’ say ‘session already
active’?
Yes, 
# ipmitool -I lanplus -H 10.10.106.155 -U root -a sol activate
Info: SOL payload already active on another session
 
Does /var/log/confluent/consoles/<nodename> have any interesting events crop up?
[04/13 15:17:21 console connected]

 many our own messages
^MMONITORING_TEST dbb54 1492160401 | <== This is the last message
^M
[04/14 09:05:13 console connected]
[04/14 09:11:59 console connected]
[04/14 09:13:38 console disconnected]
[04/14 09:14:54 console connected]
[04/14 10:15:13 connection by xcat_console]
[04/14 10:15:14 disconnection by xcat_console]
[04/14 13:14:30 connection by xcat_console]
 
Pyghmi will do keepalive as well, and if that’s the problem, it
should be much shorter than 24 hours.  In fact, it should be checking
if the SOL payload is active and owned by confluent specifically
every couple of minutes.
yes, thats correct
 
Sent: Friday, April 14, 2017 5:55 AM
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
 
My last reply was incorrect. Problems still here. Im trying to find
something usefull inbetween confluent/pyghmi...
Confluent restart solves hangs/reopen all connections.
I think it isnt the best option to restart confluent 1 or 2 times in 24h.
-- 
banuchka
It is Dell’s related problem, not 100% but

Confluent from current master is doing things well :) 
Thanks for pretty nice tool “confluentdbutil".
 
Looks like that problem was before
 The fix was to use ipmitool with
keepalive(one from xcat repos).
Here pyghmi is used maybe that the reason?
 
Hi, 
 
Im trying to completely migrate from conserver to confluent, but catch strange behaviour.
Some of my consoles hangs ~after 24, so no any new messages in their logs or in rcons.
I send messages with timestamp from OS >/dev/console every 30-60min
and take a look on them for monitoring purposes(consoles availability
monitoring).
I can open rcons and hit enter, after few secs console is waking
up(strange). I didnt see it happen with conserver or maybe im
wrong...
- as i can see the bigest part of consoles with hangs behaviour are
Dell idrac. Doesnt matter which type of RacSerial or IPMISerial is in
use.
- racreset hard/ipmitool bmc reset didnt do the things
- hit enter to console wake it up(for example with expect i can send
\r\n\f, but it looks bad)
- i didnt try to clean confluent's conf and restart it. Not sure it may help.
- HP consoles works well, same ipmi
- few consoles with custom pluging works good as well
 
So maybe my question is not about confluent, but if some of you have
some knowledge about same problems please share it! ;)
 
-- 
banuchka
-- 
banuchka
-- 
banuchka
-------------------------------------------------------------------
----------- 
Check out the vibrant tech community on one of the world's most 
engaging tech sites, Slashdot.org! http://sdm.link/slashdot__________
_____________________________________ 
xCAT-user mailing list 
https://lists.sourceforge.net/lists/listinfo/xcat-user 
 
 
 
-- 
banuchka
-- 
banuchka
-- 
banuchka
-- 
banuchka
 
 
-- 
banuchka
-- 
banuchka
-- 
banuchka
-- 
banuchka
-- 
banuchka
-- 
banuchka
-- 
banuchka
banuchka
2017-04-19 13:38:04 UTC
Permalink
Bit follow up:
experiment with nodehealth+echo > /dev/console + rcons didn’t hang console
 maybe it need more time. Ill save it running inside tmux session for bit long time. 

On 19 April 2017 at 14:07:58, banuchka (***@gmail.com) wrote:

Thanks Jarrod, I already have few “plugins” for old Sun servers without SOL so it isn’t a big problem to create another one.
I really appreciate your help.
As one more thing I’m trying to fix all BaudRates on servers, because as i can see on DRAC there are minimum 3 places with that setting(Im not sure this is a problem, but it’s not a good practice to read and write on different speed).
I’ll try your advice as well and let you know.

On 19 April 2017 at 13:59:59, Jarrod Johnson (***@lenovo.com) wrote:

I appreciate all the patience and help, let me know if you had a
request about making a shell plugin. The interface is not exactly
fleshed out ('CONFLUENT_NODE' is the only variable that makes it). If
the approach helps, I can accelerate a syntax for a shell module to
request more variables from the configuration (e.g.
CONFLUENT_HARDWAREMANAGEMENT_MANAGER SECRET_HARDWARMANAGEMENTUSER,
etc).

In case you have a question, here's one example:
# cat
/opt/confluent.backup/lib/python/confluent/plugins/console/xcatkvm.sh 
#!/bin/bash
exec /opt/xcat/share/xcat/cons/kvm $CONFLUENT_NODE


As an aside, would you be able to do one more experiment? Start
confluent up, verify console is working, then run nodehealth a few
times against the node and see if it triggers the bad state?
Especially if you have some cron job that involves some node* commands,
imitate that. I was trying to think about things that would be
different between ipmitool and pyghmi, and the one thing that occurs to
me is that in pyghmi we try to multiplex commands and serial over the
same session to limit session consumption. In ipmitool, it's just SOL
(apart from an occasional 'get device id' for keepalive), so I'm
wondering if some timing or large volume of ipmi commands on a session
with active sol session could mess up their BMC SOL session.

Unfortunately, I don't have the resources to help chase this since I
can't reproduce it on our equipment, so all I can do is guessing based
on comparative analysis.

-----Original Message-----
From: banuchka <***@gmail.com>
To: xCAT Users Mailing list <xcat-***@lists.sourceforge.net>, Jarrod
 J
ohnson <***@lenovo.com>
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs
~after 24h.
Date: Wed, 19 Apr 2017 11:32:58 +0100

Hi,

I’m trying to use plugin for confluent with simple "ipmitool sol
activate” (placed here
/opt/confluent/lib/python/confluent/plugins/console/). It is last
attempt to understand whats going on here.
FW upgrade didn’t help me globally.
With current setup with pyghmi i see lots of “log on/log off” messages
in BMC’s logs that doesn’t happen when im using ipmitool.
I’m out of ideas right now...
Post by Jarrod Johnson
Yeah, there will be a bit push in the coming weeks it will have at
least an ‘events’ log along with a lot more function.
 
Then some more fleshed out documentation (beyond the preliminary stuff on hpc.lenovo.com).
 
Let me know if the firmware exploration works out.  That particular
change line suggests firmware upgrades, but it is possible they could
have some high BMC cpu usage that could manifest in such a way.  The
‘works with ipmitool’ though has me scratching my head.
 
Sent: Friday, April 14, 2017 2:54 PM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
 
Last idea doesn’t work for me. So by the way idea as is is working
great – confluent does disconnect/connect after time in constant. But
for now it is 100% correct to say – it is a problem with IDRAC fw.
===
- Fix for occasional iDRAC unresponsiveness caused by upgrades via Firmware RACADM or
have an active SOL or SSH sessions while firmware upgrade is in progress.
===
I’m not sure, but maybe its something like i have here. So did the
upgrade on few hosts and give them plenty of time to show me results.
Thanks for your answers, help and time
 it is very interesting quest :)
 
- Interesting ambitions 
- Python VS Perl, thats good
- I think log files(not just trace, stderr, stdout) and
documentation(source on Github is the best doc o know, but
) are
things that i would like to be in Confluent
 
Very interested in the outcome.  And thank you for working through
it.  Also interested what you have liked, would like, and have
disliked about confluent.
 
Sent: Friday, April 14, 2017 12:01 PM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
 
Thank you Jarrod, i’ll try to add patch and let you know after. Hope
90 minutes is enough, yes.
 
Hmm, this is going to be very difficult to root cause (I only have
Lenovo equipment as one might expect).
 
I’m loathe to do a workaround, but in console.py (find /usr –name
console.py) , might be interesting to see how a change like the
diff --git a/pyghmi/ipmi/console.py b/pyghmi/ipmi/console.py
index 95e8551..a5f6062 100644
--- a/pyghmi/ipmi/console.py
+++ b/pyghmi/ipmi/console.py
     def __init__(self, bmc, userid, password,
                  iohandler, port=623,
+        self.keepalivecount = 0
         self.keepaliveid = None
         self.connected = False
         self.broken = False
             self._print_error(response['error'])
             return
+        self.keepalivecount = 0
         #Send activate sol payload directive
         #netfn= 6 (application)
         #command = 0x48 (activate payload)
             return
         currowner = struct.unpack(
             "<I", struct.pack('4B', *response['data'][:4]))
+        if currowner[0] != self.ipmi_session.sessionid or 
             # the session is deactivated or active for something
else
             self.activated = False
             self._print_error('SOL deactivated')
             return
+        self.keepalivecount += 1
         # ok, still here, that means session is alive, but another
         # common issue is firmware messing with mux on reboot
         # this would be a nice thing to check, but the serial
channel
 
If it would pan out, should cause the console session to disconnect
itself roughly every 90 minutes and trigger reconnect (is 90 minutes
short enough in your case?)  Would require a service confluent
restart to see if it had the desired effect.
 
Sorry I haven’t tested and can’t think of root cause, but going to
take some time off for the weekend.
 
I would be curious if the same ipmitool is running a day later than a
check (e.g. if ipmitool is exiting and getting restarted).  I don’t
have the time at the moment to see if they do some other interesting
thing to avoid the behavior.
 
Sent: Friday, April 14, 2017 11:45 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
 
cloud53.ulan:/home/banuchka # ipmitool sol info 1
Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01
Set in progress                 : set-complete
Enabled                         : true
Force Encryption                : true
Force Authentication            : false
Privilege Level                 : ADMINISTRATOR
Character Accumulate Level (ms) : 50
Character Send Threshold        : 255
Retry Count                     : 7
Retry Interval (ms)             : 480
Volatile Bit Rate (kbps)        : 38.4
Non-Volatile Bit Rate (kbps)    : 115.2
Payload Channel                 : 1 (0x01)
Payload Port                    : 623
cloud53.ulan:/home/banuchka # ipmitool sol set volatile-bit-rate 115.2 1
cloud53.ulan:/home/banuchka # ipmitool sol info 1
Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01
Set in progress                 : set-complete
Enabled                         : true
Force Encryption                : true
Force Authentication            : false
Privilege Level                 : ADMINISTRATOR
Character Accumulate Level (ms) : 50
Character Send Threshold        : 255
Retry Count                     : 7
Retry Interval (ms)             : 480
Volatile Bit Rate (kbps)        : 115.2
Non-Volatile Bit Rate (kbps)    : 115.2
Payload Channel                 : 1 (0x01)
Payload Port                    : 623
cloud53.ulan:/home/banuchka # echo 123 > /dev/console
 
and nothing happened
 
in the console’s log
—
[04/14 12:49:12 console disconnected][04/14 12:49:29 console
connected][04/14 13:01:02 console disconnected][04/14 13:01:02
console connected][04/14 13:03:54 console disconnected][04/14
13:04:15 console connected][04/14 13:38:37 console connected][04/14
15:31:47 console disconnected][04/14 15:36:24 console
connected][04/14 15:42:08 connection by xcat_console]
---
 
ipmitool sol set volatile-bit-rate 115.2 1
 
 
To change the volatile bit rate to match the non-volatile bit rate
and see if the corruption goes away.
 
Sent: Friday, April 14, 2017 11:36 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
 
115200
 
idracadm7 get iDRAC.IPMISerial
[Key=iDRAC.Embedded.1#IPMISerial.1]
BaudRate=115200
ChanPrivLimit=4
ConnectionMode=Terminal
DeleteControl=Disabled
EchoControl=Enabled
FlowControl=RTS/CTS
HandshakeControl=Enabled
InputNewLineSeq=1
LineEdit=Enabled
NewLineSeq=CR-LF
 
that is strange, right
 
Hmm, what’s the baud rate the console is actually running at?  Odd to
see the volatile and non volatile bit rates not be the same.
 
Sent: Friday, April 14, 2017 11:28 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
 
 
 
And to be clear, the corruption only starts after a long period of
time of being continuously connected?
Yes, that is correct
 
I might be interested in seeing ipmitool sol info 1 output against a
system while it is working versus showing corrupted info.
# ipmitool -I lanplus -H cloud2manage -U root -a sol info 1
Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01
Set in progress                 : set-complete
Enabled                         : true
Force Encryption                : true
Force Authentication            : false
Privilege Level                 : ADMINISTRATOR
Character Accumulate Level (ms) : 50
Character Send Threshold        : 255
Retry Count                     : 7
Retry Interval (ms)             : 480
Volatile Bit Rate (kbps)        : 38.4
Non-Volatile Bit Rate (kbps)    : 115.2
Payload Channel                 : 1 (0x01)
Payload Port                    : 623
 
# ipmitool -I lanplus -H cloud2manage -U root -a sol info 1
Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01
Set in progress                 : set-complete
Enabled                         : true
Force Encryption                : true
Force Authentication            : false
Privilege Level                 : ADMINISTRATOR
Character Accumulate Level (ms) : 50
Character Send Threshold        : 255
Retry Count                     : 7
Retry Interval (ms)             : 480
Volatile Bit Rate (kbps)        : 38.4
Non-Volatile Bit Rate (kbps)    : 115.2
Payload Channel                 : 1 (0x01)
Payload Port                    : 623
 
Sent: Friday, April 14, 2017 11:09 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
 
Yes, reopen causes it to work again,  without any garbage
 so looks
like normal console :)
Hit <enter> causes at first garbage output(ᅵᅵ Porᅵlo) and *normal
console* before...
 
So reopen causes it to work again, and before, it’s not *hung*, but
erratic with garbage characters and occasional blips of sanity?
 
Sent: Friday, April 14, 2017 11:00 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
 
Reopen console did the trick as well...
 
‘ctrl-e, then c, then o’ to reconnect.
 
Was conserver ondemand or full logging?
 
Sent: Friday, April 14, 2017 10:52 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
 
Console starts showing garbage after <enter> inside rcons.
What do you mean when said “restarting console”?
- <enter> inside rcons/confetty
- bmc reset (console disconnected/console connected)
 
You’re absolutely right with ipmitool and conserver with the same
servers we were out of such troubles.
So the console starts showing garbage?  Restarting the console causes
the garbage to go away?
 
You said that ipmitool with a certain configuration did not trigger this?
 
Sent: Friday, April 14, 2017 9:29 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
 
I’m out of ideas, let me show you all i see.
 
 
MONITORING_TEST dbb54 1492160401 <= last message i’ve sent from OS
(more complex log below)
 
 
13:23:42.342886 IP (tos 0x0, ttl 64, id 16448, offset 0, flags [DF],
proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:23:42.345504 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 108)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
 


 
13:24:09.422491 IP (tos 0x0, ttl 64, id 17060, offset 0, flags [DF],
proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:24:09.425045 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 108)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
 
---
MONITORING_TEST dbb54 1492160401
 
ᅵᅵ
  Porᅵ
—
 
13:24:35.727671 IP (tos 0x0, ttl 64, id 19582, offset 0, flags [DF],
proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:24:35.731533 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 108)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:24:47.390367 IP (tos 0x0, ttl 64, id 20347, offset 0, flags [DF],
proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:24:47.392799 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 92)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 64
13:24:47.408312 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 108)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:24:47.409797 IP (tos 0x0, ttl 64, id 20349, offset 0, flags [DF],
proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:03.127774 IP (tos 0x0, ttl 64, id 21818, offset 0, flags [DF],
proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:03.131561 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 108)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:25:27.269696 IP (tos 0x0, ttl 64, id 26284, offset 0, flags [DF],
proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:27.272204 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 108)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:25:47.410313 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 92)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 64
13:25:47.413754 IP (tos 0x0, ttl 64, id 28210, offset 0, flags [DF],
proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:48.709947 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 204)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 176
13:25:48.712033 IP (tos 0x0, ttl 64, id 28355, offset 0, flags [DF],
proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:52.564080 IP (tos 0x0, ttl 64, id 29103, offset 0, flags [DF],
proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:52.566810 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 108)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
 
---
  Porᅵlo]0;console: dbb54 [13:25]
 
 
---
 
If you ctrl-e, c, o, does it restore the console after the time?
 
Can you tell that it goes after exactly 24hours on the dot?
 
When console hung, does ‘ipmitool sol activate’ say ‘session already
active’?
Yes, 
# ipmitool -I lanplus -H 10.10.106.155 -U root -a sol activate
Info: SOL payload already active on another session
 
Does /var/log/confluent/consoles/<nodename> have any interesting events crop up?
[04/13 15:17:21 console connected]

 many our own messages
^MMONITORING_TEST dbb54 1492160401 | <== This is the last message
^M
[04/14 09:05:13 console connected]
[04/14 09:11:59 console connected]
[04/14 09:13:38 console disconnected]
[04/14 09:14:54 console connected]
[04/14 10:15:13 connection by xcat_console]
[04/14 10:15:14 disconnection by xcat_console]
[04/14 13:14:30 connection by xcat_console]
 
Pyghmi will do keepalive as well, and if that’s the problem, it
should be much shorter than 24 hours.  In fact, it should be checking
if the SOL payload is active and owned by confluent specifically
every couple of minutes.
yes, thats correct
 
Sent: Friday, April 14, 2017 5:55 AM
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
 
My last reply was incorrect. Problems still here. Im trying to find
something usefull inbetween confluent/pyghmi...
Confluent restart solves hangs/reopen all connections.
I think it isnt the best option to restart confluent 1 or 2 times in 24h.
-- 
banuchka
It is Dell’s related problem, not 100% but

Confluent from current master is doing things well :) 
Thanks for pretty nice tool “confluentdbutil".
 
Looks like that problem was before
 The fix was to use ipmitool with
keepalive(one from xcat repos).
Here pyghmi is used maybe that the reason?
 
Hi, 
 
Im trying to completely migrate from conserver to confluent, but catch strange behaviour.
Some of my consoles hangs ~after 24, so no any new messages in their logs or in rcons.
I send messages with timestamp from OS >/dev/console every 30-60min
and take a look on them for monitoring purposes(consoles availability
monitoring).
I can open rcons and hit enter, after few secs console is waking
up(strange). I didnt see it happen with conserver or maybe im
wrong...
- as i can see the bigest part of consoles with hangs behaviour are
Dell idrac. Doesnt matter which type of RacSerial or IPMISerial is in
use.
- racreset hard/ipmitool bmc reset didnt do the things
- hit enter to console wake it up(for example with expect i can send
\r\n\f, but it looks bad)
- i didnt try to clean confluent's conf and restart it. Not sure it may help.
- HP consoles works well, same ipmi
- few consoles with custom pluging works good as well
 
So maybe my question is not about confluent, but if some of you have
some knowledge about same problems please share it! ;)
 
-- 
banuchka
-- 
banuchka
-- 
banuchka
-------------------------------------------------------------------
----------- 
Check out the vibrant tech community on one of the world's most 
engaging tech sites, Slashdot.org! http://sdm.link/slashdot__________
_____________________________________ 
xCAT-user mailing list 
https://lists.sourceforge.net/lists/listinfo/xcat-user 
 
 
 
-- 
banuchka
-- 
banuchka
-- 
banuchka
-- 
banuchka
 
 
-- 
banuchka
-- 
banuchka
-- 
banuchka
-- 
banuchka
-- 
banuchka
-- 
banuchka
-- 
banuchka
-- 
banuchka
Jarrod Johnson
2017-04-19 13:39:08 UTC
Permalink
Ok, also were those login/logouts always there, or only after that ‘try to suicide every 90 minutes’ experiment?

From: banuchka [mailto:***@gmail.com]
Sent: Wednesday, April 19, 2017 9:38 AM
To: xcat-***@lists.sourceforge.net; Jarrod Johnson
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

Bit follow up:
experiment with nodehealth+echo > /dev/console + rcons didn’t hang console
 maybe it need more time. Ill save it running inside tmux session for bit long time.


On 19 April 2017 at 14:07:58, banuchka (***@gmail.com<mailto:***@gmail.com>) wrote:
Thanks Jarrod, I already have few “plugins” for old Sun servers without SOL so it isn’t a big problem to create another one.
I really appreciate your help.
As one more thing I’m trying to fix all BaudRates on servers, because as i can see on DRAC there are minimum 3 places with that setting(Im not sure this is a problem, but it’s not a good practice to read and write on different speed).
I’ll try your advice as well and let you know.


On 19 April 2017 at 13:59:59, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
I appreciate all the patience and help, let me know if you had a
request about making a shell plugin. The interface is not exactly
fleshed out ('CONFLUENT_NODE' is the only variable that makes it). If
the approach helps, I can accelerate a syntax for a shell module to
request more variables from the configuration (e.g.
CONFLUENT_HARDWAREMANAGEMENT_MANAGER SECRET_HARDWARMANAGEMENTUSER,
etc).

In case you have a question, here's one example:
# cat
/opt/confluent.backup/lib/python/confluent/plugins/console/xcatkvm.sh
#!/bin/bash
exec /opt/xcat/share/xcat/cons/kvm $CONFLUENT_NODE


As an aside, would you be able to do one more experiment? Start
confluent up, verify console is working, then run nodehealth a few
times against the node and see if it triggers the bad state?
Especially if you have some cron job that involves some node* commands,
imitate that. I was trying to think about things that would be
different between ipmitool and pyghmi, and the one thing that occurs to
me is that in pyghmi we try to multiplex commands and serial over the
same session to limit session consumption. In ipmitool, it's just SOL
(apart from an occasional 'get device id' for keepalive), so I'm
wondering if some timing or large volume of ipmi commands on a session
with active sol session could mess up their BMC SOL session.

Unfortunately, I don't have the resources to help chase this since I
can't reproduce it on our equipment, so all I can do is guessing based
on comparative analysis.

-----Original Message-----
From: banuchka <***@gmail.com<mailto:***@gmail.com>>
To: xCAT Users Mailing list <xcat-***@lists.sourceforge.net<mailto:xcat-***@lists.sourceforge.net>>, Jarrod
J
ohnson <***@lenovo.com<mailto:***@lenovo.com>>
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs
~after 24h.
Date: Wed, 19 Apr 2017 11:32:58 +0100

Hi,

I’m trying to use plugin for confluent with simple "ipmitool sol
activate” (placed here
/opt/confluent/lib/python/confluent/plugins/console/). It is last
attempt to understand whats going on here.
FW upgrade didn’t help me globally.
With current setup with pyghmi i see lots of “log on/log off” messages
in BMC’s logs that doesn’t happen when im using ipmitool.
I’m out of ideas right now...
Post by Jarrod Johnson
Yeah, there will be a bit push in the coming weeks it will have at
least an ‘events’ log along with a lot more function.
Then some more fleshed out documentation (beyond the preliminary stuff on hpc.lenovo.com).
Let me know if the firmware exploration works out. That particular
change line suggests firmware upgrades, but it is possible they could
have some high BMC cpu usage that could manifest in such a way. The
‘works with ipmitool’ though has me scratching my head.
Sent: Friday, April 14, 2017 2:54 PM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
Last idea doesn’t work for me. So by the way idea as is is working
great – confluent does disconnect/connect after time in constant. But
for now it is 100% correct to say – it is a problem with IDRAC fw.
===
- Fix for occasional iDRAC unresponsiveness caused by upgrades via Firmware RACADM or
have an active SOL or SSH sessions while firmware upgrade is in progress.
===
I’m not sure, but maybe its something like i have here. So did the
upgrade on few hosts and give them plenty of time to show me results.
Thanks for your answers, help and time
 it is very interesting quest :)
- Interesting ambitions
- Python VS Perl, thats good
- I think log files(not just trace, stderr, stdout) and
documentation(source on Github is the best doc o know, but
) are
things that i would like to be in Confluent
Very interested in the outcome. And thank you for working through
it. Also interested what you have liked, would like, and have
disliked about confluent.
Sent: Friday, April 14, 2017 12:01 PM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
Thank you Jarrod, i’ll try to add patch and let you know after. Hope
90 minutes is enough, yes.
Hmm, this is going to be very difficult to root cause (I only have
Lenovo equipment as one might expect).
I’m loathe to do a workaround, but in console.py (find /usr –name
console.py) , might be interesting to see how a change like the
diff --git a/pyghmi/ipmi/console.py b/pyghmi/ipmi/console.py
index 95e8551..a5f6062 100644
--- a/pyghmi/ipmi/console.py
+++ b/pyghmi/ipmi/console.py
def __init__(self, bmc, userid, password,
iohandler, port=623,
+ self.keepalivecount = 0
self.keepaliveid = None
self.connected = False
self.broken = False
self._print_error(response['error'])
return
+ self.keepalivecount = 0
#Send activate sol payload directive
#netfn= 6 (application)
#command = 0x48 (activate payload)
return
currowner = struct.unpack(
"<I", struct.pack('4B', *response['data'][:4]))
+ if currowner[0] != self.ipmi_session.sessionid or
# the session is deactivated or active for something else
self.activated = False
self._print_error('SOL deactivated')
return
+ self.keepalivecount += 1
# ok, still here, that means session is alive, but another
# common issue is firmware messing with mux on reboot
# this would be a nice thing to check, but the serial
channel
If it would pan out, should cause the console session to disconnect
itself roughly every 90 minutes and trigger reconnect (is 90 minutes
short enough in your case?) Would require a service confluent
restart to see if it had the desired effect.
Sorry I haven’t tested and can’t think of root cause, but going to
take some time off for the weekend.
I would be curious if the same ipmitool is running a day later than a
check (e.g. if ipmitool is exiting and getting restarted). I don’t
have the time at the moment to see if they do some other interesting
thing to avoid the behavior.
Sent: Friday, April 14, 2017 11:45 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
cloud53.ulan:/home/banuchka # ipmitool sol info 1
Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01
Set in progress : set-complete
Enabled : true
Force Encryption : true
Force Authentication : false
Privilege Level : ADMINISTRATOR
Character Accumulate Level (ms) : 50
Character Send Threshold : 255
Retry Count : 7
Retry Interval (ms) : 480
Volatile Bit Rate (kbps) : 38.4
Non-Volatile Bit Rate (kbps) : 115.2
Payload Channel : 1 (0x01)
Payload Port : 623
cloud53.ulan:/home/banuchka # ipmitool sol set volatile-bit-rate 115.2 1
cloud53.ulan:/home/banuchka # ipmitool sol info 1
Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01
Set in progress : set-complete
Enabled : true
Force Encryption : true
Force Authentication : false
Privilege Level : ADMINISTRATOR
Character Accumulate Level (ms) : 50
Character Send Threshold : 255
Retry Count : 7
Retry Interval (ms) : 480
Volatile Bit Rate (kbps) : 115.2
Non-Volatile Bit Rate (kbps) : 115.2
Payload Channel : 1 (0x01)
Payload Port : 623
cloud53.ulan:/home/banuchka # echo 123 > /dev/console
and nothing happened
in the console’s log
—
[04/14 12:49:12 console disconnected][04/14 12:49:29 console
connected][04/14 13:01:02 console disconnected][04/14 13:01:02
console connected][04/14 13:03:54 console disconnected][04/14
13:04:15 console connected][04/14 13:38:37 console connected][04/14
15:31:47 console disconnected][04/14 15:36:24 console
connected][04/14 15:42:08 connection by xcat_console]
---
ipmitool sol set volatile-bit-rate 115.2 1
To change the volatile bit rate to match the non-volatile bit rate
and see if the corruption goes away.
Sent: Friday, April 14, 2017 11:36 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
115200
idracadm7 get iDRAC.IPMISerial
[Key=iDRAC.Embedded.1#IPMISerial.1]
BaudRate=115200
ChanPrivLimit=4
ConnectionMode=Terminal
DeleteControl=Disabled
EchoControl=Enabled
FlowControl=RTS/CTS
HandshakeControl=Enabled
InputNewLineSeq=1
LineEdit=Enabled
NewLineSeq=CR-LF
that is strange, right
Hmm, what’s the baud rate the console is actually running at? Odd to
see the volatile and non volatile bit rates not be the same.
Sent: Friday, April 14, 2017 11:28 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
And to be clear, the corruption only starts after a long period of
time of being continuously connected?
Yes, that is correct
I might be interested in seeing ipmitool sol info 1 output against a
system while it is working versus showing corrupted info.
# ipmitool -I lanplus -H cloud2manage -U root -a sol info 1
Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01
Set in progress : set-complete
Enabled : true
Force Encryption : true
Force Authentication : false
Privilege Level : ADMINISTRATOR
Character Accumulate Level (ms) : 50
Character Send Threshold : 255
Retry Count : 7
Retry Interval (ms) : 480
Volatile Bit Rate (kbps) : 38.4
Non-Volatile Bit Rate (kbps) : 115.2
Payload Channel : 1 (0x01)
Payload Port : 623
# ipmitool -I lanplus -H cloud2manage -U root -a sol info 1
Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01
Set in progress : set-complete
Enabled : true
Force Encryption : true
Force Authentication : false
Privilege Level : ADMINISTRATOR
Character Accumulate Level (ms) : 50
Character Send Threshold : 255
Retry Count : 7
Retry Interval (ms) : 480
Volatile Bit Rate (kbps) : 38.4
Non-Volatile Bit Rate (kbps) : 115.2
Payload Channel : 1 (0x01)
Payload Port : 623
Sent: Friday, April 14, 2017 11:09 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
Yes, reopen causes it to work again, without any garbage
 so looks
like normal console :)
Hit <enter> causes at first garbage output(ᅵᅵ Porᅵlo) and *normal
console* before...
So reopen causes it to work again, and before, it’s not *hung*, but
erratic with garbage characters and occasional blips of sanity?
Sent: Friday, April 14, 2017 11:00 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
Reopen console did the trick as well...
‘ctrl-e, then c, then o’ to reconnect.
Was conserver ondemand or full logging?
Sent: Friday, April 14, 2017 10:52 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
Console starts showing garbage after <enter> inside rcons.
What do you mean when said “restarting console”?
- <enter> inside rcons/confetty
- bmc reset (console disconnected/console connected)
You’re absolutely right with ipmitool and conserver with the same
servers we were out of such troubles.
So the console starts showing garbage? Restarting the console causes
the garbage to go away?
You said that ipmitool with a certain configuration did not trigger this?
Sent: Friday, April 14, 2017 9:29 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
I’m out of ideas, let me show you all i see.
MONITORING_TEST dbb54 1492160401 <= last message i’ve sent from OS
(more complex log below)
13:23:42.342886 IP (tos 0x0, ttl 64, id 16448, offset 0, flags [DF],
proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:23:42.345504 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80


13:24:09.422491 IP (tos 0x0, ttl 64, id 17060, offset 0, flags [DF],
proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:24:09.425045 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
---
MONITORING_TEST dbb54 1492160401
ᅵᅵ
Porᅵ
—
13:24:35.727671 IP (tos 0x0, ttl 64, id 19582, offset 0, flags [DF],
proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:24:35.731533 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:24:47.390367 IP (tos 0x0, ttl 64, id 20347, offset 0, flags [DF],
proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:24:47.392799 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 92)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 64
13:24:47.408312 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:24:47.409797 IP (tos 0x0, ttl 64, id 20349, offset 0, flags [DF],
proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:03.127774 IP (tos 0x0, ttl 64, id 21818, offset 0, flags [DF],
proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:03.131561 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:25:27.269696 IP (tos 0x0, ttl 64, id 26284, offset 0, flags [DF],
proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:27.272204 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:25:47.410313 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 92)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 64
13:25:47.413754 IP (tos 0x0, ttl 64, id 28210, offset 0, flags [DF],
proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:48.709947 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 204)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 176
13:25:48.712033 IP (tos 0x0, ttl 64, id 28355, offset 0, flags [DF],
proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:52.564080 IP (tos 0x0, ttl 64, id 29103, offset 0, flags [DF],
proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:52.566810 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
---
Porï¿œlo]0;console: dbb54 [13:25]
---
If you ctrl-e, c, o, does it restore the console after the time?
Can you tell that it goes after exactly 24hours on the dot?
When console hung, does ‘ipmitool sol activate’ say ‘session already
active’?
Yes,
# ipmitool -I lanplus -H 10.10.106.155 -U root -a sol activate
Info: SOL payload already active on another session
Does /var/log/confluent/consoles/<nodename> have any interesting events crop up?
[04/13 15:17:21 console connected]

 many our own messages
^MMONITORING_TEST dbb54 1492160401 | <== This is the last message
^M
[04/14 09:05:13 console connected]
[04/14 09:11:59 console connected]
[04/14 09:13:38 console disconnected]
[04/14 09:14:54 console connected]
[04/14 10:15:13 connection by xcat_console]
[04/14 10:15:14 disconnection by xcat_console]
[04/14 13:14:30 connection by xcat_console]
Pyghmi will do keepalive as well, and if that’s the problem, it
should be much shorter than 24 hours. In fact, it should be checking
if the SOL payload is active and owned by confluent specifically
every couple of minutes.
yes, thats correct
Sent: Friday, April 14, 2017 5:55 AM
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
My last reply was incorrect. Problems still here. Im trying to find
something usefull inbetween confluent/pyghmi...
Confluent restart solves hangs/reopen all connections.
I think it isnt the best option to restart confluent 1 or 2 times in 24h.
--
banuchka
It is Dell’s related problem, not 100% but

Confluent from current master is doing things well :)
Thanks for pretty nice tool “confluentdbutil".
Looks like that problem was before
 The fix was to use ipmitool with
keepalive(one from xcat repos).
Here pyghmi is used maybe that the reason?
Hi,
Im trying to completely migrate from conserver to confluent, but catch strange behaviour.
Some of my consoles hangs ~after 24, so no any new messages in their logs or in rcons.
I send messages with timestamp from OS >/dev/console every 30-60min
and take a look on them for monitoring purposes(consoles availability
monitoring).
I can open rcons and hit enter, after few secs console is waking
up(strange). I didnt see it happen with conserver or maybe im
wrong...
- as i can see the bigest part of consoles with hangs behaviour are
Dell idrac. Doesnt matter which type of RacSerial or IPMISerial is in
use.
- racreset hard/ipmitool bmc reset didnt do the things
- hit enter to console wake it up(for example with expect i can send
\r\n\f, but it looks bad)
- i didnt try to clean confluent's conf and restart it. Not sure it may help.
- HP consoles works well, same ipmi
- few consoles with custom pluging works good as well
So maybe my question is not about confluent, but if some of you have
some knowledge about same problems please share it! ;)
--
banuchka
--
banuchka
--
banuchka
-------------------------------------------------------------------
-----------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot__________
_____________________________________
xCAT-user mailing list
https://lists.sourceforge.net/lists/listinfo/xcat-user
--
banuchka
--
banuchka
--
banuchka
--
banuchka
--
banuchka
--
banuchka
--
banuchka
--
banuchka
--
banuchka
--
banuchka
--
banuchka
--
banuchka
banuchka
2017-04-19 13:46:40 UTC
Permalink
I've disabled "ninety minutes suicide”, but it is fun experiment :)
I guess I may hang sol session/bmc when I doing echo with incorrect baudrate, maybe I’m wrong. Try to check it...

On 19 April 2017 at 14:41:30, Jarrod Johnson (***@lenovo.com) wrote:

Ok, also were those login/logouts always there, or only after that ‘try to suicide every 90 minutes’ experiment?

 

From: banuchka [mailto:***@gmail.com]
Sent: Wednesday, April 19, 2017 9:38 AM
To: xcat-***@lists.sourceforge.net; Jarrod Johnson
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

Bit follow up:

experiment with nodehealth+echo > /dev/console + rcons didn’t hang console
 maybe it need more time. Ill save it running inside tmux session for bit long time. 

 

On 19 April 2017 at 14:07:58, banuchka (***@gmail.com) wrote:

Thanks Jarrod, I already have few “plugins” for old Sun servers without SOL so it isn’t a big problem to create another one.

I really appreciate your help.

As one more thing I’m trying to fix all BaudRates on servers, because as i can see on DRAC there are minimum 3 places with that setting(Im not sure this is a problem, but it’s not a good practice to read and write on different speed).

I’ll try your advice as well and let you know.

 

On 19 April 2017 at 13:59:59, Jarrod Johnson (***@lenovo.com) wrote:

I appreciate all the patience and help, let me know if you had a
request about making a shell plugin. The interface is not exactly
fleshed out ('CONFLUENT_NODE' is the only variable that makes it). If
the approach helps, I can accelerate a syntax for a shell module to
request more variables from the configuration (e.g.
CONFLUENT_HARDWAREMANAGEMENT_MANAGER SECRET_HARDWARMANAGEMENTUSER,
etc).

In case you have a question, here's one example:
# cat
/opt/confluent.backup/lib/python/confluent/plugins/console/xcatkvm.sh 
#!/bin/bash
exec /opt/xcat/share/xcat/cons/kvm $CONFLUENT_NODE


As an aside, would you be able to do one more experiment? Start
confluent up, verify console is working, then run nodehealth a few
times against the node and see if it triggers the bad state?
Especially if you have some cron job that involves some node* commands,
imitate that. I was trying to think about things that would be
different between ipmitool and pyghmi, and the one thing that occurs to
me is that in pyghmi we try to multiplex commands and serial over the
same session to limit session consumption. In ipmitool, it's just SOL
(apart from an occasional 'get device id' for keepalive), so I'm
wondering if some timing or large volume of ipmi commands on a session
with active sol session could mess up their BMC SOL session.

Unfortunately, I don't have the resources to help chase this since I
can't reproduce it on our equipment, so all I can do is guessing based
on comparative analysis.

-----Original Message-----
From: banuchka <***@gmail.com>
To: xCAT Users Mailing list <xcat-***@lists.sourceforge.net>, Jarrod
 J
ohnson <***@lenovo.com>
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs
~after 24h.
Date: Wed, 19 Apr 2017 11:32:58 +0100

Hi,

I’m trying to use plugin for confluent with simple "ipmitool sol
activate” (placed here
/opt/confluent/lib/python/confluent/plugins/console/). It is last
attempt to understand whats going on here.
FW upgrade didn’t help me globally.
With current setup with pyghmi i see lots of “log on/log off” messages
in BMC’s logs that doesn’t happen when im using ipmitool.
I’m out of ideas right now...
Post by Jarrod Johnson
Yeah, there will be a bit push in the coming weeks it will have at
least an ‘events’ log along with a lot more function.
 
Then some more fleshed out documentation (beyond the preliminary stuff on hpc.lenovo.com).
 
Let me know if the firmware exploration works out.  That particular
change line suggests firmware upgrades, but it is possible they could
have some high BMC cpu usage that could manifest in such a way.  The
‘works with ipmitool’ though has me scratching my head.
 
Sent: Friday, April 14, 2017 2:54 PM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
 
Last idea doesn’t work for me. So by the way idea as is is working
great – confluent does disconnect/connect after time in constant. But
for now it is 100% correct to say – it is a problem with IDRAC fw.
===
- Fix for occasional iDRAC unresponsiveness caused by upgrades via Firmware RACADM or
have an active SOL or SSH sessions while firmware upgrade is in progress.
===
I’m not sure, but maybe its something like i have here. So did the
upgrade on few hosts and give them plenty of time to show me results.
Thanks for your answers, help and time
 it is very interesting quest :)
 
- Interesting ambitions 
- Python VS Perl, thats good
- I think log files(not just trace, stderr, stdout) and
documentation(source on Github is the best doc o know, but
) are
things that i would like to be in Confluent
 
Very interested in the outcome.  And thank you for working through
it.  Also interested what you have liked, would like, and have
disliked about confluent.
 
Sent: Friday, April 14, 2017 12:01 PM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
 
Thank you Jarrod, i’ll try to add patch and let you know after. Hope
90 minutes is enough, yes.
 
Hmm, this is going to be very difficult to root cause (I only have
Lenovo equipment as one might expect).
 
I’m loathe to do a workaround, but in console.py (find /usr –name
console.py) , might be interesting to see how a change like the
diff --git a/pyghmi/ipmi/console.py b/pyghmi/ipmi/console.py
index 95e8551..a5f6062 100644
--- a/pyghmi/ipmi/console.py
+++ b/pyghmi/ipmi/console.py
     def __init__(self, bmc, userid, password,
                  iohandler, port=623,
+        self.keepalivecount = 0
         self.keepaliveid = None
         self.connected = False
         self.broken = False
             self._print_error(response['error'])
             return
+        self.keepalivecount = 0
         #Send activate sol payload directive
         #netfn= 6 (application)
         #command = 0x48 (activate payload)
             return
         currowner = struct.unpack(
             "<I", struct.pack('4B', *response['data'][:4]))
+        if currowner[0] != self.ipmi_session.sessionid or 
             # the session is deactivated or active for something
else
             self.activated = False
             self._print_error('SOL deactivated')
             return
+        self.keepalivecount += 1
         # ok, still here, that means session is alive, but another
         # common issue is firmware messing with mux on reboot
         # this would be a nice thing to check, but the serial
channel
 
If it would pan out, should cause the console session to disconnect
itself roughly every 90 minutes and trigger reconnect (is 90 minutes
short enough in your case?)  Would require a service confluent
restart to see if it had the desired effect.
 
Sorry I haven’t tested and can’t think of root cause, but going to
take some time off for the weekend.
 
I would be curious if the same ipmitool is running a day later than a
check (e.g. if ipmitool is exiting and getting restarted).  I don’t
have the time at the moment to see if they do some other interesting
thing to avoid the behavior.
 
Sent: Friday, April 14, 2017 11:45 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
 
cloud53.ulan:/home/banuchka # ipmitool sol info 1
Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01
Set in progress                 : set-complete
Enabled                         : true
Force Encryption                : true
Force Authentication            : false
Privilege Level                 : ADMINISTRATOR
Character Accumulate Level (ms) : 50
Character Send Threshold        : 255
Retry Count                     : 7
Retry Interval (ms)             : 480
Volatile Bit Rate (kbps)        : 38.4
Non-Volatile Bit Rate (kbps)    : 115.2
Payload Channel                 : 1 (0x01)
Payload Port                    : 623
cloud53.ulan:/home/banuchka # ipmitool sol set volatile-bit-rate 115.2 1
cloud53.ulan:/home/banuchka # ipmitool sol info 1
Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01
Set in progress                 : set-complete
Enabled                         : true
Force Encryption                : true
Force Authentication            : false
Privilege Level                 : ADMINISTRATOR
Character Accumulate Level (ms) : 50
Character Send Threshold        : 255
Retry Count                     : 7
Retry Interval (ms)             : 480
Volatile Bit Rate (kbps)        : 115.2
Non-Volatile Bit Rate (kbps)    : 115.2
Payload Channel                 : 1 (0x01)
Payload Port                    : 623
cloud53.ulan:/home/banuchka # echo 123 > /dev/console
 
and nothing happened
 
in the console’s log
—
[04/14 12:49:12 console disconnected][04/14 12:49:29 console
connected][04/14 13:01:02 console disconnected][04/14 13:01:02
console connected][04/14 13:03:54 console disconnected][04/14
13:04:15 console connected][04/14 13:38:37 console connected][04/14
15:31:47 console disconnected][04/14 15:36:24 console
connected][04/14 15:42:08 connection by xcat_console]
---
 
ipmitool sol set volatile-bit-rate 115.2 1
 
 
To change the volatile bit rate to match the non-volatile bit rate
and see if the corruption goes away.
 
Sent: Friday, April 14, 2017 11:36 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
 
115200
 
idracadm7 get iDRAC.IPMISerial
[Key=iDRAC.Embedded.1#IPMISerial.1]
BaudRate=115200
ChanPrivLimit=4
ConnectionMode=Terminal
DeleteControl=Disabled
EchoControl=Enabled
FlowControl=RTS/CTS
HandshakeControl=Enabled
InputNewLineSeq=1
LineEdit=Enabled
NewLineSeq=CR-LF
 
that is strange, right
 
Hmm, what’s the baud rate the console is actually running at?  Odd to
see the volatile and non volatile bit rates not be the same.
 
Sent: Friday, April 14, 2017 11:28 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
 
 
 
And to be clear, the corruption only starts after a long period of
time of being continuously connected?
Yes, that is correct
 
I might be interested in seeing ipmitool sol info 1 output against a
system while it is working versus showing corrupted info.
# ipmitool -I lanplus -H cloud2manage -U root -a sol info 1
Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01
Set in progress                 : set-complete
Enabled                         : true
Force Encryption                : true
Force Authentication            : false
Privilege Level                 : ADMINISTRATOR
Character Accumulate Level (ms) : 50
Character Send Threshold        : 255
Retry Count                     : 7
Retry Interval (ms)             : 480
Volatile Bit Rate (kbps)        : 38.4
Non-Volatile Bit Rate (kbps)    : 115.2
Payload Channel                 : 1 (0x01)
Payload Port                    : 623
 
# ipmitool -I lanplus -H cloud2manage -U root -a sol info 1
Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01
Set in progress                 : set-complete
Enabled                         : true
Force Encryption                : true
Force Authentication            : false
Privilege Level                 : ADMINISTRATOR
Character Accumulate Level (ms) : 50
Character Send Threshold        : 255
Retry Count                     : 7
Retry Interval (ms)             : 480
Volatile Bit Rate (kbps)        : 38.4
Non-Volatile Bit Rate (kbps)    : 115.2
Payload Channel                 : 1 (0x01)
Payload Port                    : 623
 
Sent: Friday, April 14, 2017 11:09 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
 
Yes, reopen causes it to work again,  without any garbage
 so looks
like normal console :)
Hit <enter> causes at first garbage output(ᅵᅵ Porᅵlo) and *normal
console* before...
 
So reopen causes it to work again, and before, it’s not *hung*, but
erratic with garbage characters and occasional blips of sanity?
 
Sent: Friday, April 14, 2017 11:00 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
 
Reopen console did the trick as well...
 
‘ctrl-e, then c, then o’ to reconnect.
 
Was conserver ondemand or full logging?
 
Sent: Friday, April 14, 2017 10:52 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
 
Console starts showing garbage after <enter> inside rcons.
What do you mean when said “restarting console”?
- <enter> inside rcons/confetty
- bmc reset (console disconnected/console connected)
 
You’re absolutely right with ipmitool and conserver with the same
servers we were out of such troubles.
So the console starts showing garbage?  Restarting the console causes
the garbage to go away?
 
You said that ipmitool with a certain configuration did not trigger this?
 
Sent: Friday, April 14, 2017 9:29 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
 
I’m out of ideas, let me show you all i see.
 
 
MONITORING_TEST dbb54 1492160401 <= last message i’ve sent from OS
(more complex log below)
 
 
13:23:42.342886 IP (tos 0x0, ttl 64, id 16448, offset 0, flags [DF],
proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:23:42.345504 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 108)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
 


 
13:24:09.422491 IP (tos 0x0, ttl 64, id 17060, offset 0, flags [DF],
proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:24:09.425045 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 108)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
 
---
MONITORING_TEST dbb54 1492160401
 
ᅵᅵ
  Porᅵ
—
 
13:24:35.727671 IP (tos 0x0, ttl 64, id 19582, offset 0, flags [DF],
proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:24:35.731533 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 108)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:24:47.390367 IP (tos 0x0, ttl 64, id 20347, offset 0, flags [DF],
proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:24:47.392799 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 92)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 64
13:24:47.408312 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 108)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:24:47.409797 IP (tos 0x0, ttl 64, id 20349, offset 0, flags [DF],
proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:03.127774 IP (tos 0x0, ttl 64, id 21818, offset 0, flags [DF],
proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:03.131561 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 108)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:25:27.269696 IP (tos 0x0, ttl 64, id 26284, offset 0, flags [DF],
proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:27.272204 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 108)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:25:47.410313 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 92)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 64
13:25:47.413754 IP (tos 0x0, ttl 64, id 28210, offset 0, flags [DF],
proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:48.709947 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 204)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 176
13:25:48.712033 IP (tos 0x0, ttl 64, id 28355, offset 0, flags [DF],
proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:52.564080 IP (tos 0x0, ttl 64, id 29103, offset 0, flags [DF],
proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:52.566810 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 108)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
 
---
  Porᅵlo]0;console: dbb54 [13:25]
 
 
---
 
If you ctrl-e, c, o, does it restore the console after the time?
 
Can you tell that it goes after exactly 24hours on the dot?
 
When console hung, does ‘ipmitool sol activate’ say ‘session already
active’?
Yes, 
# ipmitool -I lanplus -H 10.10.106.155 -U root -a sol activate
Info: SOL payload already active on another session
 
Does /var/log/confluent/consoles/<nodename> have any interesting events crop up?
[04/13 15:17:21 console connected]

 many our own messages
^MMONITORING_TEST dbb54 1492160401 | <== This is the last message
^M
[04/14 09:05:13 console connected]
[04/14 09:11:59 console connected]
[04/14 09:13:38 console disconnected]
[04/14 09:14:54 console connected]
[04/14 10:15:13 connection by xcat_console]
[04/14 10:15:14 disconnection by xcat_console]
[04/14 13:14:30 connection by xcat_console]
 
Pyghmi will do keepalive as well, and if that’s the problem, it
should be much shorter than 24 hours.  In fact, it should be checking
if the SOL payload is active and owned by confluent specifically
every couple of minutes.
yes, thats correct
 
Sent: Friday, April 14, 2017 5:55 AM
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
 
My last reply was incorrect. Problems still here. Im trying to find
something usefull inbetween confluent/pyghmi...
Confluent restart solves hangs/reopen all connections.
I think it isnt the best option to restart confluent 1 or 2 times in 24h.
-- 
banuchka
It is Dell’s related problem, not 100% but

Confluent from current master is doing things well :) 
Thanks for pretty nice tool “confluentdbutil".
 
Looks like that problem was before
 The fix was to use ipmitool with
keepalive(one from xcat repos).
Here pyghmi is used maybe that the reason?
 
Hi, 
 
Im trying to completely migrate from conserver to confluent, but catch strange behaviour.
Some of my consoles hangs ~after 24, so no any new messages in their logs or in rcons.
I send messages with timestamp from OS >/dev/console every 30-60min
and take a look on them for monitoring purposes(consoles availability
monitoring).
I can open rcons and hit enter, after few secs console is waking
up(strange). I didnt see it happen with conserver or maybe im
wrong...
- as i can see the bigest part of consoles with hangs behaviour are
Dell idrac. Doesnt matter which type of RacSerial or IPMISerial is in
use.
- racreset hard/ipmitool bmc reset didnt do the things
- hit enter to console wake it up(for example with expect i can send
\r\n\f, but it looks bad)
- i didnt try to clean confluent's conf and restart it. Not sure it may help.
- HP consoles works well, same ipmi
- few consoles with custom pluging works good as well
 
So maybe my question is not about confluent, but if some of you have
some knowledge about same problems please share it! ;)
 
-- 
banuchka
-- 
banuchka
-- 
banuchka
-------------------------------------------------------------------
----------- 
Check out the vibrant tech community on one of the world's most 
engaging tech sites, Slashdot.org! http://sdm.link/slashdot__________
_____________________________________ 
xCAT-user mailing list 
https://lists.sourceforge.net/lists/listinfo/xcat-user 
 
 
 
-- 
banuchka
-- 
banuchka
-- 
banuchka
-- 
banuchka
 
 
-- 
banuchka
-- 
banuchka
-- 
banuchka
-- 
banuchka
-- 
banuchka
-- 
banuchka

-- 
banuchka

-- 
banuchka

-- 
banuchka
banuchka
2017-04-19 13:50:55 UTC
Permalink
And one more thing about Confluent:
is it expected behaviour when i did “makeconfluent” / “makeconfluent -l”(confluent service is running) to regenerate nodes/add new nodes confluent is shutting down
?
So for now I did some wrapper for that procedure(makeconfluent -d for unneeded nodes, makeconfluent nodelist for new nodes).

On 19 April 2017 at 14:41:30, Jarrod Johnson (***@lenovo.com) wrote:

Ok, also were those login/logouts always there, or only after that ‘try to suicide every 90 minutes’ experiment?

 

From: banuchka [mailto:***@gmail.com]
Sent: Wednesday, April 19, 2017 9:38 AM
To: xcat-***@lists.sourceforge.net; Jarrod Johnson
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

Bit follow up:

experiment with nodehealth+echo > /dev/console + rcons didn’t hang console
 maybe it need more time. Ill save it running inside tmux session for bit long time. 

 

On 19 April 2017 at 14:07:58, banuchka (***@gmail.com) wrote:

Thanks Jarrod, I already have few “plugins” for old Sun servers without SOL so it isn’t a big problem to create another one.

I really appreciate your help.

As one more thing I’m trying to fix all BaudRates on servers, because as i can see on DRAC there are minimum 3 places with that setting(Im not sure this is a problem, but it’s not a good practice to read and write on different speed).

I’ll try your advice as well and let you know.

 

On 19 April 2017 at 13:59:59, Jarrod Johnson (***@lenovo.com) wrote:

I appreciate all the patience and help, let me know if you had a
request about making a shell plugin. The interface is not exactly
fleshed out ('CONFLUENT_NODE' is the only variable that makes it). If
the approach helps, I can accelerate a syntax for a shell module to
request more variables from the configuration (e.g.
CONFLUENT_HARDWAREMANAGEMENT_MANAGER SECRET_HARDWARMANAGEMENTUSER,
etc).

In case you have a question, here's one example:
# cat
/opt/confluent.backup/lib/python/confluent/plugins/console/xcatkvm.sh 
#!/bin/bash
exec /opt/xcat/share/xcat/cons/kvm $CONFLUENT_NODE


As an aside, would you be able to do one more experiment? Start
confluent up, verify console is working, then run nodehealth a few
times against the node and see if it triggers the bad state?
Especially if you have some cron job that involves some node* commands,
imitate that. I was trying to think about things that would be
different between ipmitool and pyghmi, and the one thing that occurs to
me is that in pyghmi we try to multiplex commands and serial over the
same session to limit session consumption. In ipmitool, it's just SOL
(apart from an occasional 'get device id' for keepalive), so I'm
wondering if some timing or large volume of ipmi commands on a session
with active sol session could mess up their BMC SOL session.

Unfortunately, I don't have the resources to help chase this since I
can't reproduce it on our equipment, so all I can do is guessing based
on comparative analysis.

-----Original Message-----
From: banuchka <***@gmail.com>
To: xCAT Users Mailing list <xcat-***@lists.sourceforge.net>, Jarrod
 J
ohnson <***@lenovo.com>
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs
~after 24h.
Date: Wed, 19 Apr 2017 11:32:58 +0100

Hi,

I’m trying to use plugin for confluent with simple "ipmitool sol
activate” (placed here
/opt/confluent/lib/python/confluent/plugins/console/). It is last
attempt to understand whats going on here.
FW upgrade didn’t help me globally.
With current setup with pyghmi i see lots of “log on/log off” messages
in BMC’s logs that doesn’t happen when im using ipmitool.
I’m out of ideas right now...
Post by Jarrod Johnson
Yeah, there will be a bit push in the coming weeks it will have at
least an ‘events’ log along with a lot more function.
 
Then some more fleshed out documentation (beyond the preliminary stuff on hpc.lenovo.com).
 
Let me know if the firmware exploration works out.  That particular
change line suggests firmware upgrades, but it is possible they could
have some high BMC cpu usage that could manifest in such a way.  The
‘works with ipmitool’ though has me scratching my head.
 
Sent: Friday, April 14, 2017 2:54 PM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
 
Last idea doesn’t work for me. So by the way idea as is is working
great – confluent does disconnect/connect after time in constant. But
for now it is 100% correct to say – it is a problem with IDRAC fw.
===
- Fix for occasional iDRAC unresponsiveness caused by upgrades via Firmware RACADM or
have an active SOL or SSH sessions while firmware upgrade is in progress.
===
I’m not sure, but maybe its something like i have here. So did the
upgrade on few hosts and give them plenty of time to show me results.
Thanks for your answers, help and time
 it is very interesting quest :)
 
- Interesting ambitions 
- Python VS Perl, thats good
- I think log files(not just trace, stderr, stdout) and
documentation(source on Github is the best doc o know, but
) are
things that i would like to be in Confluent
 
Very interested in the outcome.  And thank you for working through
it.  Also interested what you have liked, would like, and have
disliked about confluent.
 
Sent: Friday, April 14, 2017 12:01 PM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
 
Thank you Jarrod, i’ll try to add patch and let you know after. Hope
90 minutes is enough, yes.
 
Hmm, this is going to be very difficult to root cause (I only have
Lenovo equipment as one might expect).
 
I’m loathe to do a workaround, but in console.py (find /usr –name
console.py) , might be interesting to see how a change like the
diff --git a/pyghmi/ipmi/console.py b/pyghmi/ipmi/console.py
index 95e8551..a5f6062 100644
--- a/pyghmi/ipmi/console.py
+++ b/pyghmi/ipmi/console.py
     def __init__(self, bmc, userid, password,
                  iohandler, port=623,
+        self.keepalivecount = 0
         self.keepaliveid = None
         self.connected = False
         self.broken = False
             self._print_error(response['error'])
             return
+        self.keepalivecount = 0
         #Send activate sol payload directive
         #netfn= 6 (application)
         #command = 0x48 (activate payload)
             return
         currowner = struct.unpack(
             "<I", struct.pack('4B', *response['data'][:4]))
+        if currowner[0] != self.ipmi_session.sessionid or 
             # the session is deactivated or active for something
else
             self.activated = False
             self._print_error('SOL deactivated')
             return
+        self.keepalivecount += 1
         # ok, still here, that means session is alive, but another
         # common issue is firmware messing with mux on reboot
         # this would be a nice thing to check, but the serial
channel
 
If it would pan out, should cause the console session to disconnect
itself roughly every 90 minutes and trigger reconnect (is 90 minutes
short enough in your case?)  Would require a service confluent
restart to see if it had the desired effect.
 
Sorry I haven’t tested and can’t think of root cause, but going to
take some time off for the weekend.
 
I would be curious if the same ipmitool is running a day later than a
check (e.g. if ipmitool is exiting and getting restarted).  I don’t
have the time at the moment to see if they do some other interesting
thing to avoid the behavior.
 
Sent: Friday, April 14, 2017 11:45 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
 
cloud53.ulan:/home/banuchka # ipmitool sol info 1
Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01
Set in progress                 : set-complete
Enabled                         : true
Force Encryption                : true
Force Authentication            : false
Privilege Level                 : ADMINISTRATOR
Character Accumulate Level (ms) : 50
Character Send Threshold        : 255
Retry Count                     : 7
Retry Interval (ms)             : 480
Volatile Bit Rate (kbps)        : 38.4
Non-Volatile Bit Rate (kbps)    : 115.2
Payload Channel                 : 1 (0x01)
Payload Port                    : 623
cloud53.ulan:/home/banuchka # ipmitool sol set volatile-bit-rate 115.2 1
cloud53.ulan:/home/banuchka # ipmitool sol info 1
Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01
Set in progress                 : set-complete
Enabled                         : true
Force Encryption                : true
Force Authentication            : false
Privilege Level                 : ADMINISTRATOR
Character Accumulate Level (ms) : 50
Character Send Threshold        : 255
Retry Count                     : 7
Retry Interval (ms)             : 480
Volatile Bit Rate (kbps)        : 115.2
Non-Volatile Bit Rate (kbps)    : 115.2
Payload Channel                 : 1 (0x01)
Payload Port                    : 623
cloud53.ulan:/home/banuchka # echo 123 > /dev/console
 
and nothing happened
 
in the console’s log
—
[04/14 12:49:12 console disconnected][04/14 12:49:29 console
connected][04/14 13:01:02 console disconnected][04/14 13:01:02
console connected][04/14 13:03:54 console disconnected][04/14
13:04:15 console connected][04/14 13:38:37 console connected][04/14
15:31:47 console disconnected][04/14 15:36:24 console
connected][04/14 15:42:08 connection by xcat_console]
---
 
ipmitool sol set volatile-bit-rate 115.2 1
 
 
To change the volatile bit rate to match the non-volatile bit rate
and see if the corruption goes away.
 
Sent: Friday, April 14, 2017 11:36 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
 
115200
 
idracadm7 get iDRAC.IPMISerial
[Key=iDRAC.Embedded.1#IPMISerial.1]
BaudRate=115200
ChanPrivLimit=4
ConnectionMode=Terminal
DeleteControl=Disabled
EchoControl=Enabled
FlowControl=RTS/CTS
HandshakeControl=Enabled
InputNewLineSeq=1
LineEdit=Enabled
NewLineSeq=CR-LF
 
that is strange, right
 
Hmm, what’s the baud rate the console is actually running at?  Odd to
see the volatile and non volatile bit rates not be the same.
 
Sent: Friday, April 14, 2017 11:28 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
 
 
 
And to be clear, the corruption only starts after a long period of
time of being continuously connected?
Yes, that is correct
 
I might be interested in seeing ipmitool sol info 1 output against a
system while it is working versus showing corrupted info.
# ipmitool -I lanplus -H cloud2manage -U root -a sol info 1
Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01
Set in progress                 : set-complete
Enabled                         : true
Force Encryption                : true
Force Authentication            : false
Privilege Level                 : ADMINISTRATOR
Character Accumulate Level (ms) : 50
Character Send Threshold        : 255
Retry Count                     : 7
Retry Interval (ms)             : 480
Volatile Bit Rate (kbps)        : 38.4
Non-Volatile Bit Rate (kbps)    : 115.2
Payload Channel                 : 1 (0x01)
Payload Port                    : 623
 
# ipmitool -I lanplus -H cloud2manage -U root -a sol info 1
Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01
Set in progress                 : set-complete
Enabled                         : true
Force Encryption                : true
Force Authentication            : false
Privilege Level                 : ADMINISTRATOR
Character Accumulate Level (ms) : 50
Character Send Threshold        : 255
Retry Count                     : 7
Retry Interval (ms)             : 480
Volatile Bit Rate (kbps)        : 38.4
Non-Volatile Bit Rate (kbps)    : 115.2
Payload Channel                 : 1 (0x01)
Payload Port                    : 623
 
Sent: Friday, April 14, 2017 11:09 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
 
Yes, reopen causes it to work again,  without any garbage
 so looks
like normal console :)
Hit <enter> causes at first garbage output(ᅵᅵ Porᅵlo) and *normal
console* before...
 
So reopen causes it to work again, and before, it’s not *hung*, but
erratic with garbage characters and occasional blips of sanity?
 
Sent: Friday, April 14, 2017 11:00 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
 
Reopen console did the trick as well...
 
‘ctrl-e, then c, then o’ to reconnect.
 
Was conserver ondemand or full logging?
 
Sent: Friday, April 14, 2017 10:52 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
 
Console starts showing garbage after <enter> inside rcons.
What do you mean when said “restarting console”?
- <enter> inside rcons/confetty
- bmc reset (console disconnected/console connected)
 
You’re absolutely right with ipmitool and conserver with the same
servers we were out of such troubles.
So the console starts showing garbage?  Restarting the console causes
the garbage to go away?
 
You said that ipmitool with a certain configuration did not trigger this?
 
Sent: Friday, April 14, 2017 9:29 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
 
I’m out of ideas, let me show you all i see.
 
 
MONITORING_TEST dbb54 1492160401 <= last message i’ve sent from OS
(more complex log below)
 
 
13:23:42.342886 IP (tos 0x0, ttl 64, id 16448, offset 0, flags [DF],
proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:23:42.345504 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 108)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
 


 
13:24:09.422491 IP (tos 0x0, ttl 64, id 17060, offset 0, flags [DF],
proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:24:09.425045 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 108)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
 
---
MONITORING_TEST dbb54 1492160401
 
ᅵᅵ
  Porᅵ
—
 
13:24:35.727671 IP (tos 0x0, ttl 64, id 19582, offset 0, flags [DF],
proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:24:35.731533 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 108)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:24:47.390367 IP (tos 0x0, ttl 64, id 20347, offset 0, flags [DF],
proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:24:47.392799 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 92)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 64
13:24:47.408312 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 108)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:24:47.409797 IP (tos 0x0, ttl 64, id 20349, offset 0, flags [DF],
proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:03.127774 IP (tos 0x0, ttl 64, id 21818, offset 0, flags [DF],
proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:03.131561 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 108)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:25:27.269696 IP (tos 0x0, ttl 64, id 26284, offset 0, flags [DF],
proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:27.272204 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 108)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:25:47.410313 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 92)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 64
13:25:47.413754 IP (tos 0x0, ttl 64, id 28210, offset 0, flags [DF],
proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:48.709947 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 204)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 176
13:25:48.712033 IP (tos 0x0, ttl 64, id 28355, offset 0, flags [DF],
proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:52.564080 IP (tos 0x0, ttl 64, id 29103, offset 0, flags [DF],
proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:52.566810 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 108)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
 
---
  Porᅵlo]0;console: dbb54 [13:25]
 
 
---
 
If you ctrl-e, c, o, does it restore the console after the time?
 
Can you tell that it goes after exactly 24hours on the dot?
 
When console hung, does ‘ipmitool sol activate’ say ‘session already
active’?
Yes, 
# ipmitool -I lanplus -H 10.10.106.155 -U root -a sol activate
Info: SOL payload already active on another session
 
Does /var/log/confluent/consoles/<nodename> have any interesting events crop up?
[04/13 15:17:21 console connected]

 many our own messages
^MMONITORING_TEST dbb54 1492160401 | <== This is the last message
^M
[04/14 09:05:13 console connected]
[04/14 09:11:59 console connected]
[04/14 09:13:38 console disconnected]
[04/14 09:14:54 console connected]
[04/14 10:15:13 connection by xcat_console]
[04/14 10:15:14 disconnection by xcat_console]
[04/14 13:14:30 connection by xcat_console]
 
Pyghmi will do keepalive as well, and if that’s the problem, it
should be much shorter than 24 hours.  In fact, it should be checking
if the SOL payload is active and owned by confluent specifically
every couple of minutes.
yes, thats correct
 
Sent: Friday, April 14, 2017 5:55 AM
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
 
My last reply was incorrect. Problems still here. Im trying to find
something usefull inbetween confluent/pyghmi...
Confluent restart solves hangs/reopen all connections.
I think it isnt the best option to restart confluent 1 or 2 times in 24h.
-- 
banuchka
It is Dell’s related problem, not 100% but

Confluent from current master is doing things well :) 
Thanks for pretty nice tool “confluentdbutil".
 
Looks like that problem was before
 The fix was to use ipmitool with
keepalive(one from xcat repos).
Here pyghmi is used maybe that the reason?
 
Hi, 
 
Im trying to completely migrate from conserver to confluent, but catch strange behaviour.
Some of my consoles hangs ~after 24, so no any new messages in their logs or in rcons.
I send messages with timestamp from OS >/dev/console every 30-60min
and take a look on them for monitoring purposes(consoles availability
monitoring).
I can open rcons and hit enter, after few secs console is waking
up(strange). I didnt see it happen with conserver or maybe im
wrong...
- as i can see the bigest part of consoles with hangs behaviour are
Dell idrac. Doesnt matter which type of RacSerial or IPMISerial is in
use.
- racreset hard/ipmitool bmc reset didnt do the things
- hit enter to console wake it up(for example with expect i can send
\r\n\f, but it looks bad)
- i didnt try to clean confluent's conf and restart it. Not sure it may help.
- HP consoles works well, same ipmi
- few consoles with custom pluging works good as well
 
So maybe my question is not about confluent, but if some of you have
some knowledge about same problems please share it! ;)
 
-- 
banuchka
-- 
banuchka
-- 
banuchka
-------------------------------------------------------------------
----------- 
Check out the vibrant tech community on one of the world's most 
engaging tech sites, Slashdot.org! http://sdm.link/slashdot__________
_____________________________________ 
xCAT-user mailing list 
https://lists.sourceforge.net/lists/listinfo/xcat-user 
 
 
 
-- 
banuchka
-- 
banuchka
-- 
banuchka
-- 
banuchka
 
 
-- 
banuchka
-- 
banuchka
-- 
banuchka
-- 
banuchka
-- 
banuchka
-- 
banuchka

-- 
banuchka

-- 
banuchka

-- 
banuchka
Jarrod Johnson
2017-04-19 13:53:07 UTC
Permalink
Confluent shouldn’t shut down or even restart


From: banuchka [mailto:***@gmail.com]
Sent: Wednesday, April 19, 2017 9:51 AM
To: xcat-***@lists.sourceforge.net; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

And one more thing about Confluent:
is it expected behaviour when i did “makeconfluent” / “makeconfluent -l”(confluent service is running) to regenerate nodes/add new nodes confluent is shutting down
?
So for now I did some wrapper for that procedure(makeconfluent -d for unneeded nodes, makeconfluent nodelist for new nodes).


On 19 April 2017 at 14:41:30, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
Ok, also were those login/logouts always there, or only after that ‘try to suicide every 90 minutes’ experiment?

From: banuchka [mailto:***@gmail.com]
Sent: Wednesday, April 19, 2017 9:38 AM
To: xcat-***@lists.sourceforge.net<mailto:xcat-***@lists.sourceforge.net>; Jarrod Johnson
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

Bit follow up:
experiment with nodehealth+echo > /dev/console + rcons didn’t hang console
 maybe it need more time. Ill save it running inside tmux session for bit long time.


On 19 April 2017 at 14:07:58, banuchka (***@gmail.com<mailto:***@gmail.com>) wrote:
Thanks Jarrod, I already have few “plugins” for old Sun servers without SOL so it isn’t a big problem to create another one.
I really appreciate your help.
As one more thing I’m trying to fix all BaudRates on servers, because as i can see on DRAC there are minimum 3 places with that setting(Im not sure this is a problem, but it’s not a good practice to read and write on different speed).
I’ll try your advice as well and let you know.


On 19 April 2017 at 13:59:59, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
I appreciate all the patience and help, let me know if you had a
request about making a shell plugin. The interface is not exactly
fleshed out ('CONFLUENT_NODE' is the only variable that makes it). If
the approach helps, I can accelerate a syntax for a shell module to
request more variables from the configuration (e.g.
CONFLUENT_HARDWAREMANAGEMENT_MANAGER SECRET_HARDWARMANAGEMENTUSER,
etc).

In case you have a question, here's one example:
# cat
/opt/confluent.backup/lib/python/confluent/plugins/console/xcatkvm.sh
#!/bin/bash
exec /opt/xcat/share/xcat/cons/kvm $CONFLUENT_NODE


As an aside, would you be able to do one more experiment? Start
confluent up, verify console is working, then run nodehealth a few
times against the node and see if it triggers the bad state?
Especially if you have some cron job that involves some node* commands,
imitate that. I was trying to think about things that would be
different between ipmitool and pyghmi, and the one thing that occurs to
me is that in pyghmi we try to multiplex commands and serial over the
same session to limit session consumption. In ipmitool, it's just SOL
(apart from an occasional 'get device id' for keepalive), so I'm
wondering if some timing or large volume of ipmi commands on a session
with active sol session could mess up their BMC SOL session.

Unfortunately, I don't have the resources to help chase this since I
can't reproduce it on our equipment, so all I can do is guessing based
on comparative analysis.

-----Original Message-----
From: banuchka <***@gmail.com<mailto:***@gmail.com>>
To: xCAT Users Mailing list <xcat-***@lists.sourceforge.net<mailto:xcat-***@lists.sourceforge.net>>, Jarrod
J
ohnson <***@lenovo.com<mailto:***@lenovo.com>>
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs
~after 24h.
Date: Wed, 19 Apr 2017 11:32:58 +0100

Hi,

I’m trying to use plugin for confluent with simple "ipmitool sol
activate” (placed here
/opt/confluent/lib/python/confluent/plugins/console/). It is last
attempt to understand whats going on here.
FW upgrade didn’t help me globally.
With current setup with pyghmi i see lots of “log on/log off” messages
in BMC’s logs that doesn’t happen when im using ipmitool.
I’m out of ideas right now...
Post by Jarrod Johnson
Yeah, there will be a bit push in the coming weeks it will have at
least an ‘events’ log along with a lot more function.
Then some more fleshed out documentation (beyond the preliminary stuff on hpc.lenovo.com).
Let me know if the firmware exploration works out. That particular
change line suggests firmware upgrades, but it is possible they could
have some high BMC cpu usage that could manifest in such a way. The
‘works with ipmitool’ though has me scratching my head.
Sent: Friday, April 14, 2017 2:54 PM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
Last idea doesn’t work for me. So by the way idea as is is working
great – confluent does disconnect/connect after time in constant. But
for now it is 100% correct to say – it is a problem with IDRAC fw.
===
- Fix for occasional iDRAC unresponsiveness caused by upgrades via Firmware RACADM or
have an active SOL or SSH sessions while firmware upgrade is in progress.
===
I’m not sure, but maybe its something like i have here. So did the
upgrade on few hosts and give them plenty of time to show me results.
Thanks for your answers, help and time
 it is very interesting quest :)
- Interesting ambitions
- Python VS Perl, thats good
- I think log files(not just trace, stderr, stdout) and
documentation(source on Github is the best doc o know, but
) are
things that i would like to be in Confluent
Very interested in the outcome. And thank you for working through
it. Also interested what you have liked, would like, and have
disliked about confluent.
Sent: Friday, April 14, 2017 12:01 PM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
Thank you Jarrod, i’ll try to add patch and let you know after. Hope
90 minutes is enough, yes.
Hmm, this is going to be very difficult to root cause (I only have
Lenovo equipment as one might expect).
I’m loathe to do a workaround, but in console.py (find /usr –name
console.py) , might be interesting to see how a change like the
diff --git a/pyghmi/ipmi/console.py b/pyghmi/ipmi/console.py
index 95e8551..a5f6062 100644
--- a/pyghmi/ipmi/console.py
+++ b/pyghmi/ipmi/console.py
def __init__(self, bmc, userid, password,
iohandler, port=623,
+ self.keepalivecount = 0
self.keepaliveid = None
self.connected = False
self.broken = False
self._print_error(response['error'])
return
+ self.keepalivecount = 0
#Send activate sol payload directive
#netfn= 6 (application)
#command = 0x48 (activate payload)
return
currowner = struct.unpack(
"<I", struct.pack('4B', *response['data'][:4]))
+ if currowner[0] != self.ipmi_session.sessionid or
# the session is deactivated or active for something else
self.activated = False
self._print_error('SOL deactivated')
return
+ self.keepalivecount += 1
# ok, still here, that means session is alive, but another
# common issue is firmware messing with mux on reboot
# this would be a nice thing to check, but the serial
channel
If it would pan out, should cause the console session to disconnect
itself roughly every 90 minutes and trigger reconnect (is 90 minutes
short enough in your case?) Would require a service confluent
restart to see if it had the desired effect.
Sorry I haven’t tested and can’t think of root cause, but going to
take some time off for the weekend.
I would be curious if the same ipmitool is running a day later than a
check (e.g. if ipmitool is exiting and getting restarted). I don’t
have the time at the moment to see if they do some other interesting
thing to avoid the behavior.
Sent: Friday, April 14, 2017 11:45 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
cloud53.ulan:/home/banuchka # ipmitool sol info 1
Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01
Set in progress : set-complete
Enabled : true
Force Encryption : true
Force Authentication : false
Privilege Level : ADMINISTRATOR
Character Accumulate Level (ms) : 50
Character Send Threshold : 255
Retry Count : 7
Retry Interval (ms) : 480
Volatile Bit Rate (kbps) : 38.4
Non-Volatile Bit Rate (kbps) : 115.2
Payload Channel : 1 (0x01)
Payload Port : 623
cloud53.ulan:/home/banuchka # ipmitool sol set volatile-bit-rate 115.2 1
cloud53.ulan:/home/banuchka # ipmitool sol info 1
Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01
Set in progress : set-complete
Enabled : true
Force Encryption : true
Force Authentication : false
Privilege Level : ADMINISTRATOR
Character Accumulate Level (ms) : 50
Character Send Threshold : 255
Retry Count : 7
Retry Interval (ms) : 480
Volatile Bit Rate (kbps) : 115.2
Non-Volatile Bit Rate (kbps) : 115.2
Payload Channel : 1 (0x01)
Payload Port : 623
cloud53.ulan:/home/banuchka # echo 123 > /dev/console
and nothing happened
in the console’s log
—
[04/14 12:49:12 console disconnected][04/14 12:49:29 console
connected][04/14 13:01:02 console disconnected][04/14 13:01:02
console connected][04/14 13:03:54 console disconnected][04/14
13:04:15 console connected][04/14 13:38:37 console connected][04/14
15:31:47 console disconnected][04/14 15:36:24 console
connected][04/14 15:42:08 connection by xcat_console]
---
ipmitool sol set volatile-bit-rate 115.2 1
To change the volatile bit rate to match the non-volatile bit rate
and see if the corruption goes away.
Sent: Friday, April 14, 2017 11:36 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
115200
idracadm7 get iDRAC.IPMISerial
[Key=iDRAC.Embedded.1#IPMISerial.1]
BaudRate=115200
ChanPrivLimit=4
ConnectionMode=Terminal
DeleteControl=Disabled
EchoControl=Enabled
FlowControl=RTS/CTS
HandshakeControl=Enabled
InputNewLineSeq=1
LineEdit=Enabled
NewLineSeq=CR-LF
that is strange, right
Hmm, what’s the baud rate the console is actually running at? Odd to
see the volatile and non volatile bit rates not be the same.
Sent: Friday, April 14, 2017 11:28 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
And to be clear, the corruption only starts after a long period of
time of being continuously connected?
Yes, that is correct
I might be interested in seeing ipmitool sol info 1 output against a
system while it is working versus showing corrupted info.
# ipmitool -I lanplus -H cloud2manage -U root -a sol info 1
Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01
Set in progress : set-complete
Enabled : true
Force Encryption : true
Force Authentication : false
Privilege Level : ADMINISTRATOR
Character Accumulate Level (ms) : 50
Character Send Threshold : 255
Retry Count : 7
Retry Interval (ms) : 480
Volatile Bit Rate (kbps) : 38.4
Non-Volatile Bit Rate (kbps) : 115.2
Payload Channel : 1 (0x01)
Payload Port : 623
# ipmitool -I lanplus -H cloud2manage -U root -a sol info 1
Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01
Set in progress : set-complete
Enabled : true
Force Encryption : true
Force Authentication : false
Privilege Level : ADMINISTRATOR
Character Accumulate Level (ms) : 50
Character Send Threshold : 255
Retry Count : 7
Retry Interval (ms) : 480
Volatile Bit Rate (kbps) : 38.4
Non-Volatile Bit Rate (kbps) : 115.2
Payload Channel : 1 (0x01)
Payload Port : 623
Sent: Friday, April 14, 2017 11:09 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
Yes, reopen causes it to work again, without any garbage
 so looks
like normal console :)
Hit <enter> causes at first garbage output(ᅵᅵ Porᅵlo) and *normal
console* before...
So reopen causes it to work again, and before, it’s not *hung*, but
erratic with garbage characters and occasional blips of sanity?
Sent: Friday, April 14, 2017 11:00 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
Reopen console did the trick as well...
‘ctrl-e, then c, then o’ to reconnect.
Was conserver ondemand or full logging?
Sent: Friday, April 14, 2017 10:52 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
Console starts showing garbage after <enter> inside rcons.
What do you mean when said “restarting console”?
- <enter> inside rcons/confetty
- bmc reset (console disconnected/console connected)
You’re absolutely right with ipmitool and conserver with the same
servers we were out of such troubles.
So the console starts showing garbage? Restarting the console causes
the garbage to go away?
You said that ipmitool with a certain configuration did not trigger this?
Sent: Friday, April 14, 2017 9:29 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
I’m out of ideas, let me show you all i see.
MONITORING_TEST dbb54 1492160401 <= last message i’ve sent from OS
(more complex log below)
13:23:42.342886 IP (tos 0x0, ttl 64, id 16448, offset 0, flags [DF],
proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:23:42.345504 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80


13:24:09.422491 IP (tos 0x0, ttl 64, id 17060, offset 0, flags [DF],
proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:24:09.425045 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
---
MONITORING_TEST dbb54 1492160401
ᅵᅵ
Porᅵ
—
13:24:35.727671 IP (tos 0x0, ttl 64, id 19582, offset 0, flags [DF],
proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:24:35.731533 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:24:47.390367 IP (tos 0x0, ttl 64, id 20347, offset 0, flags [DF],
proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:24:47.392799 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 92)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 64
13:24:47.408312 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:24:47.409797 IP (tos 0x0, ttl 64, id 20349, offset 0, flags [DF],
proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:03.127774 IP (tos 0x0, ttl 64, id 21818, offset 0, flags [DF],
proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:03.131561 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:25:27.269696 IP (tos 0x0, ttl 64, id 26284, offset 0, flags [DF],
proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:27.272204 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:25:47.410313 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 92)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 64
13:25:47.413754 IP (tos 0x0, ttl 64, id 28210, offset 0, flags [DF],
proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:48.709947 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 204)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 176
13:25:48.712033 IP (tos 0x0, ttl 64, id 28355, offset 0, flags [DF],
proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:52.564080 IP (tos 0x0, ttl 64, id 29103, offset 0, flags [DF],
proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:52.566810 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
---
Porï¿œlo]0;console: dbb54 [13:25]
---
If you ctrl-e, c, o, does it restore the console after the time?
Can you tell that it goes after exactly 24hours on the dot?
When console hung, does ‘ipmitool sol activate’ say ‘session already
active’?
Yes,
# ipmitool -I lanplus -H 10.10.106.155 -U root -a sol activate
Info: SOL payload already active on another session
Does /var/log/confluent/consoles/<nodename> have any interesting events crop up?
[04/13 15:17:21 console connected]

 many our own messages
^MMONITORING_TEST dbb54 1492160401 | <== This is the last message
^M
[04/14 09:05:13 console connected]
[04/14 09:11:59 console connected]
[04/14 09:13:38 console disconnected]
[04/14 09:14:54 console connected]
[04/14 10:15:13 connection by xcat_console]
[04/14 10:15:14 disconnection by xcat_console]
[04/14 13:14:30 connection by xcat_console]
Pyghmi will do keepalive as well, and if that’s the problem, it
should be much shorter than 24 hours. In fact, it should be checking
if the SOL payload is active and owned by confluent specifically
every couple of minutes.
yes, thats correct
Sent: Friday, April 14, 2017 5:55 AM
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
My last reply was incorrect. Problems still here. Im trying to find
something usefull inbetween confluent/pyghmi...
Confluent restart solves hangs/reopen all connections.
I think it isnt the best option to restart confluent 1 or 2 times in 24h.
--
banuchka
It is Dell’s related problem, not 100% but

Confluent from current master is doing things well :)
Thanks for pretty nice tool “confluentdbutil".
Looks like that problem was before
 The fix was to use ipmitool with
keepalive(one from xcat repos).
Here pyghmi is used maybe that the reason?
Hi,
Im trying to completely migrate from conserver to confluent, but catch strange behaviour.
Some of my consoles hangs ~after 24, so no any new messages in their logs or in rcons.
I send messages with timestamp from OS >/dev/console every 30-60min
and take a look on them for monitoring purposes(consoles availability
monitoring).
I can open rcons and hit enter, after few secs console is waking
up(strange). I didnt see it happen with conserver or maybe im
wrong...
- as i can see the bigest part of consoles with hangs behaviour are
Dell idrac. Doesnt matter which type of RacSerial or IPMISerial is in
use.
- racreset hard/ipmitool bmc reset didnt do the things
- hit enter to console wake it up(for example with expect i can send
\r\n\f, but it looks bad)
- i didnt try to clean confluent's conf and restart it. Not sure it may help.
- HP consoles works well, same ipmi
- few consoles with custom pluging works good as well
So maybe my question is not about confluent, but if some of you have
some knowledge about same problems please share it! ;)
--
banuchka
--
banuchka
--
banuchka
-------------------------------------------------------------------
-----------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot__________
_____________________________________
xCAT-user mailing list
https://lists.sourceforge.net/lists/listinfo/xcat-user
--
banuchka
--
banuchka
--
banuchka
--
banuchka
--
banuchka
--
banuchka
--
banuchka
--
banuchka
--
banuchka
--
banuchka
--
banuchka
--
banuchka
--
banuchka
banuchka
2017-04-19 13:56:12 UTC
Permalink
Bad news :)

On 19 April 2017 at 14:55:45, Jarrod Johnson (***@lenovo.com) wrote:

Confluent shouldn’t shut down or even restart


 

From: banuchka [mailto:***@gmail.com]
Sent: Wednesday, April 19, 2017 9:51 AM
To: xcat-***@lists.sourceforge.net; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

And one more thing about Confluent:

is it expected behaviour when i did “makeconfluent” / “makeconfluent -l”(confluent service is running) to regenerate nodes/add new nodes confluent is shutting down
?

So for now I did some wrapper for that procedure(makeconfluent -d for unneeded nodes, makeconfluent nodelist for new nodes).

 

On 19 April 2017 at 14:41:30, Jarrod Johnson (***@lenovo.com) wrote:

Ok, also were those login/logouts always there, or only after that ‘try to suicide every 90 minutes’ experiment?

 

From: banuchka [mailto:***@gmail.com]
Sent: Wednesday, April 19, 2017 9:38 AM
To: xcat-***@lists.sourceforge.net; Jarrod Johnson
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

Bit follow up:

experiment with nodehealth+echo > /dev/console + rcons didn’t hang console
 maybe it need more time. Ill save it running inside tmux session for bit long time. 

 

On 19 April 2017 at 14:07:58, banuchka (***@gmail.com) wrote:

Thanks Jarrod, I already have few “plugins” for old Sun servers without SOL so it isn’t a big problem to create another one.

I really appreciate your help.

As one more thing I’m trying to fix all BaudRates on servers, because as i can see on DRAC there are minimum 3 places with that setting(Im not sure this is a problem, but it’s not a good practice to read and write on different speed).

I’ll try your advice as well and let you know.

 

On 19 April 2017 at 13:59:59, Jarrod Johnson (***@lenovo.com) wrote:

I appreciate all the patience and help, let me know if you had a
request about making a shell plugin. The interface is not exactly
fleshed out ('CONFLUENT_NODE' is the only variable that makes it). If
the approach helps, I can accelerate a syntax for a shell module to
request more variables from the configuration (e.g.
CONFLUENT_HARDWAREMANAGEMENT_MANAGER SECRET_HARDWARMANAGEMENTUSER,
etc).

In case you have a question, here's one example:
# cat
/opt/confluent.backup/lib/python/confluent/plugins/console/xcatkvm.sh 
#!/bin/bash
exec /opt/xcat/share/xcat/cons/kvm $CONFLUENT_NODE


As an aside, would you be able to do one more experiment? Start
confluent up, verify console is working, then run nodehealth a few
times against the node and see if it triggers the bad state?
Especially if you have some cron job that involves some node* commands,
imitate that. I was trying to think about things that would be
different between ipmitool and pyghmi, and the one thing that occurs to
me is that in pyghmi we try to multiplex commands and serial over the
same session to limit session consumption. In ipmitool, it's just SOL
(apart from an occasional 'get device id' for keepalive), so I'm
wondering if some timing or large volume of ipmi commands on a session
with active sol session could mess up their BMC SOL session.

Unfortunately, I don't have the resources to help chase this since I
can't reproduce it on our equipment, so all I can do is guessing based
on comparative analysis.

-----Original Message-----
From: banuchka <***@gmail.com>
To: xCAT Users Mailing list <xcat-***@lists.sourceforge.net>, Jarrod
 J
ohnson <***@lenovo.com>
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs
~after 24h.
Date: Wed, 19 Apr 2017 11:32:58 +0100

Hi,

I’m trying to use plugin for confluent with simple "ipmitool sol
activate” (placed here
/opt/confluent/lib/python/confluent/plugins/console/). It is last
attempt to understand whats going on here.
FW upgrade didn’t help me globally.
With current setup with pyghmi i see lots of “log on/log off” messages
in BMC’s logs that doesn’t happen when im using ipmitool.
I’m out of ideas right now...
Post by Jarrod Johnson
Yeah, there will be a bit push in the coming weeks it will have at
least an ‘events’ log along with a lot more function.
 
Then some more fleshed out documentation (beyond the preliminary stuff on hpc.lenovo.com).
 
Let me know if the firmware exploration works out.  That particular
change line suggests firmware upgrades, but it is possible they could
have some high BMC cpu usage that could manifest in such a way.  The
‘works with ipmitool’ though has me scratching my head.
 
Sent: Friday, April 14, 2017 2:54 PM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
 
Last idea doesn’t work for me. So by the way idea as is is working
great – confluent does disconnect/connect after time in constant. But
for now it is 100% correct to say – it is a problem with IDRAC fw.
===
- Fix for occasional iDRAC unresponsiveness caused by upgrades via Firmware RACADM or
have an active SOL or SSH sessions while firmware upgrade is in progress.
===
I’m not sure, but maybe its something like i have here. So did the
upgrade on few hosts and give them plenty of time to show me results.
Thanks for your answers, help and time
 it is very interesting quest :)
 
- Interesting ambitions 
- Python VS Perl, thats good
- I think log files(not just trace, stderr, stdout) and
documentation(source on Github is the best doc o know, but
) are
things that i would like to be in Confluent
 
Very interested in the outcome.  And thank you for working through
it.  Also interested what you have liked, would like, and have
disliked about confluent.
 
Sent: Friday, April 14, 2017 12:01 PM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
 
Thank you Jarrod, i’ll try to add patch and let you know after. Hope
90 minutes is enough, yes.
 
Hmm, this is going to be very difficult to root cause (I only have
Lenovo equipment as one might expect).
 
I’m loathe to do a workaround, but in console.py (find /usr –name
console.py) , might be interesting to see how a change like the
diff --git a/pyghmi/ipmi/console.py b/pyghmi/ipmi/console.py
index 95e8551..a5f6062 100644
--- a/pyghmi/ipmi/console.py
+++ b/pyghmi/ipmi/console.py
     def __init__(self, bmc, userid, password,
                  iohandler, port=623,
+        self.keepalivecount = 0
         self.keepaliveid = None
         self.connected = False
         self.broken = False
             self._print_error(response['error'])
             return
+        self.keepalivecount = 0
         #Send activate sol payload directive
         #netfn= 6 (application)
         #command = 0x48 (activate payload)
             return
         currowner = struct.unpack(
             "<I", struct.pack('4B', *response['data'][:4]))
+        if currowner[0] != self.ipmi_session.sessionid or 
             # the session is deactivated or active for something
else
             self.activated = False
             self._print_error('SOL deactivated')
             return
+        self.keepalivecount += 1
         # ok, still here, that means session is alive, but another
         # common issue is firmware messing with mux on reboot
         # this would be a nice thing to check, but the serial
channel
 
If it would pan out, should cause the console session to disconnect
itself roughly every 90 minutes and trigger reconnect (is 90 minutes
short enough in your case?)  Would require a service confluent
restart to see if it had the desired effect.
 
Sorry I haven’t tested and can’t think of root cause, but going to
take some time off for the weekend.
 
I would be curious if the same ipmitool is running a day later than a
check (e.g. if ipmitool is exiting and getting restarted).  I don’t
have the time at the moment to see if they do some other interesting
thing to avoid the behavior.
 
Sent: Friday, April 14, 2017 11:45 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
 
cloud53.ulan:/home/banuchka # ipmitool sol info 1
Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01
Set in progress                 : set-complete
Enabled                         : true
Force Encryption                : true
Force Authentication            : false
Privilege Level                 : ADMINISTRATOR
Character Accumulate Level (ms) : 50
Character Send Threshold        : 255
Retry Count                     : 7
Retry Interval (ms)             : 480
Volatile Bit Rate (kbps)        : 38.4
Non-Volatile Bit Rate (kbps)    : 115.2
Payload Channel                 : 1 (0x01)
Payload Port                    : 623
cloud53.ulan:/home/banuchka # ipmitool sol set volatile-bit-rate 115.2 1
cloud53.ulan:/home/banuchka # ipmitool sol info 1
Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01
Set in progress                 : set-complete
Enabled                         : true
Force Encryption                : true
Force Authentication            : false
Privilege Level                 : ADMINISTRATOR
Character Accumulate Level (ms) : 50
Character Send Threshold        : 255
Retry Count                     : 7
Retry Interval (ms)             : 480
Volatile Bit Rate (kbps)        : 115.2
Non-Volatile Bit Rate (kbps)    : 115.2
Payload Channel                 : 1 (0x01)
Payload Port                    : 623
cloud53.ulan:/home/banuchka # echo 123 > /dev/console
 
and nothing happened
 
in the console’s log
—
[04/14 12:49:12 console disconnected][04/14 12:49:29 console
connected][04/14 13:01:02 console disconnected][04/14 13:01:02
console connected][04/14 13:03:54 console disconnected][04/14
13:04:15 console connected][04/14 13:38:37 console connected][04/14
15:31:47 console disconnected][04/14 15:36:24 console
connected][04/14 15:42:08 connection by xcat_console]
---
 
ipmitool sol set volatile-bit-rate 115.2 1
 
 
To change the volatile bit rate to match the non-volatile bit rate
and see if the corruption goes away.
 
Sent: Friday, April 14, 2017 11:36 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
 
115200
 
idracadm7 get iDRAC.IPMISerial
[Key=iDRAC.Embedded.1#IPMISerial.1]
BaudRate=115200
ChanPrivLimit=4
ConnectionMode=Terminal
DeleteControl=Disabled
EchoControl=Enabled
FlowControl=RTS/CTS
HandshakeControl=Enabled
InputNewLineSeq=1
LineEdit=Enabled
NewLineSeq=CR-LF
 
that is strange, right
 
Hmm, what’s the baud rate the console is actually running at?  Odd to
see the volatile and non volatile bit rates not be the same.
 
Sent: Friday, April 14, 2017 11:28 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
 
 
 
And to be clear, the corruption only starts after a long period of
time of being continuously connected?
Yes, that is correct
 
I might be interested in seeing ipmitool sol info 1 output against a
system while it is working versus showing corrupted info.
# ipmitool -I lanplus -H cloud2manage -U root -a sol info 1
Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01
Set in progress                 : set-complete
Enabled                         : true
Force Encryption                : true
Force Authentication            : false
Privilege Level                 : ADMINISTRATOR
Character Accumulate Level (ms) : 50
Character Send Threshold        : 255
Retry Count                     : 7
Retry Interval (ms)             : 480
Volatile Bit Rate (kbps)        : 38.4
Non-Volatile Bit Rate (kbps)    : 115.2
Payload Channel                 : 1 (0x01)
Payload Port                    : 623
 
# ipmitool -I lanplus -H cloud2manage -U root -a sol info 1
Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01
Set in progress                 : set-complete
Enabled                         : true
Force Encryption                : true
Force Authentication            : false
Privilege Level                 : ADMINISTRATOR
Character Accumulate Level (ms) : 50
Character Send Threshold        : 255
Retry Count                     : 7
Retry Interval (ms)             : 480
Volatile Bit Rate (kbps)        : 38.4
Non-Volatile Bit Rate (kbps)    : 115.2
Payload Channel                 : 1 (0x01)
Payload Port                    : 623
 
Sent: Friday, April 14, 2017 11:09 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
 
Yes, reopen causes it to work again,  without any garbage
 so looks
like normal console :)
Hit <enter> causes at first garbage output(ᅵᅵ Porᅵlo) and *normal
console* before...
 
So reopen causes it to work again, and before, it’s not *hung*, but
erratic with garbage characters and occasional blips of sanity?
 
Sent: Friday, April 14, 2017 11:00 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
 
Reopen console did the trick as well...
 
‘ctrl-e, then c, then o’ to reconnect.
 
Was conserver ondemand or full logging?
 
Sent: Friday, April 14, 2017 10:52 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
 
Console starts showing garbage after <enter> inside rcons.
What do you mean when said “restarting console”?
- <enter> inside rcons/confetty
- bmc reset (console disconnected/console connected)
 
You’re absolutely right with ipmitool and conserver with the same
servers we were out of such troubles.
So the console starts showing garbage?  Restarting the console causes
the garbage to go away?
 
You said that ipmitool with a certain configuration did not trigger this?
 
Sent: Friday, April 14, 2017 9:29 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
 
I’m out of ideas, let me show you all i see.
 
 
MONITORING_TEST dbb54 1492160401 <= last message i’ve sent from OS
(more complex log below)
 
 
13:23:42.342886 IP (tos 0x0, ttl 64, id 16448, offset 0, flags [DF],
proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:23:42.345504 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 108)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
 


 
13:24:09.422491 IP (tos 0x0, ttl 64, id 17060, offset 0, flags [DF],
proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:24:09.425045 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 108)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
 
---
MONITORING_TEST dbb54 1492160401
 
ᅵᅵ
  Porᅵ
—
 
13:24:35.727671 IP (tos 0x0, ttl 64, id 19582, offset 0, flags [DF],
proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:24:35.731533 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 108)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:24:47.390367 IP (tos 0x0, ttl 64, id 20347, offset 0, flags [DF],
proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:24:47.392799 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 92)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 64
13:24:47.408312 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 108)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:24:47.409797 IP (tos 0x0, ttl 64, id 20349, offset 0, flags [DF],
proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:03.127774 IP (tos 0x0, ttl 64, id 21818, offset 0, flags [DF],
proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:03.131561 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 108)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:25:27.269696 IP (tos 0x0, ttl 64, id 26284, offset 0, flags [DF],
proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:27.272204 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 108)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:25:47.410313 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 92)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 64
13:25:47.413754 IP (tos 0x0, ttl 64, id 28210, offset 0, flags [DF],
proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:48.709947 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 204)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 176
13:25:48.712033 IP (tos 0x0, ttl 64, id 28355, offset 0, flags [DF],
proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:52.564080 IP (tos 0x0, ttl 64, id 29103, offset 0, flags [DF],
proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:52.566810 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 108)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
 
---
  Porᅵlo]0;console: dbb54 [13:25]
 
 
---
 
If you ctrl-e, c, o, does it restore the console after the time?
 
Can you tell that it goes after exactly 24hours on the dot?
 
When console hung, does ‘ipmitool sol activate’ say ‘session already
active’?
Yes, 
# ipmitool -I lanplus -H 10.10.106.155 -U root -a sol activate
Info: SOL payload already active on another session
 
Does /var/log/confluent/consoles/<nodename> have any interesting events crop up?
[04/13 15:17:21 console connected]

 many our own messages
^MMONITORING_TEST dbb54 1492160401 | <== This is the last message
^M
[04/14 09:05:13 console connected]
[04/14 09:11:59 console connected]
[04/14 09:13:38 console disconnected]
[04/14 09:14:54 console connected]
[04/14 10:15:13 connection by xcat_console]
[04/14 10:15:14 disconnection by xcat_console]
[04/14 13:14:30 connection by xcat_console]
 
Pyghmi will do keepalive as well, and if that’s the problem, it
should be much shorter than 24 hours.  In fact, it should be checking
if the SOL payload is active and owned by confluent specifically
every couple of minutes.
yes, thats correct
 
Sent: Friday, April 14, 2017 5:55 AM
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
 
My last reply was incorrect. Problems still here. Im trying to find
something usefull inbetween confluent/pyghmi...
Confluent restart solves hangs/reopen all connections.
I think it isnt the best option to restart confluent 1 or 2 times in 24h.
-- 
banuchka
It is Dell’s related problem, not 100% but

Confluent from current master is doing things well :) 
Thanks for pretty nice tool “confluentdbutil".
 
Looks like that problem was before
 The fix was to use ipmitool with
keepalive(one from xcat repos).
Here pyghmi is used maybe that the reason?
 
Hi, 
 
Im trying to completely migrate from conserver to confluent, but catch strange behaviour.
Some of my consoles hangs ~after 24, so no any new messages in their logs or in rcons.
I send messages with timestamp from OS >/dev/console every 30-60min
and take a look on them for monitoring purposes(consoles availability
monitoring).
I can open rcons and hit enter, after few secs console is waking
up(strange). I didnt see it happen with conserver or maybe im
wrong...
- as i can see the bigest part of consoles with hangs behaviour are
Dell idrac. Doesnt matter which type of RacSerial or IPMISerial is in
use.
- racreset hard/ipmitool bmc reset didnt do the things
- hit enter to console wake it up(for example with expect i can send
\r\n\f, but it looks bad)
- i didnt try to clean confluent's conf and restart it. Not sure it may help.
- HP consoles works well, same ipmi
- few consoles with custom pluging works good as well
 
So maybe my question is not about confluent, but if some of you have
some knowledge about same problems please share it! ;)
 
-- 
banuchka
-- 
banuchka
-- 
banuchka
-------------------------------------------------------------------
----------- 
Check out the vibrant tech community on one of the world's most 
engaging tech sites, Slashdot.org! http://sdm.link/slashdot__________
_____________________________________ 
xCAT-user mailing list 
https://lists.sourceforge.net/lists/listinfo/xcat-user 
 
 
 
-- 
banuchka
-- 
banuchka
-- 
banuchka
-- 
banuchka
 
 
-- 
banuchka
-- 
banuchka
-- 
banuchka
-- 
banuchka
-- 
banuchka
-- 
banuchka

-- 
banuchka

-- 
banuchka

-- 
banuchka

-- 
banuchka
Jarrod Johnson
2017-04-19 13:59:02 UTC
Permalink
Anything in the /var/log/confluent neighborhood when that restart
happens? It doesn't happen if given a noderange?

-----Original Message-----
From: banuchka <***@gmail.com>
To: xcat-***@lists.sourceforge.net <xcat-***@lists.sourceforge.net>,
Jarrod Johnson <***@lenovo.com>
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs
~after 24h.
Date: Wed, 19 Apr 2017 14:56:12 +0100

Bad news :)
Confluent shouldn’t shut down or even restart…
 
Sent: Wednesday, April 19, 2017 9:51 AM
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
 
is it expected behaviour when i did “makeconfluent” / “makeconfluent
-l”(confluent service is running) to regenerate nodes/add new nodes
confluent is shutting down…?
So for now I did some wrapper for that procedure(makeconfluent -d for
unneeded nodes, makeconfluent nodelist for new nodes).
 
Ok, also were those login/logouts always there, or only after that
‘try to suicide every 90 minutes’ experiment?
 
Sent: Wednesday, April 19, 2017 9:38 AM
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
 
experiment with nodehealth+echo > /dev/console + rcons didn’t hang
console… maybe it need more time. Ill save it running inside tmux
session for bit long time. 
 
Thanks Jarrod, I already have few “plugins” for old Sun servers
without SOL so it isn’t a big problem to create another one.
I really appreciate your help.
As one more thing I’m trying to fix all BaudRates on servers, because
as i can see on DRAC there are minimum 3 places with that setting(Im
not sure this is a problem, but it’s not a good practice to read and
write on different speed).
I’ll try your advice as well and let you know.
 
I appreciate all the patience and help, let me know if you had a
request about making a shell plugin. The interface is not exactly
fleshed out ('CONFLUENT_NODE' is the only variable that makes it). If
the approach helps, I can accelerate a syntax for a shell module to
request more variables from the configuration (e.g.
CONFLUENT_HARDWAREMANAGEMENT_MANAGER SECRET_HARDWARMANAGEMENTUSER,
etc).
# cat
/opt/confluent.backup/lib/python/confluent/plugins/console/xcatkvm.sh
 
#!/bin/bash
exec /opt/xcat/share/xcat/cons/kvm $CONFLUENT_NODE
As an aside, would you be able to do one more experiment? Start
confluent up, verify console is working, then run nodehealth a few
times against the node and see if it triggers the bad state?
Especially if you have some cron job that involves some node*
commands,
imitate that. I was trying to think about things that would be
different between ipmitool and pyghmi, and the one thing that occurs to
me is that in pyghmi we try to multiplex commands and serial over the
same session to limit session consumption. In ipmitool, it's just SOL
(apart from an occasional 'get device id' for keepalive), so I'm
wondering if some timing or large volume of ipmi commands on a
session
with active sol session could mess up their BMC SOL session.
Unfortunately, I don't have the resources to help chase this since I
can't reproduce it on our equipment, so all I can do is guessing based
on comparative analysis.
-----Original Message-----
 J
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
Date: Wed, 19 Apr 2017 11:32:58 +0100
Hi,
I’m trying to use plugin for confluent with simple "ipmitool sol
activate” (placed here
/opt/confluent/lib/python/confluent/plugins/console/). It is last
attempt to understand whats going on here.
FW upgrade didn’t help me globally.
With current setup with pyghmi i see lots of “log on/log off” messages
in BMC’s logs that doesn’t happen when im using ipmitool.
I’m out of ideas right now...
Post by Jarrod Johnson
Yeah, there will be a bit push in the coming weeks it will have at
least an ‘events’ log along with a lot more function.
 
Then some more fleshed out documentation (beyond the preliminary
stuff on hpc.lenovo.com).
 
Let me know if the firmware exploration works out.  That particular
change line suggests firmware upgrades, but it is possible they
could
Post by Jarrod Johnson
have some high BMC cpu usage that could manifest in such a way. 
The
Post by Jarrod Johnson
‘works with ipmitool’ though has me scratching my head.
 
Sent: Friday, April 14, 2017 2:54 PM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles
hangs
Post by Jarrod Johnson
~after 24h.
 
Last idea doesn’t work for me. So by the way idea as is is working
great – confluent does disconnect/connect after time in constant.
But
Post by Jarrod Johnson
for now it is 100% correct to say – it is a problem with IDRAC fw.
===
- Fix for occasional iDRAC unresponsiveness caused by upgrades via Firmware RACADM or
have an active SOL or SSH sessions while firmware upgrade is in progress.
===
I’m not sure, but maybe its something like i have here. So did the
upgrade on few hosts and give them plenty of time to show me
results.
Post by Jarrod Johnson
Thanks for your answers, help and time… it is very interesting
quest
Post by Jarrod Johnson
:)
 
- Interesting ambitions 
- Python VS Perl, thats good
- I think log files(not just trace, stderr, stdout) and
documentation(source on Github is the best doc o know, but…) are
things that i would like to be in Confluent
 
Very interested in the outcome.  And thank you for working through
it.  Also interested what you have liked, would like, and have
disliked about confluent.
 
Sent: Friday, April 14, 2017 12:01 PM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles
hangs
Post by Jarrod Johnson
~after 24h.
 
Thank you Jarrod, i’ll try to add patch and let you know after.
Hope
Post by Jarrod Johnson
90 minutes is enough, yes.
 
Hmm, this is going to be very difficult to root cause (I only have
Lenovo equipment as one might expect).
 
I’m loathe to do a workaround, but in console.py (find /usr –name
console.py) , might be interesting to see how a change like the
diff --git a/pyghmi/ipmi/console.py b/pyghmi/ipmi/console.py
index 95e8551..a5f6062 100644
--- a/pyghmi/ipmi/console.py
+++ b/pyghmi/ipmi/console.py
     def __init__(self, bmc, userid, password,
                  iohandler, port=623,
+        self.keepalivecount = 0
         self.keepaliveid = None
         self.connected = False
         self.broken = False
             self._print_error(response['error'])
             return
+        self.keepalivecount = 0
         #Send activate sol payload directive
         #netfn= 6 (application)
         #command = 0x48 (activate payload)
             return
         currowner = struct.unpack(
             "<I", struct.pack('4B', *response['data'][:4]))
+        if currowner[0] != self.ipmi_session.sessionid or 
             # the session is deactivated or active for something
else
             self.activated = False
             self._print_error('SOL deactivated')
             return
+        self.keepalivecount += 1
         # ok, still here, that means session is alive, but another
         # common issue is firmware messing with mux on reboot
         # this would be a nice thing to check, but the serial
channel
 
If it would pan out, should cause the console session to disconnect
itself roughly every 90 minutes and trigger reconnect (is 90
minutes
Post by Jarrod Johnson
short enough in your case?)  Would require a service confluent
restart to see if it had the desired effect.
 
Sorry I haven’t tested and can’t think of root cause, but going to
take some time off for the weekend.
 
I would be curious if the same ipmitool is running a day later than
a
Post by Jarrod Johnson
check (e.g. if ipmitool is exiting and getting restarted).  I don’t
have the time at the moment to see if they do some other
interesting
Post by Jarrod Johnson
thing to avoid the behavior.
 
Sent: Friday, April 14, 2017 11:45 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles
hangs
Post by Jarrod Johnson
~after 24h.
 
cloud53.ulan:/home/banuchka # ipmitool sol info 1
Info: SOL parameter 'Payload Channel (7)' not supported -
defaulting
Post by Jarrod Johnson
to 0x01
Set in progress                 : set-complete
Enabled                         : true
Force Encryption                : true
Force Authentication            : false
Privilege Level                 : ADMINISTRATOR
Character Accumulate Level (ms) : 50
Character Send Threshold        : 255
Retry Count                     : 7
Retry Interval (ms)             : 480
Volatile Bit Rate (kbps)        : 38.4
Non-Volatile Bit Rate (kbps)    : 115.2
Payload Channel                 : 1 (0x01)
Payload Port                    : 623
cloud53.ulan:/home/banuchka # ipmitool sol set volatile-bit-rate 115.2 1
cloud53.ulan:/home/banuchka # ipmitool sol info 1
Info: SOL parameter 'Payload Channel (7)' not supported -
defaulting
Post by Jarrod Johnson
to 0x01
Set in progress                 : set-complete
Enabled                         : true
Force Encryption                : true
Force Authentication            : false
Privilege Level                 : ADMINISTRATOR
Character Accumulate Level (ms) : 50
Character Send Threshold        : 255
Retry Count                     : 7
Retry Interval (ms)             : 480
Volatile Bit Rate (kbps)        : 115.2
Non-Volatile Bit Rate (kbps)    : 115.2
Payload Channel                 : 1 (0x01)
Payload Port                    : 623
cloud53.ulan:/home/banuchka # echo 123 > /dev/console
 
and nothing happened
 
in the console’s log

[04/14 12:49:12 console disconnected][04/14 12:49:29 console
connected][04/14 13:01:02 console disconnected][04/14 13:01:02
console connected][04/14 13:03:54 console disconnected][04/14
13:04:15 console connected][04/14 13:38:37 console connected][04/14
15:31:47 console disconnected][04/14 15:36:24 console
connected][04/14 15:42:08 connection by xcat_console]
---
 
If you do have any in corrupted state, would be interested to see
ipmitool sol set volatile-bit-rate 115.2 1
 
 
To change the volatile bit rate to match the non-volatile bit rate
and see if the corruption goes away.
 
Sent: Friday, April 14, 2017 11:36 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles
hangs
Post by Jarrod Johnson
~after 24h.
 
115200
 
idracadm7 get iDRAC.IPMISerial
[Key=iDRAC.Embedded.1#IPMISerial.1]
BaudRate=115200
ChanPrivLimit=4
ConnectionMode=Terminal
DeleteControl=Disabled
EchoControl=Enabled
FlowControl=RTS/CTS
HandshakeControl=Enabled
InputNewLineSeq=1
LineEdit=Enabled
NewLineSeq=CR-LF
 
that is strange, right
 
Hmm, what’s the baud rate the console is actually running at?  Odd
to
Post by Jarrod Johnson
see the volatile and non volatile bit rates not be the same.
 
Sent: Friday, April 14, 2017 11:28 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles
hangs
Post by Jarrod Johnson
~after 24h.
 
 
 
And to be clear, the corruption only starts after a long period of
time of being continuously connected?
Yes, that is correct
 
I might be interested in seeing ipmitool sol info 1 output against
a
Post by Jarrod Johnson
system while it is working versus showing corrupted info.
# ipmitool -I lanplus -H cloud2manage -U root -a sol info 1
Info: SOL parameter 'Payload Channel (7)' not supported -
defaulting
Post by Jarrod Johnson
to 0x01
Set in progress                 : set-complete
Enabled                         : true
Force Encryption                : true
Force Authentication            : false
Privilege Level                 : ADMINISTRATOR
Character Accumulate Level (ms) : 50
Character Send Threshold        : 255
Retry Count                     : 7
Retry Interval (ms)             : 480
Volatile Bit Rate (kbps)        : 38.4
Non-Volatile Bit Rate (kbps)    : 115.2
Payload Channel                 : 1 (0x01)
Payload Port                    : 623
 
# ipmitool -I lanplus -H cloud2manage -U root -a sol info 1
Info: SOL parameter 'Payload Channel (7)' not supported -
defaulting
Post by Jarrod Johnson
to 0x01
Set in progress                 : set-complete
Enabled                         : true
Force Encryption                : true
Force Authentication            : false
Privilege Level                 : ADMINISTRATOR
Character Accumulate Level (ms) : 50
Character Send Threshold        : 255
Retry Count                     : 7
Retry Interval (ms)             : 480
Volatile Bit Rate (kbps)        : 38.4
Non-Volatile Bit Rate (kbps)    : 115.2
Payload Channel                 : 1 (0x01)
Payload Port                    : 623
 
Sent: Friday, April 14, 2017 11:09 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles
hangs
Post by Jarrod Johnson
~after 24h.
 
Yes, reopen causes it to work again,  without any garbage… so looks
like normal console :)
Hit <enter> causes at first garbage output(�� Por�lo) and *normal
console* before...
 
So reopen causes it to work again, and before, it’s not *hung*, but
erratic with garbage characters and occasional blips of sanity?
 
Sent: Friday, April 14, 2017 11:00 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles
hangs
Post by Jarrod Johnson
~after 24h.
 
Reopen console did the trick as well...
 
‘ctrl-e, then c, then o’ to reconnect.
 
Was conserver ondemand or full logging?
 
Sent: Friday, April 14, 2017 10:52 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles
hangs
Post by Jarrod Johnson
~after 24h.
 
Console starts showing garbage after <enter> inside rcons.
What do you mean when said “restarting console”?
- <enter> inside rcons/confetty
- bmc reset (console disconnected/console connected)
 
You’re absolutely right with ipmitool and conserver with the same
servers we were out of such troubles.
So the console starts showing garbage?  Restarting the console
causes
Post by Jarrod Johnson
the garbage to go away?
 
You said that ipmitool with a certain configuration did not trigger this?
 
Sent: Friday, April 14, 2017 9:29 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: Re: [xcat-user] Confluent as console server. Consoles
hangs
Post by Jarrod Johnson
~after 24h.
 
I’m out of ideas, let me show you all i see.
 
 
MONITORING_TEST dbb54 1492160401 <= last message i’ve sent from OS
(more complex log below)
 
 
13:23:42.342886 IP (tos 0x0, ttl 64, id 16448, offset 0, flags
[DF],
Post by Jarrod Johnson
proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP,
length
Post by Jarrod Johnson
64
13:23:42.345504 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 108)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP,
length
Post by Jarrod Johnson
80
 

 
13:24:09.422491 IP (tos 0x0, ttl 64, id 17060, offset 0, flags
[DF],
Post by Jarrod Johnson
proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP,
length
Post by Jarrod Johnson
64
13:24:09.425045 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 108)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP,
length
Post by Jarrod Johnson
80
 
---
MONITORING_TEST dbb54 1492160401
 
��
  Por�

 
13:24:35.727671 IP (tos 0x0, ttl 64, id 19582, offset 0, flags
[DF],
Post by Jarrod Johnson
proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP,
length
Post by Jarrod Johnson
64
13:24:35.731533 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 108)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP,
length
Post by Jarrod Johnson
80
13:24:47.390367 IP (tos 0x0, ttl 64, id 20347, offset 0, flags
[DF],
Post by Jarrod Johnson
proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP,
length
Post by Jarrod Johnson
64
13:24:47.392799 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 92)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP,
length
Post by Jarrod Johnson
64
13:24:47.408312 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 108)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP,
length
Post by Jarrod Johnson
80
13:24:47.409797 IP (tos 0x0, ttl 64, id 20349, offset 0, flags
[DF],
Post by Jarrod Johnson
proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP,
length
Post by Jarrod Johnson
64
13:25:03.127774 IP (tos 0x0, ttl 64, id 21818, offset 0, flags
[DF],
Post by Jarrod Johnson
proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP,
length
Post by Jarrod Johnson
64
13:25:03.131561 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 108)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP,
length
Post by Jarrod Johnson
80
13:25:27.269696 IP (tos 0x0, ttl 64, id 26284, offset 0, flags
[DF],
Post by Jarrod Johnson
proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP,
length
Post by Jarrod Johnson
64
13:25:27.272204 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 108)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP,
length
Post by Jarrod Johnson
80
13:25:47.410313 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 92)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP,
length
Post by Jarrod Johnson
64
13:25:47.413754 IP (tos 0x0, ttl 64, id 28210, offset 0, flags
[DF],
Post by Jarrod Johnson
proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP,
length
Post by Jarrod Johnson
64
13:25:48.709947 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 204)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP,
length
Post by Jarrod Johnson
176
13:25:48.712033 IP (tos 0x0, ttl 64, id 28355, offset 0, flags
[DF],
Post by Jarrod Johnson
proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP,
length
Post by Jarrod Johnson
64
13:25:52.564080 IP (tos 0x0, ttl 64, id 29103, offset 0, flags
[DF],
Post by Jarrod Johnson
proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP,
length
Post by Jarrod Johnson
64
13:25:52.566810 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 108)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP,
length
Post by Jarrod Johnson
80
 
---
  Por�lo]0;console: dbb54 [13:25]
 
 
---
 
If you ctrl-e, c, o, does it restore the console after the time?
 
Can you tell that it goes after exactly 24hours on the dot?
 
When console hung, does ‘ipmitool sol activate’ say ‘session
already
Post by Jarrod Johnson
active’?
Yes, 
# ipmitool -I lanplus -H 10.10.106.155 -U root -a sol activate
Info: SOL payload already active on another session
 
Does /var/log/confluent/consoles/<nodename> have any interesting events crop up?
[04/13 15:17:21 console connected]
… many our own messages
^MMONITORING_TEST dbb54 1492160401 | <== This is the last message
^M
[04/14 09:05:13 console connected]
[04/14 09:11:59 console connected]
[04/14 09:13:38 console disconnected]
[04/14 09:14:54 console connected]
[04/14 10:15:13 connection by xcat_console]
[04/14 10:15:14 disconnection by xcat_console]
[04/14 13:14:30 connection by xcat_console]
 
Pyghmi will do keepalive as well, and if that’s the problem, it
should be much shorter than 24 hours.  In fact, it should be
checking
Post by Jarrod Johnson
if the SOL payload is active and owned by confluent specifically
every couple of minutes.
yes, thats correct
 
Sent: Friday, April 14, 2017 5:55 AM
Subject: Re: [xcat-user] Confluent as console server. Consoles
hangs
Post by Jarrod Johnson
~after 24h.
 
My last reply was incorrect. Problems still here. Im trying to find
something usefull inbetween confluent/pyghmi...
Confluent restart solves hangs/reopen all connections.
I think it isnt the best option to restart confluent 1 or 2 times
in
Post by Jarrod Johnson
24h.
-- 
banuchka
It is Dell’s related problem, not 100% but…
Confluent from current master is doing things well :) 
Thanks for pretty nice tool “confluentdbutil".
 
Looks like that problem was before… The fix was to use ipmitool
with
Post by Jarrod Johnson
keepalive(one from xcat repos).
Here pyghmi is used maybe that the reason?
 
Hi, 
 
Im trying to completely migrate from conserver to confluent, but
catch strange behaviour.
Some of my consoles hangs ~after 24, so no any new messages in
their
Post by Jarrod Johnson
logs or in rcons.
I send messages with timestamp from OS >/dev/console every 30-60min
and take a look on them for monitoring purposes(consoles
availability
Post by Jarrod Johnson
monitoring).
I can open rcons and hit enter, after few secs console is waking
up(strange). I didnt see it happen with conserver or maybe im wrong...
- as i can see the bigest part of consoles with hangs behaviour are
Dell idrac. Doesnt matter which type of RacSerial or IPMISerial is
in
Post by Jarrod Johnson
use.
- racreset hard/ipmitool bmc reset didnt do the things
- hit enter to console wake it up(for example with expect i can
send
Post by Jarrod Johnson
\r\n\f, but it looks bad)
- i didnt try to clean confluent's conf and restart it. Not sure it may help.
- HP consoles works well, same ipmi
- few consoles with custom pluging works good as well
 
So maybe my question is not about confluent, but if some of you
have
Post by Jarrod Johnson
some knowledge about same problems please share it! ;)
 
-- 
banuchka
-- 
banuchka
-- 
banuchka
-------------------------------------------------------------------
----------- 
Check out the vibrant tech community on one of the world's most 
engaging tech sites, Slashdot.org! http://sdm.link/slashdot________
__
Post by Jarrod Johnson
_____________________________________ 
xCAT-user mailing list 
https://lists.sourceforge.net/lists/listinfo/xcat-user 
 
 
 
-- 
banuchka
-- 
banuchka
-- 
banuchka
-- 
banuchka
 
 
-- 
banuchka
-- 
banuchka
-- 
banuchka
-- 
banuchka
-- 
banuchka
-- 
banuchka
-- 
banuchka
-- 
banuchka
-- 
banuchka
-- 
banuchka
banuchka
2017-04-19 14:29:57 UTC
Permalink
All ok with noderange. I did regeneration of nodelist with conserver with “makeconservercfg -l” at some point of time, because it is simpler to track changes.

There are output when i tried to do the same with “makeconfluentcfg -l”:


==> /var/log/confluent/stderr <==
Apr 19 14:22:29   File "/opt/confluent/lib/python/confluent/log.py", line 702, in write
    self.log(traceback.format_stack(limit=2)[0][:-1] + ": " + data): Traceback (most recent call last):
Apr 19 14:22:29   File "/opt/confluent/lib/python/confluent/log.py", line 702, in write
    self.log(traceback.format_stack(limit=2)[0][:-1] + ": " + data):   File "/opt/confluent/bin/confluent", line 35, in <module>
Apr 19 14:22:29   File "/opt/confluent/lib/python/confluent/log.py", line 702, in write
    self.log(traceback.format_stack(limit=2)[0][:-1] + ": " + data):
Apr 19 14:22:30   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator): Traceback (most recent call last):
Apr 19 14:22:30   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):   File "/usr/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 457, in fire_timers
Apr 19 14:22:30   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):     timer()
Apr 19 14:22:30   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):   File "/usr/lib/python2.7/site-packages/eventlet/hubs/timer.py", line 58, in __call__
Apr 19 14:22:30   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):     cb(*args, **kw)
Apr 19 14:22:30   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):   File "/usr/lib/python2.7/site-packages/eventlet/greenthread.py", line 214, in main
Apr 19 14:22:30   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):     result = function(*args, **kwargs)
Apr 19 14:22:30   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):   File "/opt/confluent/lib/python/confluent/shellmodule.py", line 52, in relaydata
Apr 19 14:22:30   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):     3600 + (random.random() * 120))
Apr 19 14:22:30   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):   File "/usr/lib/python2.7/site-packages/eventlet/green/select.py", line 80, in select
Apr 19 14:22:30   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):     listeners.append(hub.add(hub.READ, k, on_read, on_error, lambda x: None))
Apr 19 14:22:30   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):   File "/usr/lib/python2.7/site-packages/eventlet/hubs/epolls.py", line 49, in add
Apr 19 14:22:30   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):     listener = BaseHub.add(self, evtype, fileno, cb, tb, mac)
Apr 19 14:22:30   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):   File "/usr/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 177, in add
Apr 19 14:22:30   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):     evtype, fileno, evtype, cb, bucket[fileno]))
Apr 19 14:22:30   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator): RuntimeError: Second simultaneous read on fileno 8 detected.  Unless you really know what you're doing, make sure that only one greenthread can read any particular socket.
 Consider using a pools.Pool. If you do know what you're doing and want to disable this error, call eventlet.debug.hub_prevent_multiple_readers(False) - MY THREAD=<function on_read at 0x7fb45c5f0de8>; THA
T THREAD=FdListener('read', 8, <built-in method switch of GreenThread object at 0x7fb3fd6efd70>, <built-in method throw of GreenThread object at 0x7fb3fd6efd70>)
Apr 19 14:22:30   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator): Traceback (most recent call last):
Apr 19 14:22:30   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):   File "/usr/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 457, in fire_timers
Apr 19 14:22:30   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):     timer()
Apr 19 14:22:30   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):   File "/usr/lib/python2.7/site-packages/eventlet/hubs/timer.py", line 58, in __call__
Apr 19 14:22:30   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):     cb(*args, **kw)
Apr 19 14:22:30   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):   File "/usr/lib/python2.7/site-packages/eventlet/greenthread.py", line 214, in main
Apr 19 14:22:30   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):     result = function(*args, **kwargs)
Apr 19 14:22:30   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):   File "/opt/confluent/lib/python/confluent/shellmodule.py", line 52, in relaydata
Apr 19 14:22:30   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):     3600 + (random.random() * 120))
Apr 19 14:22:30   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):   File "/usr/lib/python2.7/site-packages/eventlet/green/select.py", line 80, in select
Apr 19 14:22:30   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):     listeners.append(hub.add(hub.READ, k, on_read, on_error, lambda x: None))
Apr 19 14:22:30   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):   File "/usr/lib/python2.7/site-packages/eventlet/hubs/epolls.py", line 49, in add
Apr 19 14:22:30   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):     listener = BaseHub.add(self, evtype, fileno, cb, tb, mac)
Apr 19 14:22:30   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):   File "/usr/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 177, in add
Apr 19 14:22:30   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):     evtype, fileno, evtype, cb, bucket[fileno]))
Apr 19 14:22:30   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator): RuntimeError: Second simultaneous read on fileno 31 detected.  Unless you really know what you're doing, make sure that only one greenthread can read any particular socket.
  Consider using a pools.Pool. If you do know what you're doing and want to disable this error, call eventlet.debug.hub_prevent_multiple_readers(False) - MY THREAD=<function on_read at 0x7fb3fdfe4aa0>; TH
AT THREAD=FdListener('read', 31, <function on_read at 0x7fb41c771938>, <function on_error at 0x7fb3fe56b0c8>)


==> /var/log/confluent/trace <==
Apr 19 14:22:29 Traceback (most recent call last):
  File "/opt/confluent/lib/python/confluent/consoleserver.py", line 220, in _connect_backend
    self._console.connect(self.get_console_output)
  File "/opt/confluent/lib/python/confluent/plugins/hardwaremanagement/ipmi.py", line 237, in connect
    iohandler=self.handle_data)
  File "/usr/lib/python2.7/site-packages/pyghmi/ipmi/console.py", line 62, in __init__
    onlogon=self._got_session)
  File "/usr/lib/python2.7/site-packages/pyghmi/ipmi/private/session.py", line 413, in __new__
    for res in socket.getaddrinfo(bmc, port, 0, socket.SOCK_DGRAM):
  File "/usr/lib/python2.7/site-packages/eventlet/support/greendns.py", line 485, in getaddrinfo
    qname, addrs = _getaddrinfo_lookup(host, family, flags)
  File "/usr/lib/python2.7/site-packages/eventlet/support/greendns.py", line 449, in _getaddrinfo_lookup
    answer = resolve(host, qfamily, False)
  File "/usr/lib/python2.7/site-packages/eventlet/support/greendns.py", line 396, in resolve
    return resolver.query(name, rdtype, raise_on_no_answer=raises)
  File "/usr/lib/python2.7/site-packages/eventlet/support/greendns.py", line 356, in query
    raise result[1]
TypeError: <lambda>() takes exactly 1 argument (0 given)

==> /var/log/confluent/stderr <==
Apr 19 14:22:33   File "/usr/lib64/python2.7/atexit.py", line 29, in _run_exitfuncs
    print >> sys.stderr, "Error in atexit._run_exitfuncs:": Error in atexit._run_exitfuncs:
Apr 19 14:22:33   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator): Traceback (most recent call last):
Apr 19 14:22:33   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):   File "/usr/lib64/python2.7/atexit.py", line 24, in _run_exitfuncs
Apr 19 14:22:33   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):     func(*targs, **kargs)
Apr 19 14:22:33   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):   File "/opt/confluent/lib/python/confluent/plugins/hardwaremanagement/ipmi.py", line 39, in exithandler
Apr 19 14:22:33   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):     console.session.iothread.join()
Apr 19 14:22:33   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):   File "/usr/lib/python2.7/site-packages/pyghmi/ipmi/private/session.py", line 74, in join
Apr 19 14:22:33   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):     Session._cleanup()
Apr 19 14:22:33   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):   File "/usr/lib/python2.7/site-packages/pyghmi/ipmi/private/session.py", line 322, in _cleanup
Apr 19 14:22:33   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator):     for sesskey in cls.bmc_handlers:
Apr 19 14:22:33   File "/usr/lib64/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator): RuntimeError: dictionary changed size during iteration
Apr 19 14:22:33   File "/opt/confluent/lib/python/confluent/log.py", line 702, in write
    self.log(traceback.format_stack(limit=2)[0][:-1] + ": " + data): Error in sys.exitfunc:
Apr 19 14:22:33   File "/opt/confluent/lib/python/confluent/log.py", line 702, in write
    self.log(traceback.format_stack(limit=2)[0][:-1] + ": " + data): Traceback (most recent call last):
Apr 19 14:22:33   File "/opt/confluent/lib/python/confluent/log.py", line 702, in write
    self.log(traceback.format_stack(limit=2)[0][:-1] + ": " + data):   File "/usr/lib64/python2.7/atexit.py", line 24, in _run_exitfuncs
Apr 19 14:22:33   File "/opt/confluent/lib/python/confluent/log.py", line 702, in write
    self.log(traceback.format_stack(limit=2)[0][:-1] + ": " + data):
Apr 19 14:22:33   File "/opt/confluent/lib/python/confluent/log.py", line 702, in write
    self.log(traceback.format_stack(limit=2)[0][:-1] + ": " + data): func(*targs, **kargs)
Apr 19 14:22:33   File "/opt/confluent/lib/python/confluent/log.py", line 702, in write
    self.log(traceback.format_stack(limit=2)[0][:-1] + ": " + data):   File "/opt/confluent/lib/python/confluent/plugins/hardwaremanagement/ipmi.py", line 39, in exithandler
Apr 19 14:22:33   File "/opt/confluent/lib/python/confluent/log.py", line 702, in write
    self.log(traceback.format_stack(limit=2)[0][:-1] + ": " + data):
Apr 19 14:22:33   File "/opt/confluent/lib/python/confluent/log.py", line 702, in write
    self.log(traceback.format_stack(limit=2)[0][:-1] + ": " + data): console.session.iothread.join()
Apr 19 14:22:33   File "/opt/confluent/lib/python/confluent/log.py", line 702, in write
    self.log(traceback.format_stack(limit=2)[0][:-1] + ": " + data):   File "/usr/lib/python2.7/site-packages/pyghmi/ipmi/private/session.py", line 74, in join
Apr 19 14:22:33   File "/opt/confluent/lib/python/confluent/log.py", line 702, in write
    self.log(traceback.format_stack(limit=2)[0][:-1] + ": " + data):
Apr 19 14:22:33   File "/opt/confluent/lib/python/confluent/log.py", line 702, in write
    self.log(traceback.format_stack(limit=2)[0][:-1] + ": " + data): Session._cleanup()
Apr 19 14:22:33   File "/opt/confluent/lib/python/confluent/log.py", line 702, in write
    self.log(traceback.format_stack(limit=2)[0][:-1] + ": " + data):   File "/usr/lib/python2.7/site-packages/pyghmi/ipmi/private/session.py", line 322, in _cleanup
Apr 19 14:22:33   File "/opt/confluent/lib/python/confluent/log.py", line 702, in write
    self.log(traceback.format_stack(limit=2)[0][:-1] + ": " + data):
Apr 19 14:22:33   File "/opt/confluent/lib/python/confluent/log.py", line 702, in write
    self.log(traceback.format_stack(limit=2)[0][:-1] + ": " + data): for sesskey in cls.bmc_handlers:
Apr 19 14:22:33   File "/opt/confluent/lib/python/confluent/log.py", line 702, in write
    self.log(traceback.format_stack(limit=2)[0][:-1] + ": " + data): RuntimeError
Apr 19 14:22:33   File "/opt/confluent/lib/python/confluent/log.py", line 702, in write
    self.log(traceback.format_stack(limit=2)[0][:-1] + ": " + data): :
Apr 19 14:22:33   File "/opt/confluent/lib/python/confluent/log.py", line 702, in write
    self.log(traceback.format_stack(limit=2)[0][:-1] + ": " + data): dictionary changed size during iteration


On 19 April 2017 at 15:01:27, Jarrod Johnson (***@lenovo.com) wrote:

Anything in the /var/log/confluent neighborhood when that restart
happens? It doesn't happen if given a noderange?

-----Original Message-----
From: banuchka <***@gmail.com>
To: xcat-***@lists.sourceforge.net <xcat-***@lists.sourceforge.net>,
Jarrod Johnson <***@lenovo.com>
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs
~after 24h.
Date: Wed, 19 Apr 2017 14:56:12 +0100

Bad news :)
Post by Jarrod Johnson
Confluent shouldn’t shut down or even restart

 
Sent: Wednesday, April 19, 2017 9:51 AM
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
 
is it expected behaviour when i did “makeconfluent” / “makeconfluent
-l”(confluent service is running) to regenerate nodes/add new nodes
confluent is shutting down
?
So for now I did some wrapper for that procedure(makeconfluent -d for
unneeded nodes, makeconfluent nodelist for new nodes).
 
Ok, also were those login/logouts always there, or only after that
‘try to suicide every 90 minutes’ experiment?
 
Sent: Wednesday, April 19, 2017 9:38 AM
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
 
experiment with nodehealth+echo > /dev/console + rcons didn’t hang
console
 maybe it need more time. Ill save it running inside tmux
session for bit long time. 
 
Thanks Jarrod, I already have few “plugins” for old Sun servers
without SOL so it isn’t a big problem to create another one.
I really appreciate your help.
As one more thing I’m trying to fix all BaudRates on servers, because
as i can see on DRAC there are minimum 3 places with that setting(Im
not sure this is a problem, but it’s not a good practice to read and
write on different speed).
I’ll try your advice as well and let you know.
 
I appreciate all the patience and help, let me know if you had a
request about making a shell plugin. The interface is not exactly
fleshed out ('CONFLUENT_NODE' is the only variable that makes it). If
the approach helps, I can accelerate a syntax for a shell module to
request more variables from the configuration (e.g.
CONFLUENT_HARDWAREMANAGEMENT_MANAGER SECRET_HARDWARMANAGEMENTUSER,
etc).
# cat
/opt/confluent.backup/lib/python/confluent/plugins/console/xcatkvm.sh
 
#!/bin/bash
exec /opt/xcat/share/xcat/cons/kvm $CONFLUENT_NODE
As an aside, would you be able to do one more experiment? Start
confluent up, verify console is working, then run nodehealth a few
times against the node and see if it triggers the bad state?
Especially if you have some cron job that involves some node*
commands,
imitate that. I was trying to think about things that would be
different between ipmitool and pyghmi, and the one thing that occurs to
me is that in pyghmi we try to multiplex commands and serial over the
same session to limit session consumption. In ipmitool, it's just SOL
(apart from an occasional 'get device id' for keepalive), so I'm
wondering if some timing or large volume of ipmi commands on a
session
with active sol session could mess up their BMC SOL session.
Unfortunately, I don't have the resources to help chase this since I
can't reproduce it on our equipment, so all I can do is guessing based
on comparative analysis.
-----Original Message-----
 J
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.
Date: Wed, 19 Apr 2017 11:32:58 +0100
Hi,
I’m trying to use plugin for confluent with simple "ipmitool sol
activate” (placed here
/opt/confluent/lib/python/confluent/plugins/console/). It is last
attempt to understand whats going on here.
FW upgrade didn’t help me globally.
With current setup with pyghmi i see lots of “log on/log off” messages
in BMC’s logs that doesn’t happen when im using ipmitool.
I’m out of ideas right now...
Post by Jarrod Johnson
Yeah, there will be a bit push in the coming weeks it will have at
least an ‘events’ log along with a lot more function.
 
Then some more fleshed out documentation (beyond the preliminary
stuff on hpc.lenovo.com).
 
Let me know if the firmware exploration works out.  That particular
change line suggests firmware upgrades, but it is possible they
could
Post by Jarrod Johnson
have some high BMC cpu usage that could manifest in such a way. 
The
Post by Jarrod Johnson
‘works with ipmitool’ though has me scratching my head.
 
Sent: Friday, April 14, 2017 2:54 PM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles
hangs
Post by Jarrod Johnson
~after 24h.
 
Last idea doesn’t work for me. So by the way idea as is is working
great – confluent does disconnect/connect after time in constant.
But
Post by Jarrod Johnson
for now it is 100% correct to say – it is a problem with IDRAC fw.
===
- Fix for occasional iDRAC unresponsiveness caused by upgrades via Firmware RACADM or
have an active SOL or SSH sessions while firmware upgrade is in progress.
===
I’m not sure, but maybe its something like i have here. So did the
upgrade on few hosts and give them plenty of time to show me
results.
Post by Jarrod Johnson
Thanks for your answers, help and time
 it is very interesting
quest
Post by Jarrod Johnson
:)
 
- Interesting ambitions 
- Python VS Perl, thats good
- I think log files(not just trace, stderr, stdout) and
documentation(source on Github is the best doc o know, but
) are
things that i would like to be in Confluent
 
Very interested in the outcome.  And thank you for working through
it.  Also interested what you have liked, would like, and have
disliked about confluent.
 
Sent: Friday, April 14, 2017 12:01 PM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles
hangs
Post by Jarrod Johnson
~after 24h.
 
Thank you Jarrod, i’ll try to add patch and let you know after.
Hope
Post by Jarrod Johnson
90 minutes is enough, yes.
 
Hmm, this is going to be very difficult to root cause (I only have
Lenovo equipment as one might expect).
 
I’m loathe to do a workaround, but in console.py (find /usr –name
console.py) , might be interesting to see how a change like the
diff --git a/pyghmi/ipmi/console.py b/pyghmi/ipmi/console.py
index 95e8551..a5f6062 100644
--- a/pyghmi/ipmi/console.py
+++ b/pyghmi/ipmi/console.py
     def __init__(self, bmc, userid, password,
                  iohandler, port=623,
+        self.keepalivecount = 0
         self.keepaliveid = None
         self.connected = False
         self.broken = False
             self._print_error(response['error'])
             return
+        self.keepalivecount = 0
         #Send activate sol payload directive
         #netfn= 6 (application)
         #command = 0x48 (activate payload)
             return
         currowner = struct.unpack(
             "<I", struct.pack('4B', *response['data'][:4]))
+        if currowner[0] != self.ipmi_session.sessionid or 
             # the session is deactivated or active for something
else
             self.activated = False
             self._print_error('SOL deactivated')
             return
+        self.keepalivecount += 1
         # ok, still here, that means session is alive, but another
         # common issue is firmware messing with mux on reboot
         # this would be a nice thing to check, but the serial
channel
 
If it would pan out, should cause the console session to disconnect
itself roughly every 90 minutes and trigger reconnect (is 90
minutes
Post by Jarrod Johnson
short enough in your case?)  Would require a service confluent
restart to see if it had the desired effect.
 
Sorry I haven’t tested and can’t think of root cause, but going to
take some time off for the weekend.
 
I would be curious if the same ipmitool is running a day later than
a
Post by Jarrod Johnson
check (e.g. if ipmitool is exiting and getting restarted).  I don’t
have the time at the moment to see if they do some other
interesting
Post by Jarrod Johnson
thing to avoid the behavior.
 
Sent: Friday, April 14, 2017 11:45 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles
hangs
Post by Jarrod Johnson
~after 24h.
 
cloud53.ulan:/home/banuchka # ipmitool sol info 1
Info: SOL parameter 'Payload Channel (7)' not supported -
defaulting
Post by Jarrod Johnson
to 0x01
Set in progress                 : set-complete
Enabled                         : true
Force Encryption                : true
Force Authentication            : false
Privilege Level                 : ADMINISTRATOR
Character Accumulate Level (ms) : 50
Character Send Threshold        : 255
Retry Count                     : 7
Retry Interval (ms)             : 480
Volatile Bit Rate (kbps)        : 38.4
Non-Volatile Bit Rate (kbps)    : 115.2
Payload Channel                 : 1 (0x01)
Payload Port                    : 623
cloud53.ulan:/home/banuchka # ipmitool sol set volatile-bit-rate 115.2 1
cloud53.ulan:/home/banuchka # ipmitool sol info 1
Info: SOL parameter 'Payload Channel (7)' not supported -
defaulting
Post by Jarrod Johnson
to 0x01
Set in progress                 : set-complete
Enabled                         : true
Force Encryption                : true
Force Authentication            : false
Privilege Level                 : ADMINISTRATOR
Character Accumulate Level (ms) : 50
Character Send Threshold        : 255
Retry Count                     : 7
Retry Interval (ms)             : 480
Volatile Bit Rate (kbps)        : 115.2
Non-Volatile Bit Rate (kbps)    : 115.2
Payload Channel                 : 1 (0x01)
Payload Port                    : 623
cloud53.ulan:/home/banuchka # echo 123 > /dev/console
 
and nothing happened
 
in the console’s log
—
[04/14 12:49:12 console disconnected][04/14 12:49:29 console
connected][04/14 13:01:02 console disconnected][04/14 13:01:02
console connected][04/14 13:03:54 console disconnected][04/14
13:04:15 console connected][04/14 13:38:37 console connected][04/14
15:31:47 console disconnected][04/14 15:36:24 console
connected][04/14 15:42:08 connection by xcat_console]
---
 
If you do have any in corrupted state, would be interested to see
ipmitool sol set volatile-bit-rate 115.2 1
 
 
To change the volatile bit rate to match the non-volatile bit rate
and see if the corruption goes away.
 
Sent: Friday, April 14, 2017 11:36 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles
hangs
Post by Jarrod Johnson
~after 24h.
 
115200
 
idracadm7 get iDRAC.IPMISerial
[Key=iDRAC.Embedded.1#IPMISerial.1]
BaudRate=115200
ChanPrivLimit=4
ConnectionMode=Terminal
DeleteControl=Disabled
EchoControl=Enabled
FlowControl=RTS/CTS
HandshakeControl=Enabled
InputNewLineSeq=1
LineEdit=Enabled
NewLineSeq=CR-LF
 
that is strange, right
 
Hmm, what’s the baud rate the console is actually running at?  Odd
to
Post by Jarrod Johnson
see the volatile and non volatile bit rates not be the same.
 
Sent: Friday, April 14, 2017 11:28 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles
hangs
Post by Jarrod Johnson
~after 24h.
 
 
 
And to be clear, the corruption only starts after a long period of
time of being continuously connected?
Yes, that is correct
 
I might be interested in seeing ipmitool sol info 1 output against
a
Post by Jarrod Johnson
system while it is working versus showing corrupted info.
# ipmitool -I lanplus -H cloud2manage -U root -a sol info 1
Info: SOL parameter 'Payload Channel (7)' not supported -
defaulting
Post by Jarrod Johnson
to 0x01
Set in progress                 : set-complete
Enabled                         : true
Force Encryption                : true
Force Authentication            : false
Privilege Level                 : ADMINISTRATOR
Character Accumulate Level (ms) : 50
Character Send Threshold        : 255
Retry Count                     : 7
Retry Interval (ms)             : 480
Volatile Bit Rate (kbps)        : 38.4
Non-Volatile Bit Rate (kbps)    : 115.2
Payload Channel                 : 1 (0x01)
Payload Port                    : 623
 
# ipmitool -I lanplus -H cloud2manage -U root -a sol info 1
Info: SOL parameter 'Payload Channel (7)' not supported -
defaulting
Post by Jarrod Johnson
to 0x01
Set in progress                 : set-complete
Enabled                         : true
Force Encryption                : true
Force Authentication            : false
Privilege Level                 : ADMINISTRATOR
Character Accumulate Level (ms) : 50
Character Send Threshold        : 255
Retry Count                     : 7
Retry Interval (ms)             : 480
Volatile Bit Rate (kbps)        : 38.4
Non-Volatile Bit Rate (kbps)    : 115.2
Payload Channel                 : 1 (0x01)
Payload Port                    : 623
 
Sent: Friday, April 14, 2017 11:09 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles
hangs
Post by Jarrod Johnson
~after 24h.
 
Yes, reopen causes it to work again,  without any garbage
 so looks
like normal console :)
Hit <enter> causes at first garbage output(ᅵᅵ Porᅵlo) and *normal
console* before...
 
So reopen causes it to work again, and before, it’s not *hung*, but
erratic with garbage characters and occasional blips of sanity?
 
Sent: Friday, April 14, 2017 11:00 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles
hangs
Post by Jarrod Johnson
~after 24h.
 
Reopen console did the trick as well...
 
‘ctrl-e, then c, then o’ to reconnect.
 
Was conserver ondemand or full logging?
 
Sent: Friday, April 14, 2017 10:52 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles
hangs
Post by Jarrod Johnson
~after 24h.
 
Console starts showing garbage after <enter> inside rcons.
What do you mean when said “restarting console”?
- <enter> inside rcons/confetty
- bmc reset (console disconnected/console connected)
 
You’re absolutely right with ipmitool and conserver with the same
servers we were out of such troubles.
So the console starts showing garbage?  Restarting the console
causes
Post by Jarrod Johnson
the garbage to go away?
 
You said that ipmitool with a certain configuration did not trigger this?
 
Sent: Friday, April 14, 2017 9:29 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: Re: [xcat-user] Confluent as console server. Consoles
hangs
Post by Jarrod Johnson
~after 24h.
 
I’m out of ideas, let me show you all i see.
 
 
MONITORING_TEST dbb54 1492160401 <= last message i’ve sent from OS
(more complex log below)
 
 
13:23:42.342886 IP (tos 0x0, ttl 64, id 16448, offset 0, flags
[DF],
Post by Jarrod Johnson
proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP,
length
Post by Jarrod Johnson
64
13:23:42.345504 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 108)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP,
length
Post by Jarrod Johnson
80
 


 
13:24:09.422491 IP (tos 0x0, ttl 64, id 17060, offset 0, flags
[DF],
Post by Jarrod Johnson
proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP,
length
Post by Jarrod Johnson
64
13:24:09.425045 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 108)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP,
length
Post by Jarrod Johnson
80
 
---
MONITORING_TEST dbb54 1492160401
 
ᅵᅵ
  Porᅵ
—
 
13:24:35.727671 IP (tos 0x0, ttl 64, id 19582, offset 0, flags
[DF],
Post by Jarrod Johnson
proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP,
length
Post by Jarrod Johnson
64
13:24:35.731533 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 108)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP,
length
Post by Jarrod Johnson
80
13:24:47.390367 IP (tos 0x0, ttl 64, id 20347, offset 0, flags
[DF],
Post by Jarrod Johnson
proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP,
length
Post by Jarrod Johnson
64
13:24:47.392799 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 92)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP,
length
Post by Jarrod Johnson
64
13:24:47.408312 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 108)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP,
length
Post by Jarrod Johnson
80
13:24:47.409797 IP (tos 0x0, ttl 64, id 20349, offset 0, flags
[DF],
Post by Jarrod Johnson
proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP,
length
Post by Jarrod Johnson
64
13:25:03.127774 IP (tos 0x0, ttl 64, id 21818, offset 0, flags
[DF],
Post by Jarrod Johnson
proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP,
length
Post by Jarrod Johnson
64
13:25:03.131561 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 108)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP,
length
Post by Jarrod Johnson
80
13:25:27.269696 IP (tos 0x0, ttl 64, id 26284, offset 0, flags
[DF],
Post by Jarrod Johnson
proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP,
length
Post by Jarrod Johnson
64
13:25:27.272204 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 108)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP,
length
Post by Jarrod Johnson
80
13:25:47.410313 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 92)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP,
length
Post by Jarrod Johnson
64
13:25:47.413754 IP (tos 0x0, ttl 64, id 28210, offset 0, flags
[DF],
Post by Jarrod Johnson
proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP,
length
Post by Jarrod Johnson
64
13:25:48.709947 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 204)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP,
length
Post by Jarrod Johnson
176
13:25:48.712033 IP (tos 0x0, ttl 64, id 28355, offset 0, flags
[DF],
Post by Jarrod Johnson
proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP,
length
Post by Jarrod Johnson
64
13:25:52.564080 IP (tos 0x0, ttl 64, id 29103, offset 0, flags
[DF],
Post by Jarrod Johnson
proto UDP (17), length 92)
    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP,
length
Post by Jarrod Johnson
64
13:25:52.566810 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto UDP (17), length 108)
    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP,
length
Post by Jarrod Johnson
80
 
---
  Porᅵlo]0;console: dbb54 [13:25]
 
 
---
 
If you ctrl-e, c, o, does it restore the console after the time?
 
Can you tell that it goes after exactly 24hours on the dot?
 
When console hung, does ‘ipmitool sol activate’ say ‘session
already
Post by Jarrod Johnson
active’?
Yes, 
# ipmitool -I lanplus -H 10.10.106.155 -U root -a sol activate
Info: SOL payload already active on another session
 
Does /var/log/confluent/consoles/<nodename> have any interesting events crop up?
[04/13 15:17:21 console connected]

 many our own messages
^MMONITORING_TEST dbb54 1492160401 | <== This is the last message
^M
[04/14 09:05:13 console connected]
[04/14 09:11:59 console connected]
[04/14 09:13:38 console disconnected]
[04/14 09:14:54 console connected]
[04/14 10:15:13 connection by xcat_console]
[04/14 10:15:14 disconnection by xcat_console]
[04/14 13:14:30 connection by xcat_console]
 
Pyghmi will do keepalive as well, and if that’s the problem, it
should be much shorter than 24 hours.  In fact, it should be
checking
Post by Jarrod Johnson
if the SOL payload is active and owned by confluent specifically
every couple of minutes.
yes, thats correct
 
Sent: Friday, April 14, 2017 5:55 AM
Subject: Re: [xcat-user] Confluent as console server. Consoles
hangs
Post by Jarrod Johnson
~after 24h.
 
My last reply was incorrect. Problems still here. Im trying to find
something usefull inbetween confluent/pyghmi...
Confluent restart solves hangs/reopen all connections.
I think it isnt the best option to restart confluent 1 or 2 times
in
Post by Jarrod Johnson
24h.
-- 
banuchka
It is Dell’s related problem, not 100% but

Confluent from current master is doing things well :) 
Thanks for pretty nice tool “confluentdbutil".
 
Looks like that problem was before
 The fix was to use ipmitool
with
Post by Jarrod Johnson
keepalive(one from xcat repos).
Here pyghmi is used maybe that the reason?
 
Hi, 
 
Im trying to completely migrate from conserver to confluent, but
catch strange behaviour.
Some of my consoles hangs ~after 24, so no any new messages in
their
Post by Jarrod Johnson
logs or in rcons.
I send messages with timestamp from OS >/dev/console every 30-60min
and take a look on them for monitoring purposes(consoles
availability
Post by Jarrod Johnson
monitoring).
I can open rcons and hit enter, after few secs console is waking
up(strange). I didnt see it happen with conserver or maybe im wrong...
- as i can see the bigest part of consoles with hangs behaviour are
Dell idrac. Doesnt matter which type of RacSerial or IPMISerial is
in
Post by Jarrod Johnson
use.
- racreset hard/ipmitool bmc reset didnt do the things
- hit enter to console wake it up(for example with expect i can
send
Post by Jarrod Johnson
\r\n\f, but it looks bad)
- i didnt try to clean confluent's conf and restart it. Not sure it may help.
- HP consoles works well, same ipmi
- few consoles with custom pluging works good as well
 
So maybe my question is not about confluent, but if some of you
have
Post by Jarrod Johnson
some knowledge about same problems please share it! ;)
 
-- 
banuchka
-- 
banuchka
-- 
banuchka
-------------------------------------------------------------------
----------- 
Check out the vibrant tech community on one of the world's most 
engaging tech sites, Slashdot.org! http://sdm.link/slashdot________
__
Post by Jarrod Johnson
_____________________________________ 
xCAT-user mailing list 
https://lists.sourceforge.net/lists/listinfo/xcat-user 
 
 
 
-- 
banuchka
-- 
banuchka
-- 
banuchka
-- 
banuchka
 
 
-- 
banuchka
-- 
banuchka
-- 
banuchka
-- 
banuchka
-- 
banuchka
-- 
banuchka
-- 
banuchka
-- 
banuchka
-- 
banuchka
-- 
banuchka
-- 
banuchka
banuchka
2017-05-02 15:39:27 UTC
Permalink
Hi,

bit follow up. when i’m trying “makeconfluent -l” on local running instance i see error:

===
confluent[4253]: stderr :May 02 15:33:21   File "/usr/lib64/python2.7/threading.py", line 823, in __bootstrap_inner
    (self.name, _format_exc())): Exception in thread Thread-1935:
Traceback (most recent call last):
  File "/usr/lib64/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/usr/lib64/python2.7/threading.py", line 763, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/opt/confluent/lib/python/confluent/config/configmanager.py", line 1417, in _sync_to_file
    return cls._sync_to_file()
  File "/opt/confluent/lib/python/confluent/config/configmanager.py", line 1417, in _sync_to_file
    return cls._sync_to_file()
  File "/opt/confluent/lib/python/confluent/config/configmanager.py", line 1408, in _sync_to_file
    dbf.close()
  File "/usr/lib64/python2.7/bsddb/__init__.py", line 296, in close
    v = _DeadlockWrap(self.db.close)
  File "/usr/lib64/python2.7/bsddb/dbutils.py", line 68, in DeadlockWrap
    return function(*_args, **_kwargs)
DBError: (9, 'Bad file descriptor -- /etc/confluent/cfg/nodes: Bad file descriptor')
===

On 14 April 2017 at 20:59:04, Jarrod Johnson (***@lenovo.com) wrote:

Yeah, there will be a bit push in the coming weeks it will have at least an ‘events’ log along with a lot more function.

 

Then some more fleshed out documentation (beyond the preliminary stuff on hpc.lenovo.com).

 

Let me know if the firmware exploration works out.  That particular change line suggests firmware upgrades, but it is possible they could have some high BMC cpu usage that could manifest in such a way.  The ‘works with ipmitool’ though has me scratching my head.

 

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 2:54 PM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

Last idea doesn’t work for me. So by the way idea as is is working great – confluent does disconnect/connect after time in constant. But for now it is 100% correct to say – it is a problem with IDRAC fw.

from release notes for last fw:

===

- Fix for occasional iDRAC unresponsiveness caused by upgrades via Firmware RACADM or
have an active SOL or SSH sessions while firmware upgrade is in progress.

===

I’m not sure, but maybe its something like i have here. So did the upgrade on few hosts and give them plenty of time to show me results.

Thanks for your answers, help and time
 it is very interesting quest :)

 

Bit more about Confluent:

- Interesting ambitions 

- Python VS Perl, thats good

- I think log files(not just trace, stderr, stdout) and documentation(source on Github is the best doc o know, but
) are things that i would like to be in Confluent

 

On 14 April 2017 at 19:27:20, Jarrod Johnson (***@lenovo.com) wrote:

Very interested in the outcome.  And thank you for working through it.  Also interested what you have liked, would like, and have disliked about confluent.

 

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 12:01 PM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

Thank you Jarrod, i’ll try to add patch and let you know after. Hope 90 minutes is enough, yes.

 

On 14 April 2017 at 16:57:24, Jarrod Johnson (***@lenovo.com) wrote:

Hmm, this is going to be very difficult to root cause (I only have Lenovo equipment as one might expect).

 

I’m loathe to do a workaround, but in console.py (find /usr –name console.py) , might be interesting to see how a change like the following:

diff --git a/pyghmi/ipmi/console.py b/pyghmi/ipmi/console.py

index 95e8551..a5f6062 100644

--- a/pyghmi/ipmi/console.py

+++ b/pyghmi/ipmi/console.py

@@ -42,6 +42,7 @@ class Console(object):

     def __init__(self, bmc, userid, password,

                  iohandler, port=623,

                  force=False, kg=None):

+        self.keepalivecount = 0

         self.keepaliveid = None

         self.connected = False

         self.broken = False

@@ -70,6 +71,7 @@ class Console(object):

         if 'error' in response:

             self._print_error(response['error'])

             return

+        self.keepalivecount = 0

         #Send activate sol payload directive

         #netfn= 6 (application)

         #command = 0x48 (activate payload)

@@ -150,11 +152,12 @@ class Console(object):

             return

         currowner = struct.unpack(

             "<I", struct.pack('4B', *response['data'][:4]))

-        if currowner[0] != self.ipmi_session.sessionid:

+        if currowner[0] != self.ipmi_session.sessionid or  self.keepalivecount > 180:

             # the session is deactivated or active for something else

             self.activated = False

             self._print_error('SOL deactivated')

             return

+        self.keepalivecount += 1

         # ok, still here, that means session is alive, but another

         # common issue is firmware messing with mux on reboot

         # this would be a nice thing to check, but the serial channel

 

If it would pan out, should cause the console session to disconnect itself roughly every 90 minutes and trigger reconnect (is 90 minutes short enough in your case?)  Would require a service confluent restart to see if it had the desired effect.

 

Sorry I haven’t tested and can’t think of root cause, but going to take some time off for the weekend.

 

I would be curious if the same ipmitool is running a day later than a check (e.g. if ipmitool is exiting and getting restarted).  I don’t have the time at the moment to see if they do some other interesting thing to avoid the behavior.

 

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 11:45 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

cloud53.ulan:/home/banuchka # ipmitool sol info 1

Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01

Set in progress                 : set-complete

Enabled                         : true

Force Encryption                : true

Force Authentication            : false

Privilege Level                 : ADMINISTRATOR

Character Accumulate Level (ms) : 50

Character Send Threshold        : 255

Retry Count                     : 7

Retry Interval (ms)             : 480

Volatile Bit Rate (kbps)        : 38.4

Non-Volatile Bit Rate (kbps)    : 115.2

Payload Channel                 : 1 (0x01)

Payload Port                    : 623

cloud53.ulan:/home/banuchka # ipmitool sol set volatile-bit-rate 115.2 1

cloud53.ulan:/home/banuchka # ipmitool sol info 1

Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01

Set in progress                 : set-complete

Enabled                         : true

Force Encryption                : true

Force Authentication            : false

Privilege Level                 : ADMINISTRATOR

Character Accumulate Level (ms) : 50

Character Send Threshold        : 255

Retry Count                     : 7

Retry Interval (ms)             : 480

Volatile Bit Rate (kbps)        : 115.2

Non-Volatile Bit Rate (kbps)    : 115.2

Payload Channel                 : 1 (0x01)

Payload Port                    : 623

cloud53.ulan:/home/banuchka # echo 123 > /dev/console

 

and nothing happened

 

in the console’s log

—

[04/14 12:49:12 console disconnected][04/14 12:49:29 console connected][04/14 13:01:02 console disconnected][04/14 13:01:02 console connected][04/14 13:03:54 console disconnected][04/14 13:04:15 console connected][04/14 13:38:37 console connected][04/14 15:31:47 console disconnected][04/14 15:36:24 console connected][04/14 15:42:08 connection by xcat_console]

---

 

On 14 April 2017 at 16:39:35, Jarrod Johnson (***@lenovo.com) wrote:

If you do have any in corrupted state, would be interested to see what happens if you do:

ipmitool sol set volatile-bit-rate 115.2 1

 

 

To change the volatile bit rate to match the non-volatile bit rate and see if the corruption goes away.

 

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 11:36 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

115200

 

idracadm7 get iDRAC.IPMISerial

[Key=iDRAC.Embedded.1#IPMISerial.1]

BaudRate=115200

ChanPrivLimit=4

ConnectionMode=Terminal

DeleteControl=Disabled

EchoControl=Enabled

FlowControl=RTS/CTS

HandshakeControl=Enabled

InputNewLineSeq=1

LineEdit=Enabled

NewLineSeq=CR-LF

 

that is strange, right

 

On 14 April 2017 at 16:31:27, Jarrod Johnson (***@lenovo.com) wrote:

Hmm, what’s the baud rate the console is actually running at?  Odd to see the volatile and non volatile bit rates not be the same.

 

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 11:28 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

 

 

On 14 April 2017 at 16:15:16, Jarrod Johnson (***@lenovo.com) wrote:

And to be clear, the corruption only starts after a long period of time of being continuously connected?

Yes, that is correct


 

I might be interested in seeing ipmitool sol info 1 output against a system while it is working versus showing corrupted info.

corrupted:

# ipmitool -I lanplus -H cloud2manage -U root -a sol info 1

Password:

Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01

Set in progress                 : set-complete

Enabled                         : true

Force Encryption                : true

Force Authentication            : false

Privilege Level                 : ADMINISTRATOR

Character Accumulate Level (ms) : 50

Character Send Threshold        : 255

Retry Count                     : 7

Retry Interval (ms)             : 480

Volatile Bit Rate (kbps)        : 38.4

Non-Volatile Bit Rate (kbps)    : 115.2

Payload Channel                 : 1 (0x01)

Payload Port                    : 623

 

Working:

# ipmitool -I lanplus -H cloud2manage -U root -a sol info 1

Password:

Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01

Set in progress                 : set-complete

Enabled                         : true

Force Encryption                : true

Force Authentication            : false

Privilege Level                 : ADMINISTRATOR

Character Accumulate Level (ms) : 50

Character Send Threshold        : 255

Retry Count                     : 7

Retry Interval (ms)             : 480

Volatile Bit Rate (kbps)        : 38.4

Non-Volatile Bit Rate (kbps)    : 115.2

Payload Channel                 : 1 (0x01)

Payload Port                    : 623


 

From: banuchka [mailto:***@gmail.com] 
Sent: Friday, April 14, 2017 11:09 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

Yes, reopen causes it to work again,  without any garbage
 so looks like normal console :)

Hit <enter> causes at first garbage output(ᅵᅵ Porᅵlo) and *normal console* before...

 

On 14 April 2017 at 16:02:09, Jarrod Johnson (***@lenovo.com) wrote:

So reopen causes it to work again, and before, it’s not *hung*, but erratic with garbage characters and occasional blips of sanity?

 

From: banuchka [mailto:***@gmail.com] 
Sent: Friday, April 14, 2017 11:00 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

Reopen console did the trick as well...

 

On 14 April 2017 at 15:54:03, Jarrod Johnson (***@lenovo.com) wrote:

‘ctrl-e, then c, then o’ to reconnect.

 

Was conserver ondemand or full logging?

 

From: banuchka [mailto:***@gmail.com] 
Sent: Friday, April 14, 2017 10:52 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

Console starts showing garbage after <enter> inside rcons.

What do you mean when said “restarting console”?

Console continue its work after:

- <enter> inside rcons/confetty

- bmc reset (console disconnected/console connected)

 

You’re absolutely right with ipmitool and conserver with the same servers we were out of such troubles.

On 14 April 2017 at 15:47:14, Jarrod Johnson (***@lenovo.com) wrote:

So the console starts showing garbage?  Restarting the console causes the garbage to go away?

 

You said that ipmitool with a certain configuration did not trigger this?

 

From: banuchka [mailto:***@gmail.com] 
Sent: Friday, April 14, 2017 9:29 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

I’m out of ideas, let me show you all i see.

 

Inside rcons i see:

 

MONITORING_TEST dbb54 1492160401 <= last message i’ve sent from OS (more complex log below)

 

tcpdump(keepalive?):

 

13:23:42.342886 IP (tos 0x0, ttl 64, id 16448, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:23:42.345504 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

 




 

13:24:09.422491 IP (tos 0x0, ttl 64, id 17060, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:24:09.425045 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

 

Hit <enter> in rcons:

---

MONITORING_TEST dbb54 1492160401

 

ᅵᅵ

  Porᅵ

—

 

tcpdump:

13:24:35.727671 IP (tos 0x0, ttl 64, id 19582, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:24:35.731533 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

13:24:47.390367 IP (tos 0x0, ttl 64, id 20347, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:24:47.392799 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 64

13:24:47.408312 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

13:24:47.409797 IP (tos 0x0, ttl 64, id 20349, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:03.127774 IP (tos 0x0, ttl 64, id 21818, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:03.131561 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

13:25:27.269696 IP (tos 0x0, ttl 64, id 26284, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:27.272204 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

13:25:47.410313 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 64

13:25:47.413754 IP (tos 0x0, ttl 64, id 28210, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:48.709947 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 204)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 176

13:25:48.712033 IP (tos 0x0, ttl 64, id 28355, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:52.564080 IP (tos 0x0, ttl 64, id 29103, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:52.566810 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

 

and Magic, rcons:

---

  Porᅵlo]0;console: dbb54 [13:25]

 

 

dbb54 login:

---

 

On 14 April 2017 at 12:42:03, Jarrod Johnson (***@lenovo.com) wrote:

If you ctrl-e, c, o, does it restore the console after the time?

 

Can you tell that it goes after exactly 24hours on the dot?

 

When console hung, does ‘ipmitool sol activate’ say ‘session already active’?

Yes, 

# ipmitool -I lanplus -H 10.10.106.155 -U root -a sol activate

Password:

Info: SOL payload already active on another session


 

Does /var/log/confluent/consoles/<nodename> have any interesting events crop up?

[04/13 15:17:21 console connected]


 many our own messages

^MMONITORING_TEST dbb54 1492160401 | <== This is the last message from OS/ # date -***@1492160401 (Fri Apr 14 09:00:01 UTC 2017)

^M

[04/14 09:05:13 console connected]

[04/14 09:11:59 console connected]

[04/14 09:13:38 console disconnected]

[04/14 09:14:54 console connected]

[04/14 10:15:13 connection by xcat_console]

[04/14 10:15:14 disconnection by xcat_console]

[04/14 13:14:30 connection by xcat_console]


 

Pyghmi will do keepalive as well, and if that’s the problem, it should be much shorter than 24 hours.  In fact, it should be checking if the SOL payload is active and owned by confluent specifically every couple of minutes.

yes, thats correct


 

From: banuchka [mailto:***@gmail.com] 
Sent: Friday, April 14, 2017 5:55 AM
To: xcat-***@lists.sourceforge.net
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

My last reply was incorrect. Problems still here. Im trying to find something usefull inbetween confluent/pyghmi...

Confluent restart solves hangs/reopen all connections.

I think it isnt the best option to restart confluent 1 or 2 times in 24h.

-- 
banuchka

On 13 April 2017 at 17:03:19, banuchka (***@gmail.com) wrote:

It is Dell’s related problem, not 100% but


Confluent from current master is doing things well :) 

Thanks for pretty nice tool “confluentdbutil".

 

On 13 April 2017 at 11:30:14, banuchka (***@gmail.com) wrote:

Looks like that problem was before
 The fix was to use ipmitool with keepalive(one from xcat repos).

Here pyghmi is used maybe that the reason?

 

On 13 April 2017 at 08:22:28, banuchka (***@gmail.com) wrote:

Hi,

 

Im trying to completely migrate from conserver to confluent, but catch strange behaviour.

Some of my consoles hangs ~after 24, so no any new messages in their logs or in rcons.

I send messages with timestamp from OS >/dev/console every 30-60min and take a look on them for monitoring purposes(consoles availability monitoring).

I can open rcons and hit enter, after few secs console is waking up(strange). I didnt see it happen with conserver or maybe im wrong...

Some details:

- as i can see the bigest part of consoles with hangs behaviour are Dell idrac. Doesnt matter which type of RacSerial or IPMISerial is in use.

- racreset hard/ipmitool bmc reset didnt do the things

- hit enter to console wake it up(for example with expect i can send \r\n\f, but it looks bad)

- i didnt try to clean confluent's conf and restart it. Not sure it may help.

- HP consoles works well, same ipmi

- few consoles with custom pluging works good as well

 

So maybe my question is not about confluent, but if some of you have some knowledge about same problems please share it! ;)

 

--
banuchka

-- 
banuchka

-- 
banuchka

------------------------------------------------------------------------------ 
Check out the vibrant tech community on one of the world's most 
engaging tech sites, Slashdot.org! http://sdm.link/slashdot_______________________________________________ 
xCAT-user mailing list 
xCAT-***@lists.sourceforge.net 
https://lists.sourceforge.net/lists/listinfo/xcat-user 

 

 

 

-- 
banuchka

-- 
banuchka

-- 
banuchka

-- 
banuchka

 

 

-- 
banuchka

-- 
banuchka

-- 
banuchka

-- 
banuchka

-- 
banuchka

-- 
banuchka
Jarrod Johnson
2017-05-02 15:49:09 UTC
Permalink
Is there anything special about /etc/confluent/cfg?

From: banuchka [mailto:***@gmail.com]
Sent: Tuesday, May 02, 2017 11:39 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

Hi,

bit follow up. when i’m trying “makeconfluent -l” on local running instance i see error:

===
confluent[4253]: stderr :May 02 15:33:21 File "/usr/lib64/python2.7/threading.py", line 823, in __bootstrap_inner
(self.name, _format_exc())): Exception in thread Thread-1935:
Traceback (most recent call last):
File "/usr/lib64/python2.7/threading.py", line 810, in __bootstrap_inner
self.run()
File "/usr/lib64/python2.7/threading.py", line 763, in run
self.__target(*self.__args, **self.__kwargs)
File "/opt/confluent/lib/python/confluent/config/configmanager.py", line 1417, in _sync_to_file
return cls._sync_to_file()
File "/opt/confluent/lib/python/confluent/config/configmanager.py", line 1417, in _sync_to_file
return cls._sync_to_file()
File "/opt/confluent/lib/python/confluent/conficg/configmanager.py", line 1408, in _sync_to_file
dbf.close()
File "/usr/lib64/python2.7/bsddb/__init__.py", line 296, in close
v = _DeadlockWrap(self.db.close)
File "/usr/lib64/python2.7/bsddb/dbutils.py", line 68, in DeadlockWrap
return function(*_args, **_kwargs)
DBError: (9, 'Bad file descriptor -- /etc/confluent/cfg/nodes: Bad file descriptor')
===


On 14 April 2017 at 20:59:04, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
Yeah, there will be a bit push in the coming weeks it will have at least an ‘events’ log along with a lot more function.

Then some more fleshed out documentation (beyond the preliminary stuff on hpc.lenovo.com).

Let me know if the firmware exploration works out. That particular change line suggests firmware upgrades, but it is possible they could have some high BMC cpu usage that could manifest in such a way. The ‘works with ipmitool’ though has me scratching my head.

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 2:54 PM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

Last idea doesn’t work for me. So by the way idea as is is working great – confluent does disconnect/connect after time in constant. But for now it is 100% correct to say – it is a problem with IDRAC fw.
from release notes for last fw:
===
- Fix for occasional iDRAC unresponsiveness caused by upgrades via Firmware RACADM or
have an active SOL or SSH sessions while firmware upgrade is in progress.
===
I’m not sure, but maybe its something like i have here. So did the upgrade on few hosts and give them plenty of time to show me results.
Thanks for your answers, help and time
 it is very interesting quest :)

Bit more about Confluent:
- Interesting ambitions
- Python VS Perl, thats good
- I think log files(not just trace, stderr, stdout) and documentation(source on Github is the best doc o know, but
) are things that i would like to be in Confluent


On 14 April 2017 at 19:27:20, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
Very interested in the outcome. And thank you for working through it. Also interested what you have liked, would like, and have disliked about confluent.

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 12:01 PM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

Thank you Jarrod, i’ll try to add patch and let you know after. Hope 90 minutes is enough, yes.


On 14 April 2017 at 16:57:24, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
Hmm, this is going to be very difficult to root cause (I only have Lenovo equipment as one might expect).

I’m loathe to do a workaround, but in console.py (find /usr –name console.py) , might be interesting to see how a change like the following:
diff --git a/pyghmi/ipmi/console.py b/pyghmi/ipmi/console.py
index 95e8551..a5f6062 100644
--- a/pyghmi/ipmi/console.py
+++ b/pyghmi/ipmi/console.py
@@ -42,6 +42,7 @@ class Console(object):
def __init__(self, bmc, userid, password,
iohandler, port=623,
force=False, kg=None):
+ self.keepalivecount = 0
self.keepaliveid = None
self.connected = False
self.broken = False
@@ -70,6 +71,7 @@ class Console(object):
if 'error' in response:
self._print_error(response['error'])
return
+ self.keepalivecount = 0
#Send activate sol payload directive
#netfn= 6 (application)
#command = 0x48 (activate payload)
@@ -150,11 +152,12 @@ class Console(object):
return
currowner = struct.unpack(
"<I", struct.pack('4B', *response['data'][:4]))
- if currowner[0] != self.ipmi_session.sessionid:
+ if currowner[0] != self.ipmi_session.sessionid or self.keepalivecount > 180:
# the session is deactivated or active for something else
self.activated = False
self._print_error('SOL deactivated')
return
+ self.keepalivecount += 1
# ok, still here, that means session is alive, but another
# common issue is firmware messing with mux on reboot
# this would be a nice thing to check, but the serial channel

If it would pan out, should cause the console session to disconnect itself roughly every 90 minutes and trigger reconnect (is 90 minutes short enough in your case?) Would require a service confluent restart to see if it had the desired effect.

Sorry I haven’t tested and can’t think of root cause, but going to take some time off for the weekend.

I would be curious if the same ipmitool is running a day later than a check (e.g. if ipmitool is exiting and getting restarted). I don’t have the time at the moment to see if they do some other interesting thing to avoid the behavior.

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 11:45 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

cloud53.ulan:/home/banuchka # ipmitool sol info 1
Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01
Set in progress : set-complete
Enabled : true
Force Encryption : true
Force Authentication : false
Privilege Level : ADMINISTRATOR
Character Accumulate Level (ms) : 50
Character Send Threshold : 255
Retry Count : 7
Retry Interval (ms) : 480
Volatile Bit Rate (kbps) : 38.4
Non-Volatile Bit Rate (kbps) : 115.2
Payload Channel : 1 (0x01)
Payload Port : 623
cloud53.ulan:/home/banuchka # ipmitool sol set volatile-bit-rate 115.2 1
cloud53.ulan:/home/banuchka # ipmitool sol info 1
Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01
Set in progress : set-complete
Enabled : true
Force Encryption : true
Force Authentication : false
Privilege Level : ADMINISTRATOR
Character Accumulate Level (ms) : 50
Character Send Threshold : 255
Retry Count : 7
Retry Interval (ms) : 480
Volatile Bit Rate (kbps) : 115.2
Non-Volatile Bit Rate (kbps) : 115.2
Payload Channel : 1 (0x01)
Payload Port : 623
cloud53.ulan:/home/banuchka # echo 123 > /dev/console

and nothing happened

in the console’s log
—
[04/14 12:49:12 console disconnected][04/14 12:49:29 console connected][04/14 13:01:02 console disconnected][04/14 13:01:02 console connected][04/14 13:03:54 console disconnected][04/14 13:04:15 console connected][04/14 13:38:37 console connected][04/14 15:31:47 console disconnected][04/14 15:36:24 console connected][04/14 15:42:08 connection by xcat_console]
---


On 14 April 2017 at 16:39:35, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
If you do have any in corrupted state, would be interested to see what happens if you do:
ipmitool sol set volatile-bit-rate 115.2 1


To change the volatile bit rate to match the non-volatile bit rate and see if the corruption goes away.

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 11:36 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

115200

idracadm7 get iDRAC.IPMISerial
[Key=iDRAC.Embedded.1#IPMISerial.1]
BaudRate=115200
ChanPrivLimit=4
ConnectionMode=Terminal
DeleteControl=Disabled
EchoControl=Enabled
FlowControl=RTS/CTS
HandshakeControl=Enabled
InputNewLineSeq=1
LineEdit=Enabled
NewLineSeq=CR-LF

that is strange, right


On 14 April 2017 at 16:31:27, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
Hmm, what’s the baud rate the console is actually running at? Odd to see the volatile and non volatile bit rates not be the same.

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 11:28 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.




On 14 April 2017 at 16:15:16, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
And to be clear, the corruption only starts after a long period of time of being continuously connected?

Yes, that is correct


I might be interested in seeing ipmitool sol info 1 output against a system while it is working versus showing corrupted info.

corrupted:

# ipmitool -I lanplus -H cloud2manage -U root -a sol info 1

Password:

Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01

Set in progress : set-complete

Enabled : true

Force Encryption : true

Force Authentication : false

Privilege Level : ADMINISTRATOR

Character Accumulate Level (ms) : 50

Character Send Threshold : 255

Retry Count : 7

Retry Interval (ms) : 480

Volatile Bit Rate (kbps) : 38.4

Non-Volatile Bit Rate (kbps) : 115.2

Payload Channel : 1 (0x01)

Payload Port : 623



Working:

# ipmitool -I lanplus -H cloud2manage -U root -a sol info 1

Password:

Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01

Set in progress : set-complete

Enabled : true

Force Encryption : true

Force Authentication : false

Privilege Level : ADMINISTRATOR

Character Accumulate Level (ms) : 50

Character Send Threshold : 255

Retry Count : 7

Retry Interval (ms) : 480

Volatile Bit Rate (kbps) : 38.4

Non-Volatile Bit Rate (kbps) : 115.2

Payload Channel : 1 (0x01)

Payload Port : 623


From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 11:09 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

Yes, reopen causes it to work again, without any garbage
 so looks like normal console :)
Hit <enter> causes at first garbage output(ᅵᅵ Porᅵlo) and *normal console* before...


On 14 April 2017 at 16:02:09, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
So reopen causes it to work again, and before, it’s not *hung*, but erratic with garbage characters and occasional blips of sanity?

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 11:00 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

Reopen console did the trick as well...


On 14 April 2017 at 15:54:03, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
‘ctrl-e, then c, then o’ to reconnect.

Was conserver ondemand or full logging?

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 10:52 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

Console starts showing garbage after <enter> inside rcons.
What do you mean when said “restarting console”?
Console continue its work after:
- <enter> inside rcons/confetty
- bmc reset (console disconnected/console connected)

You’re absolutely right with ipmitool and conserver with the same servers we were out of such troubles.

On 14 April 2017 at 15:47:14, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
So the console starts showing garbage? Restarting the console causes the garbage to go away?

You said that ipmitool with a certain configuration did not trigger this?

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 9:29 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

I’m out of ideas, let me show you all i see.

Inside rcons i see:

MONITORING_TEST dbb54 1492160401 <= last message i’ve sent from OS (more complex log below)

tcpdump(keepalive?):

13:23:42.342886 IP (tos 0x0, ttl 64, id 16448, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:23:42.345504 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80




13:24:09.422491 IP (tos 0x0, ttl 64, id 17060, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:24:09.425045 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

Hit <enter> in rcons:
---
MONITORING_TEST dbb54 1492160401

ᅵᅵ
Porᅵ
—

tcpdump:
13:24:35.727671 IP (tos 0x0, ttl 64, id 19582, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:24:35.731533 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:24:47.390367 IP (tos 0x0, ttl 64, id 20347, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:24:47.392799 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 92)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 64
13:24:47.408312 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:24:47.409797 IP (tos 0x0, ttl 64, id 20349, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:03.127774 IP (tos 0x0, ttl 64, id 21818, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:03.131561 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:25:27.269696 IP (tos 0x0, ttl 64, id 26284, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:27.272204 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:25:47.410313 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 92)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 64
13:25:47.413754 IP (tos 0x0, ttl 64, id 28210, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:48.709947 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 204)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 176
13:25:48.712033 IP (tos 0x0, ttl 64, id 28355, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:52.564080 IP (tos 0x0, ttl 64, id 29103, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:52.566810 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

and Magic, rcons:
---
Porï¿œlo]0;console: dbb54 [13:25]


dbb54 login:
---


On 14 April 2017 at 12:42:03, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
If you ctrl-e, c, o, does it restore the console after the time?

Can you tell that it goes after exactly 24hours on the dot?

When console hung, does ‘ipmitool sol activate’ say ‘session already active’?

Yes,

# ipmitool -I lanplus -H 10.10.106.155 -U root -a sol activate

Password:

Info: SOL payload already active on another session


Does /var/log/confluent/consoles/<nodename> have any interesting events crop up?

[04/13 15:17:21 console connected]


 many our own messages

^MMONITORING_TEST dbb54 1492160401 | <== This is the last message from OS/ # date -***@1492160401 (Fri Apr 14 09:00:01 UTC 2017)

^M

[04/14 09:05:13 console connected]

[04/14 09:11:59 console connected]

[04/14 09:13:38 console disconnected]

[04/14 09:14:54 console connected]

[04/14 10:15:13 connection by xcat_console]

[04/14 10:15:14 disconnection by xcat_console]

[04/14 13:14:30 connection by xcat_console]


Pyghmi will do keepalive as well, and if that’s the problem, it should be much shorter than 24 hours. In fact, it should be checking if the SOL payload is active and owned by confluent specifically every couple of minutes.

yes, thats correct


From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 5:55 AM
To: xcat-***@lists.sourceforge.net<mailto:xcat-***@lists.sourceforge.net>
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

My last reply was incorrect. Problems still here. Im trying to find something usefull inbetween confluent/pyghmi...
Confluent restart solves hangs/reopen all connections.
I think it isnt the best option to restart confluent 1 or 2 times in 24h.
--
banuchka

On 13 April 2017 at 17:03:19, banuchka (***@gmail.com<mailto:***@gmail.com>) wrote:
It is Dell’s related problem, not 100% but

Confluent from current master is doing things well :)
Thanks for pretty nice tool “confluentdbutil".


On 13 April 2017 at 11:30:14, banuchka (***@gmail.com<mailto:***@gmail.com>) wrote:
Looks like that problem was before
 The fix was to use ipmitool with keepalive(one from xcat repos).
Here pyghmi is used maybe that the reason?


On 13 April 2017 at 08:22:28, banuchka (***@gmail.com<mailto:***@gmail.com>) wrote:
Hi,

Im trying to completely migrate from conserver to confluent, but catch strange behaviour.
Some of my consoles hangs ~after 24, so no any new messages in their logs or in rcons.
I send messages with timestamp from OS >/dev/console every 30-60min and take a look on them for monitoring purposes(consoles availability monitoring).
I can open rcons and hit enter, after few secs console is waking up(strange). I didnt see it happen with conserver or maybe im wrong...
Some details:
- as i can see the bigest part of consoles with hangs behaviour are Dell idrac. Doesnt matter which type of RacSerial or IPMISerial is in use.
- racreset hard/ipmitool bmc reset didnt do the things
- hit enter to console wake it up(for example with expect i can send \r\n\f, but it looks bad)
- i didnt try to clean confluent's conf and restart it. Not sure it may help.
- HP consoles works well, same ipmi
- few consoles with custom pluging works good as well

So maybe my question is not about confluent, but if some of you have some knowledge about same problems please share it! ;)
--
banuchka
--
banuchka
--
banuchka
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot_______________________________________________
xCAT-user mailing list
xCAT-***@lists.sourceforge.net<mailto:xCAT-***@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/xcat-user
--
banuchka
--
banuchka
--
banuchka
--
banuchka
--
banuchka
--
banuchka
--
banuchka
--
banuchka
--
banuchka
--
banuchka
banuchka
2017-05-02 16:17:06 UTC
Permalink
there isn’t
 it happen only on nodes with 1k+ hosts
 when im using it on servers with at about 50-100 servers all done fine

On 2 May 2017 at 16:58:11, Jarrod Johnson (***@lenovo.com) wrote:

Is there anything special about /etc/confluent/cfg?

 

From: banuchka [mailto:***@gmail.com]
Sent: Tuesday, May 02, 2017 11:39 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

Hi,

 

bit follow up. when i’m trying “makeconfluent -l” on local running instance i see error:

 

===

confluent[4253]: stderr :May 02 15:33:21   File "/usr/lib64/python2.7/threading.py", line 823, in __bootstrap_inner

    (self.name, _format_exc())): Exception in thread Thread-1935:

Traceback (most recent call last):

  File "/usr/lib64/python2.7/threading.py", line 810, in __bootstrap_inner

    self.run()

  File "/usr/lib64/python2.7/threading.py", line 763, in run

    self.__target(*self.__args, **self.__kwargs)

  File "/opt/confluent/lib/python/confluent/config/configmanager.py", line 1417, in _sync_to_file

    return cls._sync_to_file()

  File "/opt/confluent/lib/python/confluent/config/configmanager.py", line 1417, in _sync_to_file

    return cls._sync_to_file()

  File "/opt/confluent/lib/python/confluent/conficg/configmanager.py", line 1408, in _sync_to_file

    dbf.close()

  File "/usr/lib64/python2.7/bsddb/__init__.py", line 296, in close

    v = _DeadlockWrap(self.db.close)

  File "/usr/lib64/python2.7/bsddb/dbutils.py", line 68, in DeadlockWrap

    return function(*_args, **_kwargs)

DBError: (9, 'Bad file descriptor -- /etc/confluent/cfg/nodes: Bad file descriptor')

===

 

On 14 April 2017 at 20:59:04, Jarrod Johnson (***@lenovo.com) wrote:

Yeah, there will be a bit push in the coming weeks it will have at least an ‘events’ log along with a lot more function.

 

Then some more fleshed out documentation (beyond the preliminary stuff on hpc.lenovo.com).

 

Let me know if the firmware exploration works out.  That particular change line suggests firmware upgrades, but it is possible they could have some high BMC cpu usage that could manifest in such a way.  The ‘works with ipmitool’ though has me scratching my head.

 

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 2:54 PM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

Last idea doesn’t work for me. So by the way idea as is is working great – confluent does disconnect/connect after time in constant. But for now it is 100% correct to say – it is a problem with IDRAC fw.

from release notes for last fw:

===

- Fix for occasional iDRAC unresponsiveness caused by upgrades via Firmware RACADM or
have an active SOL or SSH sessions while firmware upgrade is in progress.

===

I’m not sure, but maybe its something like i have here. So did the upgrade on few hosts and give them plenty of time to show me results.

Thanks for your answers, help and time
 it is very interesting quest :)

 

Bit more about Confluent:

- Interesting ambitions 

- Python VS Perl, thats good

- I think log files(not just trace, stderr, stdout) and documentation(source on Github is the best doc o know, but
) are things that i would like to be in Confluent

 

On 14 April 2017 at 19:27:20, Jarrod Johnson (***@lenovo.com) wrote:

Very interested in the outcome.  And thank you for working through it.  Also interested what you have liked, would like, and have disliked about confluent.

 

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 12:01 PM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

Thank you Jarrod, i’ll try to add patch and let you know after. Hope 90 minutes is enough, yes.

 

On 14 April 2017 at 16:57:24, Jarrod Johnson (***@lenovo.com) wrote:

Hmm, this is going to be very difficult to root cause (I only have Lenovo equipment as one might expect).

 

I’m loathe to do a workaround, but in console.py (find /usr –name console.py) , might be interesting to see how a change like the following:

diff --git a/pyghmi/ipmi/console.py b/pyghmi/ipmi/console.py

index 95e8551..a5f6062 100644

--- a/pyghmi/ipmi/console.py

+++ b/pyghmi/ipmi/console.py

@@ -42,6 +42,7 @@ class Console(object):

     def __init__(self, bmc, userid, password,

                  iohandler, port=623,

                  force=False, kg=None):

+        self.keepalivecount = 0

         self.keepaliveid = None

         self.connected = False

         self.broken = False

@@ -70,6 +71,7 @@ class Console(object):

         if 'error' in response:

             self._print_error(response['error'])

             return

+        self.keepalivecount = 0

         #Send activate sol payload directive

         #netfn= 6 (application)

         #command = 0x48 (activate payload)

@@ -150,11 +152,12 @@ class Console(object):

             return

         currowner = struct.unpack(

             "<I", struct.pack('4B', *response['data'][:4]))

-        if currowner[0] != self.ipmi_session.sessionid:

+        if currowner[0] != self.ipmi_session.sessionid or  self.keepalivecount > 180:

             # the session is deactivated or active for something else

             self.activated = False

             self._print_error('SOL deactivated')

             return

+        self.keepalivecount += 1

         # ok, still here, that means session is alive, but another

         # common issue is firmware messing with mux on reboot

         # this would be a nice thing to check, but the serial channel

 

If it would pan out, should cause the console session to disconnect itself roughly every 90 minutes and trigger reconnect (is 90 minutes short enough in your case?)  Would require a service confluent restart to see if it had the desired effect.

 

Sorry I haven’t tested and can’t think of root cause, but going to take some time off for the weekend.

 

I would be curious if the same ipmitool is running a day later than a check (e.g. if ipmitool is exiting and getting restarted).  I don’t have the time at the moment to see if they do some other interesting thing to avoid the behavior.

 

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 11:45 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

cloud53.ulan:/home/banuchka # ipmitool sol info 1

Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01

Set in progress                 : set-complete

Enabled                         : true

Force Encryption                : true

Force Authentication            : false

Privilege Level                 : ADMINISTRATOR

Character Accumulate Level (ms) : 50

Character Send Threshold        : 255

Retry Count                     : 7

Retry Interval (ms)             : 480

Volatile Bit Rate (kbps)        : 38.4

Non-Volatile Bit Rate (kbps)    : 115.2

Payload Channel                 : 1 (0x01)

Payload Port                    : 623

cloud53.ulan:/home/banuchka # ipmitool sol set volatile-bit-rate 115.2 1

cloud53.ulan:/home/banuchka # ipmitool sol info 1

Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01

Set in progress                 : set-complete

Enabled                         : true

Force Encryption                : true

Force Authentication            : false

Privilege Level                 : ADMINISTRATOR

Character Accumulate Level (ms) : 50

Character Send Threshold        : 255

Retry Count                     : 7

Retry Interval (ms)             : 480

Volatile Bit Rate (kbps)        : 115.2

Non-Volatile Bit Rate (kbps)    : 115.2

Payload Channel                 : 1 (0x01)

Payload Port                    : 623

cloud53.ulan:/home/banuchka # echo 123 > /dev/console

 

and nothing happened

 

in the console’s log

—

[04/14 12:49:12 console disconnected][04/14 12:49:29 console connected][04/14 13:01:02 console disconnected][04/14 13:01:02 console connected][04/14 13:03:54 console disconnected][04/14 13:04:15 console connected][04/14 13:38:37 console connected][04/14 15:31:47 console disconnected][04/14 15:36:24 console connected][04/14 15:42:08 connection by xcat_console]

---

 

On 14 April 2017 at 16:39:35, Jarrod Johnson (***@lenovo.com) wrote:

If you do have any in corrupted state, would be interested to see what happens if you do:

ipmitool sol set volatile-bit-rate 115.2 1

 

 

To change the volatile bit rate to match the non-volatile bit rate and see if the corruption goes away.

 

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 11:36 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

115200

 

idracadm7 get iDRAC.IPMISerial

[Key=iDRAC.Embedded.1#IPMISerial.1]

BaudRate=115200

ChanPrivLimit=4

ConnectionMode=Terminal

DeleteControl=Disabled

EchoControl=Enabled

FlowControl=RTS/CTS

HandshakeControl=Enabled

InputNewLineSeq=1

LineEdit=Enabled

NewLineSeq=CR-LF

 

that is strange, right

 

On 14 April 2017 at 16:31:27, Jarrod Johnson (***@lenovo.com) wrote:

Hmm, what’s the baud rate the console is actually running at?  Odd to see the volatile and non volatile bit rates not be the same.

 

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 11:28 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

 

 

On 14 April 2017 at 16:15:16, Jarrod Johnson (***@lenovo.com) wrote:

And to be clear, the corruption only starts after a long period of time of being continuously connected?

Yes, that is correct


 

I might be interested in seeing ipmitool sol info 1 output against a system while it is working versus showing corrupted info.

corrupted:

# ipmitool -I lanplus -H cloud2manage -U root -a sol info 1

Password:

Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01

Set in progress                 : set-complete

Enabled                         : true

Force Encryption                : true

Force Authentication            : false

Privilege Level                 : ADMINISTRATOR

Character Accumulate Level (ms) : 50

Character Send Threshold        : 255

Retry Count                     : 7

Retry Interval (ms)             : 480

Volatile Bit Rate (kbps)        : 38.4

Non-Volatile Bit Rate (kbps)    : 115.2

Payload Channel                 : 1 (0x01)

Payload Port                    : 623

 

Working:

# ipmitool -I lanplus -H cloud2manage -U root -a sol info 1

Password:

Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01

Set in progress                 : set-complete

Enabled                         : true

Force Encryption                : true

Force Authentication            : false

Privilege Level                 : ADMINISTRATOR

Character Accumulate Level (ms) : 50

Character Send Threshold        : 255

Retry Count                     : 7

Retry Interval (ms)             : 480

Volatile Bit Rate (kbps)        : 38.4

Non-Volatile Bit Rate (kbps)    : 115.2

Payload Channel                 : 1 (0x01)

Payload Port                    : 623


 

From: banuchka [mailto:***@gmail.com] 
Sent: Friday, April 14, 2017 11:09 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

Yes, reopen causes it to work again,  without any garbage
 so looks like normal console :)

Hit <enter> causes at first garbage output(ᅵᅵ Porᅵlo) and *normal console* before...

 

On 14 April 2017 at 16:02:09, Jarrod Johnson (***@lenovo.com) wrote:

So reopen causes it to work again, and before, it’s not *hung*, but erratic with garbage characters and occasional blips of sanity?

 

From: banuchka [mailto:***@gmail.com] 
Sent: Friday, April 14, 2017 11:00 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

Reopen console did the trick as well...

 

On 14 April 2017 at 15:54:03, Jarrod Johnson (***@lenovo.com) wrote:

‘ctrl-e, then c, then o’ to reconnect.

 

Was conserver ondemand or full logging?

 

From: banuchka [mailto:***@gmail.com] 
Sent: Friday, April 14, 2017 10:52 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

Console starts showing garbage after <enter> inside rcons.

What do you mean when said “restarting console”?

Console continue its work after:

- <enter> inside rcons/confetty

- bmc reset (console disconnected/console connected)

 

You’re absolutely right with ipmitool and conserver with the same servers we were out of such troubles.

On 14 April 2017 at 15:47:14, Jarrod Johnson (***@lenovo.com) wrote:

So the console starts showing garbage?  Restarting the console causes the garbage to go away?

 

You said that ipmitool with a certain configuration did not trigger this?

 

From: banuchka [mailto:***@gmail.com] 
Sent: Friday, April 14, 2017 9:29 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

I’m out of ideas, let me show you all i see.

 

Inside rcons i see:

 

MONITORING_TEST dbb54 1492160401 <= last message i’ve sent from OS (more complex log below)

 

tcpdump(keepalive?):

 

13:23:42.342886 IP (tos 0x0, ttl 64, id 16448, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:23:42.345504 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

 




 

13:24:09.422491 IP (tos 0x0, ttl 64, id 17060, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:24:09.425045 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

 

Hit <enter> in rcons:

---

MONITORING_TEST dbb54 1492160401

 

ᅵᅵ

  Porᅵ

—

 

tcpdump:

13:24:35.727671 IP (tos 0x0, ttl 64, id 19582, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:24:35.731533 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

13:24:47.390367 IP (tos 0x0, ttl 64, id 20347, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:24:47.392799 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 64

13:24:47.408312 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

13:24:47.409797 IP (tos 0x0, ttl 64, id 20349, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:03.127774 IP (tos 0x0, ttl 64, id 21818, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:03.131561 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

13:25:27.269696 IP (tos 0x0, ttl 64, id 26284, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:27.272204 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

13:25:47.410313 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 64

13:25:47.413754 IP (tos 0x0, ttl 64, id 28210, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:48.709947 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 204)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 176

13:25:48.712033 IP (tos 0x0, ttl 64, id 28355, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:52.564080 IP (tos 0x0, ttl 64, id 29103, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:52.566810 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

 

and Magic, rcons:

---

  Porᅵlo]0;console: dbb54 [13:25]

 

 

dbb54 login:

---

 

On 14 April 2017 at 12:42:03, Jarrod Johnson (***@lenovo.com) wrote:

If you ctrl-e, c, o, does it restore the console after the time?

 

Can you tell that it goes after exactly 24hours on the dot?

 

When console hung, does ‘ipmitool sol activate’ say ‘session already active’?

Yes, 

# ipmitool -I lanplus -H 10.10.106.155 -U root -a sol activate

Password:

Info: SOL payload already active on another session


 

Does /var/log/confluent/consoles/<nodename> have any interesting events crop up?

[04/13 15:17:21 console connected]


 many our own messages

^MMONITORING_TEST dbb54 1492160401 | <== This is the last message from OS/ # date -***@1492160401 (Fri Apr 14 09:00:01 UTC 2017)

^M

[04/14 09:05:13 console connected]

[04/14 09:11:59 console connected]

[04/14 09:13:38 console disconnected]

[04/14 09:14:54 console connected]

[04/14 10:15:13 connection by xcat_console]

[04/14 10:15:14 disconnection by xcat_console]

[04/14 13:14:30 connection by xcat_console]


 

Pyghmi will do keepalive as well, and if that’s the problem, it should be much shorter than 24 hours.  In fact, it should be checking if the SOL payload is active and owned by confluent specifically every couple of minutes.

yes, thats correct


 

From: banuchka [mailto:***@gmail.com] 
Sent: Friday, April 14, 2017 5:55 AM
To: xcat-***@lists.sourceforge.net
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

My last reply was incorrect. Problems still here. Im trying to find something usefull inbetween confluent/pyghmi...

Confluent restart solves hangs/reopen all connections.

I think it isnt the best option to restart confluent 1 or 2 times in 24h.

-- 
banuchka

On 13 April 2017 at 17:03:19, banuchka (***@gmail.com) wrote:

It is Dell’s related problem, not 100% but


Confluent from current master is doing things well :) 

Thanks for pretty nice tool “confluentdbutil".

 

On 13 April 2017 at 11:30:14, banuchka (***@gmail.com) wrote:

Looks like that problem was before
 The fix was to use ipmitool with keepalive(one from xcat repos).

Here pyghmi is used maybe that the reason?

 

On 13 April 2017 at 08:22:28, banuchka (***@gmail.com) wrote:

Hi,

 

Im trying to completely migrate from conserver to confluent, but catch strange behaviour.

Some of my consoles hangs ~after 24, so no any new messages in their logs or in rcons.

I send messages with timestamp from OS >/dev/console every 30-60min and take a look on them for monitoring purposes(consoles availability monitoring).

I can open rcons and hit enter, after few secs console is waking up(strange). I didnt see it happen with conserver or maybe im wrong...

Some details:

- as i can see the bigest part of consoles with hangs behaviour are Dell idrac. Doesnt matter which type of RacSerial or IPMISerial is in use.

- racreset hard/ipmitool bmc reset didnt do the things

- hit enter to console wake it up(for example with expect i can send \r\n\f, but it looks bad)

- i didnt try to clean confluent's conf and restart it. Not sure it may help.

- HP consoles works well, same ipmi

- few consoles with custom pluging works good as well

 

So maybe my question is not about confluent, but if some of you have some knowledge about same problems please share it! ;)

 

--
banuchka

--
banuchka

-- 
banuchka

------------------------------------------------------------------------------ 
Check out the vibrant tech community on one of the world's most 
engaging tech sites, Slashdot.org! http://sdm.link/slashdot_______________________________________________ 
xCAT-user mailing list 
xCAT-***@lists.sourceforge.net 
https://lists.sourceforge.net/lists/listinfo/xcat-user 

 

 

 

-- 
banuchka

-- 
banuchka

-- 
banuchka

-- 
banuchka

 

 

-- 
banuchka

-- 
banuchka

-- 
banuchka

-- 
banuchka

-- 
banuchka

-- 
banuchka

-- 
banuchka
Jarrod Johnson
2017-05-02 17:17:56 UTC
Permalink
Hmm, what’s ulimit look like? Wondering about ulimit –n


From: banuchka [mailto:***@gmail.com]
Sent: Tuesday, May 02, 2017 12:17 PM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

there isn’t
 it happen only on nodes with 1k+ hosts
 when im using it on servers with at about 50-100 servers all done fine


On 2 May 2017 at 16:58:11, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
Is there anything special about /etc/confluent/cfg?

From: banuchka [mailto:***@gmail.com]
Sent: Tuesday, May 02, 2017 11:39 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

Hi,

bit follow up. when i’m trying “makeconfluent -l” on local running instance i see error:

===
confluent[4253]: stderr :May 02 15:33:21 File "/usr/lib64/python2.7/threading.py", line 823, in __bootstrap_inner
(self.name, _format_exc())): Exception in thread Thread-1935:
Traceback (most recent call last):
File "/usr/lib64/python2.7/threading.py", line 810, in __bootstrap_inner
self.run()
File "/usr/lib64/python2.7/threading.py", line 763, in run
self.__target(*self.__args, **self.__kwargs)
File "/opt/confluent/lib/python/confluent/config/configmanager.py", line 1417, in _sync_to_file
return cls._sync_to_file()
File "/opt/confluent/lib/python/confluent/config/configmanager.py", line 1417, in _sync_to_file
return cls._sync_to_file()
File "/opt/confluent/lib/python/confluent/conficg/configmanager.py", line 1408, in _sync_to_file
dbf.close()
File "/usr/lib64/python2.7/bsddb/__init__.py", line 296, in close
v = _DeadlockWrap(self.db.close)
File "/usr/lib64/python2.7/bsddb/dbutils.py", line 68, in DeadlockWrap
return function(*_args, **_kwargs)
DBError: (9, 'Bad file descriptor -- /etc/confluent/cfg/nodes: Bad file descriptor')
===


On 14 April 2017 at 20:59:04, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
Yeah, there will be a bit push in the coming weeks it will have at least an ‘events’ log along with a lot more function.

Then some more fleshed out documentation (beyond the preliminary stuff on hpc.lenovo.com).

Let me know if the firmware exploration works out. That particular change line suggests firmware upgrades, but it is possible they could have some high BMC cpu usage that could manifest in such a way. The ‘works with ipmitool’ though has me scratching my head.

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 2:54 PM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

Last idea doesn’t work for me. So by the way idea as is is working great – confluent does disconnect/connect after time in constant. But for now it is 100% correct to say – it is a problem with IDRAC fw.
from release notes for last fw:
===
- Fix for occasional iDRAC unresponsiveness caused by upgrades via Firmware RACADM or
have an active SOL or SSH sessions while firmware upgrade is in progress.
===
I’m not sure, but maybe its something like i have here. So did the upgrade on few hosts and give them plenty of time to show me results.
Thanks for your answers, help and time
 it is very interesting quest :)

Bit more about Confluent:
- Interesting ambitions
- Python VS Perl, thats good
- I think log files(not just trace, stderr, stdout) and documentation(source on Github is the best doc o know, but
) are things that i would like to be in Confluent


On 14 April 2017 at 19:27:20, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
Very interested in the outcome. And thank you for working through it. Also interested what you have liked, would like, and have disliked about confluent.

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 12:01 PM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

Thank you Jarrod, i’ll try to add patch and let you know after. Hope 90 minutes is enough, yes.


On 14 April 2017 at 16:57:24, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
Hmm, this is going to be very difficult to root cause (I only have Lenovo equipment as one might expect).

I’m loathe to do a workaround, but in console.py (find /usr –name console.py) , might be interesting to see how a change like the following:
diff --git a/pyghmi/ipmi/console.py b/pyghmi/ipmi/console.py
index 95e8551..a5f6062 100644
--- a/pyghmi/ipmi/console.py
+++ b/pyghmi/ipmi/console.py
@@ -42,6 +42,7 @@ class Console(object):
def __init__(self, bmc, userid, password,
iohandler, port=623,
force=False, kg=None):
+ self.keepalivecount = 0
self.keepaliveid = None
self.connected = False
self.broken = False
@@ -70,6 +71,7 @@ class Console(object):
if 'error' in response:
self._print_error(response['error'])
return
+ self.keepalivecount = 0
#Send activate sol payload directive
#netfn= 6 (application)
#command = 0x48 (activate payload)
@@ -150,11 +152,12 @@ class Console(object):
return
currowner = struct.unpack(
"<I", struct.pack('4B', *response['data'][:4]))
- if currowner[0] != self.ipmi_session.sessionid:
+ if currowner[0] != self.ipmi_session.sessionid or self.keepalivecount > 180:
# the session is deactivated or active for something else
self.activated = False
self._print_error('SOL deactivated')
return
+ self.keepalivecount += 1
# ok, still here, that means session is alive, but another
# common issue is firmware messing with mux on reboot
# this would be a nice thing to check, but the serial channel

If it would pan out, should cause the console session to disconnect itself roughly every 90 minutes and trigger reconnect (is 90 minutes short enough in your case?) Would require a service confluent restart to see if it had the desired effect.

Sorry I haven’t tested and can’t think of root cause, but going to take some time off for the weekend.

I would be curious if the same ipmitool is running a day later than a check (e.g. if ipmitool is exiting and getting restarted). I don’t have the time at the moment to see if they do some other interesting thing to avoid the behavior.

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 11:45 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

cloud53.ulan:/home/banuchka # ipmitool sol info 1
Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01
Set in progress : set-complete
Enabled : true
Force Encryption : true
Force Authentication : false
Privilege Level : ADMINISTRATOR
Character Accumulate Level (ms) : 50
Character Send Threshold : 255
Retry Count : 7
Retry Interval (ms) : 480
Volatile Bit Rate (kbps) : 38.4
Non-Volatile Bit Rate (kbps) : 115.2
Payload Channel : 1 (0x01)
Payload Port : 623
cloud53.ulan:/home/banuchka # ipmitool sol set volatile-bit-rate 115.2 1
cloud53.ulan:/home/banuchka # ipmitool sol info 1
Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01
Set in progress : set-complete
Enabled : true
Force Encryption : true
Force Authentication : false
Privilege Level : ADMINISTRATOR
Character Accumulate Level (ms) : 50
Character Send Threshold : 255
Retry Count : 7
Retry Interval (ms) : 480
Volatile Bit Rate (kbps) : 115.2
Non-Volatile Bit Rate (kbps) : 115.2
Payload Channel : 1 (0x01)
Payload Port : 623
cloud53.ulan:/home/banuchka # echo 123 > /dev/console

and nothing happened

in the console’s log
—
[04/14 12:49:12 console disconnected][04/14 12:49:29 console connected][04/14 13:01:02 console disconnected][04/14 13:01:02 console connected][04/14 13:03:54 console disconnected][04/14 13:04:15 console connected][04/14 13:38:37 console connected][04/14 15:31:47 console disconnected][04/14 15:36:24 console connected][04/14 15:42:08 connection by xcat_console]
---


On 14 April 2017 at 16:39:35, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
If you do have any in corrupted state, would be interested to see what happens if you do:
ipmitool sol set volatile-bit-rate 115.2 1


To change the volatile bit rate to match the non-volatile bit rate and see if the corruption goes away.

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 11:36 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

115200

idracadm7 get iDRAC.IPMISerial
[Key=iDRAC.Embedded.1#IPMISerial.1]
BaudRate=115200
ChanPrivLimit=4
ConnectionMode=Terminal
DeleteControl=Disabled
EchoControl=Enabled
FlowControl=RTS/CTS
HandshakeControl=Enabled
InputNewLineSeq=1
LineEdit=Enabled
NewLineSeq=CR-LF

that is strange, right


On 14 April 2017 at 16:31:27, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
Hmm, what’s the baud rate the console is actually running at? Odd to see the volatile and non volatile bit rates not be the same.

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 11:28 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.




On 14 April 2017 at 16:15:16, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
And to be clear, the corruption only starts after a long period of time of being continuously connected?

Yes, that is correct


I might be interested in seeing ipmitool sol info 1 output against a system while it is working versus showing corrupted info.

corrupted:

# ipmitool -I lanplus -H cloud2manage -U root -a sol info 1

Password:

Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01

Set in progress : set-complete

Enabled : true

Force Encryption : true

Force Authentication : false

Privilege Level : ADMINISTRATOR

Character Accumulate Level (ms) : 50

Character Send Threshold : 255

Retry Count : 7

Retry Interval (ms) : 480

Volatile Bit Rate (kbps) : 38.4

Non-Volatile Bit Rate (kbps) : 115.2

Payload Channel : 1 (0x01)

Payload Port : 623



Working:

# ipmitool -I lanplus -H cloud2manage -U root -a sol info 1

Password:

Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01

Set in progress : set-complete

Enabled : true

Force Encryption : true

Force Authentication : false

Privilege Level : ADMINISTRATOR

Character Accumulate Level (ms) : 50

Character Send Threshold : 255

Retry Count : 7

Retry Interval (ms) : 480

Volatile Bit Rate (kbps) : 38.4

Non-Volatile Bit Rate (kbps) : 115.2

Payload Channel : 1 (0x01)

Payload Port : 623


From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 11:09 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

Yes, reopen causes it to work again, without any garbage
 so looks like normal console :)
Hit <enter> causes at first garbage output(ᅵᅵ Porᅵlo) and *normal console* before...


On 14 April 2017 at 16:02:09, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
So reopen causes it to work again, and before, it’s not *hung*, but erratic with garbage characters and occasional blips of sanity?

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 11:00 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

Reopen console did the trick as well...


On 14 April 2017 at 15:54:03, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
‘ctrl-e, then c, then o’ to reconnect.

Was conserver ondemand or full logging?

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 10:52 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

Console starts showing garbage after <enter> inside rcons.
What do you mean when said “restarting console”?
Console continue its work after:
- <enter> inside rcons/confetty
- bmc reset (console disconnected/console connected)

You’re absolutely right with ipmitool and conserver with the same servers we were out of such troubles.

On 14 April 2017 at 15:47:14, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
So the console starts showing garbage? Restarting the console causes the garbage to go away?

You said that ipmitool with a certain configuration did not trigger this?

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 9:29 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

I’m out of ideas, let me show you all i see.

Inside rcons i see:

MONITORING_TEST dbb54 1492160401 <= last message i’ve sent from OS (more complex log below)

tcpdump(keepalive?):

13:23:42.342886 IP (tos 0x0, ttl 64, id 16448, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:23:42.345504 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80




13:24:09.422491 IP (tos 0x0, ttl 64, id 17060, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:24:09.425045 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

Hit <enter> in rcons:
---
MONITORING_TEST dbb54 1492160401

ᅵᅵ
Porᅵ
—

tcpdump:
13:24:35.727671 IP (tos 0x0, ttl 64, id 19582, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:24:35.731533 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:24:47.390367 IP (tos 0x0, ttl 64, id 20347, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:24:47.392799 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 92)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 64
13:24:47.408312 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:24:47.409797 IP (tos 0x0, ttl 64, id 20349, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:03.127774 IP (tos 0x0, ttl 64, id 21818, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:03.131561 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:25:27.269696 IP (tos 0x0, ttl 64, id 26284, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:27.272204 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:25:47.410313 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 92)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 64
13:25:47.413754 IP (tos 0x0, ttl 64, id 28210, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:48.709947 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 204)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 176
13:25:48.712033 IP (tos 0x0, ttl 64, id 28355, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:52.564080 IP (tos 0x0, ttl 64, id 29103, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:52.566810 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

and Magic, rcons:
---
Porï¿œlo]0;console: dbb54 [13:25]


dbb54 login:
---


On 14 April 2017 at 12:42:03, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
If you ctrl-e, c, o, does it restore the console after the time?

Can you tell that it goes after exactly 24hours on the dot?

When console hung, does ‘ipmitool sol activate’ say ‘session already active’?

Yes,

# ipmitool -I lanplus -H 10.10.106.155 -U root -a sol activate

Password:

Info: SOL payload already active on another session


Does /var/log/confluent/consoles/<nodename> have any interesting events crop up?

[04/13 15:17:21 console connected]


 many our own messages

^MMONITORING_TEST dbb54 1492160401 | <== This is the last message from OS/ # date -***@1492160401 (Fri Apr 14 09:00:01 UTC 2017)

^M

[04/14 09:05:13 console connected]

[04/14 09:11:59 console connected]

[04/14 09:13:38 console disconnected]

[04/14 09:14:54 console connected]

[04/14 10:15:13 connection by xcat_console]

[04/14 10:15:14 disconnection by xcat_console]

[04/14 13:14:30 connection by xcat_console]


Pyghmi will do keepalive as well, and if that’s the problem, it should be much shorter than 24 hours. In fact, it should be checking if the SOL payload is active and owned by confluent specifically every couple of minutes.

yes, thats correct


From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 5:55 AM
To: xcat-***@lists.sourceforge.net<mailto:xcat-***@lists.sourceforge.net>
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

My last reply was incorrect. Problems still here. Im trying to find something usefull inbetween confluent/pyghmi...
Confluent restart solves hangs/reopen all connections.
I think it isnt the best option to restart confluent 1 or 2 times in 24h.
--
banuchka

On 13 April 2017 at 17:03:19, banuchka (***@gmail.com<mailto:***@gmail.com>) wrote:
It is Dell’s related problem, not 100% but

Confluent from current master is doing things well :)
Thanks for pretty nice tool “confluentdbutil".


On 13 April 2017 at 11:30:14, banuchka (***@gmail.com<mailto:***@gmail.com>) wrote:
Looks like that problem was before
 The fix was to use ipmitool with keepalive(one from xcat repos).
Here pyghmi is used maybe that the reason?


On 13 April 2017 at 08:22:28, banuchka (***@gmail.com<mailto:***@gmail.com>) wrote:
Hi,

Im trying to completely migrate from conserver to confluent, but catch strange behaviour.
Some of my consoles hangs ~after 24, so no any new messages in their logs or in rcons.
I send messages with timestamp from OS >/dev/console every 30-60min and take a look on them for monitoring purposes(consoles availability monitoring).
I can open rcons and hit enter, after few secs console is waking up(strange). I didnt see it happen with conserver or maybe im wrong...
Some details:
- as i can see the bigest part of consoles with hangs behaviour are Dell idrac. Doesnt matter which type of RacSerial or IPMISerial is in use.
- racreset hard/ipmitool bmc reset didnt do the things
- hit enter to console wake it up(for example with expect i can send \r\n\f, but it looks bad)
- i didnt try to clean confluent's conf and restart it. Not sure it may help.
- HP consoles works well, same ipmi
- few consoles with custom pluging works good as well

So maybe my question is not about confluent, but if some of you have some knowledge about same problems please share it! ;)
--
banuchka
--
banuchka
--
banuchka
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot_______________________________________________
xCAT-user mailing list
xCAT-***@lists.sourceforge.net<mailto:xCAT-***@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/xcat-user
--
banuchka
--
banuchka
--
banuchka
--
banuchka
--
banuchka
--
banuchka
--
banuchka
--
banuchka
--
banuchka
--
banuchka
--
banuchka
banuchka
2017-05-02 17:42:02 UTC
Permalink
$ ulimit -a
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 257578
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1048576
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 1048576
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

On 2 May 2017 at 18:26:42, Jarrod Johnson (***@lenovo.com) wrote:

Hmm, what’s ulimit look like?  Wondering about ulimit –n


 

From: banuchka [mailto:***@gmail.com]
Sent: Tuesday, May 02, 2017 12:17 PM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

there isn’t
 it happen only on nodes with 1k+ hosts
 when im using it on servers with at about 50-100 servers all done fine

 

On 2 May 2017 at 16:58:11, Jarrod Johnson (***@lenovo.com) wrote:

Is there anything special about /etc/confluent/cfg?

 

From: banuchka [mailto:***@gmail.com]
Sent: Tuesday, May 02, 2017 11:39 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

Hi,

 

bit follow up. when i’m trying “makeconfluent -l” on local running instance i see error:

 

===

confluent[4253]: stderr :May 02 15:33:21   File "/usr/lib64/python2.7/threading.py", line 823, in __bootstrap_inner

    (self.name, _format_exc())): Exception in thread Thread-1935:

Traceback (most recent call last):

  File "/usr/lib64/python2.7/threading.py", line 810, in __bootstrap_inner

    self.run()

  File "/usr/lib64/python2.7/threading.py", line 763, in run

    self.__target(*self.__args, **self.__kwargs)

  File "/opt/confluent/lib/python/confluent/config/configmanager.py", line 1417, in _sync_to_file

    return cls._sync_to_file()

  File "/opt/confluent/lib/python/confluent/config/configmanager.py", line 1417, in _sync_to_file

    return cls._sync_to_file()

  File "/opt/confluent/lib/python/confluent/conficg/configmanager.py", line 1408, in _sync_to_file

    dbf.close()

  File "/usr/lib64/python2.7/bsddb/__init__.py", line 296, in close

    v = _DeadlockWrap(self.db.close)

  File "/usr/lib64/python2.7/bsddb/dbutils.py", line 68, in DeadlockWrap

    return function(*_args, **_kwargs)

DBError: (9, 'Bad file descriptor -- /etc/confluent/cfg/nodes: Bad file descriptor')

===

 

On 14 April 2017 at 20:59:04, Jarrod Johnson (***@lenovo.com) wrote:

Yeah, there will be a bit push in the coming weeks it will have at least an ‘events’ log along with a lot more function.

 

Then some more fleshed out documentation (beyond the preliminary stuff on hpc.lenovo.com).

 

Let me know if the firmware exploration works out.  That particular change line suggests firmware upgrades, but it is possible they could have some high BMC cpu usage that could manifest in such a way.  The ‘works with ipmitool’ though has me scratching my head.

 

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 2:54 PM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

Last idea doesn’t work for me. So by the way idea as is is working great – confluent does disconnect/connect after time in constant. But for now it is 100% correct to say – it is a problem with IDRAC fw.

from release notes for last fw:

===

- Fix for occasional iDRAC unresponsiveness caused by upgrades via Firmware RACADM or
have an active SOL or SSH sessions while firmware upgrade is in progress.

===

I’m not sure, but maybe its something like i have here. So did the upgrade on few hosts and give them plenty of time to show me results.

Thanks for your answers, help and time
 it is very interesting quest :)

 

Bit more about Confluent:

- Interesting ambitions 

- Python VS Perl, thats good

- I think log files(not just trace, stderr, stdout) and documentation(source on Github is the best doc o know, but
) are things that i would like to be in Confluent

 

On 14 April 2017 at 19:27:20, Jarrod Johnson (***@lenovo.com) wrote:

Very interested in the outcome.  And thank you for working through it.  Also interested what you have liked, would like, and have disliked about confluent.

 

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 12:01 PM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

Thank you Jarrod, i’ll try to add patch and let you know after. Hope 90 minutes is enough, yes.

 

On 14 April 2017 at 16:57:24, Jarrod Johnson (***@lenovo.com) wrote:

Hmm, this is going to be very difficult to root cause (I only have Lenovo equipment as one might expect).

 

I’m loathe to do a workaround, but in console.py (find /usr –name console.py) , might be interesting to see how a change like the following:

diff --git a/pyghmi/ipmi/console.py b/pyghmi/ipmi/console.py

index 95e8551..a5f6062 100644

--- a/pyghmi/ipmi/console.py

+++ b/pyghmi/ipmi/console.py

@@ -42,6 +42,7 @@ class Console(object):

     def __init__(self, bmc, userid, password,

                  iohandler, port=623,

                  force=False, kg=None):

+        self.keepalivecount = 0

         self.keepaliveid = None

         self.connected = False

         self.broken = False

@@ -70,6 +71,7 @@ class Console(object):

         if 'error' in response:

             self._print_error(response['error'])

             return

+        self.keepalivecount = 0

         #Send activate sol payload directive

         #netfn= 6 (application)

         #command = 0x48 (activate payload)

@@ -150,11 +152,12 @@ class Console(object):

             return

         currowner = struct.unpack(

             "<I", struct.pack('4B', *response['data'][:4]))

-        if currowner[0] != self.ipmi_session.sessionid:

+        if currowner[0] != self.ipmi_session.sessionid or  self.keepalivecount > 180:

             # the session is deactivated or active for something else

             self.activated = False

             self._print_error('SOL deactivated')

             return

+        self.keepalivecount += 1

         # ok, still here, that means session is alive, but another

         # common issue is firmware messing with mux on reboot

         # this would be a nice thing to check, but the serial channel

 

If it would pan out, should cause the console session to disconnect itself roughly every 90 minutes and trigger reconnect (is 90 minutes short enough in your case?)  Would require a service confluent restart to see if it had the desired effect.

 

Sorry I haven’t tested and can’t think of root cause, but going to take some time off for the weekend.

 

I would be curious if the same ipmitool is running a day later than a check (e.g. if ipmitool is exiting and getting restarted).  I don’t have the time at the moment to see if they do some other interesting thing to avoid the behavior.

 

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 11:45 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

cloud53.ulan:/home/banuchka # ipmitool sol info 1

Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01

Set in progress                 : set-complete

Enabled                         : true

Force Encryption                : true

Force Authentication            : false

Privilege Level                 : ADMINISTRATOR

Character Accumulate Level (ms) : 50

Character Send Threshold        : 255

Retry Count                     : 7

Retry Interval (ms)             : 480

Volatile Bit Rate (kbps)        : 38.4

Non-Volatile Bit Rate (kbps)    : 115.2

Payload Channel                 : 1 (0x01)

Payload Port                    : 623

cloud53.ulan:/home/banuchka # ipmitool sol set volatile-bit-rate 115.2 1

cloud53.ulan:/home/banuchka # ipmitool sol info 1

Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01

Set in progress                 : set-complete

Enabled                         : true

Force Encryption                : true

Force Authentication            : false

Privilege Level                 : ADMINISTRATOR

Character Accumulate Level (ms) : 50

Character Send Threshold        : 255

Retry Count                     : 7

Retry Interval (ms)             : 480

Volatile Bit Rate (kbps)        : 115.2

Non-Volatile Bit Rate (kbps)    : 115.2

Payload Channel                 : 1 (0x01)

Payload Port                    : 623

cloud53.ulan:/home/banuchka # echo 123 > /dev/console

 

and nothing happened

 

in the console’s log

—

[04/14 12:49:12 console disconnected][04/14 12:49:29 console connected][04/14 13:01:02 console disconnected][04/14 13:01:02 console connected][04/14 13:03:54 console disconnected][04/14 13:04:15 console connected][04/14 13:38:37 console connected][04/14 15:31:47 console disconnected][04/14 15:36:24 console connected][04/14 15:42:08 connection by xcat_console]

---

 

On 14 April 2017 at 16:39:35, Jarrod Johnson (***@lenovo.com) wrote:

If you do have any in corrupted state, would be interested to see what happens if you do:

ipmitool sol set volatile-bit-rate 115.2 1

 

 

To change the volatile bit rate to match the non-volatile bit rate and see if the corruption goes away.

 

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 11:36 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

115200

 

idracadm7 get iDRAC.IPMISerial

[Key=iDRAC.Embedded.1#IPMISerial.1]

BaudRate=115200

ChanPrivLimit=4

ConnectionMode=Terminal

DeleteControl=Disabled

EchoControl=Enabled

FlowControl=RTS/CTS

HandshakeControl=Enabled

InputNewLineSeq=1

LineEdit=Enabled

NewLineSeq=CR-LF

 

that is strange, right

 

On 14 April 2017 at 16:31:27, Jarrod Johnson (***@lenovo.com) wrote:

Hmm, what’s the baud rate the console is actually running at?  Odd to see the volatile and non volatile bit rates not be the same.

 

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 11:28 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

 

 

On 14 April 2017 at 16:15:16, Jarrod Johnson (***@lenovo.com) wrote:

And to be clear, the corruption only starts after a long period of time of being continuously connected?

Yes, that is correct


 

I might be interested in seeing ipmitool sol info 1 output against a system while it is working versus showing corrupted info.

corrupted:

# ipmitool -I lanplus -H cloud2manage -U root -a sol info 1

Password:

Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01

Set in progress                 : set-complete

Enabled                         : true

Force Encryption                : true

Force Authentication            : false

Privilege Level                 : ADMINISTRATOR

Character Accumulate Level (ms) : 50

Character Send Threshold        : 255

Retry Count                     : 7

Retry Interval (ms)             : 480

Volatile Bit Rate (kbps)        : 38.4

Non-Volatile Bit Rate (kbps)    : 115.2

Payload Channel                 : 1 (0x01)

Payload Port                    : 623

 

Working:

# ipmitool -I lanplus -H cloud2manage -U root -a sol info 1

Password:

Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01

Set in progress                 : set-complete

Enabled                         : true

Force Encryption                : true

Force Authentication            : false

Privilege Level                 : ADMINISTRATOR

Character Accumulate Level (ms) : 50

Character Send Threshold        : 255

Retry Count                     : 7

Retry Interval (ms)             : 480

Volatile Bit Rate (kbps)        : 38.4

Non-Volatile Bit Rate (kbps)    : 115.2

Payload Channel                 : 1 (0x01)

Payload Port                    : 623


 

From: banuchka [mailto:***@gmail.com] 
Sent: Friday, April 14, 2017 11:09 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

Yes, reopen causes it to work again,  without any garbage
 so looks like normal console :)

Hit <enter> causes at first garbage output(ᅵᅵ Porᅵlo) and *normal console* before...

 

On 14 April 2017 at 16:02:09, Jarrod Johnson (***@lenovo.com) wrote:

So reopen causes it to work again, and before, it’s not *hung*, but erratic with garbage characters and occasional blips of sanity?

 

From: banuchka [mailto:***@gmail.com] 
Sent: Friday, April 14, 2017 11:00 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

Reopen console did the trick as well...

 

On 14 April 2017 at 15:54:03, Jarrod Johnson (***@lenovo.com) wrote:

‘ctrl-e, then c, then o’ to reconnect.

 

Was conserver ondemand or full logging?

 

From: banuchka [mailto:***@gmail.com] 
Sent: Friday, April 14, 2017 10:52 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

Console starts showing garbage after <enter> inside rcons.

What do you mean when said “restarting console”?

Console continue its work after:

- <enter> inside rcons/confetty

- bmc reset (console disconnected/console connected)

 

You’re absolutely right with ipmitool and conserver with the same servers we were out of such troubles.

On 14 April 2017 at 15:47:14, Jarrod Johnson (***@lenovo.com) wrote:

So the console starts showing garbage?  Restarting the console causes the garbage to go away?

 

You said that ipmitool with a certain configuration did not trigger this?

 

From: banuchka [mailto:***@gmail.com] 
Sent: Friday, April 14, 2017 9:29 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

I’m out of ideas, let me show you all i see.

 

Inside rcons i see:

 

MONITORING_TEST dbb54 1492160401 <= last message i’ve sent from OS (more complex log below)

 

tcpdump(keepalive?):

 

13:23:42.342886 IP (tos 0x0, ttl 64, id 16448, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:23:42.345504 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

 




 

13:24:09.422491 IP (tos 0x0, ttl 64, id 17060, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:24:09.425045 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

 

Hit <enter> in rcons:

---

MONITORING_TEST dbb54 1492160401

 

ᅵᅵ

  Porᅵ

—

 

tcpdump:

13:24:35.727671 IP (tos 0x0, ttl 64, id 19582, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:24:35.731533 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

13:24:47.390367 IP (tos 0x0, ttl 64, id 20347, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:24:47.392799 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 64

13:24:47.408312 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

13:24:47.409797 IP (tos 0x0, ttl 64, id 20349, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:03.127774 IP (tos 0x0, ttl 64, id 21818, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:03.131561 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

13:25:27.269696 IP (tos 0x0, ttl 64, id 26284, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:27.272204 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

13:25:47.410313 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 64

13:25:47.413754 IP (tos 0x0, ttl 64, id 28210, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:48.709947 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 204)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 176

13:25:48.712033 IP (tos 0x0, ttl 64, id 28355, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:52.564080 IP (tos 0x0, ttl 64, id 29103, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:52.566810 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

 

and Magic, rcons:

---

  Porᅵlo]0;console: dbb54 [13:25]

 

 

dbb54 login:

---

 

On 14 April 2017 at 12:42:03, Jarrod Johnson (***@lenovo.com) wrote:

If you ctrl-e, c, o, does it restore the console after the time?

 

Can you tell that it goes after exactly 24hours on the dot?

 

When console hung, does ‘ipmitool sol activate’ say ‘session already active’?

Yes, 

# ipmitool -I lanplus -H 10.10.106.155 -U root -a sol activate

Password:

Info: SOL payload already active on another session


 

Does /var/log/confluent/consoles/<nodename> have any interesting events crop up?

[04/13 15:17:21 console connected]


 many our own messages

^MMONITORING_TEST dbb54 1492160401 | <== This is the last message from OS/ # date -***@1492160401 (Fri Apr 14 09:00:01 UTC 2017)

^M

[04/14 09:05:13 console connected]

[04/14 09:11:59 console connected]

[04/14 09:13:38 console disconnected]

[04/14 09:14:54 console connected]

[04/14 10:15:13 connection by xcat_console]

[04/14 10:15:14 disconnection by xcat_console]

[04/14 13:14:30 connection by xcat_console]


 

Pyghmi will do keepalive as well, and if that’s the problem, it should be much shorter than 24 hours.  In fact, it should be checking if the SOL payload is active and owned by confluent specifically every couple of minutes.

yes, thats correct


 

From: banuchka [mailto:***@gmail.com] 
Sent: Friday, April 14, 2017 5:55 AM
To: xcat-***@lists.sourceforge.net
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

My last reply was incorrect. Problems still here. Im trying to find something usefull inbetween confluent/pyghmi...

Confluent restart solves hangs/reopen all connections.

I think it isnt the best option to restart confluent 1 or 2 times in 24h.

-- 
banuchka

On 13 April 2017 at 17:03:19, banuchka (***@gmail.com) wrote:

It is Dell’s related problem, not 100% but


Confluent from current master is doing things well :)

Thanks for pretty nice tool “confluentdbutil".

 

On 13 April 2017 at 11:30:14, banuchka (***@gmail.com) wrote:

Looks like that problem was before
 The fix was to use ipmitool with keepalive(one from xcat repos).

Here pyghmi is used maybe that the reason?

 

On 13 April 2017 at 08:22:28, banuchka (***@gmail.com) wrote:

Hi,

 

Im trying to completely migrate from conserver to confluent, but catch strange behaviour.

Some of my consoles hangs ~after 24, so no any new messages in their logs or in rcons.

I send messages with timestamp from OS >/dev/console every 30-60min and take a look on them for monitoring purposes(consoles availability monitoring).

I can open rcons and hit enter, after few secs console is waking up(strange). I didnt see it happen with conserver or maybe im wrong...

Some details:

- as i can see the bigest part of consoles with hangs behaviour are Dell idrac. Doesnt matter which type of RacSerial or IPMISerial is in use.

- racreset hard/ipmitool bmc reset didnt do the things

- hit enter to console wake it up(for example with expect i can send \r\n\f, but it looks bad)

- i didnt try to clean confluent's conf and restart it. Not sure it may help.

- HP consoles works well, same ipmi

- few consoles with custom pluging works good as well

 

So maybe my question is not about confluent, but if some of you have some knowledge about same problems please share it! ;)

 

--
banuchka

--
banuchka

--
banuchka

------------------------------------------------------------------------------ 
Check out the vibrant tech community on one of the world's most 
engaging tech sites, Slashdot.org! http://sdm.link/slashdot_______________________________________________ 
xCAT-user mailing list 
xCAT-***@lists.sourceforge.net 
https://lists.sourceforge.net/lists/listinfo/xcat-user 

 

 

 

-- 
banuchka

-- 
banuchka

-- 
banuchka

-- 
banuchka

 

 

-- 
banuchka

-- 
banuchka

-- 
banuchka

-- 
banuchka

-- 
banuchka

-- 
banuchka

-- 
banuchka

-- 
banuchka
banuchka
2017-05-02 17:43:45 UTC
Permalink
and on confluent process 

$ cat /proc/24468/limits
Limit                     Soft Limit           Hard Limit           Units
Max cpu time              unlimited            unlimited            seconds
Max file size             unlimited            unlimited            bytes
Max data size             unlimited            unlimited            bytes
Max stack size            8388608              unlimited            bytes
Max core file size        unlimited            unlimited            bytes
Max resident set          unlimited            unlimited            bytes
Max processes             1048576              1048576              processes
Max open files            1048576              1048576              files
Max locked memory         65536                65536                bytes
Max address space         unlimited            unlimited            bytes
Max file locks            unlimited            unlimited            locks
Max pending signals       257578               257578               signals
Max msgqueue size         819200               819200               bytes
Max nice priority         0                    0
Max realtime priority     0                    0
Max realtime timeout      unlimited            unlimited            us

On 2 May 2017 at 18:42:02, banuchka (***@gmail.com) wrote:

$ ulimit -a
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 257578
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1048576
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 1048576
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

On 2 May 2017 at 18:26:42, Jarrod Johnson (***@lenovo.com) wrote:

Hmm, what’s ulimit look like?  Wondering about ulimit –n


 

From: banuchka [mailto:***@gmail.com]
Sent: Tuesday, May 02, 2017 12:17 PM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

there isn’t
 it happen only on nodes with 1k+ hosts
 when im using it on servers with at about 50-100 servers all done fine

 

On 2 May 2017 at 16:58:11, Jarrod Johnson (***@lenovo.com) wrote:

Is there anything special about /etc/confluent/cfg?

 

From: banuchka [mailto:***@gmail.com]
Sent: Tuesday, May 02, 2017 11:39 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

Hi,

 

bit follow up. when i’m trying “makeconfluent -l” on local running instance i see error:

 

===

confluent[4253]: stderr :May 02 15:33:21   File "/usr/lib64/python2.7/threading.py", line 823, in __bootstrap_inner

    (self.name, _format_exc())): Exception in thread Thread-1935:

Traceback (most recent call last):

  File "/usr/lib64/python2.7/threading.py", line 810, in __bootstrap_inner

    self.run()

  File "/usr/lib64/python2.7/threading.py", line 763, in run

    self.__target(*self.__args, **self.__kwargs)

  File "/opt/confluent/lib/python/confluent/config/configmanager.py", line 1417, in _sync_to_file

    return cls._sync_to_file()

  File "/opt/confluent/lib/python/confluent/config/configmanager.py", line 1417, in _sync_to_file

    return cls._sync_to_file()

  File "/opt/confluent/lib/python/confluent/conficg/configmanager.py", line 1408, in _sync_to_file

    dbf.close()

  File "/usr/lib64/python2.7/bsddb/__init__.py", line 296, in close

    v = _DeadlockWrap(self.db.close)

  File "/usr/lib64/python2.7/bsddb/dbutils.py", line 68, in DeadlockWrap

    return function(*_args, **_kwargs)

DBError: (9, 'Bad file descriptor -- /etc/confluent/cfg/nodes: Bad file descriptor')

===

 

On 14 April 2017 at 20:59:04, Jarrod Johnson (***@lenovo.com) wrote:

Yeah, there will be a bit push in the coming weeks it will have at least an ‘events’ log along with a lot more function.

 

Then some more fleshed out documentation (beyond the preliminary stuff on hpc.lenovo.com).

 

Let me know if the firmware exploration works out.  That particular change line suggests firmware upgrades, but it is possible they could have some high BMC cpu usage that could manifest in such a way.  The ‘works with ipmitool’ though has me scratching my head.

 

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 2:54 PM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

Last idea doesn’t work for me. So by the way idea as is is working great – confluent does disconnect/connect after time in constant. But for now it is 100% correct to say – it is a problem with IDRAC fw.

from release notes for last fw:

===

- Fix for occasional iDRAC unresponsiveness caused by upgrades via Firmware RACADM or
have an active SOL or SSH sessions while firmware upgrade is in progress.

===

I’m not sure, but maybe its something like i have here. So did the upgrade on few hosts and give them plenty of time to show me results.

Thanks for your answers, help and time
 it is very interesting quest :)

 

Bit more about Confluent:

- Interesting ambitions 

- Python VS Perl, thats good

- I think log files(not just trace, stderr, stdout) and documentation(source on Github is the best doc o know, but
) are things that i would like to be in Confluent

 

On 14 April 2017 at 19:27:20, Jarrod Johnson (***@lenovo.com) wrote:

Very interested in the outcome.  And thank you for working through it.  Also interested what you have liked, would like, and have disliked about confluent.

 

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 12:01 PM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

Thank you Jarrod, i’ll try to add patch and let you know after. Hope 90 minutes is enough, yes.

 

On 14 April 2017 at 16:57:24, Jarrod Johnson (***@lenovo.com) wrote:

Hmm, this is going to be very difficult to root cause (I only have Lenovo equipment as one might expect).

 

I’m loathe to do a workaround, but in console.py (find /usr –name console.py) , might be interesting to see how a change like the following:

diff --git a/pyghmi/ipmi/console.py b/pyghmi/ipmi/console.py

index 95e8551..a5f6062 100644

--- a/pyghmi/ipmi/console.py

+++ b/pyghmi/ipmi/console.py

@@ -42,6 +42,7 @@ class Console(object):

     def __init__(self, bmc, userid, password,

                  iohandler, port=623,

                  force=False, kg=None):

+        self.keepalivecount = 0

         self.keepaliveid = None

         self.connected = False

         self.broken = False

@@ -70,6 +71,7 @@ class Console(object):

         if 'error' in response:

             self._print_error(response['error'])

             return

+        self.keepalivecount = 0

         #Send activate sol payload directive

         #netfn= 6 (application)

         #command = 0x48 (activate payload)

@@ -150,11 +152,12 @@ class Console(object):

             return

         currowner = struct.unpack(

             "<I", struct.pack('4B', *response['data'][:4]))

-        if currowner[0] != self.ipmi_session.sessionid:

+        if currowner[0] != self.ipmi_session.sessionid or  self.keepalivecount > 180:

             # the session is deactivated or active for something else

             self.activated = False

             self._print_error('SOL deactivated')

             return

+        self.keepalivecount += 1

         # ok, still here, that means session is alive, but another

         # common issue is firmware messing with mux on reboot

         # this would be a nice thing to check, but the serial channel

 

If it would pan out, should cause the console session to disconnect itself roughly every 90 minutes and trigger reconnect (is 90 minutes short enough in your case?)  Would require a service confluent restart to see if it had the desired effect.

 

Sorry I haven’t tested and can’t think of root cause, but going to take some time off for the weekend.

 

I would be curious if the same ipmitool is running a day later than a check (e.g. if ipmitool is exiting and getting restarted).  I don’t have the time at the moment to see if they do some other interesting thing to avoid the behavior.

 

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 11:45 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

cloud53.ulan:/home/banuchka # ipmitool sol info 1

Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01

Set in progress                 : set-complete

Enabled                         : true

Force Encryption                : true

Force Authentication            : false

Privilege Level                 : ADMINISTRATOR

Character Accumulate Level (ms) : 50

Character Send Threshold        : 255

Retry Count                     : 7

Retry Interval (ms)             : 480

Volatile Bit Rate (kbps)        : 38.4

Non-Volatile Bit Rate (kbps)    : 115.2

Payload Channel                 : 1 (0x01)

Payload Port                    : 623

cloud53.ulan:/home/banuchka # ipmitool sol set volatile-bit-rate 115.2 1

cloud53.ulan:/home/banuchka # ipmitool sol info 1

Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01

Set in progress                 : set-complete

Enabled                         : true

Force Encryption                : true

Force Authentication            : false

Privilege Level                 : ADMINISTRATOR

Character Accumulate Level (ms) : 50

Character Send Threshold        : 255

Retry Count                     : 7

Retry Interval (ms)             : 480

Volatile Bit Rate (kbps)        : 115.2

Non-Volatile Bit Rate (kbps)    : 115.2

Payload Channel                 : 1 (0x01)

Payload Port                    : 623

cloud53.ulan:/home/banuchka # echo 123 > /dev/console

 

and nothing happened

 

in the console’s log

—

[04/14 12:49:12 console disconnected][04/14 12:49:29 console connected][04/14 13:01:02 console disconnected][04/14 13:01:02 console connected][04/14 13:03:54 console disconnected][04/14 13:04:15 console connected][04/14 13:38:37 console connected][04/14 15:31:47 console disconnected][04/14 15:36:24 console connected][04/14 15:42:08 connection by xcat_console]

---

 

On 14 April 2017 at 16:39:35, Jarrod Johnson (***@lenovo.com) wrote:

If you do have any in corrupted state, would be interested to see what happens if you do:

ipmitool sol set volatile-bit-rate 115.2 1

 

 

To change the volatile bit rate to match the non-volatile bit rate and see if the corruption goes away.

 

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 11:36 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

115200

 

idracadm7 get iDRAC.IPMISerial

[Key=iDRAC.Embedded.1#IPMISerial.1]

BaudRate=115200

ChanPrivLimit=4

ConnectionMode=Terminal

DeleteControl=Disabled

EchoControl=Enabled

FlowControl=RTS/CTS

HandshakeControl=Enabled

InputNewLineSeq=1

LineEdit=Enabled

NewLineSeq=CR-LF

 

that is strange, right

 

On 14 April 2017 at 16:31:27, Jarrod Johnson (***@lenovo.com) wrote:

Hmm, what’s the baud rate the console is actually running at?  Odd to see the volatile and non volatile bit rates not be the same.

 

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 11:28 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

 

 

On 14 April 2017 at 16:15:16, Jarrod Johnson (***@lenovo.com) wrote:

And to be clear, the corruption only starts after a long period of time of being continuously connected?

Yes, that is correct


 

I might be interested in seeing ipmitool sol info 1 output against a system while it is working versus showing corrupted info.

corrupted:

# ipmitool -I lanplus -H cloud2manage -U root -a sol info 1

Password:

Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01

Set in progress                 : set-complete

Enabled                         : true

Force Encryption                : true

Force Authentication            : false

Privilege Level                 : ADMINISTRATOR

Character Accumulate Level (ms) : 50

Character Send Threshold        : 255

Retry Count                     : 7

Retry Interval (ms)             : 480

Volatile Bit Rate (kbps)        : 38.4

Non-Volatile Bit Rate (kbps)    : 115.2

Payload Channel                 : 1 (0x01)

Payload Port                    : 623

 

Working:

# ipmitool -I lanplus -H cloud2manage -U root -a sol info 1

Password:

Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01

Set in progress                 : set-complete

Enabled                         : true

Force Encryption                : true

Force Authentication            : false

Privilege Level                 : ADMINISTRATOR

Character Accumulate Level (ms) : 50

Character Send Threshold        : 255

Retry Count                     : 7

Retry Interval (ms)             : 480

Volatile Bit Rate (kbps)        : 38.4

Non-Volatile Bit Rate (kbps)    : 115.2

Payload Channel                 : 1 (0x01)

Payload Port                    : 623


 

From: banuchka [mailto:***@gmail.com] 
Sent: Friday, April 14, 2017 11:09 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

Yes, reopen causes it to work again,  without any garbage
 so looks like normal console :)

Hit <enter> causes at first garbage output(ᅵᅵ Porᅵlo) and *normal console* before...

 

On 14 April 2017 at 16:02:09, Jarrod Johnson (***@lenovo.com) wrote:

So reopen causes it to work again, and before, it’s not *hung*, but erratic with garbage characters and occasional blips of sanity?

 

From: banuchka [mailto:***@gmail.com] 
Sent: Friday, April 14, 2017 11:00 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

Reopen console did the trick as well...

 

On 14 April 2017 at 15:54:03, Jarrod Johnson (***@lenovo.com) wrote:

‘ctrl-e, then c, then o’ to reconnect.

 

Was conserver ondemand or full logging?

 

From: banuchka [mailto:***@gmail.com] 
Sent: Friday, April 14, 2017 10:52 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

Console starts showing garbage after <enter> inside rcons.

What do you mean when said “restarting console”?

Console continue its work after:

- <enter> inside rcons/confetty

- bmc reset (console disconnected/console connected)

 

You’re absolutely right with ipmitool and conserver with the same servers we were out of such troubles.

On 14 April 2017 at 15:47:14, Jarrod Johnson (***@lenovo.com) wrote:

So the console starts showing garbage?  Restarting the console causes the garbage to go away?

 

You said that ipmitool with a certain configuration did not trigger this?

 

From: banuchka [mailto:***@gmail.com] 
Sent: Friday, April 14, 2017 9:29 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

I’m out of ideas, let me show you all i see.

 

Inside rcons i see:

 

MONITORING_TEST dbb54 1492160401 <= last message i’ve sent from OS (more complex log below)

 

tcpdump(keepalive?):

 

13:23:42.342886 IP (tos 0x0, ttl 64, id 16448, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:23:42.345504 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

 




 

13:24:09.422491 IP (tos 0x0, ttl 64, id 17060, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:24:09.425045 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

 

Hit <enter> in rcons:

---

MONITORING_TEST dbb54 1492160401

 

ᅵᅵ

  Porᅵ

—

 

tcpdump:

13:24:35.727671 IP (tos 0x0, ttl 64, id 19582, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:24:35.731533 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

13:24:47.390367 IP (tos 0x0, ttl 64, id 20347, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:24:47.392799 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 64

13:24:47.408312 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

13:24:47.409797 IP (tos 0x0, ttl 64, id 20349, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:03.127774 IP (tos 0x0, ttl 64, id 21818, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:03.131561 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

13:25:27.269696 IP (tos 0x0, ttl 64, id 26284, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:27.272204 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

13:25:47.410313 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 64

13:25:47.413754 IP (tos 0x0, ttl 64, id 28210, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:48.709947 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 204)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 176

13:25:48.712033 IP (tos 0x0, ttl 64, id 28355, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:52.564080 IP (tos 0x0, ttl 64, id 29103, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:52.566810 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

 

and Magic, rcons:

---

  Porᅵlo]0;console: dbb54 [13:25]

 

 

dbb54 login:

---

 

On 14 April 2017 at 12:42:03, Jarrod Johnson (***@lenovo.com) wrote:

If you ctrl-e, c, o, does it restore the console after the time?

 

Can you tell that it goes after exactly 24hours on the dot?

 

When console hung, does ‘ipmitool sol activate’ say ‘session already active’?

Yes, 

# ipmitool -I lanplus -H 10.10.106.155 -U root -a sol activate

Password:

Info: SOL payload already active on another session


 

Does /var/log/confluent/consoles/<nodename> have any interesting events crop up?

[04/13 15:17:21 console connected]


 many our own messages

^MMONITORING_TEST dbb54 1492160401 | <== This is the last message from OS/ # date -***@1492160401 (Fri Apr 14 09:00:01 UTC 2017)

^M

[04/14 09:05:13 console connected]

[04/14 09:11:59 console connected]

[04/14 09:13:38 console disconnected]

[04/14 09:14:54 console connected]

[04/14 10:15:13 connection by xcat_console]

[04/14 10:15:14 disconnection by xcat_console]

[04/14 13:14:30 connection by xcat_console]


 

Pyghmi will do keepalive as well, and if that’s the problem, it should be much shorter than 24 hours.  In fact, it should be checking if the SOL payload is active and owned by confluent specifically every couple of minutes.

yes, thats correct


 

From: banuchka [mailto:***@gmail.com] 
Sent: Friday, April 14, 2017 5:55 AM
To: xcat-***@lists.sourceforge.net
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

My last reply was incorrect. Problems still here. Im trying to find something usefull inbetween confluent/pyghmi...

Confluent restart solves hangs/reopen all connections.

I think it isnt the best option to restart confluent 1 or 2 times in 24h.

-- 
banuchka

On 13 April 2017 at 17:03:19, banuchka (***@gmail.com) wrote:

It is Dell’s related problem, not 100% but


Confluent from current master is doing things well :)

Thanks for pretty nice tool “confluentdbutil".

 

On 13 April 2017 at 11:30:14, banuchka (***@gmail.com) wrote:

Looks like that problem was before
 The fix was to use ipmitool with keepalive(one from xcat repos).

Here pyghmi is used maybe that the reason?

 

On 13 April 2017 at 08:22:28, banuchka (***@gmail.com) wrote:

Hi,

 

Im trying to completely migrate from conserver to confluent, but catch strange behaviour.

Some of my consoles hangs ~after 24, so no any new messages in their logs or in rcons.

I send messages with timestamp from OS >/dev/console every 30-60min and take a look on them for monitoring purposes(consoles availability monitoring).

I can open rcons and hit enter, after few secs console is waking up(strange). I didnt see it happen with conserver or maybe im wrong...

Some details:

- as i can see the bigest part of consoles with hangs behaviour are Dell idrac. Doesnt matter which type of RacSerial or IPMISerial is in use.

- racreset hard/ipmitool bmc reset didnt do the things

- hit enter to console wake it up(for example with expect i can send \r\n\f, but it looks bad)

- i didnt try to clean confluent's conf and restart it. Not sure it may help.

- HP consoles works well, same ipmi

- few consoles with custom pluging works good as well

 

So maybe my question is not about confluent, but if some of you have some knowledge about same problems please share it! ;)

 

--
banuchka

--
banuchka

--
banuchka

------------------------------------------------------------------------------ 
Check out the vibrant tech community on one of the world's most 
engaging tech sites, Slashdot.org! http://sdm.link/slashdot_______________________________________________ 
xCAT-user mailing list 
xCAT-***@lists.sourceforge.net 
https://lists.sourceforge.net/lists/listinfo/xcat-user 

 

 

 

-- 
banuchka

-- 
banuchka

-- 
banuchka

-- 
banuchka

 

 

-- 
banuchka

-- 
banuchka

-- 
banuchka

-- 
banuchka

-- 
banuchka

-- 
banuchka

-- 
banuchka

-- 
banuchka
-- 
banuchka
banuchka
2017-05-03 18:09:57 UTC
Permalink
Hi,

one more strange thing about confluent:

May  3 12:57:28 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 12:57:26 console connected]
May  3 13:02:08 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 13:02:06 console disconnected]
May  3 13:10:32 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 13:10:30 console connected]
May  3 13:12:08 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 13:12:06 console disconnected]
May  3 13:21:08 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 13:21:06 console connected]
May  3 13:22:05 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 13:22:03 console disconnected]
May  3 13:26:02 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 13:26:00 console connected]
May  3 13:32:05 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 13:32:03 console disconnected]
May  3 13:33:17 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 13:33:15 console connected]
May  3 14:22:02 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 14:22:00 console disconnected]
May  3 14:23:11 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 14:23:09 console connected]
May  3 14:32:02 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 14:32:00 console disconnected]
May  3 14:39:44 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 14:39:42 console connected]
May  3 14:52:07 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 14:52:05 console disconnected]
May  3 14:52:17 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 14:52:15 console connected]
May  3 15:02:15 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:02:13 console disconnected]
May  3 15:06:40 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:06:38 console connected]
May  3 15:12:17 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:12:15 console disconnected]
May  3 15:15:30 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:15:28 console connected]
May  3 15:22:17 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:22:15 console disconnected]
May  3 15:30:28 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:30:26 console connected]
May  3 15:32:21 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:32:19 console disconnected]
May  3 15:36:42 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:36:40 console connected]
May  3 15:41:59 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:41:57 console disconnected]
May  3 15:45:17 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:45:15 console connected]
May  3 15:51:59 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:51:57 console disconnected]
May  3 15:57:05 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:57:03 console connected]
May  3 17:22:12 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 17:22:10 console disconnected]
May  3 17:26:38 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 17:26:36 console connected]
May  3 17:32:15 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 17:32:13 console disconnected]
May  3 17:41:26 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 17:41:24 console connected]
May  3 17:42:01 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 17:41:59 console disconnected]
May  3 17:49:32 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 17:49:30 console connected]
May  3 17:52:07 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 17:52:05 console disconnected]
May  3 17:52:42 xcat-sn1.mlan confluent[4102]: audit :May 03 17:52:40 {"operation": "start", "allowed": true, "target": "/nodes/unreg25/console/session", "user": "xcat_console"}
May  3 17:52:42 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 17:52:40 connection by xcat_console]
May  3 17:52:45 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 17:52:43 console disconnected]
May  3 17:56:09 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 17:56:07 console connected]

it isn’t Dell BMC
  

I think i’ve wrote about that behaviour here before, anyway. Times here are so random doesn’t look like a timeout issue in some place.

Need an advice before rolling back :) Thanks

On 14 April 2017 at 20:59:04, Jarrod Johnson (***@lenovo.com) wrote:

Yeah, there will be a bit push in the coming weeks it will have at least an ‘events’ log along with a lot more function.

 

Then some more fleshed out documentation (beyond the preliminary stuff on hpc.lenovo.com).

 

Let me know if the firmware exploration works out.  That particular change line suggests firmware upgrades, but it is possible they could have some high BMC cpu usage that could manifest in such a way.  The ‘works with ipmitool’ though has me scratching my head.

 

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 2:54 PM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

Last idea doesn’t work for me. So by the way idea as is is working great – confluent does disconnect/connect after time in constant. But for now it is 100% correct to say – it is a problem with IDRAC fw.

from release notes for last fw:

===

- Fix for occasional iDRAC unresponsiveness caused by upgrades via Firmware RACADM or
have an active SOL or SSH sessions while firmware upgrade is in progress.

===

I’m not sure, but maybe its something like i have here. So did the upgrade on few hosts and give them plenty of time to show me results.

Thanks for your answers, help and time
 it is very interesting quest :)

 

Bit more about Confluent:

- Interesting ambitions 

- Python VS Perl, thats good

- I think log files(not just trace, stderr, stdout) and documentation(source on Github is the best doc o know, but
) are things that i would like to be in Confluent

 

On 14 April 2017 at 19:27:20, Jarrod Johnson (***@lenovo.com) wrote:

Very interested in the outcome.  And thank you for working through it.  Also interested what you have liked, would like, and have disliked about confluent.

 

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 12:01 PM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

Thank you Jarrod, i’ll try to add patch and let you know after. Hope 90 minutes is enough, yes.

 

On 14 April 2017 at 16:57:24, Jarrod Johnson (***@lenovo.com) wrote:

Hmm, this is going to be very difficult to root cause (I only have Lenovo equipment as one might expect).

 

I’m loathe to do a workaround, but in console.py (find /usr –name console.py) , might be interesting to see how a change like the following:

diff --git a/pyghmi/ipmi/console.py b/pyghmi/ipmi/console.py

index 95e8551..a5f6062 100644

--- a/pyghmi/ipmi/console.py

+++ b/pyghmi/ipmi/console.py

@@ -42,6 +42,7 @@ class Console(object):

     def __init__(self, bmc, userid, password,

                  iohandler, port=623,

                  force=False, kg=None):

+        self.keepalivecount = 0

         self.keepaliveid = None

         self.connected = False

         self.broken = False

@@ -70,6 +71,7 @@ class Console(object):

         if 'error' in response:

             self._print_error(response['error'])

             return

+        self.keepalivecount = 0

         #Send activate sol payload directive

         #netfn= 6 (application)

         #command = 0x48 (activate payload)

@@ -150,11 +152,12 @@ class Console(object):

             return

         currowner = struct.unpack(

             "<I", struct.pack('4B', *response['data'][:4]))

-        if currowner[0] != self.ipmi_session.sessionid:

+        if currowner[0] != self.ipmi_session.sessionid or  self.keepalivecount > 180:

             # the session is deactivated or active for something else

             self.activated = False

             self._print_error('SOL deactivated')

             return

+        self.keepalivecount += 1

         # ok, still here, that means session is alive, but another

         # common issue is firmware messing with mux on reboot

         # this would be a nice thing to check, but the serial channel

 

If it would pan out, should cause the console session to disconnect itself roughly every 90 minutes and trigger reconnect (is 90 minutes short enough in your case?)  Would require a service confluent restart to see if it had the desired effect.

 

Sorry I haven’t tested and can’t think of root cause, but going to take some time off for the weekend.

 

I would be curious if the same ipmitool is running a day later than a check (e.g. if ipmitool is exiting and getting restarted).  I don’t have the time at the moment to see if they do some other interesting thing to avoid the behavior.

 

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 11:45 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

cloud53.ulan:/home/banuchka # ipmitool sol info 1

Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01

Set in progress                 : set-complete

Enabled                         : true

Force Encryption                : true

Force Authentication            : false

Privilege Level                 : ADMINISTRATOR

Character Accumulate Level (ms) : 50

Character Send Threshold        : 255

Retry Count                     : 7

Retry Interval (ms)             : 480

Volatile Bit Rate (kbps)        : 38.4

Non-Volatile Bit Rate (kbps)    : 115.2

Payload Channel                 : 1 (0x01)

Payload Port                    : 623

cloud53.ulan:/home/banuchka # ipmitool sol set volatile-bit-rate 115.2 1

cloud53.ulan:/home/banuchka # ipmitool sol info 1

Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01

Set in progress                 : set-complete

Enabled                         : true

Force Encryption                : true

Force Authentication            : false

Privilege Level                 : ADMINISTRATOR

Character Accumulate Level (ms) : 50

Character Send Threshold        : 255

Retry Count                     : 7

Retry Interval (ms)             : 480

Volatile Bit Rate (kbps)        : 115.2

Non-Volatile Bit Rate (kbps)    : 115.2

Payload Channel                 : 1 (0x01)

Payload Port                    : 623

cloud53.ulan:/home/banuchka # echo 123 > /dev/console

 

and nothing happened

 

in the console’s log

—

[04/14 12:49:12 console disconnected][04/14 12:49:29 console connected][04/14 13:01:02 console disconnected][04/14 13:01:02 console connected][04/14 13:03:54 console disconnected][04/14 13:04:15 console connected][04/14 13:38:37 console connected][04/14 15:31:47 console disconnected][04/14 15:36:24 console connected][04/14 15:42:08 connection by xcat_console]

---

 

On 14 April 2017 at 16:39:35, Jarrod Johnson (***@lenovo.com) wrote:

If you do have any in corrupted state, would be interested to see what happens if you do:

ipmitool sol set volatile-bit-rate 115.2 1

 

 

To change the volatile bit rate to match the non-volatile bit rate and see if the corruption goes away.

 

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 11:36 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

115200

 

idracadm7 get iDRAC.IPMISerial

[Key=iDRAC.Embedded.1#IPMISerial.1]

BaudRate=115200

ChanPrivLimit=4

ConnectionMode=Terminal

DeleteControl=Disabled

EchoControl=Enabled

FlowControl=RTS/CTS

HandshakeControl=Enabled

InputNewLineSeq=1

LineEdit=Enabled

NewLineSeq=CR-LF

 

that is strange, right

 

On 14 April 2017 at 16:31:27, Jarrod Johnson (***@lenovo.com) wrote:

Hmm, what’s the baud rate the console is actually running at?  Odd to see the volatile and non volatile bit rates not be the same.

 

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 11:28 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

 

 

On 14 April 2017 at 16:15:16, Jarrod Johnson (***@lenovo.com) wrote:

And to be clear, the corruption only starts after a long period of time of being continuously connected?

Yes, that is correct


 

I might be interested in seeing ipmitool sol info 1 output against a system while it is working versus showing corrupted info.

corrupted:

# ipmitool -I lanplus -H cloud2manage -U root -a sol info 1

Password:

Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01

Set in progress                 : set-complete

Enabled                         : true

Force Encryption                : true

Force Authentication            : false

Privilege Level                 : ADMINISTRATOR

Character Accumulate Level (ms) : 50

Character Send Threshold        : 255

Retry Count                     : 7

Retry Interval (ms)             : 480

Volatile Bit Rate (kbps)        : 38.4

Non-Volatile Bit Rate (kbps)    : 115.2

Payload Channel                 : 1 (0x01)

Payload Port                    : 623

 

Working:

# ipmitool -I lanplus -H cloud2manage -U root -a sol info 1

Password:

Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01

Set in progress                 : set-complete

Enabled                         : true

Force Encryption                : true

Force Authentication            : false

Privilege Level                 : ADMINISTRATOR

Character Accumulate Level (ms) : 50

Character Send Threshold        : 255

Retry Count                     : 7

Retry Interval (ms)             : 480

Volatile Bit Rate (kbps)        : 38.4

Non-Volatile Bit Rate (kbps)    : 115.2

Payload Channel                 : 1 (0x01)

Payload Port                    : 623


 

From: banuchka [mailto:***@gmail.com] 
Sent: Friday, April 14, 2017 11:09 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

Yes, reopen causes it to work again,  without any garbage
 so looks like normal console :)

Hit <enter> causes at first garbage output(ᅵᅵ Porᅵlo) and *normal console* before...

 

On 14 April 2017 at 16:02:09, Jarrod Johnson (***@lenovo.com) wrote:

So reopen causes it to work again, and before, it’s not *hung*, but erratic with garbage characters and occasional blips of sanity?

 

From: banuchka [mailto:***@gmail.com] 
Sent: Friday, April 14, 2017 11:00 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

Reopen console did the trick as well...

 

On 14 April 2017 at 15:54:03, Jarrod Johnson (***@lenovo.com) wrote:

‘ctrl-e, then c, then o’ to reconnect.

 

Was conserver ondemand or full logging?

 

From: banuchka [mailto:***@gmail.com] 
Sent: Friday, April 14, 2017 10:52 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

Console starts showing garbage after <enter> inside rcons.

What do you mean when said “restarting console”?

Console continue its work after:

- <enter> inside rcons/confetty

- bmc reset (console disconnected/console connected)

 

You’re absolutely right with ipmitool and conserver with the same servers we were out of such troubles.

On 14 April 2017 at 15:47:14, Jarrod Johnson (***@lenovo.com) wrote:

So the console starts showing garbage?  Restarting the console causes the garbage to go away?

 

You said that ipmitool with a certain configuration did not trigger this?

 

From: banuchka [mailto:***@gmail.com] 
Sent: Friday, April 14, 2017 9:29 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

I’m out of ideas, let me show you all i see.

 

Inside rcons i see:

 

MONITORING_TEST dbb54 1492160401 <= last message i’ve sent from OS (more complex log below)

 

tcpdump(keepalive?):

 

13:23:42.342886 IP (tos 0x0, ttl 64, id 16448, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:23:42.345504 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

 




 

13:24:09.422491 IP (tos 0x0, ttl 64, id 17060, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:24:09.425045 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

 

Hit <enter> in rcons:

---

MONITORING_TEST dbb54 1492160401

 

ᅵᅵ

  Porᅵ

—

 

tcpdump:

13:24:35.727671 IP (tos 0x0, ttl 64, id 19582, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:24:35.731533 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

13:24:47.390367 IP (tos 0x0, ttl 64, id 20347, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:24:47.392799 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 64

13:24:47.408312 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

13:24:47.409797 IP (tos 0x0, ttl 64, id 20349, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:03.127774 IP (tos 0x0, ttl 64, id 21818, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:03.131561 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

13:25:27.269696 IP (tos 0x0, ttl 64, id 26284, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:27.272204 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

13:25:47.410313 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 64

13:25:47.413754 IP (tos 0x0, ttl 64, id 28210, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:48.709947 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 204)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 176

13:25:48.712033 IP (tos 0x0, ttl 64, id 28355, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:52.564080 IP (tos 0x0, ttl 64, id 29103, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:52.566810 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

 

and Magic, rcons:

---

  Porᅵlo]0;console: dbb54 [13:25]

 

 

dbb54 login:

---

 

On 14 April 2017 at 12:42:03, Jarrod Johnson (***@lenovo.com) wrote:

If you ctrl-e, c, o, does it restore the console after the time?

 

Can you tell that it goes after exactly 24hours on the dot?

 

When console hung, does ‘ipmitool sol activate’ say ‘session already active’?

Yes, 

# ipmitool -I lanplus -H 10.10.106.155 -U root -a sol activate

Password:

Info: SOL payload already active on another session


 

Does /var/log/confluent/consoles/<nodename> have any interesting events crop up?

[04/13 15:17:21 console connected]


 many our own messages

^MMONITORING_TEST dbb54 1492160401 | <== This is the last message from OS/ # date -***@1492160401 (Fri Apr 14 09:00:01 UTC 2017)

^M

[04/14 09:05:13 console connected]

[04/14 09:11:59 console connected]

[04/14 09:13:38 console disconnected]

[04/14 09:14:54 console connected]

[04/14 10:15:13 connection by xcat_console]

[04/14 10:15:14 disconnection by xcat_console]

[04/14 13:14:30 connection by xcat_console]


 

Pyghmi will do keepalive as well, and if that’s the problem, it should be much shorter than 24 hours.  In fact, it should be checking if the SOL payload is active and owned by confluent specifically every couple of minutes.

yes, thats correct


 

From: banuchka [mailto:***@gmail.com] 
Sent: Friday, April 14, 2017 5:55 AM
To: xcat-***@lists.sourceforge.net
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

My last reply was incorrect. Problems still here. Im trying to find something usefull inbetween confluent/pyghmi...

Confluent restart solves hangs/reopen all connections.

I think it isnt the best option to restart confluent 1 or 2 times in 24h.

-- 
banuchka

On 13 April 2017 at 17:03:19, banuchka (***@gmail.com) wrote:

It is Dell’s related problem, not 100% but


Confluent from current master is doing things well :) 

Thanks for pretty nice tool “confluentdbutil".

 

On 13 April 2017 at 11:30:14, banuchka (***@gmail.com) wrote:

Looks like that problem was before
 The fix was to use ipmitool with keepalive(one from xcat repos).

Here pyghmi is used maybe that the reason?

 

On 13 April 2017 at 08:22:28, banuchka (***@gmail.com) wrote:

Hi,

 

Im trying to completely migrate from conserver to confluent, but catch strange behaviour.

Some of my consoles hangs ~after 24, so no any new messages in their logs or in rcons.

I send messages with timestamp from OS >/dev/console every 30-60min and take a look on them for monitoring purposes(consoles availability monitoring).

I can open rcons and hit enter, after few secs console is waking up(strange). I didnt see it happen with conserver or maybe im wrong...

Some details:

- as i can see the bigest part of consoles with hangs behaviour are Dell idrac. Doesnt matter which type of RacSerial or IPMISerial is in use.

- racreset hard/ipmitool bmc reset didnt do the things

- hit enter to console wake it up(for example with expect i can send \r\n\f, but it looks bad)

- i didnt try to clean confluent's conf and restart it. Not sure it may help.

- HP consoles works well, same ipmi

- few consoles with custom pluging works good as well

 

So maybe my question is not about confluent, but if some of you have some knowledge about same problems please share it! ;)

 

--
banuchka

-- 
banuchka

-- 
banuchka

------------------------------------------------------------------------------ 
Check out the vibrant tech community on one of the world's most 
engaging tech sites, Slashdot.org! http://sdm.link/slashdot_______________________________________________ 
xCAT-user mailing list 
xCAT-***@lists.sourceforge.net 
https://lists.sourceforge.net/lists/listinfo/xcat-user 

 

 

 

-- 
banuchka

-- 
banuchka

-- 
banuchka

-- 
banuchka

 

 

-- 
banuchka

-- 
banuchka

-- 
banuchka

-- 
banuchka

-- 
banuchka

-- 
banuchka
Jarrod Johnson
2017-05-03 18:12:39 UTC
Permalink
Hmm, and there isn’t anything like conserver or another confluent trying to run at the same time to the same node?

From: banuchka [mailto:***@gmail.com]
Sent: Wednesday, May 03, 2017 2:10 PM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

Hi,

one more strange thing about confluent:

May 3 12:57:28 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 12:57:26 console connected]
May 3 13:02:08 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 13:02:06 console disconnected]
May 3 13:10:32 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 13:10:30 console connected]
May 3 13:12:08 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 13:12:06 console disconnected]
May 3 13:21:08 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 13:21:06 console connected]
May 3 13:22:05 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 13:22:03 console disconnected]
May 3 13:26:02 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 13:26:00 console connected]
May 3 13:32:05 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 13:32:03 console disconnected]
May 3 13:33:17 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 13:33:15 console connected]
May 3 14:22:02 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 14:22:00 console disconnected]
May 3 14:23:11 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 14:23:09 console connected]
May 3 14:32:02 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 14:32:00 console disconnected]
May 3 14:39:44 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 14:39:42 console connected]
May 3 14:52:07 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 14:52:05 console disconnected]
May 3 14:52:17 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 14:52:15 console connected]
May 3 15:02:15 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:02:13 console disconnected]
May 3 15:06:40 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:06:38 console connected]
May 3 15:12:17 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:12:15 console disconnected]
May 3 15:15:30 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:15:28 console connected]
May 3 15:22:17 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:22:15 console disconnected]
May 3 15:30:28 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:30:26 console connected]
May 3 15:32:21 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:32:19 console disconnected]
May 3 15:36:42 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:36:40 console connected]
May 3 15:41:59 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:41:57 console disconnected]
May 3 15:45:17 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:45:15 console connected]
May 3 15:51:59 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:51:57 console disconnected]
May 3 15:57:05 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:57:03 console connected]
May 3 17:22:12 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 17:22:10 console disconnected]
May 3 17:26:38 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 17:26:36 console connected]
May 3 17:32:15 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 17:32:13 console disconnected]
May 3 17:41:26 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 17:41:24 console connected]
May 3 17:42:01 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 17:41:59 console disconnected]
May 3 17:49:32 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 17:49:30 console connected]
May 3 17:52:07 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 17:52:05 console disconnected]
May 3 17:52:42 xcat-sn1.mlan confluent[4102]: audit :May 03 17:52:40 {"operation": "start", "allowed": true, "target": "/nodes/unreg25/console/session", "user": "xcat_console"}
May 3 17:52:42 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 17:52:40 connection by xcat_console]
May 3 17:52:45 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 17:52:43 console disconnected]
May 3 17:56:09 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 17:56:07 console connected]

it isn’t Dell BMC


I think i’ve wrote about that behaviour here before, anyway. Times here are so random doesn’t look like a timeout issue in some place.

Need an advice before rolling back :) Thanks


On 14 April 2017 at 20:59:04, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
Yeah, there will be a bit push in the coming weeks it will have at least an ‘events’ log along with a lot more function.

Then some more fleshed out documentation (beyond the preliminary stuff on hpc.lenovo.com).

Let me know if the firmware exploration works out. That particular change line suggests firmware upgrades, but it is possible they could have some high BMC cpu usage that could manifest in such a way. The ‘works with ipmitool’ though has me scratching my head.

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 2:54 PM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

Last idea doesn’t work for me. So by the way idea as is is working great – confluent does disconnect/connect after time in constant. But for now it is 100% correct to say – it is a problem with IDRAC fw.
from release notes for last fw:
===
- Fix for occasional iDRAC unresponsiveness caused by upgrades via Firmware RACADM or
have an active SOL or SSH sessions while firmware upgrade is in progress.
===
I’m not sure, but maybe its something like i have here. So did the upgrade on few hosts and give them plenty of time to show me results.
Thanks for your answers, help and time
 it is very interesting quest :)

Bit more about Confluent:
- Interesting ambitions
- Python VS Perl, thats good
- I think log files(not just trace, stderr, stdout) and documentation(source on Github is the best doc o know, but
) are things that i would like to be in Confluent


On 14 April 2017 at 19:27:20, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
Very interested in the outcome. And thank you for working through it. Also interested what you have liked, would like, and have disliked about confluent.

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 12:01 PM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

Thank you Jarrod, i’ll try to add patch and let you know after. Hope 90 minutes is enough, yes.


On 14 April 2017 at 16:57:24, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
Hmm, this is going to be very difficult to root cause (I only have Lenovo equipment as one might expect).

I’m loathe to do a workaround, but in console.py (find /usr –name console.py) , might be interesting to see how a change like the following:
diff --git a/pyghmi/ipmi/console.py b/pyghmi/ipmi/console.py
index 95e8551..a5f6062 100644
--- a/pyghmi/ipmi/console.py
+++ b/pyghmi/ipmi/console.py
@@ -42,6 +42,7 @@ class Console(object):
def __init__(self, bmc, userid, password,
iohandler, port=623,
force=False, kg=None):
+ self.keepalivecount = 0
self.keepaliveid = None
self.connected = False
self.broken = False
@@ -70,6 +71,7 @@ class Console(object):
if 'error' in response:
self._print_error(response['error'])
return
+ self.keepalivecount = 0
#Send activate sol payload directive
#netfn= 6 (application)
#command = 0x48 (activate payload)
@@ -150,11 +152,12 @@ class Console(object):
return
currowner = struct.unpack(
"<I", struct.pack('4B', *response['data'][:4]))
- if currowner[0] != self.ipmi_session.sessionid:
+ if currowner[0] != self.ipmi_session.sessionid or self.keepalivecount > 180:
# the session is deactivated or active for something else
self.activated = False
self._print_error('SOL deactivated')
return
+ self.keepalivecount += 1
# ok, still here, that means session is alive, but another
# common issue is firmware messing with mux on reboot
# this would be a nice thing to check, but the serial channel

If it would pan out, should cause the console session to disconnect itself roughly every 90 minutes and trigger reconnect (is 90 minutes short enough in your case?) Would require a service confluent restart to see if it had the desired effect.

Sorry I haven’t tested and can’t think of root cause, but going to take some time off for the weekend.

I would be curious if the same ipmitool is running a day later than a check (e.g. if ipmitool is exiting and getting restarted). I don’t have the time at the moment to see if they do some other interesting thing to avoid the behavior.

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 11:45 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

cloud53.ulan:/home/banuchka # ipmitool sol info 1
Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01
Set in progress : set-complete
Enabled : true
Force Encryption : true
Force Authentication : false
Privilege Level : ADMINISTRATOR
Character Accumulate Level (ms) : 50
Character Send Threshold : 255
Retry Count : 7
Retry Interval (ms) : 480
Volatile Bit Rate (kbps) : 38.4
Non-Volatile Bit Rate (kbps) : 115.2
Payload Channel : 1 (0x01)
Payload Port : 623
cloud53.ulan:/home/banuchka # ipmitool sol set volatile-bit-rate 115.2 1
cloud53.ulan:/home/banuchka # ipmitool sol info 1
Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01
Set in progress : set-complete
Enabled : true
Force Encryption : true
Force Authentication : false
Privilege Level : ADMINISTRATOR
Character Accumulate Level (ms) : 50
Character Send Threshold : 255
Retry Count : 7
Retry Interval (ms) : 480
Volatile Bit Rate (kbps) : 115.2
Non-Volatile Bit Rate (kbps) : 115.2
Payload Channel : 1 (0x01)
Payload Port : 623
cloud53.ulan:/home/banuchka # echo 123 > /dev/console

and nothing happened

in the console’s log
—
[04/14 12:49:12 console disconnected][04/14 12:49:29 console connected][04/14 13:01:02 console disconnected][04/14 13:01:02 console connected][04/14 13:03:54 console disconnected][04/14 13:04:15 console connected][04/14 13:38:37 console connected][04/14 15:31:47 console disconnected][04/14 15:36:24 console connected][04/14 15:42:08 connection by xcat_console]
---


On 14 April 2017 at 16:39:35, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
If you do have any in corrupted state, would be interested to see what happens if you do:
ipmitool sol set volatile-bit-rate 115.2 1


To change the volatile bit rate to match the non-volatile bit rate and see if the corruption goes away.

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 11:36 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

115200

idracadm7 get iDRAC.IPMISerial
[Key=iDRAC.Embedded.1#IPMISerial.1]
BaudRate=115200
ChanPrivLimit=4
ConnectionMode=Terminal
DeleteControl=Disabled
EchoControl=Enabled
FlowControl=RTS/CTS
HandshakeControl=Enabled
InputNewLineSeq=1
LineEdit=Enabled
NewLineSeq=CR-LF

that is strange, right


On 14 April 2017 at 16:31:27, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
Hmm, what’s the baud rate the console is actually running at? Odd to see the volatile and non volatile bit rates not be the same.

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 11:28 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.




On 14 April 2017 at 16:15:16, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
And to be clear, the corruption only starts after a long period of time of being continuously connected?

Yes, that is correct


I might be interested in seeing ipmitool sol info 1 output against a system while it is working versus showing corrupted info.

corrupted:

# ipmitool -I lanplus -H cloud2manage -U root -a sol info 1

Password:

Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01

Set in progress : set-complete

Enabled : true

Force Encryption : true

Force Authentication : false

Privilege Level : ADMINISTRATOR

Character Accumulate Level (ms) : 50

Character Send Threshold : 255

Retry Count : 7

Retry Interval (ms) : 480

Volatile Bit Rate (kbps) : 38.4

Non-Volatile Bit Rate (kbps) : 115.2

Payload Channel : 1 (0x01)

Payload Port : 623



Working:

# ipmitool -I lanplus -H cloud2manage -U root -a sol info 1

Password:

Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01

Set in progress : set-complete

Enabled : true

Force Encryption : true

Force Authentication : false

Privilege Level : ADMINISTRATOR

Character Accumulate Level (ms) : 50

Character Send Threshold : 255

Retry Count : 7

Retry Interval (ms) : 480

Volatile Bit Rate (kbps) : 38.4

Non-Volatile Bit Rate (kbps) : 115.2

Payload Channel : 1 (0x01)

Payload Port : 623


From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 11:09 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

Yes, reopen causes it to work again, without any garbage
 so looks like normal console :)
Hit <enter> causes at first garbage output(ᅵᅵ Porᅵlo) and *normal console* before...


On 14 April 2017 at 16:02:09, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
So reopen causes it to work again, and before, it’s not *hung*, but erratic with garbage characters and occasional blips of sanity?

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 11:00 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

Reopen console did the trick as well...


On 14 April 2017 at 15:54:03, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
‘ctrl-e, then c, then o’ to reconnect.

Was conserver ondemand or full logging?

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 10:52 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

Console starts showing garbage after <enter> inside rcons.
What do you mean when said “restarting console”?
Console continue its work after:
- <enter> inside rcons/confetty
- bmc reset (console disconnected/console connected)

You’re absolutely right with ipmitool and conserver with the same servers we were out of such troubles.

On 14 April 2017 at 15:47:14, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
So the console starts showing garbage? Restarting the console causes the garbage to go away?

You said that ipmitool with a certain configuration did not trigger this?

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 9:29 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

I’m out of ideas, let me show you all i see.

Inside rcons i see:

MONITORING_TEST dbb54 1492160401 <= last message i’ve sent from OS (more complex log below)

tcpdump(keepalive?):

13:23:42.342886 IP (tos 0x0, ttl 64, id 16448, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:23:42.345504 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80




13:24:09.422491 IP (tos 0x0, ttl 64, id 17060, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:24:09.425045 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

Hit <enter> in rcons:
---
MONITORING_TEST dbb54 1492160401

ᅵᅵ
Porᅵ
—

tcpdump:
13:24:35.727671 IP (tos 0x0, ttl 64, id 19582, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:24:35.731533 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:24:47.390367 IP (tos 0x0, ttl 64, id 20347, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:24:47.392799 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 92)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 64
13:24:47.408312 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:24:47.409797 IP (tos 0x0, ttl 64, id 20349, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:03.127774 IP (tos 0x0, ttl 64, id 21818, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:03.131561 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:25:27.269696 IP (tos 0x0, ttl 64, id 26284, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:27.272204 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:25:47.410313 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 92)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 64
13:25:47.413754 IP (tos 0x0, ttl 64, id 28210, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:48.709947 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 204)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 176
13:25:48.712033 IP (tos 0x0, ttl 64, id 28355, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:52.564080 IP (tos 0x0, ttl 64, id 29103, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:52.566810 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

and Magic, rcons:
---
Porï¿œlo]0;console: dbb54 [13:25]


dbb54 login:
---


On 14 April 2017 at 12:42:03, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
If you ctrl-e, c, o, does it restore the console after the time?

Can you tell that it goes after exactly 24hours on the dot?

When console hung, does ‘ipmitool sol activate’ say ‘session already active’?

Yes,

# ipmitool -I lanplus -H 10.10.106.155 -U root -a sol activate

Password:

Info: SOL payload already active on another session


Does /var/log/confluent/consoles/<nodename> have any interesting events crop up?

[04/13 15:17:21 console connected]


 many our own messages

^MMONITORING_TEST dbb54 1492160401 | <== This is the last message from OS/ # date -***@1492160401 (Fri Apr 14 09:00:01 UTC 2017)

^M

[04/14 09:05:13 console connected]

[04/14 09:11:59 console connected]

[04/14 09:13:38 console disconnected]

[04/14 09:14:54 console connected]

[04/14 10:15:13 connection by xcat_console]

[04/14 10:15:14 disconnection by xcat_console]

[04/14 13:14:30 connection by xcat_console]


Pyghmi will do keepalive as well, and if that’s the problem, it should be much shorter than 24 hours. In fact, it should be checking if the SOL payload is active and owned by confluent specifically every couple of minutes.

yes, thats correct


From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 5:55 AM
To: xcat-***@lists.sourceforge.net<mailto:xcat-***@lists.sourceforge.net>
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

My last reply was incorrect. Problems still here. Im trying to find something usefull inbetween confluent/pyghmi...
Confluent restart solves hangs/reopen all connections.
I think it isnt the best option to restart confluent 1 or 2 times in 24h.
--
banuchka

On 13 April 2017 at 17:03:19, banuchka (***@gmail.com<mailto:***@gmail.com>) wrote:
It is Dell’s related problem, not 100% but

Confluent from current master is doing things well :)
Thanks for pretty nice tool “confluentdbutil".


On 13 April 2017 at 11:30:14, banuchka (***@gmail.com<mailto:***@gmail.com>) wrote:
Looks like that problem was before
 The fix was to use ipmitool with keepalive(one from xcat repos).
Here pyghmi is used maybe that the reason?


On 13 April 2017 at 08:22:28, banuchka (***@gmail.com<mailto:***@gmail.com>) wrote:
Hi,

Im trying to completely migrate from conserver to confluent, but catch strange behaviour.
Some of my consoles hangs ~after 24, so no any new messages in their logs or in rcons.
I send messages with timestamp from OS >/dev/console every 30-60min and take a look on them for monitoring purposes(consoles availability monitoring).
I can open rcons and hit enter, after few secs console is waking up(strange). I didnt see it happen with conserver or maybe im wrong...
Some details:
- as i can see the bigest part of consoles with hangs behaviour are Dell idrac. Doesnt matter which type of RacSerial or IPMISerial is in use.
- racreset hard/ipmitool bmc reset didnt do the things
- hit enter to console wake it up(for example with expect i can send \r\n\f, but it looks bad)
- i didnt try to clean confluent's conf and restart it. Not sure it may help.
- HP consoles works well, same ipmi
- few consoles with custom pluging works good as well

So maybe my question is not about confluent, but if some of you have some knowledge about same problems please share it! ;)
--
banuchka
--
banuchka
--
banuchka
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot_______________________________________________
xCAT-user mailing list
xCAT-***@lists.sourceforge.net<mailto:xCAT-***@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/xcat-user
--
banuchka
--
banuchka
--
banuchka
--
banuchka
--
banuchka
--
banuchka
--
banuchka
--
banuchka
--
banuchka
--
banuchka
banuchka
2017-05-03 18:38:53 UTC
Permalink
No, there aren’t.
All attempts from one ip(logs from iLO):
123030 Informational iLO 4 05/03/2017 18:16 05/03/2017 18:16 1 IPMI/RMCP login by root - 10.10.114.30(xcat-sn1.mlan).
123029 Informational iLO 4 05/03/2017 18:16 05/03/2017 18:16 1 IPMI/RMCP logout: root - 10.10.114.30(xcat-sn1.mlan).

123028 Informational iLO 4 05/03/2017 18:05 05/03/2017 18:05 1 IPMI/RMCP login by root - 10.10.114.30(xcat-sn1.mlan).
123027 Informational iLO 4 05/03/2017 18:05 05/03/2017 18:05 1 IPMI/RMCP logout: root - 10.10.114.30(xcat-sn1.mlan).

123022 Informational iLO 4 05/03/2017 17:54 05/03/2017 17:54 1 IPMI/RMCP login by root - 10.10.114.30(xcat-sn1.mlan).
123021 Informational iLO 4 05/03/2017 17:54 05/03/2017 17:54 1 IPMI/RMCP logout: root - 10.10.114.30(xcat-sn1.mlan).

123020 Informational iLO 4 05/03/2017 17:49 05/03/2017 17:49 1 IPMI/RMCP login by root - 10.10.114.30(xcat-sn1.mlan).
123019 Informational iLO 4 05/03/2017 17:49 05/03/2017 17:49 1 IPMI/RMCP logout: root - 10.10.114.30(xcat-sn1.mlan).

123016 Informational iLO 4 05/03/2017 17:40 05/03/2017 17:40 1 IPMI/RMCP login by root - 10.10.114.30(xcat-sn1.mlan).
123015 Informational iLO 4 05/03/2017 17:40 05/03/2017 17:40 1 IPMI/RMCP logout: root - 10.10.114.30(xcat-sn1.mlan).

123010 Informational iLO 4 05/03/2017 17:26 05/03/2017 17:26 1 IPMI/RMCP login by root - 10.10.114.30(xcat-sn1.mlan).
123009 Informational iLO 4 05/03/2017 17:26 05/03/2017 17:26 1 IPMI/RMCP logout: root - 10.10.114.30(xcat-sn1.mlan).

122994 Informational iLO 4 05/03/2017 15:56 05/03/2017 15:56 1 IPMI/RMCP login by root - 10.10.114.30(xcat-sn1.mlan).
122993 Informational iLO 4 05/03/2017 15:56 05/03/2017 15:56 1 IPMI/RMCP logout: root - 10.10.114.30(xcat-sn1.mlan).

122990 Informational iLO 4 05/03/2017 15:44 05/03/2017 15:44 1 IPMI/RMCP login by root - 10.10.114.30(xcat-sn1.mlan).
122989 Informational iLO 4 05/03/2017 15:44 05/03/2017 15:44 1 IPMI/RMCP logout: root - 10.10.114.30(xcat-sn1.mlan).

122988 Informational iLO 4 05/03/2017 15:36 05/03/2017 15:36 1 IPMI/RMCP login by root - 10.10.114.30(xcat-sn1.mlan).
122987 Informational iLO 4 05/03/2017 15:36 05/03/2017 15:36 1 IPMI/RMCP logout: root - 10.10.114.30(xcat-sn1.mlan).

122986 Informational iLO 4 05/03/2017 15:30 05/03/2017 15:30 1 IPMI/RMCP login by root - 10.10.114.30(xcat-sn1.mlan).
122985 Informational iLO 4 05/03/2017 15:30 05/03/2017 15:30 1 IPMI/RMCP logout: root - 10.10.114.30(xcat-sn1.mlan).

122980 Informational iLO 4 05/03/2017 15:15 05/03/2017 15:15 1 IPMI/RMCP login by root - 10.10.114.30(xcat-sn1.mlan).
122979 Informational iLO 4 05/03/2017 15:15 05/03/2017 15:15 1 IPMI/RMCP logout: root - 10.10.114.30(xcat-sn1.mlan).

122974 Informational iLO 4 05/03/2017 15:06 05/03/2017 15:06 1 IPMI/RMCP login by root - 10.10.114.30(xcat-sn1.mlan).
122973 Informational iLO 4 05/03/2017 15:06 05/03/2017 15:06 1 IPMI/RMCP logout: root - 10.10.114.30(xcat-sn1.mlan).

122970 Informational iLO 4 05/03/2017 14:52 05/03/2017 14:52 1 IPMI/RMCP logout: root - 10.10.114.30(xcat-sn1.mlan).
122967 Informational iLO 4 05/03/2017 14:52 05/03/2017 14:52 1 IPMI/RMCP login by root - 10.10.114.30(xcat-sn1.mlan).

On 3 May 2017 at 19:19:26, Jarrod Johnson (***@lenovo.com) wrote:

Hmm, and there isn’t anything like conserver or another confluent trying to run at the same time to the same node?

 

From: banuchka [mailto:***@gmail.com]
Sent: Wednesday, May 03, 2017 2:10 PM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

Hi,

 

one more strange thing about confluent:

 

May  3 12:57:28 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 12:57:26 console connected]

May  3 13:02:08 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 13:02:06 console disconnected]

May  3 13:10:32 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 13:10:30 console connected]

May  3 13:12:08 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 13:12:06 console disconnected]

May  3 13:21:08 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 13:21:06 console connected]

May  3 13:22:05 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 13:22:03 console disconnected]

May  3 13:26:02 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 13:26:00 console connected]

May  3 13:32:05 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 13:32:03 console disconnected]

May  3 13:33:17 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 13:33:15 console connected]

May  3 14:22:02 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 14:22:00 console disconnected]

May  3 14:23:11 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 14:23:09 console connected]

May  3 14:32:02 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 14:32:00 console disconnected]

May  3 14:39:44 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 14:39:42 console connected]

May  3 14:52:07 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 14:52:05 console disconnected]

May  3 14:52:17 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 14:52:15 console connected]

May  3 15:02:15 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:02:13 console disconnected]

May  3 15:06:40 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:06:38 console connected]

May  3 15:12:17 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:12:15 console disconnected]

May  3 15:15:30 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:15:28 console connected]

May  3 15:22:17 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:22:15 console disconnected]

May  3 15:30:28 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:30:26 console connected]

May  3 15:32:21 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:32:19 console disconnected]

May  3 15:36:42 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:36:40 console connected]

May  3 15:41:59 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:41:57 console disconnected]

May  3 15:45:17 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:45:15 console connected]

May  3 15:51:59 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:51:57 console disconnected]

May  3 15:57:05 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:57:03 console connected]

May  3 17:22:12 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 17:22:10 console disconnected]

May  3 17:26:38 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 17:26:36 console connected]

May  3 17:32:15 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 17:32:13 console disconnected]

May  3 17:41:26 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 17:41:24 console connected]

May  3 17:42:01 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 17:41:59 console disconnected]

May  3 17:49:32 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 17:49:30 console connected]

May  3 17:52:07 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 17:52:05 console disconnected]

May  3 17:52:42 xcat-sn1.mlan confluent[4102]: audit :May 03 17:52:40 {"operation": "start", "allowed": true, "target": "/nodes/unreg25/console/session", "user": "xcat_console"}

May  3 17:52:42 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 17:52:40 connection by xcat_console]

May  3 17:52:45 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 17:52:43 console disconnected]

May  3 17:56:09 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 17:56:07 console connected]

 

it isn’t Dell BMC
  

 

I think i’ve wrote about that behaviour here before, anyway. Times here are so random doesn’t look like a timeout issue in some place.

 

Need an advice before rolling back :) Thanks

 

On 14 April 2017 at 20:59:04, Jarrod Johnson (***@lenovo.com) wrote:

Yeah, there will be a bit push in the coming weeks it will have at least an ‘events’ log along with a lot more function.

 

Then some more fleshed out documentation (beyond the preliminary stuff on hpc.lenovo.com).

 

Let me know if the firmware exploration works out.  That particular change line suggests firmware upgrades, but it is possible they could have some high BMC cpu usage that could manifest in such a way.  The ‘works with ipmitool’ though has me scratching my head.

 

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 2:54 PM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

Last idea doesn’t work for me. So by the way idea as is is working great – confluent does disconnect/connect after time in constant. But for now it is 100% correct to say – it is a problem with IDRAC fw.

from release notes for last fw:

===

- Fix for occasional iDRAC unresponsiveness caused by upgrades via Firmware RACADM or
have an active SOL or SSH sessions while firmware upgrade is in progress.

===

I’m not sure, but maybe its something like i have here. So did the upgrade on few hosts and give them plenty of time to show me results.

Thanks for your answers, help and time
 it is very interesting quest :)

 

Bit more about Confluent:

- Interesting ambitions 

- Python VS Perl, thats good

- I think log files(not just trace, stderr, stdout) and documentation(source on Github is the best doc o know, but
) are things that i would like to be in Confluent

 

On 14 April 2017 at 19:27:20, Jarrod Johnson (***@lenovo.com) wrote:

Very interested in the outcome.  And thank you for working through it.  Also interested what you have liked, would like, and have disliked about confluent.

 

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 12:01 PM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

Thank you Jarrod, i’ll try to add patch and let you know after. Hope 90 minutes is enough, yes.

 

On 14 April 2017 at 16:57:24, Jarrod Johnson (***@lenovo.com) wrote:

Hmm, this is going to be very difficult to root cause (I only have Lenovo equipment as one might expect).

 

I’m loathe to do a workaround, but in console.py (find /usr –name console.py) , might be interesting to see how a change like the following:

diff --git a/pyghmi/ipmi/console.py b/pyghmi/ipmi/console.py

index 95e8551..a5f6062 100644

--- a/pyghmi/ipmi/console.py

+++ b/pyghmi/ipmi/console.py

@@ -42,6 +42,7 @@ class Console(object):

     def __init__(self, bmc, userid, password,

                  iohandler, port=623,

                  force=False, kg=None):

+        self.keepalivecount = 0

         self.keepaliveid = None

         self.connected = False

         self.broken = False

@@ -70,6 +71,7 @@ class Console(object):

         if 'error' in response:

             self._print_error(response['error'])

             return

+        self.keepalivecount = 0

         #Send activate sol payload directive

         #netfn= 6 (application)

         #command = 0x48 (activate payload)

@@ -150,11 +152,12 @@ class Console(object):

             return

         currowner = struct.unpack(

             "<I", struct.pack('4B', *response['data'][:4]))

-        if currowner[0] != self.ipmi_session.sessionid:

+        if currowner[0] != self.ipmi_session.sessionid or  self.keepalivecount > 180:

             # the session is deactivated or active for something else

             self.activated = False

             self._print_error('SOL deactivated')

             return

+        self.keepalivecount += 1

         # ok, still here, that means session is alive, but another

         # common issue is firmware messing with mux on reboot

         # this would be a nice thing to check, but the serial channel

 

If it would pan out, should cause the console session to disconnect itself roughly every 90 minutes and trigger reconnect (is 90 minutes short enough in your case?)  Would require a service confluent restart to see if it had the desired effect.

 

Sorry I haven’t tested and can’t think of root cause, but going to take some time off for the weekend.

 

I would be curious if the same ipmitool is running a day later than a check (e.g. if ipmitool is exiting and getting restarted).  I don’t have the time at the moment to see if they do some other interesting thing to avoid the behavior.

 

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 11:45 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

cloud53.ulan:/home/banuchka # ipmitool sol info 1

Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01

Set in progress                 : set-complete

Enabled                         : true

Force Encryption                : true

Force Authentication            : false

Privilege Level                 : ADMINISTRATOR

Character Accumulate Level (ms) : 50

Character Send Threshold        : 255

Retry Count                     : 7

Retry Interval (ms)             : 480

Volatile Bit Rate (kbps)        : 38.4

Non-Volatile Bit Rate (kbps)    : 115.2

Payload Channel                 : 1 (0x01)

Payload Port                    : 623

cloud53.ulan:/home/banuchka # ipmitool sol set volatile-bit-rate 115.2 1

cloud53.ulan:/home/banuchka # ipmitool sol info 1

Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01

Set in progress                 : set-complete

Enabled                         : true

Force Encryption                : true

Force Authentication            : false

Privilege Level                 : ADMINISTRATOR

Character Accumulate Level (ms) : 50

Character Send Threshold        : 255

Retry Count                     : 7

Retry Interval (ms)             : 480

Volatile Bit Rate (kbps)        : 115.2

Non-Volatile Bit Rate (kbps)    : 115.2

Payload Channel                 : 1 (0x01)

Payload Port                    : 623

cloud53.ulan:/home/banuchka # echo 123 > /dev/console

 

and nothing happened

 

in the console’s log

—

[04/14 12:49:12 console disconnected][04/14 12:49:29 console connected][04/14 13:01:02 console disconnected][04/14 13:01:02 console connected][04/14 13:03:54 console disconnected][04/14 13:04:15 console connected][04/14 13:38:37 console connected][04/14 15:31:47 console disconnected][04/14 15:36:24 console connected][04/14 15:42:08 connection by xcat_console]

---

 

On 14 April 2017 at 16:39:35, Jarrod Johnson (***@lenovo.com) wrote:

If you do have any in corrupted state, would be interested to see what happens if you do:

ipmitool sol set volatile-bit-rate 115.2 1

 

 

To change the volatile bit rate to match the non-volatile bit rate and see if the corruption goes away.

 

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 11:36 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

115200

 

idracadm7 get iDRAC.IPMISerial

[Key=iDRAC.Embedded.1#IPMISerial.1]

BaudRate=115200

ChanPrivLimit=4

ConnectionMode=Terminal

DeleteControl=Disabled

EchoControl=Enabled

FlowControl=RTS/CTS

HandshakeControl=Enabled

InputNewLineSeq=1

LineEdit=Enabled

NewLineSeq=CR-LF

 

that is strange, right

 

On 14 April 2017 at 16:31:27, Jarrod Johnson (***@lenovo.com) wrote:

Hmm, what’s the baud rate the console is actually running at?  Odd to see the volatile and non volatile bit rates not be the same.

 

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 11:28 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

 

 

On 14 April 2017 at 16:15:16, Jarrod Johnson (***@lenovo.com) wrote:

And to be clear, the corruption only starts after a long period of time of being continuously connected?

Yes, that is correct


 

I might be interested in seeing ipmitool sol info 1 output against a system while it is working versus showing corrupted info.

corrupted:

# ipmitool -I lanplus -H cloud2manage -U root -a sol info 1

Password:

Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01

Set in progress                 : set-complete

Enabled                         : true

Force Encryption                : true

Force Authentication            : false

Privilege Level                 : ADMINISTRATOR

Character Accumulate Level (ms) : 50

Character Send Threshold        : 255

Retry Count                     : 7

Retry Interval (ms)             : 480

Volatile Bit Rate (kbps)        : 38.4

Non-Volatile Bit Rate (kbps)    : 115.2

Payload Channel                 : 1 (0x01)

Payload Port                    : 623

 

Working:

# ipmitool -I lanplus -H cloud2manage -U root -a sol info 1

Password:

Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01

Set in progress                 : set-complete

Enabled                         : true

Force Encryption                : true

Force Authentication            : false

Privilege Level                 : ADMINISTRATOR

Character Accumulate Level (ms) : 50

Character Send Threshold        : 255

Retry Count                     : 7

Retry Interval (ms)             : 480

Volatile Bit Rate (kbps)        : 38.4

Non-Volatile Bit Rate (kbps)    : 115.2

Payload Channel                 : 1 (0x01)

Payload Port                    : 623


 

From: banuchka [mailto:***@gmail.com] 
Sent: Friday, April 14, 2017 11:09 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

Yes, reopen causes it to work again,  without any garbage
 so looks like normal console :)

Hit <enter> causes at first garbage output(ᅵᅵ Porᅵlo) and *normal console* before...

 

On 14 April 2017 at 16:02:09, Jarrod Johnson (***@lenovo.com) wrote:

So reopen causes it to work again, and before, it’s not *hung*, but erratic with garbage characters and occasional blips of sanity?

 

From: banuchka [mailto:***@gmail.com] 
Sent: Friday, April 14, 2017 11:00 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

Reopen console did the trick as well...

 

On 14 April 2017 at 15:54:03, Jarrod Johnson (***@lenovo.com) wrote:

‘ctrl-e, then c, then o’ to reconnect.

 

Was conserver ondemand or full logging?

 

From: banuchka [mailto:***@gmail.com] 
Sent: Friday, April 14, 2017 10:52 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

Console starts showing garbage after <enter> inside rcons.

What do you mean when said “restarting console”?

Console continue its work after:

- <enter> inside rcons/confetty

- bmc reset (console disconnected/console connected)

 

You’re absolutely right with ipmitool and conserver with the same servers we were out of such troubles.

On 14 April 2017 at 15:47:14, Jarrod Johnson (***@lenovo.com) wrote:

So the console starts showing garbage?  Restarting the console causes the garbage to go away?

 

You said that ipmitool with a certain configuration did not trigger this?

 

From: banuchka [mailto:***@gmail.com] 
Sent: Friday, April 14, 2017 9:29 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

I’m out of ideas, let me show you all i see.

 

Inside rcons i see:

 

MONITORING_TEST dbb54 1492160401 <= last message i’ve sent from OS (more complex log below)

 

tcpdump(keepalive?):

 

13:23:42.342886 IP (tos 0x0, ttl 64, id 16448, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:23:42.345504 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

 




 

13:24:09.422491 IP (tos 0x0, ttl 64, id 17060, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:24:09.425045 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

 

Hit <enter> in rcons:

---

MONITORING_TEST dbb54 1492160401

 

ᅵᅵ

  Porᅵ

—

 

tcpdump:

13:24:35.727671 IP (tos 0x0, ttl 64, id 19582, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:24:35.731533 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

13:24:47.390367 IP (tos 0x0, ttl 64, id 20347, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:24:47.392799 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 64

13:24:47.408312 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

13:24:47.409797 IP (tos 0x0, ttl 64, id 20349, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:03.127774 IP (tos 0x0, ttl 64, id 21818, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:03.131561 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

13:25:27.269696 IP (tos 0x0, ttl 64, id 26284, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:27.272204 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

13:25:47.410313 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 64

13:25:47.413754 IP (tos 0x0, ttl 64, id 28210, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:48.709947 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 204)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 176

13:25:48.712033 IP (tos 0x0, ttl 64, id 28355, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:52.564080 IP (tos 0x0, ttl 64, id 29103, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:52.566810 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

 

and Magic, rcons:

---

  Porᅵlo]0;console: dbb54 [13:25]

 

 

dbb54 login:

---

 

On 14 April 2017 at 12:42:03, Jarrod Johnson (***@lenovo.com) wrote:

If you ctrl-e, c, o, does it restore the console after the time?

 

Can you tell that it goes after exactly 24hours on the dot?

 

When console hung, does ‘ipmitool sol activate’ say ‘session already active’?

Yes, 

# ipmitool -I lanplus -H 10.10.106.155 -U root -a sol activate

Password:

Info: SOL payload already active on another session


 

Does /var/log/confluent/consoles/<nodename> have any interesting events crop up?

[04/13 15:17:21 console connected]


 many our own messages

^MMONITORING_TEST dbb54 1492160401 | <== This is the last message from OS/ # date -***@1492160401 (Fri Apr 14 09:00:01 UTC 2017)

^M

[04/14 09:05:13 console connected]

[04/14 09:11:59 console connected]

[04/14 09:13:38 console disconnected]

[04/14 09:14:54 console connected]

[04/14 10:15:13 connection by xcat_console]

[04/14 10:15:14 disconnection by xcat_console]

[04/14 13:14:30 connection by xcat_console]


 

Pyghmi will do keepalive as well, and if that’s the problem, it should be much shorter than 24 hours.  In fact, it should be checking if the SOL payload is active and owned by confluent specifically every couple of minutes.

yes, thats correct


 

From: banuchka [mailto:***@gmail.com] 
Sent: Friday, April 14, 2017 5:55 AM
To: xcat-***@lists.sourceforge.net
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

My last reply was incorrect. Problems still here. Im trying to find something usefull inbetween confluent/pyghmi...

Confluent restart solves hangs/reopen all connections.

I think it isnt the best option to restart confluent 1 or 2 times in 24h.

-- 
banuchka

On 13 April 2017 at 17:03:19, banuchka (***@gmail.com) wrote:

It is Dell’s related problem, not 100% but


Confluent from current master is doing things well :) 

Thanks for pretty nice tool “confluentdbutil".

 

On 13 April 2017 at 11:30:14, banuchka (***@gmail.com) wrote:

Looks like that problem was before
 The fix was to use ipmitool with keepalive(one from xcat repos).

Here pyghmi is used maybe that the reason?

 

On 13 April 2017 at 08:22:28, banuchka (***@gmail.com) wrote:

Hi,

 

Im trying to completely migrate from conserver to confluent, but catch strange behaviour.

Some of my consoles hangs ~after 24, so no any new messages in their logs or in rcons.

I send messages with timestamp from OS >/dev/console every 30-60min and take a look on them for monitoring purposes(consoles availability monitoring).

I can open rcons and hit enter, after few secs console is waking up(strange). I didnt see it happen with conserver or maybe im wrong...

Some details:

- as i can see the bigest part of consoles with hangs behaviour are Dell idrac. Doesnt matter which type of RacSerial or IPMISerial is in use.

- racreset hard/ipmitool bmc reset didnt do the things

- hit enter to console wake it up(for example with expect i can send \r\n\f, but it looks bad)

- i didnt try to clean confluent's conf and restart it. Not sure it may help.

- HP consoles works well, same ipmi

- few consoles with custom pluging works good as well

 

So maybe my question is not about confluent, but if some of you have some knowledge about same problems please share it! ;)

 

--
banuchka

--
banuchka

-- 
banuchka

------------------------------------------------------------------------------ 
Check out the vibrant tech community on one of the world's most 
engaging tech sites, Slashdot.org! http://sdm.link/slashdot_______________________________________________ 
xCAT-user mailing list 
xCAT-***@lists.sourceforge.net 
https://lists.sourceforge.net/lists/listinfo/xcat-user 

 

 

 

-- 
banuchka

-- 
banuchka

-- 
banuchka

-- 
banuchka

 

 

-- 
banuchka

-- 
banuchka

-- 
banuchka

-- 
banuchka

-- 
banuchka

-- 
banuchka

-- 
banuchka
Jarrod Johnson
2017-05-03 19:13:00 UTC
Permalink
Oh missed this one... hmm


Here it looks like it gets logged out and immediately logs back in?

From: banuchka [mailto:***@gmail.com]
Sent: Wednesday, May 03, 2017 2:39 PM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

No, there aren’t.
All attempts from one ip(logs from iLO):
123030 Informational iLO 4 05/03/2017 18:16 05/03/2017 18:16 1 IPMI/RMCP login by root - 10.10.114.30(xcat-sn1.mlan).
123029 Informational iLO 4 05/03/2017 18:16 05/03/2017 18:16 1 IPMI/RMCP logout: root - 10.10.114.30(xcat-sn1.mlan).

123028 Informational iLO 4 05/03/2017 18:05 05/03/2017 18:05 1 IPMI/RMCP login by root - 10.10.114.30(xcat-sn1.mlan).
123027 Informational iLO 4 05/03/2017 18:05 05/03/2017 18:05 1 IPMI/RMCP logout: root - 10.10.114.30(xcat-sn1.mlan).

123022 Informational iLO 4 05/03/2017 17:54 05/03/2017 17:54 1 IPMI/RMCP login by root - 10.10.114.30(xcat-sn1.mlan).
123021 Informational iLO 4 05/03/2017 17:54 05/03/2017 17:54 1 IPMI/RMCP logout: root - 10.10.114.30(xcat-sn1.mlan).

123020 Informational iLO 4 05/03/2017 17:49 05/03/2017 17:49 1 IPMI/RMCP login by root - 10.10.114.30(xcat-sn1.mlan).
123019 Informational iLO 4 05/03/2017 17:49 05/03/2017 17:49 1 IPMI/RMCP logout: root - 10.10.114.30(xcat-sn1.mlan).

123016 Informational iLO 4 05/03/2017 17:40 05/03/2017 17:40 1 IPMI/RMCP login by root - 10.10.114.30(xcat-sn1.mlan).
123015 Informational iLO 4 05/03/2017 17:40 05/03/2017 17:40 1 IPMI/RMCP logout: root - 10.10.114.30(xcat-sn1.mlan).

123010 Informational iLO 4 05/03/2017 17:26 05/03/2017 17:26 1 IPMI/RMCP login by root - 10.10.114.30(xcat-sn1.mlan).
123009 Informational iLO 4 05/03/2017 17:26 05/03/2017 17:26 1 IPMI/RMCP logout: root - 10.10.114.30(xcat-sn1.mlan).

122994 Informational iLO 4 05/03/2017 15:56 05/03/2017 15:56 1 IPMI/RMCP login by root - 10.10.114.30(xcat-sn1.mlan).
122993 Informational iLO 4 05/03/2017 15:56 05/03/2017 15:56 1 IPMI/RMCP logout: root - 10.10.114.30(xcat-sn1.mlan).

122990 Informational iLO 4 05/03/2017 15:44 05/03/2017 15:44 1 IPMI/RMCP login by root - 10.10.114.30(xcat-sn1.mlan).
122989 Informational iLO 4 05/03/2017 15:44 05/03/2017 15:44 1 IPMI/RMCP logout: root - 10.10.114.30(xcat-sn1.mlan).

122988 Informational iLO 4 05/03/2017 15:36 05/03/2017 15:36 1 IPMI/RMCP login by root - 10.10.114.30(xcat-sn1.mlan).
122987 Informational iLO 4 05/03/2017 15:36 05/03/2017 15:36 1 IPMI/RMCP logout: root - 10.10.114.30(xcat-sn1.mlan).

122986 Informational iLO 4 05/03/2017 15:30 05/03/2017 15:30 1 IPMI/RMCP login by root - 10.10.114.30(xcat-sn1.mlan).
122985 Informational iLO 4 05/03/2017 15:30 05/03/2017 15:30 1 IPMI/RMCP logout: root - 10.10.114.30(xcat-sn1.mlan).

122980 Informational iLO 4 05/03/2017 15:15 05/03/2017 15:15 1 IPMI/RMCP login by root - 10.10.114.30(xcat-sn1.mlan).
122979 Informational iLO 4 05/03/2017 15:15 05/03/2017 15:15 1 IPMI/RMCP logout: root - 10.10.114.30(xcat-sn1.mlan).

122974 Informational iLO 4 05/03/2017 15:06 05/03/2017 15:06 1 IPMI/RMCP login by root - 10.10.114.30(xcat-sn1.mlan).
122973 Informational iLO 4 05/03/2017 15:06 05/03/2017 15:06 1 IPMI/RMCP logout: root - 10.10.114.30(xcat-sn1.mlan).

122970 Informational iLO 4 05/03/2017 14:52 05/03/2017 14:52 1 IPMI/RMCP logout: root - 10.10.114.30(xcat-sn1.mlan).
122967 Informational iLO 4 05/03/2017 14:52 05/03/2017 14:52 1 IPMI/RMCP login by root - 10.10.114.30(xcat-sn1.mlan).


On 3 May 2017 at 19:19:26, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
Hmm, and there isn’t anything like conserver or another confluent trying to run at the same time to the same node?

From: banuchka [mailto:***@gmail.com]
Sent: Wednesday, May 03, 2017 2:10 PM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

Hi,

one more strange thing about confluent:

May 3 12:57:28 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 12:57:26 console connected]
May 3 13:02:08 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 13:02:06 console disconnected]
May 3 13:10:32 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 13:10:30 console connected]
May 3 13:12:08 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 13:12:06 console disconnected]
May 3 13:21:08 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 13:21:06 console connected]
May 3 13:22:05 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 13:22:03 console disconnected]
May 3 13:26:02 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 13:26:00 console connected]
May 3 13:32:05 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 13:32:03 console disconnected]
May 3 13:33:17 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 13:33:15 console connected]
May 3 14:22:02 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 14:22:00 console disconnected]
May 3 14:23:11 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 14:23:09 console connected]
May 3 14:32:02 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 14:32:00 console disconnected]
May 3 14:39:44 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 14:39:42 console connected]
May 3 14:52:07 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 14:52:05 console disconnected]
May 3 14:52:17 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 14:52:15 console connected]
May 3 15:02:15 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:02:13 console disconnected]
May 3 15:06:40 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:06:38 console connected]
May 3 15:12:17 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:12:15 console disconnected]
May 3 15:15:30 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:15:28 console connected]
May 3 15:22:17 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:22:15 console disconnected]
May 3 15:30:28 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:30:26 console connected]
May 3 15:32:21 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:32:19 console disconnected]
May 3 15:36:42 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:36:40 console connected]
May 3 15:41:59 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:41:57 console disconnected]
May 3 15:45:17 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:45:15 console connected]
May 3 15:51:59 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:51:57 console disconnected]
May 3 15:57:05 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:57:03 console connected]
May 3 17:22:12 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 17:22:10 console disconnected]
May 3 17:26:38 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 17:26:36 console connected]
May 3 17:32:15 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 17:32:13 console disconnected]
May 3 17:41:26 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 17:41:24 console connected]
May 3 17:42:01 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 17:41:59 console disconnected]
May 3 17:49:32 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 17:49:30 console connected]
May 3 17:52:07 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 17:52:05 console disconnected]
May 3 17:52:42 xcat-sn1.mlan confluent[4102]: audit :May 03 17:52:40 {"operation": "start", "allowed": true, "target": "/nodes/unreg25/console/session", "user": "xcat_console"}
May 3 17:52:42 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 17:52:40 connection by xcat_console]
May 3 17:52:45 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 17:52:43 console disconnected]
May 3 17:56:09 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 17:56:07 console connected]

it isn’t Dell BMC


I think i’ve wrote about that behaviour here before, anyway. Times here are so random doesn’t look like a timeout issue in some place.

Need an advice before rolling back :) Thanks


On 14 April 2017 at 20:59:04, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
Yeah, there will be a bit push in the coming weeks it will have at least an ‘events’ log along with a lot more function.

Then some more fleshed out documentation (beyond the preliminary stuff on hpc.lenovo.com).

Let me know if the firmware exploration works out. That particular change line suggests firmware upgrades, but it is possible they could have some high BMC cpu usage that could manifest in such a way. The ‘works with ipmitool’ though has me scratching my head.

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 2:54 PM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

Last idea doesn’t work for me. So by the way idea as is is working great – confluent does disconnect/connect after time in constant. But for now it is 100% correct to say – it is a problem with IDRAC fw.
from release notes for last fw:
===
- Fix for occasional iDRAC unresponsiveness caused by upgrades via Firmware RACADM or
have an active SOL or SSH sessions while firmware upgrade is in progress.
===
I’m not sure, but maybe its something like i have here. So did the upgrade on few hosts and give them plenty of time to show me results.
Thanks for your answers, help and time
 it is very interesting quest :)

Bit more about Confluent:
- Interesting ambitions
- Python VS Perl, thats good
- I think log files(not just trace, stderr, stdout) and documentation(source on Github is the best doc o know, but
) are things that i would like to be in Confluent


On 14 April 2017 at 19:27:20, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
Very interested in the outcome. And thank you for working through it. Also interested what you have liked, would like, and have disliked about confluent.

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 12:01 PM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

Thank you Jarrod, i’ll try to add patch and let you know after. Hope 90 minutes is enough, yes.


On 14 April 2017 at 16:57:24, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
Hmm, this is going to be very difficult to root cause (I only have Lenovo equipment as one might expect).

I’m loathe to do a workaround, but in console.py (find /usr –name console.py) , might be interesting to see how a change like the following:
diff --git a/pyghmi/ipmi/console.py b/pyghmi/ipmi/console.py
index 95e8551..a5f6062 100644
--- a/pyghmi/ipmi/console.py
+++ b/pyghmi/ipmi/console.py
@@ -42,6 +42,7 @@ class Console(object):
def __init__(self, bmc, userid, password,
iohandler, port=623,
force=False, kg=None):
+ self.keepalivecount = 0
self.keepaliveid = None
self.connected = False
self.broken = False
@@ -70,6 +71,7 @@ class Console(object):
if 'error' in response:
self._print_error(response['error'])
return
+ self.keepalivecount = 0
#Send activate sol payload directive
#netfn= 6 (application)
#command = 0x48 (activate payload)
@@ -150,11 +152,12 @@ class Console(object):
return
currowner = struct.unpack(
"<I", struct.pack('4B', *response['data'][:4]))
- if currowner[0] != self.ipmi_session.sessionid:
+ if currowner[0] != self.ipmi_session.sessionid or self.keepalivecount > 180:
# the session is deactivated or active for something else
self.activated = False
self._print_error('SOL deactivated')
return
+ self.keepalivecount += 1
# ok, still here, that means session is alive, but another
# common issue is firmware messing with mux on reboot
# this would be a nice thing to check, but the serial channel

If it would pan out, should cause the console session to disconnect itself roughly every 90 minutes and trigger reconnect (is 90 minutes short enough in your case?) Would require a service confluent restart to see if it had the desired effect.

Sorry I haven’t tested and can’t think of root cause, but going to take some time off for the weekend.

I would be curious if the same ipmitool is running a day later than a check (e.g. if ipmitool is exiting and getting restarted). I don’t have the time at the moment to see if they do some other interesting thing to avoid the behavior.

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 11:45 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

cloud53.ulan:/home/banuchka # ipmitool sol info 1
Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01
Set in progress : set-complete
Enabled : true
Force Encryption : true
Force Authentication : false
Privilege Level : ADMINISTRATOR
Character Accumulate Level (ms) : 50
Character Send Threshold : 255
Retry Count : 7
Retry Interval (ms) : 480
Volatile Bit Rate (kbps) : 38.4
Non-Volatile Bit Rate (kbps) : 115.2
Payload Channel : 1 (0x01)
Payload Port : 623
cloud53.ulan:/home/banuchka # ipmitool sol set volatile-bit-rate 115.2 1
cloud53.ulan:/home/banuchka # ipmitool sol info 1
Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01
Set in progress : set-complete
Enabled : true
Force Encryption : true
Force Authentication : false
Privilege Level : ADMINISTRATOR
Character Accumulate Level (ms) : 50
Character Send Threshold : 255
Retry Count : 7
Retry Interval (ms) : 480
Volatile Bit Rate (kbps) : 115.2
Non-Volatile Bit Rate (kbps) : 115.2
Payload Channel : 1 (0x01)
Payload Port : 623
cloud53.ulan:/home/banuchka # echo 123 > /dev/console

and nothing happened

in the console’s log
—
[04/14 12:49:12 console disconnected][04/14 12:49:29 console connected][04/14 13:01:02 console disconnected][04/14 13:01:02 console connected][04/14 13:03:54 console disconnected][04/14 13:04:15 console connected][04/14 13:38:37 console connected][04/14 15:31:47 console disconnected][04/14 15:36:24 console connected][04/14 15:42:08 connection by xcat_console]
---


On 14 April 2017 at 16:39:35, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
If you do have any in corrupted state, would be interested to see what happens if you do:
ipmitool sol set volatile-bit-rate 115.2 1


To change the volatile bit rate to match the non-volatile bit rate and see if the corruption goes away.

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 11:36 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

115200

idracadm7 get iDRAC.IPMISerial
[Key=iDRAC.Embedded.1#IPMISerial.1]
BaudRate=115200
ChanPrivLimit=4
ConnectionMode=Terminal
DeleteControl=Disabled
EchoControl=Enabled
FlowControl=RTS/CTS
HandshakeControl=Enabled
InputNewLineSeq=1
LineEdit=Enabled
NewLineSeq=CR-LF

that is strange, right


On 14 April 2017 at 16:31:27, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
Hmm, what’s the baud rate the console is actually running at? Odd to see the volatile and non volatile bit rates not be the same.

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 11:28 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.




On 14 April 2017 at 16:15:16, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
And to be clear, the corruption only starts after a long period of time of being continuously connected?

Yes, that is correct


I might be interested in seeing ipmitool sol info 1 output against a system while it is working versus showing corrupted info.

corrupted:

# ipmitool -I lanplus -H cloud2manage -U root -a sol info 1

Password:

Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01

Set in progress : set-complete

Enabled : true

Force Encryption : true

Force Authentication : false

Privilege Level : ADMINISTRATOR

Character Accumulate Level (ms) : 50

Character Send Threshold : 255

Retry Count : 7

Retry Interval (ms) : 480

Volatile Bit Rate (kbps) : 38.4

Non-Volatile Bit Rate (kbps) : 115.2

Payload Channel : 1 (0x01)

Payload Port : 623



Working:

# ipmitool -I lanplus -H cloud2manage -U root -a sol info 1

Password:

Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01

Set in progress : set-complete

Enabled : true

Force Encryption : true

Force Authentication : false

Privilege Level : ADMINISTRATOR

Character Accumulate Level (ms) : 50

Character Send Threshold : 255

Retry Count : 7

Retry Interval (ms) : 480

Volatile Bit Rate (kbps) : 38.4

Non-Volatile Bit Rate (kbps) : 115.2

Payload Channel : 1 (0x01)

Payload Port : 623


From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 11:09 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

Yes, reopen causes it to work again, without any garbage
 so looks like normal console :)
Hit <enter> causes at first garbage output(ᅵᅵ Porᅵlo) and *normal console* before...


On 14 April 2017 at 16:02:09, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
So reopen causes it to work again, and before, it’s not *hung*, but erratic with garbage characters and occasional blips of sanity?

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 11:00 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

Reopen console did the trick as well...


On 14 April 2017 at 15:54:03, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
‘ctrl-e, then c, then o’ to reconnect.

Was conserver ondemand or full logging?

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 10:52 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

Console starts showing garbage after <enter> inside rcons.
What do you mean when said “restarting console”?
Console continue its work after:
- <enter> inside rcons/confetty
- bmc reset (console disconnected/console connected)

You’re absolutely right with ipmitool and conserver with the same servers we were out of such troubles.

On 14 April 2017 at 15:47:14, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
So the console starts showing garbage? Restarting the console causes the garbage to go away?

You said that ipmitool with a certain configuration did not trigger this?

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 9:29 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

I’m out of ideas, let me show you all i see.

Inside rcons i see:

MONITORING_TEST dbb54 1492160401 <= last message i’ve sent from OS (more complex log below)

tcpdump(keepalive?):

13:23:42.342886 IP (tos 0x0, ttl 64, id 16448, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:23:42.345504 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80




13:24:09.422491 IP (tos 0x0, ttl 64, id 17060, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:24:09.425045 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

Hit <enter> in rcons:
---
MONITORING_TEST dbb54 1492160401

ᅵᅵ
Porᅵ
—

tcpdump:
13:24:35.727671 IP (tos 0x0, ttl 64, id 19582, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:24:35.731533 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:24:47.390367 IP (tos 0x0, ttl 64, id 20347, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:24:47.392799 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 92)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 64
13:24:47.408312 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:24:47.409797 IP (tos 0x0, ttl 64, id 20349, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:03.127774 IP (tos 0x0, ttl 64, id 21818, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:03.131561 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:25:27.269696 IP (tos 0x0, ttl 64, id 26284, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:27.272204 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80
13:25:47.410313 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 92)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 64
13:25:47.413754 IP (tos 0x0, ttl 64, id 28210, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:48.709947 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 204)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 176
13:25:48.712033 IP (tos 0x0, ttl 64, id 28355, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:52.564080 IP (tos 0x0, ttl 64, id 29103, offset 0, flags [DF], proto UDP (17), length 92)
10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64
13:25:52.566810 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)
10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

and Magic, rcons:
---
Porï¿œlo]0;console: dbb54 [13:25]


dbb54 login:
---


On 14 April 2017 at 12:42:03, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
If you ctrl-e, c, o, does it restore the console after the time?

Can you tell that it goes after exactly 24hours on the dot?

When console hung, does ‘ipmitool sol activate’ say ‘session already active’?

Yes,

# ipmitool -I lanplus -H 10.10.106.155 -U root -a sol activate

Password:

Info: SOL payload already active on another session


Does /var/log/confluent/consoles/<nodename> have any interesting events crop up?

[04/13 15:17:21 console connected]


 many our own messages

^MMONITORING_TEST dbb54 1492160401 | <== This is the last message from OS/ # date -***@1492160401 (Fri Apr 14 09:00:01 UTC 2017)

^M

[04/14 09:05:13 console connected]

[04/14 09:11:59 console connected]

[04/14 09:13:38 console disconnected]

[04/14 09:14:54 console connected]

[04/14 10:15:13 connection by xcat_console]

[04/14 10:15:14 disconnection by xcat_console]

[04/14 13:14:30 connection by xcat_console]


Pyghmi will do keepalive as well, and if that’s the problem, it should be much shorter than 24 hours. In fact, it should be checking if the SOL payload is active and owned by confluent specifically every couple of minutes.

yes, thats correct


From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 5:55 AM
To: xcat-***@lists.sourceforge.net<mailto:xcat-***@lists.sourceforge.net>
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

My last reply was incorrect. Problems still here. Im trying to find something usefull inbetween confluent/pyghmi...
Confluent restart solves hangs/reopen all connections.
I think it isnt the best option to restart confluent 1 or 2 times in 24h.
--
banuchka

On 13 April 2017 at 17:03:19, banuchka (***@gmail.com<mailto:***@gmail.com>) wrote:
It is Dell’s related problem, not 100% but

Confluent from current master is doing things well :)
Thanks for pretty nice tool “confluentdbutil".


On 13 April 2017 at 11:30:14, banuchka (***@gmail.com<mailto:***@gmail.com>) wrote:
Looks like that problem was before
 The fix was to use ipmitool with keepalive(one from xcat repos).
Here pyghmi is used maybe that the reason?


On 13 April 2017 at 08:22:28, banuchka (***@gmail.com<mailto:***@gmail.com>) wrote:
Hi,

Im trying to completely migrate from conserver to confluent, but catch strange behaviour.
Some of my consoles hangs ~after 24, so no any new messages in their logs or in rcons.
I send messages with timestamp from OS >/dev/console every 30-60min and take a look on them for monitoring purposes(consoles availability monitoring).
I can open rcons and hit enter, after few secs console is waking up(strange). I didnt see it happen with conserver or maybe im wrong...
Some details:
- as i can see the bigest part of consoles with hangs behaviour are Dell idrac. Doesnt matter which type of RacSerial or IPMISerial is in use.
- racreset hard/ipmitool bmc reset didnt do the things
- hit enter to console wake it up(for example with expect i can send \r\n\f, but it looks bad)
- i didnt try to clean confluent's conf and restart it. Not sure it may help.
- HP consoles works well, same ipmi
- few consoles with custom pluging works good as well

So maybe my question is not about confluent, but if some of you have some knowledge about same problems please share it! ;)
--
banuchka
--
banuchka
--
banuchka
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot_______________________________________________
xCAT-user mailing list
xCAT-***@lists.sourceforge.net<mailto:xCAT-***@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/xcat-user
--
banuchka
--
banuchka
--
banuchka
--
banuchka
--
banuchka
--
banuchka
--
banuchka
--
banuchka
--
banuchka
--
banuchka
--
banuchka
banuchka
2017-05-03 19:38:53 UTC
Permalink
Looks and sounds strange, i know )

from conserver’s log:

May  3 19:02:12 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 19:02:10 console disconnected]
May  3 19:06:30 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 19:06:28 console connected]
May  3 19:12:18 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 19:12:16 console disconnected]
May  3 19:13:20 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 19:13:18 console connected]

from iLO:

123047 Informational iLO 4 05/03/2017 19:12 05/03/2017 19:12 1 IPMI/RMCP logout: root - 10.10.114.30(xcat-sn1.mlan).
123046 Informational iLO 4 05/03/2017 19:12 05/03/2017 19:12 1 IPMI/RMCP login by root - 10.10.114.30(xcat-sn1.mlan).
123045 Informational iLO 4 05/03/2017 19:06 05/03/2017 19:06 1 IPMI/RMCP login by root - 10.10.114.30(xcat-sn1.mlan).
123044 Informational iLO 4 05/03/2017 19:06 05/03/2017 19:06 1 IPMI/RMCP logout: root - 10.10.114.30(xcat-sn1.mlan).

so from iLO’s point of view login than logout
 completely strange and it is not truth, because not console is working :)

On 3 May 2017 at 20:20:41, Jarrod Johnson (***@lenovo.com) wrote:

Oh missed this one... hmm

 
Here it looks like it gets logged out and immediately logs back in?
-- 
banuchka

banuchka
2017-05-03 19:01:41 UTC
Permalink
Tomorrow I’ll try to an one(2, 3, 4) more instances of Confluent and move part of servers there until the same behaviour on new instance(-s).

On 3 May 2017 at 19:19:26, Jarrod Johnson (***@lenovo.com) wrote:

Hmm, and there isn’t anything like conserver or another confluent trying to run at the same time to the same node?

 

From: banuchka [mailto:***@gmail.com]
Sent: Wednesday, May 03, 2017 2:10 PM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

Hi,

 

one more strange thing about confluent:

 

May  3 12:57:28 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 12:57:26 console connected]

May  3 13:02:08 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 13:02:06 console disconnected]

May  3 13:10:32 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 13:10:30 console connected]

May  3 13:12:08 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 13:12:06 console disconnected]

May  3 13:21:08 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 13:21:06 console connected]

May  3 13:22:05 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 13:22:03 console disconnected]

May  3 13:26:02 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 13:26:00 console connected]

May  3 13:32:05 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 13:32:03 console disconnected]

May  3 13:33:17 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 13:33:15 console connected]

May  3 14:22:02 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 14:22:00 console disconnected]

May  3 14:23:11 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 14:23:09 console connected]

May  3 14:32:02 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 14:32:00 console disconnected]

May  3 14:39:44 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 14:39:42 console connected]

May  3 14:52:07 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 14:52:05 console disconnected]

May  3 14:52:17 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 14:52:15 console connected]

May  3 15:02:15 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:02:13 console disconnected]

May  3 15:06:40 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:06:38 console connected]

May  3 15:12:17 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:12:15 console disconnected]

May  3 15:15:30 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:15:28 console connected]

May  3 15:22:17 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:22:15 console disconnected]

May  3 15:30:28 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:30:26 console connected]

May  3 15:32:21 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:32:19 console disconnected]

May  3 15:36:42 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:36:40 console connected]

May  3 15:41:59 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:41:57 console disconnected]

May  3 15:45:17 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:45:15 console connected]

May  3 15:51:59 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:51:57 console disconnected]

May  3 15:57:05 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:57:03 console connected]

May  3 17:22:12 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 17:22:10 console disconnected]

May  3 17:26:38 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 17:26:36 console connected]

May  3 17:32:15 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 17:32:13 console disconnected]

May  3 17:41:26 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 17:41:24 console connected]

May  3 17:42:01 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 17:41:59 console disconnected]

May  3 17:49:32 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 17:49:30 console connected]

May  3 17:52:07 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 17:52:05 console disconnected]

May  3 17:52:42 xcat-sn1.mlan confluent[4102]: audit :May 03 17:52:40 {"operation": "start", "allowed": true, "target": "/nodes/unreg25/console/session", "user": "xcat_console"}

May  3 17:52:42 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 17:52:40 connection by xcat_console]

May  3 17:52:45 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 17:52:43 console disconnected]

May  3 17:56:09 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 17:56:07 console connected]

 

it isn’t Dell BMC
  

 

I think i’ve wrote about that behaviour here before, anyway. Times here are so random doesn’t look like a timeout issue in some place.

 

Need an advice before rolling back :) Thanks

 

On 14 April 2017 at 20:59:04, Jarrod Johnson (***@lenovo.com) wrote:

Yeah, there will be a bit push in the coming weeks it will have at least an ‘events’ log along with a lot more function.

 

Then some more fleshed out documentation (beyond the preliminary stuff on hpc.lenovo.com).

 

Let me know if the firmware exploration works out.  That particular change line suggests firmware upgrades, but it is possible they could have some high BMC cpu usage that could manifest in such a way.  The ‘works with ipmitool’ though has me scratching my head.

 

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 2:54 PM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

Last idea doesn’t work for me. So by the way idea as is is working great – confluent does disconnect/connect after time in constant. But for now it is 100% correct to say – it is a problem with IDRAC fw.

from release notes for last fw:

===

- Fix for occasional iDRAC unresponsiveness caused by upgrades via Firmware RACADM or
have an active SOL or SSH sessions while firmware upgrade is in progress.

===

I’m not sure, but maybe its something like i have here. So did the upgrade on few hosts and give them plenty of time to show me results.

Thanks for your answers, help and time
 it is very interesting quest :)

 

Bit more about Confluent:

- Interesting ambitions 

- Python VS Perl, thats good

- I think log files(not just trace, stderr, stdout) and documentation(source on Github is the best doc o know, but
) are things that i would like to be in Confluent

 

On 14 April 2017 at 19:27:20, Jarrod Johnson (***@lenovo.com) wrote:

Very interested in the outcome.  And thank you for working through it.  Also interested what you have liked, would like, and have disliked about confluent.

 

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 12:01 PM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

Thank you Jarrod, i’ll try to add patch and let you know after. Hope 90 minutes is enough, yes.

 

On 14 April 2017 at 16:57:24, Jarrod Johnson (***@lenovo.com) wrote:

Hmm, this is going to be very difficult to root cause (I only have Lenovo equipment as one might expect).

 

I’m loathe to do a workaround, but in console.py (find /usr –name console.py) , might be interesting to see how a change like the following:

diff --git a/pyghmi/ipmi/console.py b/pyghmi/ipmi/console.py

index 95e8551..a5f6062 100644

--- a/pyghmi/ipmi/console.py

+++ b/pyghmi/ipmi/console.py

@@ -42,6 +42,7 @@ class Console(object):

     def __init__(self, bmc, userid, password,

                  iohandler, port=623,

                  force=False, kg=None):

+        self.keepalivecount = 0

         self.keepaliveid = None

         self.connected = False

         self.broken = False

@@ -70,6 +71,7 @@ class Console(object):

         if 'error' in response:

             self._print_error(response['error'])

             return

+        self.keepalivecount = 0

         #Send activate sol payload directive

         #netfn= 6 (application)

         #command = 0x48 (activate payload)

@@ -150,11 +152,12 @@ class Console(object):

             return

         currowner = struct.unpack(

             "<I", struct.pack('4B', *response['data'][:4]))

-        if currowner[0] != self.ipmi_session.sessionid:

+        if currowner[0] != self.ipmi_session.sessionid or  self.keepalivecount > 180:

             # the session is deactivated or active for something else

             self.activated = False

             self._print_error('SOL deactivated')

             return

+        self.keepalivecount += 1

         # ok, still here, that means session is alive, but another

         # common issue is firmware messing with mux on reboot

         # this would be a nice thing to check, but the serial channel

 

If it would pan out, should cause the console session to disconnect itself roughly every 90 minutes and trigger reconnect (is 90 minutes short enough in your case?)  Would require a service confluent restart to see if it had the desired effect.

 

Sorry I haven’t tested and can’t think of root cause, but going to take some time off for the weekend.

 

I would be curious if the same ipmitool is running a day later than a check (e.g. if ipmitool is exiting and getting restarted).  I don’t have the time at the moment to see if they do some other interesting thing to avoid the behavior.

 

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 11:45 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

cloud53.ulan:/home/banuchka # ipmitool sol info 1

Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01

Set in progress                 : set-complete

Enabled                         : true

Force Encryption                : true

Force Authentication            : false

Privilege Level                 : ADMINISTRATOR

Character Accumulate Level (ms) : 50

Character Send Threshold        : 255

Retry Count                     : 7

Retry Interval (ms)             : 480

Volatile Bit Rate (kbps)        : 38.4

Non-Volatile Bit Rate (kbps)    : 115.2

Payload Channel                 : 1 (0x01)

Payload Port                    : 623

cloud53.ulan:/home/banuchka # ipmitool sol set volatile-bit-rate 115.2 1

cloud53.ulan:/home/banuchka # ipmitool sol info 1

Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01

Set in progress                 : set-complete

Enabled                         : true

Force Encryption                : true

Force Authentication            : false

Privilege Level                 : ADMINISTRATOR

Character Accumulate Level (ms) : 50

Character Send Threshold        : 255

Retry Count                     : 7

Retry Interval (ms)             : 480

Volatile Bit Rate (kbps)        : 115.2

Non-Volatile Bit Rate (kbps)    : 115.2

Payload Channel                 : 1 (0x01)

Payload Port                    : 623

cloud53.ulan:/home/banuchka # echo 123 > /dev/console

 

and nothing happened

 

in the console’s log

—

[04/14 12:49:12 console disconnected][04/14 12:49:29 console connected][04/14 13:01:02 console disconnected][04/14 13:01:02 console connected][04/14 13:03:54 console disconnected][04/14 13:04:15 console connected][04/14 13:38:37 console connected][04/14 15:31:47 console disconnected][04/14 15:36:24 console connected][04/14 15:42:08 connection by xcat_console]

---

 

On 14 April 2017 at 16:39:35, Jarrod Johnson (***@lenovo.com) wrote:

If you do have any in corrupted state, would be interested to see what happens if you do:

ipmitool sol set volatile-bit-rate 115.2 1

 

 

To change the volatile bit rate to match the non-volatile bit rate and see if the corruption goes away.

 

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 11:36 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

115200

 

idracadm7 get iDRAC.IPMISerial

[Key=iDRAC.Embedded.1#IPMISerial.1]

BaudRate=115200

ChanPrivLimit=4

ConnectionMode=Terminal

DeleteControl=Disabled

EchoControl=Enabled

FlowControl=RTS/CTS

HandshakeControl=Enabled

InputNewLineSeq=1

LineEdit=Enabled

NewLineSeq=CR-LF

 

that is strange, right

 

On 14 April 2017 at 16:31:27, Jarrod Johnson (***@lenovo.com) wrote:

Hmm, what’s the baud rate the console is actually running at?  Odd to see the volatile and non volatile bit rates not be the same.

 

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 11:28 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

 

 

On 14 April 2017 at 16:15:16, Jarrod Johnson (***@lenovo.com) wrote:

And to be clear, the corruption only starts after a long period of time of being continuously connected?

Yes, that is correct


 

I might be interested in seeing ipmitool sol info 1 output against a system while it is working versus showing corrupted info.

corrupted:

# ipmitool -I lanplus -H cloud2manage -U root -a sol info 1

Password:

Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01

Set in progress                 : set-complete

Enabled                         : true

Force Encryption                : true

Force Authentication            : false

Privilege Level                 : ADMINISTRATOR

Character Accumulate Level (ms) : 50

Character Send Threshold        : 255

Retry Count                     : 7

Retry Interval (ms)             : 480

Volatile Bit Rate (kbps)        : 38.4

Non-Volatile Bit Rate (kbps)    : 115.2

Payload Channel                 : 1 (0x01)

Payload Port                    : 623

 

Working:

# ipmitool -I lanplus -H cloud2manage -U root -a sol info 1

Password:

Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01

Set in progress                 : set-complete

Enabled                         : true

Force Encryption                : true

Force Authentication            : false

Privilege Level                 : ADMINISTRATOR

Character Accumulate Level (ms) : 50

Character Send Threshold        : 255

Retry Count                     : 7

Retry Interval (ms)             : 480

Volatile Bit Rate (kbps)        : 38.4

Non-Volatile Bit Rate (kbps)    : 115.2

Payload Channel                 : 1 (0x01)

Payload Port                    : 623


 

From: banuchka [mailto:***@gmail.com] 
Sent: Friday, April 14, 2017 11:09 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

Yes, reopen causes it to work again,  without any garbage
 so looks like normal console :)

Hit <enter> causes at first garbage output(ᅵᅵ Porᅵlo) and *normal console* before...

 

On 14 April 2017 at 16:02:09, Jarrod Johnson (***@lenovo.com) wrote:

So reopen causes it to work again, and before, it’s not *hung*, but erratic with garbage characters and occasional blips of sanity?

 

From: banuchka [mailto:***@gmail.com] 
Sent: Friday, April 14, 2017 11:00 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

Reopen console did the trick as well...

 

On 14 April 2017 at 15:54:03, Jarrod Johnson (***@lenovo.com) wrote:

‘ctrl-e, then c, then o’ to reconnect.

 

Was conserver ondemand or full logging?

 

From: banuchka [mailto:***@gmail.com] 
Sent: Friday, April 14, 2017 10:52 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

Console starts showing garbage after <enter> inside rcons.

What do you mean when said “restarting console”?

Console continue its work after:

- <enter> inside rcons/confetty

- bmc reset (console disconnected/console connected)

 

You’re absolutely right with ipmitool and conserver with the same servers we were out of such troubles.

On 14 April 2017 at 15:47:14, Jarrod Johnson (***@lenovo.com) wrote:

So the console starts showing garbage?  Restarting the console causes the garbage to go away?

 

You said that ipmitool with a certain configuration did not trigger this?

 

From: banuchka [mailto:***@gmail.com] 
Sent: Friday, April 14, 2017 9:29 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

I’m out of ideas, let me show you all i see.

 

Inside rcons i see:

 

MONITORING_TEST dbb54 1492160401 <= last message i’ve sent from OS (more complex log below)

 

tcpdump(keepalive?):

 

13:23:42.342886 IP (tos 0x0, ttl 64, id 16448, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:23:42.345504 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

 




 

13:24:09.422491 IP (tos 0x0, ttl 64, id 17060, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:24:09.425045 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

 

Hit <enter> in rcons:

---

MONITORING_TEST dbb54 1492160401

 

ᅵᅵ

  Porᅵ

—

 

tcpdump:

13:24:35.727671 IP (tos 0x0, ttl 64, id 19582, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:24:35.731533 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

13:24:47.390367 IP (tos 0x0, ttl 64, id 20347, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:24:47.392799 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 64

13:24:47.408312 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

13:24:47.409797 IP (tos 0x0, ttl 64, id 20349, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:03.127774 IP (tos 0x0, ttl 64, id 21818, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:03.131561 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

13:25:27.269696 IP (tos 0x0, ttl 64, id 26284, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:27.272204 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

13:25:47.410313 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 64

13:25:47.413754 IP (tos 0x0, ttl 64, id 28210, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:48.709947 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 204)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 176

13:25:48.712033 IP (tos 0x0, ttl 64, id 28355, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:52.564080 IP (tos 0x0, ttl 64, id 29103, offset 0, flags [DF], proto UDP (17), length 92)

    10.10.114.30.36790 > 10.10.106.155.623: [udp sum ok] UDP, length 64

13:25:52.566810 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 108)

    10.10.106.155.623 > 10.10.114.30.36790: [udp sum ok] UDP, length 80

 

and Magic, rcons:

---

  Porᅵlo]0;console: dbb54 [13:25]

 

 

dbb54 login:

---

 

On 14 April 2017 at 12:42:03, Jarrod Johnson (***@lenovo.com) wrote:

If you ctrl-e, c, o, does it restore the console after the time?

 

Can you tell that it goes after exactly 24hours on the dot?

 

When console hung, does ‘ipmitool sol activate’ say ‘session already active’?

Yes, 

# ipmitool -I lanplus -H 10.10.106.155 -U root -a sol activate

Password:

Info: SOL payload already active on another session


 

Does /var/log/confluent/consoles/<nodename> have any interesting events crop up?

[04/13 15:17:21 console connected]


 many our own messages

^MMONITORING_TEST dbb54 1492160401 | <== This is the last message from OS/ # date -***@1492160401 (Fri Apr 14 09:00:01 UTC 2017)

^M

[04/14 09:05:13 console connected]

[04/14 09:11:59 console connected]

[04/14 09:13:38 console disconnected]

[04/14 09:14:54 console connected]

[04/14 10:15:13 connection by xcat_console]

[04/14 10:15:14 disconnection by xcat_console]

[04/14 13:14:30 connection by xcat_console]


 

Pyghmi will do keepalive as well, and if that’s the problem, it should be much shorter than 24 hours.  In fact, it should be checking if the SOL payload is active and owned by confluent specifically every couple of minutes.

yes, thats correct


 

From: banuchka [mailto:***@gmail.com] 
Sent: Friday, April 14, 2017 5:55 AM
To: xcat-***@lists.sourceforge.net
Subject: Re: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

 

My last reply was incorrect. Problems still here. Im trying to find something usefull inbetween confluent/pyghmi...

Confluent restart solves hangs/reopen all connections.

I think it isnt the best option to restart confluent 1 or 2 times in 24h.

-- 
banuchka

On 13 April 2017 at 17:03:19, banuchka (***@gmail.com) wrote:

It is Dell’s related problem, not 100% but


Confluent from current master is doing things well :) 

Thanks for pretty nice tool “confluentdbutil".

 

On 13 April 2017 at 11:30:14, banuchka (***@gmail.com) wrote:

Looks like that problem was before
 The fix was to use ipmitool with keepalive(one from xcat repos).

Here pyghmi is used maybe that the reason?

 

On 13 April 2017 at 08:22:28, banuchka (***@gmail.com) wrote:

Hi,

 

Im trying to completely migrate from conserver to confluent, but catch strange behaviour.

Some of my consoles hangs ~after 24, so no any new messages in their logs or in rcons.

I send messages with timestamp from OS >/dev/console every 30-60min and take a look on them for monitoring purposes(consoles availability monitoring).

I can open rcons and hit enter, after few secs console is waking up(strange). I didnt see it happen with conserver or maybe im wrong...

Some details:

- as i can see the bigest part of consoles with hangs behaviour are Dell idrac. Doesnt matter which type of RacSerial or IPMISerial is in use.

- racreset hard/ipmitool bmc reset didnt do the things

- hit enter to console wake it up(for example with expect i can send \r\n\f, but it looks bad)

- i didnt try to clean confluent's conf and restart it. Not sure it may help.

- HP consoles works well, same ipmi

- few consoles with custom pluging works good as well

 

So maybe my question is not about confluent, but if some of you have some knowledge about same problems please share it! ;)

 

--
banuchka

--
banuchka

-- 
banuchka

------------------------------------------------------------------------------ 
Check out the vibrant tech community on one of the world's most 
engaging tech sites, Slashdot.org! http://sdm.link/slashdot_______________________________________________ 
xCAT-user mailing list 
xCAT-***@lists.sourceforge.net 
https://lists.sourceforge.net/lists/listinfo/xcat-user 

 

 

 

-- 
banuchka

-- 
banuchka

-- 
banuchka

-- 
banuchka

 

 

-- 
banuchka

-- 
banuchka

-- 
banuchka

-- 
banuchka

-- 
banuchka

-- 
banuchka

-- 
banuchka
banuchka
2017-05-03 19:13:58 UTC
Permalink
Jarrod, what do you think about max/stable number of servers(with full ipmi
logging) is fine for one Confluent instance?
--
banuchka
Post by banuchka
Tomorrow I’ll try to an one(2, 3, 4) more instances of Confluent and move
part of servers there until the same behaviour on new instance(-s).
Hmm, and there isn’t anything like conserver or another confluent trying
to run at the same time to the same node?
*Sent:* Wednesday, May 03, 2017 2:10 PM
*To:* xCAT Users Mailing list; Jarrod Johnson
*Subject:* RE: [xcat-user] Confluent as console server. Consoles hangs
~after 24h.
Hi,
May 3 12:57:28 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 12:57:26
console connected]
May 3 13:02:08 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 13:02:06
console disconnected]
May 3 13:10:32 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 13:10:30
console connected]
May 3 13:12:08 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 13:12:06
console disconnected]
May 3 13:21:08 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 13:21:06
console connected]
May 3 13:22:05 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 13:22:03
console disconnected]
May 3 13:26:02 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 13:26:00
console connected]
May 3 13:32:05 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 13:32:03
console disconnected]
May 3 13:33:17 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 13:33:15
console connected]
May 3 14:22:02 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 14:22:00
console disconnected]
May 3 14:23:11 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 14:23:09
console connected]
May 3 14:32:02 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 14:32:00
console disconnected]
May 3 14:39:44 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 14:39:42
console connected]
May 3 14:52:07 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 14:52:05
console disconnected]
May 3 14:52:17 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 14:52:15
console connected]
May 3 15:02:15 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:02:13
console disconnected]
May 3 15:06:40 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:06:38
console connected]
May 3 15:12:17 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:12:15
console disconnected]
May 3 15:15:30 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:15:28
console connected]
May 3 15:22:17 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:22:15
console disconnected]
May 3 15:30:28 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:30:26
console connected]
May 3 15:32:21 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:32:19
console disconnected]
May 3 15:36:42 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:36:40
console connected]
May 3 15:41:59 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:41:57
console disconnected]
May 3 15:45:17 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:45:15
console connected]
May 3 15:51:59 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:51:57
console disconnected]
May 3 15:57:05 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:57:03
console connected]
May 3 17:22:12 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 17:22:10
console disconnected]
May 3 17:26:38 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 17:26:36
console connected]
May 3 17:32:15 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 17:32:13
console disconnected]
May 3 17:41:26 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 17:41:24
console connected]
May 3 17:42:01 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 17:41:59
console disconnected]
May 3 17:49:32 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 17:49:30
console connected]
May 3 17:52:07 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 17:52:05
console disconnected]
May 3 17:52:42 xcat-sn1.mlan confluent[4102]: audit :May 03 17:52:40
"/nodes/unreg25/console/session", "user": "xcat_console"}
May 3 17:52:42 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 17:52:40
connection by xcat_console]
May 3 17:52:45 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 17:52:43
console disconnected]
May 3 17:56:09 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 17:56:07
console connected]
it isn’t Dell BMC

I think i’ve wrote about that behaviour here before, anyway. Times here
are so random doesn’t look like a timeout issue in some place.
Need an advice before rolling back :) Thanks
Yeah, there will be a bit push in the coming weeks it will have at least
an ‘events’ log along with a lot more function.
Then some more fleshed out documentation (beyond the preliminary stuff on hpc.lenovo.com).
Let me know if the firmware exploration works out. That particular change
line suggests firmware upgrades, but it is possible they could have some
high BMC cpu usage that could manifest in such a way. The ‘works with
ipmitool’ though has me scratching my head.
*Sent:* Friday, April 14, 2017 2:54 PM
*To:* xCAT Users Mailing list; Jarrod Johnson
*Subject:* RE: [xcat-user] Confluent as console server. Consoles hangs
~after 24h.
Last idea doesn’t work for me. So by the way idea as is is working great –
confluent does disconnect/connect after time in constant. But for now it is
100% correct to say – it is a problem with IDRAC fw.
===
- Fix for occasional iDRAC unresponsiveness caused by upgrades via Firmware RACADM or
have an active SOL or SSH sessions while firmware upgrade is in progress.
===
I’m not sure, but maybe its something like i have here. So did the upgrade
on few hosts and give them plenty of time to show me results.
Thanks for your answers, help and time
 it is very interesting quest :)
- Interesting ambitions
- Python VS Perl, thats good
- I think log files(not just trace, stderr, stdout) and
documentation(source on Github is the best doc o know, but
) are things
that i would like to be in Confluent
Very interested in the outcome. And thank you for working through it.
Also interested what you have liked, would like, and have disliked about
confluent.
*Sent:* Friday, April 14, 2017 12:01 PM
*To:* xCAT Users Mailing list; Jarrod Johnson
*Subject:* RE: [xcat-user] Confluent as console server. Consoles hangs
~after 24h.
Thank you Jarrod, i’ll try to add patch and let you know after. Hope 90
minutes is enough, yes.
Hmm, this is going to be very difficult to root cause (I only have Lenovo
equipment as one might expect).
I’m loathe to do a workaround, but in console.py (find /usr –name
*diff --git a/pyghmi/ipmi/console.py b/pyghmi/ipmi/console.py*
*index 95e8551..a5f6062 100644*
*--- a/pyghmi/ipmi/console.py*
*+++ b/pyghmi/ipmi/console.py*
def __init__(self, bmc, userid, password,
iohandler, port=623,
+ self.keepalivecount = 0
self.keepaliveid = None
self.connected = False
self.broken = False
self._print_error(response['error'])
return
+ self.keepalivecount = 0
#Send activate sol payload directive
#netfn= 6 (application)
#command = 0x48 (activate payload)
return
currowner = struct.unpack(
"<I", struct.pack('4B', *response['data'][:4]))
+ if currowner[0] != self.ipmi_session.sessionid or
# the session is deactivated or active for something else
self.activated = False
self._print_error('SOL deactivated')
return
+ self.keepalivecount += 1
# ok, still here, that means session is alive, but another
# common issue is firmware messing with mux on reboot
# this would be a nice thing to check, but the serial channel
If it would pan out, should cause the console session to disconnect itself
roughly every 90 minutes and trigger reconnect (is 90 minutes short enough
in your case?) Would require a service confluent restart to see if it had
the desired effect.
Sorry I haven’t tested and can’t think of root cause, but going to take
some time off for the weekend.
I would be curious if the same ipmitool is running a day later than a
check (e.g. if ipmitool is exiting and getting restarted). I don’t have
the time at the moment to see if they do some other interesting thing to
avoid the behavior.
*Sent:* Friday, April 14, 2017 11:45 AM
*To:* xCAT Users Mailing list; Jarrod Johnson
*Subject:* RE: [xcat-user] Confluent as console server. Consoles hangs
~after 24h.
cloud53.ulan:/home/banuchka # ipmitool sol info 1
Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01
Set in progress : set-complete
Enabled : true
Force Encryption : true
Force Authentication : false
Privilege Level : ADMINISTRATOR
Character Accumulate Level (ms) : 50
Character Send Threshold : 255
Retry Count
Jarrod Johnson
2017-05-03 19:15:33 UTC
Permalink
Frankly, I haven’t had the opportunity to test as high as I would like. So far I can only first hand vouch for 500 with console logging enabled, which is about where we felt comfortable with conserver in general.

From: banuchka [mailto:***@gmail.com]
Sent: Wednesday, May 03, 2017 3:14 PM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

Jarrod, what do you think about max/stable number of servers(with full ipmi logging) is fine for one Confluent instance?
--
banuchka

On 3 May 2017 at 20:01:41, banuchka (***@gmail.com<mailto:***@gmail.com>) wrote:
Tomorrow I’ll try to an one(2, 3, 4) more instances of Confluent and move part of servers there until the same behaviour on new instance(-s).


On 3 May 2017 at 19:19:26, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
Hmm, and there isn’t anything like conserver or another confluent trying to run at the same time to the same node?

From: banuchka [mailto:***@gmail.com<mailto:***@gmail.com>]
Sent: Wednesday, May 03, 2017 2:10 PM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

Hi,

one more strange thing about confluent:

May 3 12:57:28 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 12:57:26 console connected]
May 3 13:02:08 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 13:02:06 console disconnected]
May 3 13:10:32 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 13:10:30 console connected]
May 3 13:12:08 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 13:12:06 console disconnected]
May 3 13:21:08 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 13:21:06 console connected]
May 3 13:22:05 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 13:22:03 console disconnected]
May 3 13:26:02 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 13:26:00 console connected]
May 3 13:32:05 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 13:32:03 console disconnected]
May 3 13:33:17 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 13:33:15 console connected]
May 3 14:22:02 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 14:22:00 console disconnected]
May 3 14:23:11 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 14:23:09 console connected]
May 3 14:32:02 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 14:32:00 console disconnected]
May 3 14:39:44 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 14:39:42 console connected]
May 3 14:52:07 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 14:52:05 console disconnected]
May 3 14:52:17 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 14:52:15 console connected]
May 3 15:02:15 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:02:13 console disconnected]
May 3 15:06:40 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:06:38 console connected]
May 3 15:12:17 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:12:15 console disconnected]
May 3 15:15:30 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:15:28 console connected]
May 3 15:22:17 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:22:15 console disconnected]
May 3 15:30:28 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:30:26 console connected]
May 3 15:32:21 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:32:19 console disconnected]
May 3 15:36:42 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:36:40 console connected]
May 3 15:41:59 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:41:57 console disconnected]
May 3 15:45:17 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:45:15 console connected]
May 3 15:51:59 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:51:57 console disconnected]
May 3 15:57:05 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 15:57:03 console connected]
May 3 17:22:12 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 17:22:10 console disconnected]
May 3 17:26:38 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 17:26:36 console connected]
May 3 17:32:15 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 17:32:13 console disconnected]
May 3 17:41:26 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 17:41:24 console connected]
May 3 17:42:01 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 17:41:59 console disconnected]
May 3 17:49:32 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 17:49:30 console connected]
May 3 17:52:07 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 17:52:05 console disconnected]
May 3 17:52:42 xcat-sn1.mlan confluent[4102]: audit :May 03 17:52:40 {"operation": "start", "allowed": true, "target": "/nodes/unreg25/console/session", "user": "xcat_console"}
May 3 17:52:42 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 17:52:40 connection by xcat_console]
May 3 17:52:45 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 17:52:43 console disconnected]
May 3 17:56:09 xcat-sn1.mlan confluent[4102]: unreg25 :[05/03 17:56:07 console connected]

it isn’t Dell BMC


I think i’ve wrote about that behaviour here before, anyway. Times here are so random doesn’t look like a timeout issue in some place.

Need an advice before rolling back :) Thanks


On 14 April 2017 at 20:59:04, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
Yeah, there will be a bit push in the coming weeks it will have at least an ‘events’ log along with a lot more function.

Then some more fleshed out documentation (beyond the preliminary stuff on hpc.lenovo.com<http://hpc.lenovo.com>).

Let me know if the firmware exploration works out. That particular change line suggests firmware upgrades, but it is possible they could have some high BMC cpu usage that could manifest in such a way. The ‘works with ipmitool’ though has me scratching my head.

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 2:54 PM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

Last idea doesn’t work for me. So by the way idea as is is working great – confluent does disconnect/connect after time in constant. But for now it is 100% correct to say – it is a problem with IDRAC fw.
from release notes for last fw:
===
- Fix for occasional iDRAC unresponsiveness caused by upgrades via Firmware RACADM or
have an active SOL or SSH sessions while firmware upgrade is in progress.
===
I’m not sure, but maybe its something like i have here. So did the upgrade on few hosts and give them plenty of time to show me results.
Thanks for your answers, help and time
 it is very interesting quest :)

Bit more about Confluent:
- Interesting ambitions
- Python VS Perl, thats good
- I think log files(not just trace, stderr, stdout) and documentation(source on Github is the best doc o know, but
) are things that i would like to be in Confluent


On 14 April 2017 at 19:27:20, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
Very interested in the outcome. And thank you for working through it. Also interested what you have liked, would like, and have disliked about confluent.

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 12:01 PM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

Thank you Jarrod, i’ll try to add patch and let you know after. Hope 90 minutes is enough, yes.


On 14 April 2017 at 16:57:24, Jarrod Johnson (***@lenovo.com<mailto:***@lenovo.com>) wrote:
Hmm, this is going to be very difficult to root cause (I only have Lenovo equipment as one might expect).

I’m loathe to do a workaround, but in console.py (find /usr –name console.py) , might be interesting to see how a change like the following:
diff --git a/pyghmi/ipmi/console.py b/pyghmi/ipmi/console.py
index 95e8551..a5f6062 100644
--- a/pyghmi/ipmi/console.py
+++ b/pyghmi/ipmi/console.py
@@ -42,6 +42,7 @@ class Console(object):
def __init__(self, bmc, userid, password,
iohandler, port=623,
force=False, kg=None):
+ self.keepalivecount = 0
self.keepaliveid = None
self.connected = False
self.broken = False
@@ -70,6 +71,7 @@ class Console(object):
if 'error' in response:
self._print_error(response['error'])
return
+ self.keepalivecount = 0
#Send activate sol payload directive
#netfn= 6 (application)
#command = 0x48 (activate payload)
@@ -150,11 +152,12 @@ class Console(object):
return
currowner = struct.unpack(
"<I", struct.pack('4B', *response['data'][:4]))
- if currowner[0] != self.ipmi_session.sessionid:
+ if currowner[0] != self.ipmi_session.sessionid or self.keepalivecount > 180:
# the session is deactivated or active for something else
self.activated = False
self._print_error('SOL deactivated')
return
+ self.keepalivecount += 1
# ok, still here, that means session is alive, but another
# common issue is firmware messing with mux on reboot
# this would be a nice thing to check, but the serial channel

If it would pan out, should cause the console session to disconnect itself roughly every 90 minutes and trigger reconnect (is 90 minutes short enough in your case?) Would require a service confluent restart to see if it had the desired effect.

Sorry I haven’t tested and can’t think of root cause, but going to take some time off for the weekend.

I would be curious if the same ipmitool is running a day later than a check (e.g. if ipmitool is exiting and getting restarted). I don’t have the time at the moment to see if they do some other interesting thing to avoid the behavior.

From: banuchka [mailto:***@gmail.com]
Sent: Friday, April 14, 2017 11:45 AM
To: xCAT Users Mailing list; Jarrod Johnson
Subject: RE: [xcat-user] Confluent as console server. Consoles hangs ~after 24h.

cloud53.ulan:/home/banuchka # ipmitool sol info 1
Info: SOL parameter 'Payload Channel (7)' not supported - defaulting to 0x01
Set in progress : set-complete
Enabled : true
Force Encryption : true
Force Authentication : false
Privilege Level : ADMINISTRATOR
Character Accumulate Level (ms) : 50
Character Send Threshold : 255
Retry Count
banuchka
2017-05-03 19:34:19 UTC
Permalink
Ok, i’ll try to separate nodes on instances per vendor for example
 and let you know about any results. For now I’m using it for 1k+ servers:)

On 3 May 2017 at 20:22:13, Jarrod Johnson (***@lenovo.com) wrote:

Frankly, I haven’t had the opportunity to test as high as I would like.  So far I can only first hand vouch for 500 with console logging enabled, which is about where we felt comfortable with conserver in general.
-- 
banuchka
Loading...