rssLink RSS for all categories
 
icon_red
icon_green
icon_red
icon_red
icon_blue
icon_green
icon_green
icon_red
icon_red
icon_red
icon_orange
icon_green
icon_green
icon_green
icon_green
icon_blue
icon_green
icon_orange
icon_red
icon_green
icon_red
icon_red
icon_green
icon_red
icon_red
icon_red
icon_red
icon_orange
icon_green
 

FS#1845 — FS#5820 — rbx-s1/rbx-s2 ace

Attached to Project— Network
Incident
Whole Network
CLOSED
100%
We have an incident on the ACE of rbx-s1. We are looking for the origin of the problem.
Date:  Tuesday, 27 September 2011, 15:48PM
Reason for closing:  Done
Comment by OVH - Sunday, 25 September 2011, 01:40AM

We will probably be obmliged to reboot.
We are preparing the card.


Comment by OVH - Sunday, 25 September 2011, 01:41AM

Card is being restarted:

20w1d: SP: The PC in slot 2 is shutting down. Please wait ...
20w1d: SP: PC shutdown completed for module 2
Sep 25 00:07:45 GMT: %C6KPWR-SP-4-DISABLED: power to module in slot 2 set off (Reset)

20w1d: Processor 0 of module in slot 2 cannot service session requests.

20w2d: Processor 0 of module in slot 2 cannot service session requests.

20w2d: Processor 0 of module in slot 2 cannot service session requests.

20w2d: Processor 0 of module in slot 2 cannot service session requests.

Sep 25 00:13:03 GMT: %DIAG-SP-6-RUN_MINIMUM: Module 2: Running Minimal Diagnostics...
Sep 25 00:13:14 GMT: %DIAG-SP-6-DIAG_OK: Module 2: Passed Online Diagnostics
Sep 25 00:13:18 GMT: %OIR-SP-6-INSCARD: Card inserted in slot 2, interfaces are now online


Comment by OVH - Sunday, 25 September 2011, 01:41AM

It's done, the card is up again.


Comment by OVH - Sunday, 25 September 2011, 04:01AM

The slave card s2 ace that took the load of s1 crashed.

Sep 25 01:38:28 GMT: %OIR-SP-3-PWRCYCLE: Card in module 2, is being power-cycled 'off (Reset - Module Reloaded During Download)'
Sep 25 01:38:29 GMT: %C6KPWR-SP-4-DISABLED: power to module in slot 2 set off (Reset - Module Reloaded During Download)
Sep 25 01:38:30 GMT: %DIAG-SP-3-TEST_FAIL: Module 2: TestAsicSync{ID=3} has failed. Error code = 0x76 (DIAG_QUERY_HYPERION_SYNC_ERROR)

The card is back with the original message of the crash:
last boot reason: SB Wdog uspace big loadavg


Comment by OVH - Sunday, 25 September 2011, 04:01AM

Sep 25 02:03:20 GMT: %OIR-SP-3-PWRCYCLE: Card in module 2, is being power-cycled 'off (Reset - Module Reloaded During Download)'
Sep 25 02:03:20 GMT: %C6KPWR-SP-4-DISABLED: power to module in slot 2 set off (Reset - Module Reloaded During Download)
Sep 25 02:08:52 GMT: %DIAG-SP-6-RUN_MINIMUM: Module 2: Running Minimal Diagnostics...
Sep 25 02:09:05 GMT: %DIAG-SP-6-DIAG_OK: Module 2: Passed Online Diagnostics
Sep 25 02:09:08 GMT: %OIR-SP-6-INSCARD: Card inserted in slot 2, interfaces are now online

The card is up with the reboot message:
last boot reason: SB Wdog uspace big loadavg


Comment by OVH - Sunday, 25 September 2011, 04:17AM

If we study the error message it means that because of a client (uspace) there is a big load (big loadavg) and therefore the watchdog (ft fail-tolerance) triggers the switch from master card to the slave card. in case i don't know I decided to switch to the slave card because I decided that the master is not fine.
no idea if this is true. we'll see the answer of the TAC.

we changed the ft values from

heartbeat interval 300
heartbeat count 20

to

heartbeat interval 1000
heartbeat count 50

We'll see if it's more stable this way.


Comment by OVH - Sunday, 25 September 2011, 04:19AM

And why do we have the problem only at night,do we have a nag customer ?

s2/ace est master :

rbx-s2-ace/Admin# sh proc cpu

CPU utilization for five seconds: 68%; one minute: 66%; five minutes: 63%

s1/ace est slave actuellement

rbx-s1-ace/Admin# sh proc cpu

CPU utilization for five seconds: 31%; one minute: 34%; five minutes: 33%


Comment by OVH - Sunday, 25 September 2011, 04:22AM

If the situation is not stable, we will add a limitation to 4 simultaneous connections for the administration of ACE. Some customers use 50 or 100 access!? and they are probably causing the problem.


Comment by OVH - Sunday, 25 September 2011, 04:23AM

We have applied it on some contexts of some customers.


Comment by OVH - Sunday, 25 September 2011, 18:13PM

rbx-s2-ace/Admin# sh proc cpu

CPU utilization for five seconds: 3%; one minute: 4%; five minutes: 5%
rbx-s1-ace/Admin# sh proc cpu

CPU utilization for five seconds: 10%; one minute: 12%; five minutes:
13%

It's much better.