OVHcloud Network Status

Current status
Legend
  • Operational
  • Degraded performance
  • Partial Outage
  • Major Outage
  • Under maintenance
FS#5820 — rbx-s1/rbx-s2 ace
Incident Report for Network & Infrastructure
Resolved
We have an incident on the ACE of rbx-s1. We are looking for the origin of the problem.

Update(s):

Date: 2011-09-25 16:13:28 UTC
rbx-s2-ace/Admin# sh proc cpu

CPU utilization for five seconds: 3%; one minute: 4%; five minutes: 5%
rbx-s1-ace/Admin# sh proc cpu

CPU utilization for five seconds: 10%; one minute: 12%; five minutes:
13%

It's much better.

Date: 2011-09-25 02:23:02 UTC
We have applied it on some contexts of some customers.

Date: 2011-09-25 02:22:00 UTC
If the situation is not stable, we will add a limitation to 4 simultaneous connections for the administration of ACE. Some customers use 50 or 100 access!? and they are probably causing the problem.

Date: 2011-09-25 02:19:37 UTC
And why do we have the problem only at night,do we have a nag customer ?

s2/ace est master :

rbx-s2-ace/Admin# sh proc cpu

CPU utilization for five seconds: 68%; one minute: 66%; five minutes: 63%

s1/ace est slave actuellement

rbx-s1-ace/Admin# sh proc cpu

CPU utilization for five seconds: 31%; one minute: 34%; five minutes: 33%

Date: 2011-09-25 02:17:35 UTC
If we study the error message it means that because of a client (uspace) there is a big load (big loadavg) and therefore the watchdog (ft fail-tolerance) triggers the switch from master card to the slave card. in case i don't know I decided to switch to the slave card because I decided that the master is not fine.
no idea if this is true. we'll see the answer of the TAC.

we changed the ft values from

heartbeat interval 300
heartbeat count 20

to

heartbeat interval 1000
heartbeat count 50

We'll see if it's more stable this way.

Date: 2011-09-25 02:01:54 UTC
Sep 25 02:03:20 GMT: %OIR-SP-3-PWRCYCLE: Card in module 2, is being power-cycled 'off (Reset - Module Reloaded During Download)'
Sep 25 02:03:20 GMT: %C6KPWR-SP-4-DISABLED: power to module in slot 2 set off (Reset - Module Reloaded During Download)
Sep 25 02:08:52 GMT: %DIAG-SP-6-RUN_MINIMUM: Module 2: Running Minimal Diagnostics...
Sep 25 02:09:05 GMT: %DIAG-SP-6-DIAG_OK: Module 2: Passed Online Diagnostics
Sep 25 02:09:08 GMT: %OIR-SP-6-INSCARD: Card inserted in slot 2, interfaces are now online

The card is up with the reboot message:
last boot reason: SB Wdog uspace big loadavg

Date: 2011-09-25 02:01:06 UTC
The slave card s2 ace that took the load of s1 crashed.

Sep 25 01:38:28 GMT: %OIR-SP-3-PWRCYCLE: Card in module 2, is being power-cycled 'off (Reset - Module Reloaded During Download)'
Sep 25 01:38:29 GMT: %C6KPWR-SP-4-DISABLED: power to module in slot 2 set off (Reset - Module Reloaded During Download)
Sep 25 01:38:30 GMT: %DIAG-SP-3-TEST_FAIL: Module 2: TestAsicSync{ID=3} has failed. Error code = 0x76 (DIAG_QUERY_HYPERION_SYNC_ERROR)

The card is back with the original message of the crash:
last boot reason: SB Wdog uspace big loadavg

Date: 2011-09-24 23:41:30 UTC
It's done, the card is up again.

Date: 2011-09-24 23:41:04 UTC
Card is being restarted:

20w1d: SP: The PC in slot 2 is shutting down. Please wait ...
20w1d: SP: PC shutdown completed for module 2
Sep 25 00:07:45 GMT: %C6KPWR-SP-4-DISABLED: power to module in slot 2 set off (Reset)

20w1d: Processor 0 of module in slot 2 cannot service session requests.

20w2d: Processor 0 of module in slot 2 cannot service session requests.

20w2d: Processor 0 of module in slot 2 cannot service session requests.

20w2d: Processor 0 of module in slot 2 cannot service session requests.

Sep 25 00:13:03 GMT: %DIAG-SP-6-RUN_MINIMUM: Module 2: Running Minimal Diagnostics...
Sep 25 00:13:14 GMT: %DIAG-SP-6-DIAG_OK: Module 2: Passed Online Diagnostics
Sep 25 00:13:18 GMT: %OIR-SP-6-INSCARD: Card inserted in slot 2, interfaces are now online

Date: 2011-09-24 23:40:45 UTC
We will probably be obmliged to reboot.
We are preparing the card.
Posted Sep 24, 2011 - 23:39 UTC