OVHcloud Network Status

Current status
Legend
  • Operational
  • Degraded performance
  • Partial Outage
  • Major Outage
  • Under maintenance
FS#5651 — rbx-31
Incident Report for Network & Infrastructure
Resolved
There is a problem on the chassis. The card seems defected and blocks the chassis.


Update(s):

Date: 2011-08-04 08:52:31 UTC
The intervention did not pass properly because of the problem of versions on the new card. We were forced to restart the chassis on a new cold reboot. The chassis is in a stable condition, but we suspect another card to be the source of the problem. So, as a precaution we replace card 3.



Date: 2011-08-04 08:47:34 UTC
We insert a new card in the slot #2


Date: 2011-08-03 13:35:09 UTC
We are turning actually on one sup in slot1. Apparently at least one of the spare cards inserted yesterday has been defected. We retest all the cards in lab and we are planning an intervention tonight to insert a new card sup in slot2. We are changing eventually in a preventive title the cards 3 and 4.

Date: 2011-08-03 13:19:56 UTC
we are restarting on a new card in slot1. One sup. We re-descend again the setting at the backup.


Date: 2011-08-03 13:15:36 UTC
Card #1 is not restarting:

Local Test Mode encounters Minor hardware problem in Module # 1
Supervisor module 1 encontered CRITICAL failure: 0x1e - EARL_FAILURE L3_FAILURE RWENGINE_FAILURE L2_FAILURE
Failed Module Bringup Process
Use 'show test 1' to see results of tests.
Use 'reset 1' to reset the module.

we are trying to restart the chassis without cards 3 and 4 which are the last common elements to the previous setting.


Date: 2011-08-03 13:14:05 UTC
card1:
*** Bus Timeout NMI ***
PC = 0x80b808c8, SP = 0x87fff110 frame = 0xa0005ea8

*** Unknown External Interrupt ***
Stacked Cause = 0x800, Stacked Status Reg = 0x2441fc03
Current Cause IP[7..0] = 0x8, Current SREG IP[7..0] = 0xfc


Date: 2011-08-03 13:13:53 UTC
card1:
PC = 0xbfc0a6f4, Cause = 0x4c00, Status Reg = 0x2441fc03
PC = 0xbfc0a6f4, Cause = 0x4c00, Status Reg = 0x2441fc03
PC = 0xbfc0a6f4, Cause = 0x4c00, Status Reg = 0x2441fc03
PC = 0xbfc0a6f4, Cause = 0x4c00, Status Reg = 0x2441fc03
PC = 0xbfc0a6f4, Cause = 0x4c00, Status Reg = 0x2441fc03
PC = 0xbfc0a6f4, Cause = 0x4c00, Status Reg = 0x2441fc03
PC = 0xbfc0a6f4, Cause = 0x4c00, Status Reg = 0x2441fc03
PC = 0xbfc0a6f4, Cause = 0x4c00, Status Reg = 0x2441fc03
PC = 0xbfc0a6f4, Cause = 0x4c00, Status Reg = 0x2441fc03
PC = 0xbfc0a6f4, Cause = 0x4c00, Status Reg = 0x2441fc03
PC = 0xbfc0a6f4, Cause = 0x4c00, Status Reg = 0x2441fc03
PC = 0xbfc0a6f4, Cause = 0x4c00, Status Reg = 0x2441fc03
PC = 0xbfc0a6f4, Cause = 0x4c00, Status Reg = 0x2441fc03
PC = 0xbfc0a6f4, Cause = 0x4c00, Status Reg = 0x2441fc03
PC = 0xbfc0a6f4, Cause = 0x4c00, Status Reg = 0x2441fc03
PC = 0xbfc0a6f4, Cause = 0x4c00, Status Reg = 0x2441fc03
PC = 0xbfc0a6f4, Cause = 0x4c00, Status Reg = 0x2441fc03
PC = 0xbfc0a6f4, Cause = 0x4c00, Status Reg = 0x2441fc03
PC = 0xbfc0a6f4, Cause = 0x4c00, Status Reg = 0x2441fc03
PC = 0xbfc0a6f4, Cause = 0x4c00, Status Reg = 0x2441fc03
PC = 0xbfc0a6f4, Cause = 0x4c00, Status Reg = 0x2441fc03
PC = 0xbfc0a6f4, Cause = 0x4c00, Status Reg = 0x2441fc03
PC = 0xbfc0a6f4, Cause = 0x4c00, Status Reg = 0x2441fc03
PC = 0xbfc0a6f4, Cause = 0x4c00, Status Reg = 0x2441fc03
PC = 0xbfc0a6f4, Cause = 0x4c00, Status Reg = 0x2441fc03
PC = 0xbfc0a6f4, Cause = 0x4c00, Status Reg = 0x2441fc03
PC = 0xbfc0a6f4, Cause = 0x4c00, Status Reg = 0x2441fc03
PC = 0xbfc0a6f4, Cause = 0x4c00, Status Reg = 0x2441fc03
PC = 0xbfc0a6f4, Cause = 0x4c00, Status Reg = 0x2441fc03

Date: 2011-08-03 13:13:35 UTC
A new crash. We are restarting the chassis in cold. Cards 3 and 4 are still unchanged.


Date: 2011-08-03 11:54:21 UTC
Card #2 took relay while rebooting #1. Another card other than the sups is probably at the origin of the encountered problems since yesterday night. We are going to replace card #5.



Date: 2011-08-03 11:52:31 UTC
Aug 3 13:15:38 rbx-31-c1.routers.ovh.net 2011 Aug 03 11:15:17 %SYS-4-SUPERVISOR_ERR:Forwarding engine IP checksum error counter = 6
Aug 3 13:15:35 rbx-31-c1.routers.ovh.net 2011 Aug 03 11:15:14 %SYS-5-MOD_OK:Module 16(WS-F6K-MSFC,SAD040604MY) is online
Aug 3 13:15:34 rbx-31-c1.routers.ovh.net 2011 Aug 03 11:15:13 %SYS-3-MOD_PORTINTFINSYNC:Port Interface in sync for Module 2
Aug 3 13:15:34 rbx-31-c1.routers.ovh.net 2011 Aug 03 11:15:12 %SYS-5-MOD_OK:Module 5(WS-X6408A-GBIC,SAD05030JDD) is online
Aug 3 13:15:32 rbx-31-m2.routers.ovh.net 58: Aug 3 13:15:13 GMT: %SCP-5-ONLINE: Module online (supervisor switchover)

Date: 2011-08-03 11:52:24 UTC
Card #1 just been crashed again

Date: 2011-08-03 11:51:44 UTC
It is going to be alright tonight :)


Date: 2011-08-03 11:51:11 UTC
We need to find the problem's origin. We are testing different cards in the chassis.
http://yfrog.com/kl9ambvj

then, we will test cards of the router in another chassis.
http://yfrog.com/kiwgtqrj

We put a new card in #2, the card is not switched on. we are changing the slot's power in the chassis, it is switching on:
It's aright it's the chassis. there we go we will change it, we taking out the card of the chassis, we remove the chassis of the rack from the back,
we take the cards then we reinsert the chassis from the back then we reinsert the cards.
http://yfrog.com/kepqdsyj

It's all green, it is working. nothing but to drop the backup setting.
http://yfrog.com/gzk7nftsj

it will go to bin with 2 grilled cards
http://yfrog.com/kexkduxj



Date: 2011-08-03 11:41:10 UTC
m2 is set.

Date: 2011-08-02 20:57:46 UTC
The chassis + 2 sup are grilled. We have replaced them all and we had to entirely reset the router. The service is up on #1. we are finishing with #2.

sportive ...

Date: 2011-08-02 20:55:42 UTC
We have grabbed a spare chassis and we have replaced it.

Date: 2011-08-02 20:54:47 UTC
EOBC channel fail on #2

Date: 2011-08-02 20:54:33 UTC
#1 continuing to boot.

#2 is booting too.

Date: 2011-08-02 20:53:43 UTC
The fact to remove #4 it blocked the boot. So we guess it is at the origin of the problem.



Date: 2011-08-02 20:52:39 UTC
#2 is dead.

we are putting in #1. We are removing other cards. We are trying to boot already #1 and check whether it works.


Date: 2011-08-02 20:49:41 UTC
The boot is performing.

We are taking out #1. putting in #2.


Date: 2011-08-02 20:48:48 UTC
We are preparing meanwhile a spare of the card #2.
At least one card is defected.


Date: 2011-08-02 20:48:04 UTC
Uptime is 1051 days, 15 hours, 22 minutes

Date: 2011-08-02 20:47:56 UTC
The card #2 is out. Other cards are no longer been detected.
We are rebooting in Hardware.
Posted Aug 02, 2011 - 20:47 UTC