OVHcloud Network Status

Current status
Legend
  • Operational
  • Degraded performance
  • Partial Outage
  • Major Outage
  • Under maintenance
FS#5592 — rbx-g1/g2
Incident Report for Network & Infrastructure
Resolved
We have a problem on the ASR9000

Jul 6 12:58:05 rbx-g1-a9.fr.eu 5919: LC/0/0/CPU0:Jul 6 10:57:46 UTC: fib_mgr[161]: %ROUTING-FIB-4-RSRC_LOW : CEF running low on DATA_TYPE_TABLE_SET resource memory. CEF will nowbegin resource constrained forwarding. Only route deletes will behandled in this state, which may result in mismatch between RIB/CEF.Traffic loss on certain prefixes can be expected. The CEF will automatically resume normal operation, once the resource utilizationreturns to normal level
Jul 6 12:57:42 rbx-g2-a9.fr.eu 15654: LC/0/3/CPU0:Jul 6 10:57:23 UTC: fib_mgr[161]: %PLATFORM-PLAT_FIB-6-INFO : PD FIB object LEAF OOR state changed to GREEN
Jul 6 12:57:42 rbx-g2-a9.fr.eu 15655: LC/0/3/CPU0:Jul 6 10:57:23 UTC: fib_mgr[161]: %ROUTING-FIB-6-RSRC_OK : CEF resource state has returned to normal. CEF hasexited resource constrained operation and normal forwarding has been restored


Update(s):

Date: 2011-07-07 12:50:25 UTC
We have received the spare card of Cisco at 4H00 am.
http://yfrog.com/z/kejb0uj

The old card is still in the router.
First of all, we disconnect the optical fibres.
http://yfrog.com/z/kg4rknnj

It is done, the card is ready to get out.
http://yfrog.com/z/kl2d5jj

Ready to go ? Go ... The card is out
http://yfrog.com/z/kl1aslhj

We verify the logs and everything is OK
http://yfrog.com/z/kj1kfij

We put down the old card and unpack the new one
http://yfrog.com/z/kh47sqj

The card is ready to be inserted
http://yfrog.com/z/kiz82vtj

The card is inserted and it boots
http://yfrog.com/z/kjh2dvj

We verify the logs: the boot goes well
http://yfrog.com/z/kl42ttj

We re-connect the optical fibres.
http://yfrog.com/z/h7iialhxj

We verify the logs: everything is OK
http://yfrog.com/z/khd74nj

We verify the weathermap and the traffic movement
to Paris and Frankfurt: everything
is OK
http://weathermap.ovh.net/backbone

The old card is re-packed and will be sent to
Cisco.

We thank the Cisco team for the follow up of this
night. The internal bug was fixed at 1h am.

Date: 2011-07-06 23:22:06 UTC
http://status.ovh.co.uk/?do=details&id=1154

[...]
We will replace the card #6 of g1 by the card #4 of g2 on which we have ports not used or little traffic.
[...]

That's why it does not stick with Cisco bases.

Date: 2011-07-06 23:19:06 UTC
Apparently the card is not in the bases.
It's probably because we've already had two broken cards and following the previews RMA it was not updated.

Date: 2011-07-06 23:15:26 UTC
Well ,this is all: Cisco bases are not updated with the contract recently signed,we will not have the card within 2hours.

Date: 2011-07-06 23:10:55 UTC
Traffic was reloaded,everything is going right .

The inital problem is fixed.

Now we need to replace the card. The RMA is in progress.

Date: 2011-07-06 23:07:17 UTC
Cisco asked us to restart the card to see if it is definately dead.

RP/0/RSP0/CPU0:rbx-g1-a9(admin)#reload location 0/4/CPU0
Wed Jul 6 19:37:06.607 UTC

Preparing system for backup. This may take a few minutes especially for large configurations.
[Done]
Proceed with reload? [confirm]

Date: 2011-07-06 23:01:44 UTC
We started replacing the card with Cisco via hardware support T+2H,this means that Cisco will give us the card which is down in less than 2 hours in case of hardware problem on one of the elements of the router .

We checked the ports down and we don't expect an impact on traffic even without the card. All ports are lined and it should not saturate.

We just set the router in routing.

Now we will check saturation of the links.

Date: 2011-07-06 22:57:46 UTC
The card 0/4 died.

Date: 2011-07-06 22:56:43 UTC
g1 is up.
We will check it now.

Date: 2011-07-06 22:56:16 UTC
RP/0/RSP0/CPU0:rbx-g1-a9(admin)#reload location all
Wed Jul 6 19:13:11.504 UTC

Preparing system for backup. This may take a few minutes especially for large configurations.
Status report: node0_RSP0_CPU0: START TO BACKUP
Status report: node0_RSP0_CPU0: BACKUP HAS COMPLETED SUCCESSFULLY
[Done]
Proceed with reload? [confirm]RP/0/RSP0/CPU0::This node received reload command. Reloading in 5 secs

Restarting in process.

Date: 2011-07-06 22:55:51 UTC
g1 is off the loop, all is rooted on g2.
We are ready to restart.

Date: 2011-07-06 22:54:25 UTC
We will set g1 off the routage.

Date: 2011-07-06 22:53:26 UTC
g2 is OK.
We set it in the routage,is is on the loop.

Date: 2011-07-06 22:52:21 UTC
g2 is UP.

We are checking it.

Date: 2011-07-06 22:51:55 UTC
RP/0/RSP1/CPU0:rbx-g2-a9(admin)#reload location all
Wed Jul 6 18:58:42.597 UTC

Preparing system for backup. This may take a few minutes especially for large configurations.
Status report: node0_RSP1_CPU0: START TO BACKUP
Status report: node0_RSP1_CPU0: BACKUP HAS COMPLETED SUCCESSFULLY
[Done]
Proceed with reload? [confirm]RP/0/RSP1/CPU0::This node received reload command. Reloading in 5 secs

Date: 2011-07-06 22:51:35 UTC
All routage is going through g1 currently.
We are ready for g2.

Date: 2011-07-06 22:50:39 UTC
RP/0/RSP1/CPU0:rbx-g2-a9(admin-config)#hw-module profile scale l3xl
Wed Jul 6 18:50:16.520 UTC
In order to activate this new memory resource profile, you must manually reboot the system.

We have to restart the router.

Date: 2011-07-06 22:50:10 UTC
It is turning in a loop for the new IPs in the network.
We are waiting for CISCO.

6 th2-1-6k.fr.eu (213.186.32.181) 55.409 ms * 50.620 ms
7 th1-1-6k.fr.eu (213.186.32.165) 58.132 ms * 50.333 ms
8 rbx-g2-a9.fr.eu (91.121.131.141) 55.075 ms 53.812 ms 54.613 ms
9 gsw-2-6k.fr.eu (91.121.131.214) 77.756 ms * *
10 rbx-g1-a9.fr.eu (91.121.131.33) 57.627 ms 57.028 ms 57.390 ms
11 gsw-2-6k.fr.eu (91.121.131.38) 263.777 ms
gsw-2-6k.fr.eu (91.121.131.34) 205.179 ms
gsw-2-6k.fr.eu (213.251.128.106) 209.499 ms
12 rbx-g1-a9.fr.eu (91.121.131.33) 62.124 ms 59.690 ms 62.422 ms
13 gsw-2-6k.fr.eu (91.121.131.38) 62.392 ms *
gsw-2-6k.fr.eu (213.251.128.106) 61.387 ms
14 rbx-g1-a9.fr.eu (91.121.131.33) 65.804 ms 65.402 ms 65.773 ms
15 gsw-2-6k.fr.eu (91.121.131.38) 65.205 ms *
gsw-2-6k.fr.eu (213.251.128.106) 64.206 ms
16 rbx-g1-a9.fr.eu (91.121.131.33) 69.591 ms 67.366 ms 68.669 ms
17 * * gsw-2-6k.fr.eu (213.251.128.106) 220.553 ms
18 rbx-g1-a9.fr.eu (91.121.131.33) 71.096 ms 73.312 ms 71.266 ms
19 gsw-2-6k.fr.eu (91.121.131.38) 70.817 ms
gsw-2-6k.fr.eu (91.121.131.34) 70.360 ms
gsw-2-6k.fr.eu (213.251.128.106) 71.530 ms

Date: 2011-07-06 22:48:04 UTC
The registration of the new IPs is not done.

We are in contact with TAC CISCO in order to fix the problem.

Date: 2011-07-06 14:44:06 UTC
RP/0/RSP1/CPU0:rbx-g2-a9#sh cef resource detail location 0/0/cpu0
Wed Jul 6 12:35:19.098 UTC
CEF resource availability summary state: YELLOW
CEF will drop route updates
No. of times HW caused oor: 26
CEF entered oor at : Jul 6 12:30:33.573
CEF came out of oor at : Jul 6 12:29:48.370
ipv4 shared memory resource:
CurrMode GREEN, CurrAvail 866398208 bytes, MaxAvail 984129536 bytes
ipv6 shared memory resource:
CurrMode GREEN, CurrAvail 866398208 bytes, MaxAvail 984129536 bytes
mpls shared memory resource:
CurrMode GREEN, CurrAvail 866398208 bytes, MaxAvail 984129536 bytes
common shared memory resource:
CurrMode GREEN, CurrAvail 866398208 bytes, MaxAvail 984129536 bytes
DATA_TYPE_TABLE_SET hardware resource: YELLOW
DATA_TYPE_TABLE hardware resource: YELLOW
DATA_TYPE_IDB hardware resource: YELLOW
DATA_TYPE_IDB_EXT hardware resource: YELLOW
DATA_TYPE_LEAF hardware resource: YELLOW
DATA_TYPE_LOADINFO hardware resource: YELLOW
DATA_TYPE_PATH_LIST hardware resource: YELLOW
DATA_TYPE_NHINFO hardware resource: YELLOW
DATA_TYPE_LABEL_INFO hardware resource: YELLOW
DATA_TYPE_FRR_NHINFO hardware resource: YELLOW
DATA_TYPE_ECD hardware resource: YELLOW
DATA_TYPE_RECURSIVE_NH hardware resource: YELLOW
DATA_TYPE_TUNNEL_ENDPOINT hardware resource: YELLOW
DATA_TYPE_LOCAL_TUNNEL_INTF hardware resource: YELLOW
DATA_TYPE_ECD_TRACKER hardware resource: YELLOW
DATA_TYPE_ECD_V2 hardware resource: YELLOW
DATA_TYPE_ATTRIBUTE hardware resource: YELLOW
DATA_TYPE_LSPA hardware resource: YELLOW
DATA_TYPE_LDI_LW hardware resource: YELLOW
DATA_TYPE_LDSH_ARRAY hardware resource: YELLOW
DATA_TYPE_TE_TUN_INFO hardware resource: YELLOW
DATA_TYPE_DUMMY hardware resource: YELLOW
DATA_TYPE_IDB_VRF_LCL_CEF hardware resource: YELLOW
DATA_TYPE_TABLE_UNRESOLVED hardware resource: YELLOW
DATA_TYPE_MOL hardware resource: YELLOW
DATA_TYPE_MPI hardware resource: YELLOW
DATA_TYPE_SUBS_INFO hardware resource: YELLOW
DATA_TYPE_GRE_TUNNEL_INFO hardware resource: YELLOW
RP/0/RSP1/CPU0:rbx-g2-a9#

Date: 2011-07-06 14:43:30 UTC
RP/0/RSP1/CPU0:rbx-g2-a9# show bgp nexthops statistics
Wed Jul 6 12:34:19.284 UTC
Total Nexthop Processing
Time Spent: 871.632 secs

Maximum Nexthop Processing
Received: 6w3d
Bestpaths Deleted: 0
Bestpaths Changed: 144079
Time Spent: 2.918 secs

Last Notification Processing
Received: 1d14h
Time Spent: 0.021 secs

Gateway Address Family: IPv4 Unicast
Table ID: 0xe0000000
Nexthop Count: 147
Critical Trigger Delay: 3000msec
Non-critical Trigger Delay: 10000msec

Nexthop Version: 1, RIB version: 1

Total Critical Notifications Received: 119
Total Non-critical Notifications Received: 11570
Bestpaths Deleted After Last Walk: 0
Bestpaths Changed After Last Walk: 1961
Nexthop register:
Sync calls: 426747, last sync call: 00:15:14
Async calls: 1697, last async call: 14w6d
Nexthop unregister:
Async calls: 426603, last async call: 00:14:38
Nexthop batch finish:
Calls: 947770, last finish call: 00:14:37
Nexthop flush timer:
Times started: 853358, last time flush timer started: 00:14:38
RIB update: 0 rib update runs, last update: 00:00:00
0 prefixes installed, 0 modified, 0 removed

RP/0/RSP1/CPU0:rbx-g2-a9#show controller np struct 6 summary location 0/0/cpu0
Wed Jul 6 12:34:29.161 UTC

Node: 0/0/CPU0:
----------------------------------------------------------------
NP: 0 Struct 6: R_LDI
1685 of 65536 entries in use (1685 reserved)
Buddy allocator information:
Block Size : 1 2 4 8 16 32
Free Blocks: 288 57 8 1 1 1981
Used Blocks: 1673 0 3 0 0 0

NP: 1 Struct 6: R_LDI
1685 of 65536 entries in use (1685 reserved)
Buddy allocator information:
Block Size : 1 2 4 8 16 32
Free Blocks: 288 57 8 1 1 1981
Used Blocks: 1673 0 3 0 0 0

NP: 2 Struct 6: R_LDI
1685 of 65536 entries in use (1685 reserved)
Buddy allocator information:
Block Size : 1 2 4 8 16 32
Free Blocks: 288 57 8 1 1 1981
Used Blocks: 1673 0 3 0 0 0

NP: 3 Struct 6: R_LDI
1685 of 65536 entries in use (1685 reserved)
Buddy allocator information:
Block Size : 1 2 4 8 16 32
Free Blocks: 288 57 8 1 1 1981
Used Blocks: 1673 0 3 0 0 0

NP: 4 Struct 6: R_LDI
1685 of 65536 entries in use (1685 reserved)
Buddy allocator information:
Block Size : 1 2 4 8 16 32
Free Blocks: 288 57 8 1 1 1981
Used Blocks: 1673 0 3 0 0 0

NP: 5 Struct 6: R_LDI
1685 of 65536 entries in use (1685 reserved)
Buddy allocator information:
Block Size : 1 2 4 8 16 32
Free Blocks: 288 57 8 1 1 1981
Used Blocks: 1673 0 3 0 0 0

NP: 6 Struct 6: R_LDI
1685 of 65536 entries in use (1685 reserved)
Buddy allocator information:
Block Size : 1 2 4 8 16 32
Free Blocks: 288 57 8 1 1 1981
Used Blocks: 1673 0 3 0 0 0

NP: 7 Struct 6: R_LDI
1685 of 65536 entries in use (1685 reserved)
Buddy allocator information:
Block Size : 1 2 4 8 16 32
Free Blocks: 288 57 8 1 1 1981
Used Blocks: 1673 0 3 0 0 0

Date: 2011-07-06 14:42:31 UTC
We have added the next-hop-self on IPv6.

The same thing.

We have just opened a TAC at Cisco

Date: 2011-07-06 14:41:19 UTC
The problem resembles to this one
http://status.ovh.net/?do=details&id=752
but not quite the same.
Posted Jul 06, 2011 - 14:38 UTC