NASA Administrator Remembers Mission Control Pioneer Chris Kraft

Space

Chris Kraft, NASA’s first Flight Director, died yesterday, two days after the 50th anniversary of the first man on the moon.

Once comparing his complex work as a flight director to a conductor’s, Kraft said, ‘The conductor can’t play all the instruments–he may not even be able to play any one of them. But, he knows when the first violin should be playing, and he knows when the trumpets should be loud or soft, and when the drummer should be drumming. He mixes all this up and out comes music. That’s what we do here.’

It’s almost as if he waited for the anniversary. The world remembers the names of the men who set foot on the moon, but Chris Kraft was another hugely important person managing the hundreds of highly skilled engineers and scientists without whom the moon landing would not have been possible.

We’re fast approaching a time where there will be nobody left who was involved in project Apollo. It will never be forgotten, but it will be sad to no longer be remembered first hand.

Self-Host Your Static Assets

DevelopmentWeb

Harry Roberts, writing at CSS Wizardry:

One of the quickest wins—and one of the first things I recommend my clients do—to make websites faster can at first seem counter-intuitive: you should self-host all of your static assets, forgoing others’ CDNs/infrastructure. In this short and hopefully very straightforward post, I want to outline the disadvantages of hosting your static assets ‘off-site’, and the overwhelming benefits of hosting them on your own origin.

I’m a little late to this (the post was written back in May), but it’s an interesting counter argument to the common practice of serving third party resources from a provider’s CDN. The post goes into a lot more detail, but if you can, host it yourself.

The Two-Napkin Protocol

NetworkingBGP

An interesting piece of history I didn’t previously know.

It was 1989. Kirk Lougheed of Cisco and Yakov Rekhter of IBM were having lunch in a meeting hall cafeteria at an Internet Engineering Task Force (IETF) conference.

They wrote a new routing protocol that became RFC (Request for Comment) 1105, the Border Gateway Protocol (BGP), known to many as the “Two Napkin Protocol” — in reference to the napkins they used to capture their thoughts.

The post is worth a read just to see the photos of the napkins. I’ve never really thought before about how RFCs come to be. I’d always assumed they were the result of clever people in offices, not really thought up on the back of a napkin over drinks!

Also, as it’s 2019… happy 30th birthday, BGP (and the World Wide Web).

The Infrastructure Mess Causing Countless Internet Outages

NetworkingBGP

Roland Dobbins from Netscout Arbor, quoted in this Wired article:

“Nonspecialists kind of view the internet as this high-tech, gleaming thing like the bridge of the starship Enterprise. It’s not like that at all. It’s more like an 18th-century Royal Navy frigate. There’s a lot of running around and screaming and shouting and pulling on ropes to try to get things going in the right direction.”

It’s amazing how fragile some of the technologies that power something the world takes for granted are; BGP is a great example of that. The Internet is everywhere, and it’s becoming increasingly more necessary to be connected in order to be able to just go about our lives.

Regarding the choice of headline, however, I don’t think the word “mess” is fair. That does a disservice to the hundreds of very talented people that design, implement, and maintain the infrastructre that underpins our connected world.

Junos: Confirm a commit cleanly"

NetworkingJunos

For years, I have loved the fact that Junos allows you to perform a commit confirmed to apply the configuration with an automatic rollback in a certain number of minutes.

I have always believed that the only way to confirm the commit (i.e. stop the automatic rollback) was to commit again. This creates two commits in the commit history, one containing the actual config diff, and an empty one purely used to stop the rollback. I’ve always thought that this creates a somewhat messy commit history, and confuses the use of show | compare rollback:

{% highlight bash %} [edit] ben@device> run show system commit 0 2018-07-27 08:44:26 BST by ben via cli 1 2018-07-27 08:44:07 BST by ben via cli commit confirmed, rollback in 5mins 2 2018-07-23 10:04:29 BST by ben via cli 3 2018-07-23 10:03:58 BST by ben via cli commit confirmed, rollback in 2mins

[edit] ben@device> show | compare rollback 1

[edit] ben@device> # Huh, it’s empty?! I’m sure I did some work…

[edit] ben@device> show | compare rollback 2 [edit system] - host-name old-device + host-name device

[edit] ben@device> # Oh, there it is…

However, today I learnt that a commit check is enough to stop the rollback, and doesn’t create an empty commit! My commit histories are now much cleaner, and show | compare rollback commands a lot easier to work out what you’re actually looking at.

{% highlight bash %} [edit] ben@device> run show system commit 1 2018-07-27 08:44:07 BST by ben via cli commit confirmed, rollback in 5mins 3 2018-07-23 10:03:58 BST by ben via cli commit confirmed, rollback in 2mins

[edit] ben@device> show | compare rollback 1 [edit system] - host-name old-device + host-name device

Much better!

Lessons Learnt from an Outage

Networking

I caused an outage last week.

Not intentionally, and not a large outage by any stretch of the imagination. But it had various knock-on consequences that led to a lengthy application recovery time, long after full network connectivity was restored.

We were replacing one of the two core MPLS routers at one of the three primary sites in our network. Upgrading the hardware to a newer model required costing the existing node out, powering it off, physically replacing it with the new hardware, and bringing the node back online.

(For those of you interested in the details, we were upgrading a Juniper MX480 to an MX10003 - all logical connectivity was staying the same, but some links were being upgraded from 10Gbps to either 40 or 100).

In preparation for the upgrade, I had taken the running config from the existing node, adapted what was necessary for the newer hardware (mainly interface name changes, etc) and preloaded it onto the new node waiting in the lab. So far so good.

Unfortunately, the running config at the time I took it from the live node, was exactly that - the live node. At that moment, it wasn’t in its costed-out state, meaning that the minute we started repatching links into the new hardware, they were coming back live.

These core node provide connectivity between our three primary sites, and their three data centres. As the links came up live, the new node started drawing traffic from the local data centre fabric before it had any connectivity to the other sites (or even the other node in the same site). This meant that for a period of around four minutes, between one sixth and one half of all egress traffic from the local data centre was being blackholed, and dropped.

This caused connectivity issues between our ceph clusters, which then struggled to make sense of the situation, and filled up the only remaining link between this site and the others with traffic. (I don’t know the full details of why it did this, or what that traffic was - I’m a network engineer, and not responsible for the rest of the infrastructure. But remember I said we were upgrading from 10 to 100Gbps on some links? This is partly why). This then exacerbated the connectivity problems (for other applications in addition to ceph), pinning the links at line rate long after the we had costed out the new node as originally intended, leading to the long recovery time for the application layer.

So what did I learn from this outage? That’s the most important question when something like this happens - particularly if it was your fault. This is what I learnt.

Don’t become complacent

We had done this exact procedure for the other node in this site just days before, and all had gone smoothly (I had remembered to cost out the new node as I was preparing its config, instead of running with the live config from the existing node). Even if you’ve done an identical change before, always triple check what you’re doing if it has the potential to cause an impact on other parts of the business. Ideally, ask a colleague if they can spot anything you may have forgotten.

Unconditional summarisation isn’t always a good thing

Even though the links weren’t costed out, if all the core did was pass on routes learnt from the other sites, then the new node wouldn’t have started drawing traffic until it was able to route it.§

Instead, our core currently advertises three aggregate routes to its clients - the three private RFC1918 ranges. These aggregates are active if just a single contributing route is present in the routing table. In our case, the core node’s loopback is a perfectly valid contributing route to the 10.0.0.0/8 aggregate, causing it to be advertised even if every other link on the box was down, drawing (and dropping) traffic.

Under normal circumstances, we’d obviously expect the core nodes to have full reachability to the rest of the core, and this wouldn’t be an issue. However, certain failure scenarios can cause a node to become isolated (not just configuration mistakes like this one!) and blackhole traffic. Advertising more specific routes actually learnt from other peers, or some conditions imposed on the generation of the aggregate routes, would help limit this.

Core-facing links first! The outage was caused by the fact that the links to the data centre fabric were connected before the core-facing links. If they had been done the other way around, and core connectivity as part of the MPLS domain was confirmed before connecting any client-facing links, the issue would have been avoided.

Core-facing links are more important than client-facing - most clients will have redundancy via other nodes, and even if they don’t, until your core node has reachability to the rest of the core, it’s useless to its clients anyway.

Don’t neglect CoS

While not directly related to the outage itself, ceph filling the links with non-essential traffic (compared to, say, production web or database traffic) led to outages on other applications that could no longer communicate. Some quality of service markings and traffic shaping or policing would have a gone a long way to mitigating the impact, by restricting non-business-critical traffic to a subset of the link. Less important (but still useful) on the upgraded 100Gbps connections; but our single 10Gbps like couldn’t cope.