NIC Card Soup Gives LAX A Tummy-Ache

NIC

Systems integrators, especially government contractors, are following the story closely. News about what went wrong at LAX has been vague, but ChannelWeb has learned some specific details about last weekend's outage on a local area network (LAN) operated by U.S. Customs and Border Protection (CBP) to process international travelers at LAX's Tom Bradley International Terminal.

Meanwhile, Customs plans to accelerate network upgrades at the country's top ports of entry and is bumping up a previously scheduled upgrade for LAX from 2008 to the next 90 days, said a CBP spokesperson.

"We're looking at the top 104 locations, with the major locations at the top of the list, like LA, Atlanta, NY, Miami. We're going to replace workstations, routers and switches and install Ku-band satellite backup for communications," said Ken Ritchhart, acting assistant commissioner of the Office of Information Technology at CBP in Washington D.C.

"It'll take us between six and nine months. LAX is at the top of the list for upgrading. Hopefully, we'll get the first portion done in the next 60 days, and about 90 days for the cabling."

id
unit-1659132512259
type
Sponsored post

Last Saturday's outage started with a malfunctioning network interface card (NIC) on a single workstation on the CBP's local network, according to the CBP. The NIC card suffered a "partial failure" at about 12:50 p.m. PST and began streaming mass packets of data through the network, affecting other networked NIC cards and causing a "data storm" that took the entire LAN offline, according to Ritchhart. Later, a switch on the network also suffered an outage, compounding the problem, he said.

Asked to name the manufacturer of the NIC card, Ritchhart said he didn't have that information immediately available. Dozens of companies make network interface cards.

Unable to access the national law enforcement databases for screening passengers, and enforcing a zero-tolerance policy for processing travelers without running their details through those filters, CBP's Los Angeles field office shut down Customs at LAX. It took more than 11 hours for the agency to get its network up and running, as thousands of passengers were left stranded in the terminal and in airplanes on the tarmac.

So what went wrong?

Speculation on the blogs has centered around various aspects of the network, with some guessing at an outdated topology like Token Ring, ancient infrastructure and non-existent backup systems. None of those assumptions are precisely accurate, said Ritchhart.

The CBP runs a TCP/IP ethernet of some 80 to 100 nodes at LAX, Ritchhart said, with servers and workstations that are four years old, routers and switches that are six years old and wiring that is 20 years old. Not as ancient as some have speculated, but all of that infrastructure will be upgraded in the next three months, he said.

Meanwhile, there is a backup system for loss of connectivity to the Washington D.C.-based databases (or their backups in Mississippi), but it also operates over the local network and there is no redundancy in place for a LAN failure like the one that happened last weekend.

"We switched over to the backup laptop capabilities, but we couldn't handle nearly the volume of passengers that we had. In case we lose connectivity to D.C., we have a local copy of the database, but in this case we couldn't access the local copy because the local network was down," Ritchhart said.

Misdiagnosis of the problem caused hours to be wasted searching for the wrong fix. When the network went down, the first outside support to be called was Sprint, CBP's service provider.

"Customs thought it was their routers," said a Sprint spokesperson, who described a timeline in which Sprint tested the lines remotely, then sent a Sprint technician on site to run more tests, and finally concluded after some six hours that the transmission infrastructure was sound and it must be a LAN issue. According to the spokesperson, a Sprint engineer identified the problem with the single desktop and its rogue NIC card.

"It was a senior-level engineer, and doing it remotely, he was able to help out and I.D. what the specific problem was," she said.

Several sources contacted for this story guessed that human failings were a bigger part of the outage than technological ones.

"As with most IT meltdowns, this situation has 'management systems failure' written all over it," wrote tech consultant Michael Krigsman on his blog.

"Regardless of whether it was a router, switch or NIC card, the cause was a routine breakdown in commonplace, low-cost equipment. Personally, I vote for gross incompetence," Krigsman, the CEO of Brookline, Mass.-based software and consulting company Asuret, told ChannelWeb.

Ritchhart was not so candid, but the spokesperson said CBP would be looking at IT staff at ports of entry in addition to equipment and infrastructure.

"We have a root cause analysis going through it. We're looking at all the different things. We take it very seriously. We're looking at staffing procedures as well as remote diagnostics. Were still doing the analysis, and we've set a tiger team working on it. They'll put out a report next week," he said.

Customs will also upgrade UPS at locations that need it, Ritchhart added. On Sunday, CBP's LAN suffered a second, much shorter outage due to a power supply problem.

Some sources were baffled that a single NIC card could have caused so much trouble without being quickly isolated and taken off the network. Ritchhart characterized the cause of the outage as a low-probability event, but pledged to improve diagnostic capabilities at both the human and technological levels to prevent such a head-scratching incident from happening again.

"Usually when a card like that malfunctions it shuts down. In this case, it went crazy and brought down the network. We initially thought it was a circuit outage, and that was incorrect. We're putting in better diagnostic capabilities. Our goal is to make sure it never happens again," he said.

Dave Casey of Westron Communications has trouble understanding how a malfunctioning NIC card could take down a whole network.

"You can have a NIC card that streams, that locks and keeps transferring data. The article I read said it started with one card and then spread to other cards. That doesn't sound kosher to me. It'll kind of dominate its little section of the network, but I don't see where it would take down the entire LAN. At least, I haven't seen that in this century," he said.

"The ethernet switches that are sold today, you can isolate issues down to the port level, lock that port out, and you can automate it with scripts," added Casey, a principal at the Dallas, Tex.-based networking and voice VAR.

ROI Networks CEO Jeff Hiebert thinks the low-probability characterization by Ritchhart could be the key to understanding what happened.

"There must have been a kind of anomaly between what happened between that NIC card and the LAN," said Heibert, whose San Juan Capistrano, Calif.-based company also specializes in networking and voice.

Often, trying to fix a network problem can make it worse, said Zeus Kerravala, a networking analyst at the Yankee Group.

"We've done research that shows that when you troubleshoot a problem, 90 percent of the time it compounds the problem. And if your network's not well managed, that can easily drive up the hours offline," he said.

He said a lone NIC card could take an older network down.

"If you look at the historical net management systems, they alert you that there is a problem, but not what the problem is. The newer networks are a lot more intelligent. With a lot of the older networks, the ones that work off shared infrastructure, you open yourself up to this kind of thing. So one NIC card can flood the whole network and bring it to its knees," he said.

Bad system management software, or badly configured system management software, could have played a part as well, said a spokesperson for software vendor LANDesk.

"If they have the correct systems management software coupled with the right policies, there should be an automated fail-over capability and that could have prevented this kind of outage," he said.

For his part, the CBP's Ritchhart said the onus is on his agency to get the right technology and staff in place at LAX and elsewhere to get passengers through Customs efficiently, while securing the borders at the highest level possible.

"It's a balancing act between preserving security and getting those people on and off the airplanes," he said.