Author Topic: The BA IT collapse  (Read 14862 times)

Jaded

  • The Codfather
  • Formerly known as Jaded
Re: The BA IT collapse
« Reply #50 on: 29 May, 2017, 08:57:26 am »
Plus as others have said when they no longer value those who built their legacy, didn't encourage and promote knowledge sharing, and lay them off; then they are left with inexperienced staff who are ok when operations are working as normal but have no experience with dealing with a DR situation.

Applies well beyond IT, too.
Someone was recently telling me about how BR/Railtrack lost the knowledge of where a lot of its signalling cables were. Supposedly it had never had a full inventory because local teams repaired, replaced etc. & had never logged everything centrally. Come privatisation, a lot of those people were laid off - & either weren't asked or didn't want to say where everything was. Paper records may or may not have existed, but where? Knowledge may have been only in heads that were no longer employed.

On Friday I heard something similar about the water management system for a stretch of canal. The people who maintained it had all been got rid of, & went quietly. Years later, an enthusiastic young new bloke met one of the old guys at some heritage thing, & the old bloke took a liking to him - & they went for a walk round & talk through. The oldster remembered it all, & enjoyed describing it.

I think you'll find it wasn't just cable runs they lost. As a result of something I did for Railtrack I was offered a place on an Asset Day in 1999. This was a senior person and a handful of other managers going round the area looking at assets. They were ostensibly checking that assets (like bridges, tunnels, etc) had been counted properly. They didn't know how many they had, so had a program of counting them. It was great fun, we saw a modern signal box operating in Newport, a really really old one operating (and got to pull the levers) and then we went in the Severn Tunnel, seeing the controlled raging water and standing on the trackside as 125s dashed past. A few months later they killed a load of people with their asset inadequacies.
It is simpler than it looks.

Re: The BA IT collapse
« Reply #51 on: 29 May, 2017, 10:36:04 am »
Quote
BA blames a power outage, but a corporate IT expert said it should not have caused "even a flicker of the lights" in the data-centre.
Even if the power could not be restored, the airline's Disaster Recovery Plan should have whirred into action. But that will have depended in part on veteran staff with knowledge of the complex patchwork of systems built up over the years.
Many of those people may have left when much of the IT operation was outsourced to India.
One theory of the IT expert, who does not wish to be named, is that when the power came back on the systems were unusable because the data was unsynchronised.
In other words the airline was suddenly faced with a mass of conflicting records of passengers, aircraft and baggage movements - all the complex logistics of modern air travel.

From university days I remember that the first airline control system was SABRE (Semi-automated Business Research Environment) now between 50 and 60 years old and still in use.  It is American and its history is one of cut-throat competition.  BA's system was called Travicom that was launched in 1976 and then evolved into the current system, Galileo 12 years later.

What's been added to these venerable systems is the security element or pre-screening of passengers.  The first add-on was Capps in the late 1990s which simply checked names against an FBI database.  After 9/11 highlighted its shortcomings it was replaced by CAPPS II  which checked everything about every one everywhere and reportedly nominated Edward Kennedy as a security risk in 2004. It was replaced by the US Department of Homeland security commissioned 'Secure Flight' pre-screening system but this didn't kick in till 2010.  At this point everything gets very tricky..  For a start there are at least two lists to be used, one with a mere 47k names, the other with 2.5m identities believed to cover 1.8m people.  With a paranoid POTUS, it's easy to imagine systems developers and maintainers tearing their hair out.

And how this security aspect ties in with outsourcing is any one's guess.
Move Faster and Bake Things

Jaded

  • The Codfather
  • Formerly known as Jaded
Re: The BA IT collapse
« Reply #52 on: 29 May, 2017, 08:46:28 pm »
My understanding of outsourcing is that if the contracted company says its employees are vetted to the required standards, then you are covered...
It is simpler than it looks.

Chris S

Re: The BA IT collapse
« Reply #53 on: 29 May, 2017, 09:27:37 pm »
My understanding of outsourcing is that if the contracted company says its employees are vetted to the required standards, then you are covered...

Really? Wow.

"Can you keep a secret?"
"Yes"
"Oh well, you're in then!"

Jaded

  • The Codfather
  • Formerly known as Jaded
Re: The BA IT collapse
« Reply #54 on: 29 May, 2017, 10:03:08 pm »
Exactly.

Unless the contracting company wishes to go and vet all the employees for skills and clearance themselves. Goodness, they might even be interested in the way those staff are managed. In which case, what's the point of having an outsourcing?
It is simpler than it looks.

Re: The BA IT collapse
« Reply #55 on: 30 May, 2017, 07:57:38 pm »
Caught a bit of it being discussed with an airline IT expert on Radio 5 Live this evening.

Seems it was the integration between BA's systems and the Amadeus system that handles the boarding side of bookings/luggage-tracking/etc.

It was definitely on the BA side as Amadeus is used by many different airlines at those airports and no other airlines had a problem with that during the same period.

So, whatever BA had written to integrate their own systems and Amadeus goes wonky, the backup/HA/DR system flails-over in exactly the same way and they're left up shit creek until engineers can wrestle something back to life.

If it was power surge related then:-
a) Why wasn't the system behind some level of power conditioning?
b) Surely the primary and secondary systems aren't in the same DC/location?
c) They obviously don't have a playbook for recovering from hardware failure of both systems...
d) I'm assuming they had a secondary system for this integration! Surely it wasn't a SPOF!?!
"Yes please" said Squirrel "biscuits are our favourite things."

Re: The BA IT collapse
« Reply #56 on: 01 June, 2017, 05:30:08 pm »
'They' are still clinging to the power surge explanation.

I'm baffled by this because here in rural France we are told to expect power surges for various reasons.  The one that took out my own computer was a lightning strike a few years ago.

However, my electrician tells me that four hundred euros would buy me a blast-proof surge protector, very unlike the mickey mouse jobs from PC World*.

Being parsimonious I declined his offer and just unplug when I hear thunder.  But why didn't BA have such a device in place or if they did why didn't it work?  Was it a cost-saving PC World job?  If it was a third party installed set up isn't there insurance cover?

* I had one; it didn't save my laptop and unfortunately I'd lost the receipt so could not claim.
Move Faster and Bake Things

Kim

  • Timelord
    • Fediverse
Re: The BA IT collapse
« Reply #57 on: 01 June, 2017, 05:41:12 pm »
A datacentre UPS ought to protect against surges (up to and including nearby lightning strikes) on the mains supply, or normal forms of generator misbehaviour.

It's always possible that the UPS itself (or equipment downstream of it) failed in a way that caused a less impressive 'surge' (repeated inrush current in a short period of time, spikes from inductive A/C loads, or maybe just really out-of-spec power) and damaged a whole load of critical kit.  Maybe some sheepish engineer accidentally swapped a phase and a neutral and is keeping the now legendary low profile, or something.  In nearly all form of human endeavour there's potential for everything to go massively wrong as long as more than one bad thing happens at the same time.

Obviously things ought to be sufficiently redundant they should be able to continue with the total loss of a single site, but we've already covered cost-cutting and the risks of subtle failure modes that are hard to predict and test for.

Re: The BA IT collapse
« Reply #58 on: 02 June, 2017, 04:48:16 pm »
This Times article points to human error

"A power supply unit at the centre of last weekend’s British Airways fiasco was in perfect working order but was deliberately shut down in a catastrophic blunder, The Times has learnt.

An investigation into the incident, which disrupted the travel plans of 75,000 passengers, is likely to centre on human error rather than any equipment failure at BA, it emerged."

https://www.thetimes.co.uk/article/ba-power-fiasco-blamed-on-staff-blunder-tbfhxwsw2

Not sure I'm convinced

Edit - A bit more detail from the Guardian

https://www.theguardian.com/business/2017/jun/02/ba-shutdown-caused-by-contractor-who-switched-off-power-reports-claim
“There is no point in using the word 'impossible' to describe something that has clearly happened.”
― Douglas Adams

Ben T

Re: The BA IT collapse
« Reply #59 on: 03 June, 2017, 12:33:05 am »
A datacentre UPS ought to protect against surges (up to and including nearby lightning strikes) on the mains supply, or normal forms of generator misbehaviour.

It's always possible that the UPS itself (or equipment downstream of it) failed in a way that caused a less impressive 'surge' (repeated inrush current in a short period of time, spikes from inductive A/C loads, or maybe just really out-of-spec power) and damaged a whole load of critical kit.  Maybe some sheepish engineer accidentally swapped a phase and a neutral and is keeping the now legendary low profile, or something.  In nearly all form of human endeavour there's potential for everything to go massively wrong as long as more than one bad thing happens at the same time.

Obviously things ought to be sufficiently redundant they should be able to continue with the total loss of a single site, but we've already covered cost-cutting and the risks of subtle failure modes that are hard to predict and test for.

You tend to think with the rise of "the cloud" that all big companies use good data centres but it's amazing how many don't. I once went on a visit to an azure data centre and its ups is.....impressive, to say the least. 8 rooms each containing a not just a generator but a full on ship engine (yes, they're massive). They have to be started up every week just to be tested, and is quite a big job just starting them up , you don't just turn a key# Another room containing nothing but weird batteries but where the acid is visible and sloshing about in open trays. Loads of control rooms containing nothing but switches, some of which in the 6 figures of volts. The whole thing's more about power than computer equipment - and that's the only basis on which it's charged to (corporate, i.e.wholesale) customers.

Re: The BA IT collapse
« Reply #60 on: 03 June, 2017, 07:56:48 am »
Quote
The whole thing's more about power than computer equipment

That seems reasonable.  However what you describe doesn't suggest that it's easy to lose power and a bit of further reading about battery rooms confirms that it is a highly developed area of power supplies with a long history.  Some of the earliest battery rooms going back into the 19th C and there was of course extensive deployment on board even the early U boats.  In the 70's we monitored the radio traffic for the port of Harwich.  Even we had a battery room in case of power cuts!

In fact the more they claim it was a power failure, the less credible it becomes.

Meanwhile on Ryanair's twitter page:

Move Faster and Bake Things

Thor

  • Super-sonnicus idioticus
Re: The BA IT collapse
« Reply #61 on: 03 June, 2017, 06:21:54 pm »
This Times article points to human error

"A power supply unit at the centre of last weekend’s British Airways fiasco was in perfect working order but was deliberately shut down in a catastrophic blunder, The Times has learnt.

An investigation into the incident, which disrupted the travel plans of 75,000 passengers, is likely to centre on human error rather than any equipment failure at BA, it emerged."

https://www.thetimes.co.uk/article/ba-power-fiasco-blamed-on-staff-blunder-tbfhxwsw2

Not sure I'm convinced

Edit - A bit more detail from the Guardian

https://www.theguardian.com/business/2017/jun/02/ba-shutdown-caused-by-contractor-who-switched-off-power-reports-claim

A few years ago, I was working for a Large Financial Institution in which, on a weekday, an electrical contractor drilling a hole in a wall managed to trigger the kill switch for the entire Data Centre.  1200 servers simultaneously experienced an unscheduled outage.  Recovering from that was fun for all of us...
It was a day like any other in Ireland, only it wasn't raining

Re: The BA IT collapse
« Reply #62 on: 04 June, 2017, 01:30:39 am »
Ryanair's apparent hubris seems to be based on their own operation.  They claim to have systems operating in triplicate, sharing changes to data.  Each system can operate independently if the other two are knocked out.  I guess there might be a point at which they converge, the weakest link, maybe simply the user interface or web presence?   Or maybe even that can be triplicated!  I'm now worried their aircraft have only two engines..
Move Faster and Bake Things

Re: The BA IT collapse
« Reply #63 on: 04 June, 2017, 06:59:55 am »
Now that the outage has been resolved BA should now start reimbursing and compensating passengers.

Vince

  • Can't climb; won't climb
Re: The BA IT collapse
« Reply #64 on: 04 June, 2017, 09:57:02 am »
If the issue was at an integration point to a third party it can be troublesome to have duplication there. In my experience of doing B2B integration some parties weren't the most practical, wanting to use IP addresses instead of URLs, having one integration point for both test and live systems etc, all making hot switch over... challenging.
(Note - I'm not a network person so the above may contain traces of inexactitude.)
216km from Marsh Gibbon

mattc

  • n.b. have grown beard since photo taken
    • Didcot Audaxes
Re: The BA IT collapse
« Reply #65 on: 04 June, 2017, 05:30:48 pm »
Only just found this thread - I've been jet-lagged/audax-lagged/BA-lagged all week!

Took us 3 days to get back from Nice on Saturday. This was my first massive air-delay, and although it was a goody, I'm grateful that we didn't go through the piles-of-bodies-in-a-departure-lounge scenario. I guess BA wanted to avoid that at all cost due to the horrible PR.

ANYWAY ... partner (who wasn't on the trip!) rather "helpfully" found this Dr Seuss version of the BA statement to "cheer me up":

Statement from British Airways

The plane is not leaving
We’re sorry to say.
And check-in is heaving;
It will be all day!
Your plane to Barbados
Is stuck in Mumbai.
Your bag’s in Honduras.
“We’re sorry! We’re sorry!”The crew’s in Dubai.
What else can we say?
There’s no need to worry,
It’s just a four-day delay.
The reason was clear,
A technical crash.
There’s nothing to fear,
Not a question of cash.
“Cutbacks caused the disaster”.
Ignore those who whine
Our IT is fine
Based in west Maharashtra.
We’ll soon get you flying
There’s no reason to fret.
Oh please stop your crying.
You could try EasyJet!

 :D



My one rant - probably ill-informed: we were ready to fly, on the plane. Surely they could sort us out back at Heathrow? Heathrow air-traffic-control could get the plane down, they don't care about BA's tickets/luggage/passengers, we'd all been security-checked. So why couldn't we fly?

It won't help me, but I'd like to hear an answer!
Has never ridden RAAM
---------
No.11  Because of the great host of those who dislike the least appearance of "swank " when they travel the roads and lanes. - From Kuklos' 39 Articles

Re: The BA IT collapse
« Reply #66 on: 04 June, 2017, 06:03:58 pm »
The thing is you just think it's so easy.  You don't realise what's involved:


Quote
"Transtellar Cruise Lines would like to apologize to passengers for the continuing delay to this flight. We are currently awaiting the loading of our complement of small lemon-soaked paper napkins for your comfort, refreshment and hygiene during the journey. Meanwhile we thank you for your patience. The cabin crew will shortly be serving coffee and biscuits again."

[...]

...he suddenly caught sight of a giant departure board still hanging, but by only one support, from the ceiling above him. It was covered with grime, but some of the figures were still discernible. [...] 'Nine hundred years...' he breathed to himself. That's how late this ship was.

[...]

In every seat sat a passenger, strapped into his or her seat. The passengers' hair was long and unkempt, their fingernails were long, the men wore beards. All of them were quite clearly alive - but sleeping.

[...]

'You're the autopilot?' said Zaphod
'Yes,' said the voice from the flight console.
'You're in charge of this ship?'
'Yes,' said the voice again, 'there has been a delay. Passengers are to be kept temporarily in suspended animation, for their comfort and convenience. Coffee and biscuits are served every year, after which passengers are returned to suspended animation for their continued comfort and convenience. Departure will take place when the flight stores are complete. We apologize for the delay.'

[...]

'Delay?' he cried. 'Have you seen the world outside this ship? It's a wasteland, a desert. Civilization's been and gone, man. There are no lemon-soaked paper napkins on the way from anywhere!'
'The statistical likelihood, ' continued the autopilot primly, 'is that other civilizations will arise. There will one day be lemon-soaked paper napkins. Till then there will be a short delay. Please return to your seat.'

You were lucky.
Move Faster and Bake Things

ElyDave

  • Royal and Ancient Polar Bear Society member 263583
Re: The BA IT collapse
« Reply #67 on: 05 June, 2017, 07:10:44 am »
Quote

You tend to think with the rise of "the cloud" that all big companies use good data centres but it's amazing how many don't. I once went on a visit to an azure data centre and its ups is.....impressive, to say the least. 8 rooms each containing a not just a generator but a full on ship engine (yes, they're massive). They have to be started up every week just to be tested, and is quite a big job just starting them up , you don't just turn a key# Another room containing nothing but weird batteries but where the acid is visible and sloshing about in open trays. Loads of control rooms containing nothing but switches, some of which in the 6 figures of volts. The whole thing's more about power than computer equipment - and that's the only basis on which it's charged to (corporate, i.e.wholesale) customers.

which keeps me in one stream of work.  The installed capacities on those data centres are huge, although rarely fired up other than testing, they all fall under the EU Emissions Trading Legislation, which affects those with combustion kit > 20MWTh input, so assuming 40% efficiency, that's effectively 8MW electrical generation-ish
“Procrastination is the thief of time, collar him.” –Charles Dickens

mattc

  • n.b. have grown beard since photo taken
    • Didcot Audaxes
Re: The BA IT collapse
« Reply #68 on: 05 June, 2017, 07:59:21 am »
The thing is you just think it's so easy.  You don't realise what's involved:


Quote
"Transtellar Cruise Lines would like to apologize to passengers for the continuing delay to this flight. We are currently awaiting the loading of our complement of small lemon-soaked paper napkins for your comfort, refreshment and hygiene during the journey. Meanwhile we thank you for your patience. The cabin crew will shortly be serving coffee and biscuits again."

[...]

...he suddenly caught sight of a giant departure board still hanging, but by only one support, from the ceiling above him. It was covered with grime, but some of the figures were still discernible. [...] 'Nine hundred years...' he breathed to himself. That's how late this ship was.

[...]

In every seat sat a passenger, strapped into his or her seat. The passengers' hair was long and unkempt, their fingernails were long, the men wore beards. All of them were quite clearly alive - but sleeping.

[...]

'You're the autopilot?' said Zaphod
'Yes,' said the voice from the flight console.
'You're in charge of this ship?'
'Yes,' said the voice again, 'there has been a delay. Passengers are to be kept temporarily in suspended animation, for their comfort and convenience. Coffee and biscuits are served every year, after which passengers are returned to suspended animation for their continued comfort and convenience. Departure will take place when the flight stores are complete. We apologize for the delay.'

[...]

'Delay?' he cried. 'Have you seen the world outside this ship? It's a wasteland, a desert. Civilization's been and gone, man. There are no lemon-soaked paper napkins on the way from anywhere!'
'The statistical likelihood, ' continued the autopilot primly, 'is that other civilizations will arise. There will one day be lemon-soaked paper napkins. Till then there will be a short delay. Please return to your seat.'

You were lucky.
Good old Douglas. He wrote that nearly 40 years ago, and it damn near just came true for me!
Has never ridden RAAM
---------
No.11  Because of the great host of those who dislike the least appearance of "swank " when they travel the roads and lanes. - From Kuklos' 39 Articles

Ben T

Re: The BA IT collapse
« Reply #69 on: 05 June, 2017, 08:14:24 pm »
Quote
The whole thing's more about power than computer equipment

That seems reasonable.  However what you describe doesn't suggest that it's easy to lose power .



Yeah but this is an azure dc, I'm suggesting ba are a lot more cheapskate than to fork out for azure, or similar standard, and thus their ups is far less advanced, which is why this has happened to them.

Phil W

Re: The BA IT collapse
« Reply #70 on: 05 June, 2017, 08:20:32 pm »
Sounds like a whole series of bad decisions were made at BA. Even if they had crappy power at one DC their backup systems should have been in an entirely separate physical location with no chance of a power surge or otherwise taking everything out.

Ben T

Re: The BA IT collapse
« Reply #71 on: 05 June, 2017, 10:20:37 pm »
How do you know they've got more than one dc?

Phil W

Re: The BA IT collapse
« Reply #72 on: 06 June, 2017, 12:08:34 am »
How do you know they've got more than one dc?

Because it's been standard practice for a company of their size for decades, hence the should. If they didn't they have made even bigger blunders and bad decisions over the years than you'd have thought possible.

Re: The BA IT collapse
« Reply #73 on: 06 June, 2017, 01:45:27 am »
How do you know they've got more than one dc?

Quote
the BBC's transport correspondent, Richard Westcott, has spoken to IT experts who are sceptical that a power surge could wreak such havoc on the data centres.
BA has two data centres about a kilometre apart. There are question marks over whether a power surge could hit both. Also, there should be fail-safes in place, our correspondent said.

http://www.bbc.com/news/business-40159202

Quote
All the big installations have back-up power. If the mains fails, a UPS (uninterruptable power supply) kicks in. It's basically a big battery that keeps things ticking over until the power comes back on, or a diesel generator is fired up.
This UPS is meant to take the hit from any "surge", so the servers don't have to.
All the big servers and large routers, I'm told, also have dual power supplies fed from different sources.

...up until a while ago, British Airways' IT systems had a variety of safety nets in place to protect them from big dumps of uncontrolled power, and to get things back on their feet quickly if there was any problem. I'm assuming those safety nets are still there, so why did they fail? And did human error play a part in all this?

...If BA wants to repair its reputation, its owner IAG needs to convince the public that making hundreds of IT staff redundant last year did not leave them woefully short of experts who could have fixed the meltdown sooner. And that it won't happen again - at least not on this epic scale.

http://www.bbc.com/news/business-40117381

Given the direction the company is taking, a move to compete at the budget end of the market, I suspect Cruz is CEO because he is an axeman not a quality controller.  He's not there to win a popularity contest; he probably has a skin like a rhino.
Move Faster and Bake Things

Re: The BA IT collapse
« Reply #74 on: 06 June, 2017, 08:01:27 am »
Here's an interesting piece of informed comment on the cause; apparently 200 linked systems in the critical path.....

https://www.theregister.co.uk/2017/06/05/british_airways_critical_path_analysis/

....making the failure a "when" rather than "if"