Author Topic: The BA IT collapse  (Read 14925 times)

mattc

  • n.b. have grown beard since photo taken
    • Didcot Audaxes
Re: The BA IT collapse
« Reply #75 on: 06 June, 2017, 08:53:50 am »
...<snip>


http://www.bbc.com/news/business-40117381

Given the direction the company is taking, a move to compete at the budget end of the market, I suspect Cruz is CEO because he is an axeman not a quality controller.  He's not there to win a popularity contest; he probably has a skin like a rhino.

Cruz has already supervised a similar meltdown at a Spanish (?) airline. His career seemed to survive that OK.

And his hero is Ryanair boss Michael O’Leary ...
Has never ridden RAAM
---------
No.11  Because of the great host of those who dislike the least appearance of "swank " when they travel the roads and lanes. - From Kuklos' 39 Articles

Re: The BA IT collapse
« Reply #76 on: 06 June, 2017, 09:30:28 am »
As a change manager I was the one in the team who had an understanding of the workflow system in our organisation.  It wasn't a huge understanding but I knew the manager whose baby it was and had taken a keen interest in the meetings where he had explained it.

Consequently I knew the components in the system that every task had to reference and get 'permission' from and thus any changes to these components were very high risk.  My colleagues in the team had not been given such training and came to me if in any doubt.  On several occasions I would not sign off changes because I did not accept the testing that had been done.  Sometimes this meant standing up to managers much more senior to me and telling them No!   It all worked pretty well really and the number of user hours lost to IT outages was steadily reducing.

Then it was decided that our work was to be outsourced to Wipro in India.  Previously we'd been told our jobs were safe, then there was a big meeting at which we were told, no they weren't who wants to be made redundant, tick here.  Having little confidence in my future with the company I didn't hesitate to step forward and made my exit. 

Through my role I knew others in the company whose job it was to investigate, record and report IT outages.  In the months following the outsourcing the company IT systems had very many problems, at least one making the news.  The IT director was dismissed (they'd also constructively sacked the guy who designed the crucial system and his vital side-kick whom every one had thought could never be sacked).   User hours lost to IT outages soared and business users were very fed up.  As far as I can recall there were no total power failures during this period.

In a nutshell the management had overseen the dismissal of staff who'd had a key understanding of the systems and handed them over to a company which had many clients and had only a second hand knowledge based on what they were told by staff whose roles they were taking over. 

It's easy to imagine that scenario at BA.  I doubt if the 'executive' have a scooby doo about how their systems work!
Move Faster and Bake Things

Re: The BA IT collapse
« Reply #77 on: 06 June, 2017, 09:41:27 am »
From my perspective, Wipro were significantly worse than useless.
Rust never sleeps

ian

Re: The BA IT collapse
« Reply #78 on: 06 June, 2017, 09:56:29 am »
Well, BA made hundreds of their experienced and knowledgeable IT staff redundant and outsourced much of their work. But apparently that's nothing to do with it.

Possibly mutating neutrinos then.

Jaded

  • The Codfather
  • Formerly known as Jaded
Re: The BA IT collapse
« Reply #79 on: 06 June, 2017, 01:42:54 pm »
Hah! after Brexit they won't be allowed in. Mutating or not.
It is simpler than it looks.

Re: The BA IT collapse
« Reply #80 on: 08 June, 2017, 08:33:26 am »
Does British Airways' explanation stack up?

Quote
Willie Walsh who runs the airline's parent company, has offered a little more detail about why their computer system crash-landed last week.
Put simply, an "engineer" cut the data centre's power, messed up the reboot and fried the circuits, he has said.

His explanation has raised eyebrows amongst former British Airways IT workers I've spoken to.

The simple explanation for Walsh's explanation is that he doesn't know much about the computer systems which underpin his companies operations.

Top businessmen who suffer the defect of not understanding IT are dinosaurs who don't realise when it's time they became extinct.
Move Faster and Bake Things

David Martin

  • Thats Dr Oi You thankyouverymuch
Re: The BA IT collapse
« Reply #81 on: 08 June, 2017, 08:39:48 am »
Does British Airways' explanation stack up?

Quote
Willie Walsh who runs the airline's parent company, has offered a little more detail about why their computer system crash-landed last week.
Put simply, an "engineer" cut the data centre's power, messed up the reboot and fried the circuits, he has said.

His explanation has raised eyebrows amongst former British Airways IT workers I've spoken to.

The simple explanation for Walsh's explanation is that he doesn't know much about the computer systems which underpin his companies operations.

Top businessmen who suffer the defect of not understanding IT are dinosaurs who don't realise when it's time they became extinct.
I disagree, but a top businessman who does not seek out and use the best advice on mission critical systems? I don't think he has to understand them himself but he has to understand the importance and balance the business risk.
"By creating we think. By living we learn" - Patrick Geddes

Re: The BA IT collapse
« Reply #82 on: 08 June, 2017, 08:55:33 am »
I find the explanation lacking in some credibility.  It's surely inconceivable that the engineer can simply kill the whole system as well as the fail over, backup, redundancy, (call it what you will) just like that. 

BA don't want to unduly concern the markets and investors so they've been playing down what happened.   I'm going to stick my neck out and guess that a scheduled fail over test failed and their recovery routines failed too.

mattc

  • n.b. have grown beard since photo taken
    • Didcot Audaxes
Re: The BA IT collapse
« Reply #83 on: 08 June, 2017, 08:57:13 am »
That sounds like a lie that he thinks no one will bother to delve into - he's trying to blame a minor human error; make it sound like "just one of those things".
Has never ridden RAAM
---------
No.11  Because of the great host of those who dislike the least appearance of "swank " when they travel the roads and lanes. - From Kuklos' 39 Articles

Re: The BA IT collapse
« Reply #84 on: 08 June, 2017, 09:18:09 am »
I find the explanation lacking in some credibility.  It's surely inconceivable that the engineer can simply kill the whole system as well as the fail over, backup, redundancy, (call it what you will) just like that. 

I don't think the other system/DC was taken down by power. From what I've read it's more likely that the power was cut in one DC (they run the two DCs as active:active rather than primary:backup or DR style) after the UPS system was disabled, and then power restored in a panic (possibly botched, e.g. turned on, subsequent surge not protected by the disabled UPS, some machines go down again, data corruption, etc). When enough of the these machines were back online they started to sync partially corrupted data between the two DCs and now you're in a world of trouble.

Having worked in DCs where there are strict procedures for doing things I can see exactly how this kind of thing can happen, because there's rarely any proper enforcement of those strict procedures. A bunch of stuff written down in a playbook is useless if people ignore it. A "needs two people to do something" rule is nigh on useless if both people are complacent. Mistakes will always happen. It's the same everywhere, look at "never events" in the NHS for example.

What BA lacked is a procedure/plan (that they test frequently) for dealing with corruption like this. You can bet they'll be working towards a plan (and then the regular testing of the plan will eventually succumb to the same complacency that caused the problem in the first place).
"Yes please" said Squirrel "biscuits are our favourite things."

Re: The BA IT collapse
« Reply #85 on: 08 June, 2017, 09:20:39 am »
Lets apply logic.

A) Assume he told the truth. That means they only have one data centre with no DR, no fallback. Woah, that's some major failing there.

B) Assume he is outright lying. His job is to protect the value of the company, his motivations to lie would be to protect his job or to protect the value of the company (or both). So the real reason for the failure would be something that would be very damaging if it came out. More damaging than implying (A)

Are there any other conclusions we can definitely derive from his statement, that are anything other than speculation?

[edit] I suppose we can just decide that he didn't have a clue wtf he was talking about so he was just spouting utter bullshit.
<i>Marmite slave</i>

Re: The BA IT collapse
« Reply #86 on: 08 June, 2017, 09:25:29 am »
C) He's paraphrasing and maybe putting a bit of spin on what was reported to him. So he thinks he is telling the truth.

It's the difference between truth and fact. He may completely believe what he is saying is true, but it's not what actually happened.

You can't expect accurate technical descriptions of the root cause of the problem from someone non-technical, nor can you expect that from a senior person in the company as a media release because it will almost certainly be mis-understood by the media and misreported.
"Yes please" said Squirrel "biscuits are our favourite things."

Re: The BA IT collapse
« Reply #87 on: 08 June, 2017, 11:13:40 am »
Does British Airways' explanation stack up?

Quote
Willie Walsh who runs the airline's parent company, has offered a little more detail about why their computer system crash-landed last week.
Put simply, an "engineer" cut the data centre's power, messed up the reboot and fried the circuits, he has said.

His explanation has raised eyebrows amongst former British Airways IT workers I've spoken to.

The simple explanation for Walsh's explanation is that he doesn't know much about the computer systems which underpin his companies operations.

Top businessmen who suffer the defect of not understanding IT are dinosaurs who don't realise when it's time they became extinct.
I disagree, but a top businessman who does not seek out and use the best advice on mission critical systems? I don't think he has to understand them himself but he has to understand the importance and balance the business risk.

I meant in the sense of perhaps knowing (or caring about) the difference between a UPS and USP rather than being a coder.   
Move Faster and Bake Things

ian

Re: The BA IT collapse
« Reply #88 on: 08 June, 2017, 12:21:01 pm »
I don't think it reasonable that executive leadership necessarily know the technical details but they should understand the need for the expertise of those who do understand the technical details. That seems to be the modern failing, executives flit between disparate companies and industries with little actual comprehension of how those business deliver their products and services. They live in world circumscribed by spreadsheets, margins, and EBITAs, and where critical expertise is measured in headcount and salary expenditure. And failure simply means collecting the money and moving on to another executive position or a comfy retirement package. Management layer stacking also ensures senior management levels are very effectively insulated from the actual business. Even if they wanted to know, they're unlikely to find out.

Re: The BA IT collapse
« Reply #89 on: 08 June, 2017, 07:49:52 pm »
I find the explanation lacking in some credibility.  It's surely inconceivable that the engineer can simply kill the whole system as well as the fail over, backup, redundancy, (call it what you will) just like that. 

I don't think the other system/DC was taken down by power. From what I've read it's more likely that the power was cut in one DC (they run the two DCs as active:active rather than primary:backup or DR style) after the UPS system was disabled, and then power restored in a panic (possibly botched, e.g. turned on, subsequent surge not protected by the disabled UPS, some machines go down again, data corruption, etc). When enough of the these machines were back online they started to sync partially corrupted data between the two DCs and now you're in a world of trouble.

Having worked in DCs where there are strict procedures for doing things I can see exactly how this kind of thing can happen, because there's rarely any proper enforcement of those strict procedures. A bunch of stuff written down in a playbook is useless if people ignore it. A "needs two people to do something" rule is nigh on useless if both people are complacent. Mistakes will always happen. It's the same everywhere, look at "never events" in the NHS for example.

What BA lacked is a procedure/plan (that they test frequently) for dealing with corruption like this. You can bet they'll be working towards a plan (and then the regular testing of the plan will eventually succumb to the same complacency that caused the problem in the first place).
Of the acronyms I understood BA, NHS, UPS, but couldn't work out what DC or DR represented.

Kim

  • Timelord
    • Fediverse
Re: The BA IT collapse
« Reply #90 on: 08 June, 2017, 07:50:25 pm »
Data Centre and Disaster Recovery

Re: The BA IT collapse
« Reply #91 on: 08 June, 2017, 07:51:12 pm »
Thanks Kim.

Re: The BA IT collapse
« Reply #92 on: 19 September, 2017, 02:41:00 pm »
Ryanair crew screw up

Bizarrely the CEO appears unable to find a way of blaming the computer system or power supply:

Quote
He said the airline did not have an overall shortage of pilots, but said they had "messed up" the rosters for September and October.
"This is our mess-up. When we make a mess in Ryanair we come out with our hands up," he said.
Move Faster and Bake Things

Re: The BA IT collapse
« Reply #93 on: 19 September, 2017, 02:56:06 pm »
Recruitment and retention seem to have a role to play.

http://www.cityam.com/272219/norwegian-has-been-quietly-poaching-raft-ryanair-pilots

Ryanair seems to employ at least some of its pilots on something that resembles a zero hours contract:

https://www.irishtimes.com/business/transport-and-tourism/german-inquiry-into-ryanair-pilot-work-status-extended-1.2418995


Morat

  • I tried to HTFU but something went ping :(
Re: The BA IT collapse
« Reply #94 on: 22 September, 2017, 12:52:06 pm »
Ryanair crew screw up

Bizarrely the CEO appears unable to find a way of blaming the computer system or power supply:

Quote
He said the airline did not have an overall shortage of pilots, but said they had "messed up" the rosters for September and October.
"This is our mess-up. When we make a mess in Ryanair we come out with our hands up," he said.

Definitely HR this time :)
Everyone's favourite windbreak