Yet Another Cycling Forum

General Category => The Knowledge => Ctrl-Alt-Del => Topic started by: Asterix, the former Gaul. on 27 May, 2017, 05:38:02 pm

Title: The BA IT collapse
Post by: Asterix, the former Gaul. on 27 May, 2017, 05:38:02 pm
http://www.bbc.com/news/uk-40069865

Quote
British Airways has cancelled all flights from Heathrow and Gatwick because of global computer problems.

It apologised for the "global system outage" and said there was no evidence of a cyber attack.

Mick Rix, GMB's national officer for aviation said: "This could have all been avoided.
"BA in 2016 made hundreds of dedicated and loyal IT staff redundant and outsourced the work to India... many viewed the company's actions as just plain greedy."

My employer took the same course of outsourcing in 2005.  It was not successful because the companies doing the work were also working for other clients and did not give priority to one client.  I understand that much of the work has returned to be done by UK employees.

Whatever the cause of this particular incident, I'd suggest that such critical operations should be maintained by dedicated staff working for the company.  Thus a high degree of urgency could be assured.
Title: Re: The BA IT collapse
Post by: Kim on 27 May, 2017, 05:40:27 pm
Cynically, it seems like a good excuse to ground all your planes in a hurry without causing a mass panic...
Title: Re: The BA IT collapse
Post by: Basil on 27 May, 2017, 05:48:15 pm
Cynically, it seems like a good excuse to ground all your planes in a hurry without causing a mass panic...

Blimey.  I hadn't thought of that.
Title: Re: The BA IT collapse
Post by: Asterix, the former Gaul. on 27 May, 2017, 06:00:25 pm
Cynically, it seems like a good excuse to ground all your planes in a hurry without causing a mass panic...

That's the scary version :o

If so, when will it be safe to explain?
Title: Re: The BA IT collapse
Post by: ElyDave on 27 May, 2017, 06:26:41 pm
But for what reason?

If there was some kind of generalised threat, why only BA?
Title: Re: The BA IT collapse
Post by: Polar Bear on 27 May, 2017, 06:40:34 pm
Call to radio or tv station:

Hello, this is <codename>

There is a device on a British Airways plane.

Terrorists use code names and code words to get across their warnings.   I recall this from the days when the IRA was active.
Title: Re: The BA IT collapse
Post by: DaveReading on 27 May, 2017, 06:47:35 pm
Call to radio or tv station:

Hello, this is <codename>

There is a device on a British Airways plane.

If that was really the case then (a) passengers would not have been kept on board flights at the terminal, as has happened to many people this afternoon, and (b) inbound BA flights would not be landing at Heathrow but would be diverted, probably to Stansted.

There seems little doubt it is an IT issue, as reported.
Title: Re: The BA IT collapse
Post by: Veloman on 27 May, 2017, 06:55:38 pm
Call to radio or tv station:

Hello, this is <codename>

There is a device on a British Airways plane.

Terrorists use code names and code words to get across their warnings.   I recall this from the days when the IRA was active.

And those days have gone as the aim these days is to kill and get maximum casualty rate, unlike the IRA who were keen to disrupt infrastructure and cause general damage to buildings.  Followers of IS do not use code words and the asymmetric warfare they pursue would not benefit from code words and warnings.

That said, a vicarious phone call stating a threat could cause the closure or withdrawal of facilities. Big decision on behalf of someone to ignore such a threat.
Title: Re: The BA IT collapse
Post by: Veloman on 27 May, 2017, 06:57:16 pm
My cross-post with DaveR and totally agree with his comments.
Title: Re: The BA IT collapse
Post by: PaulF on 27 May, 2017, 06:57:32 pm
Call to radio or tv station:

Hello, this is <codename>

There is a device on a British Airways plane.

Terrorists use code names and code words to get across their warnings.   I recall this from the days when the IRA was active.

Think that was just the IRA, the current crop of terrorists don't seem to be doing the same. Don't think it's a good idea to generalise about terrorists. Or any other group for that matter.
Title: Re: The BA IT collapse
Post by: Kim on 27 May, 2017, 07:00:00 pm
...Or it could just be an infrastructure clusterfuck.  It is a bank holiday weekend, after all.
Title: Re: The BA IT collapse
Post by: Hot Flatus on 27 May, 2017, 07:02:34 pm
You should lay off the weed, Kim.

(https://assets.mubi.com/images/film/215/image-w856.jpg?1445915549)
Title: Re: The BA IT collapse
Post by: Polar Bear on 27 May, 2017, 07:14:45 pm
Call to radio or tv station:

Hello, this is <codename>

There is a device on a British Airways plane.

Terrorists use code names and code words to get across their warnings.   I recall this from the days when the IRA was active.

Think that was just the IRA, the current crop of terrorists don't seem to be doing the same. Don't think it's a good idea to generalise about terrorists. Or any other group for that matter.

I'm not generalising, just offering a possible explanation about behaviour.   We don't know and we're all hypothesising.   Probably more likely imo that BA has been hacked to be honest.
Title: Re: The BA IT collapse
Post by: Asterix, the former Gaul. on 27 May, 2017, 08:02:35 pm
Quote
We believe the root cause was a power supply issue."

My employer had back up generators in the basement.  Power supplies issues just made the lights flicker.

What is more they had agreements with other major computer 'owners' that in the event of a crash there would be a switch-over to allow continuity. Resources were not isolated but pooled through people like IBM.

My last job was change management i.e. to ensure that no changes were made unless it had been tested through three stages with a proper test plan approved by users.  Once our IT department 'bought into it' it worked very well.  When they decided to outsource I volunteered for redundancy, quite happily, as did many others.  I understand that once outsourced the system did not work well and the IT director was replaced.

The simple explanation for BA's plight is that a similar thing has happened - they failed to understand the downside of outsourcing.
Title: Re: The BA IT collapse
Post by: Ham on 27 May, 2017, 09:40:17 pm
Put simply, it mostly comes down to poor risk management.  Or, good risk management if the overall cost of this episode is less than it would cost to prevent, which I doubt. When you have an attitude to risk based around fallacious data and empirical assessment of isolated risks combined with the innate obduracy of inanimate objects, this sort of event is inevitable. The only question is, which organisation does it happen to, this time it was BA. Every major organisation that I've had dealings with for IT systems (and that's a metric fuckload) is using cost as one of the most significant factors in their decision making process, it was not always this way.

TL;DR: Shit happens.
Title: Re: The BA IT collapse
Post by: Polar Bear on 27 May, 2017, 09:50:55 pm
I was absolutely gobsmacked when I arrived at a large retailer's HQ to discover that they would be out of business in three days if their computers failed.  They had no backups, no disaster recovery plans, no failover sites, nothing.  When I left 3 1/2 years later, I left as a result of the IT director* rejecting the proposals to implement the basics.   

When I say left, I was made redundant. 

*  He is a Chelski fan.  I wonder if I still have his number ...
Title: Re: The BA IT collapse
Post by: Thor on 27 May, 2017, 10:49:23 pm
The current explanation - that a power failure can knock out one or more critical, global systems for hours - defies credibility. That such systems have no redundancy, failover options, Disaster Recovery procedures, is not credible.

Something's up - and the cover story isn't very convincing.
Title: Re: The BA IT collapse
Post by: Kim on 27 May, 2017, 10:52:40 pm
Maybe someone opted to save costs my installing the redundant systems in adjacent racks?
Title: Re: The BA IT collapse
Post by: Vince on 28 May, 2017, 01:50:24 am
Odd that BA have a system that is specific to flights in and out of Heathrow and Gatwick.
Title: Re: The BA IT collapse
Post by: Ham on 28 May, 2017, 07:51:06 am
The current explanation - that a power failure can knock out one or more critical, global systems for hours - defies credibility. That such systems have no effective redundancy, failover options, Disaster Recovery procedures, is not credible.


Afraid to tell you, you are wrong with that assumption. I've done a little FTFY to be a bit more specific.

My job is for %Megacorp, one of the main providers of services to global organisations, my role is intimately involved with understanding any issues and proposing the solution so it would be unethical and inappropriate for me to comment directly; especially given the speculation of where and how the failure occurred. But really, it's no surprise.

One little anecdote about another major org that put in their own dual site HA for some critical systems. Turns out that when one site goes out, it takes out the other. It's been that way for three years and going to be for another one, at least.

Very few organisations understand the difference between HA and DR, even at the highest levels. It's all down to money at the start and end of the day.
Title: Re: The BA IT collapse
Post by: pcolbeck on 28 May, 2017, 07:52:03 am
The current explanation - that a power failure can knock out one or more critical, global systems for hours - defies credibility. That such systems have no redundancy, failover options, Disaster Recovery procedures, is not credible.

Something's up - and the cover story isn't very convincing.

Yup. I am currently designing the network for an airport and you have two data centres with critical services having a backup server in the second data centre and even within a data centre you have redundant switches in separate racks with everything dual homes to two switches. Each switch has two PSUs (at least) as well and you would normally feed them from separate supplies. The sore and distribution switches for the network that goes out to terminals and connects the baggage handling, check in wireless etc etc would also be redundant. Only the access switches would be a single point of failure and losing one would only take down the stuff directly attached to that.

Mind you I have seen an airport network that was badly designed and a L2 issue like a broadcast storm could have taken the whole thing down.
Title: Re: The BA IT collapse
Post by: Ham on 28 May, 2017, 08:12:45 am


Yup. I am currently designing the network .....

And there you have, at least part, of it.

Systems are not just the sum of its components. You can have a great system but, if they aren't designed, installed or maintained properly you have a pile of useless. FTR, my role is the integrator of the various stuffs, like Network, DR, Server, Service, etc etc

Go on, another anecdote (dating back about 15 years). Another major corp with highly redundant highly secure systems lost complete contact with their, erm, contact centre. I happened to be there and got involved. Turned out that the dual redundant inbound circuit terminations when installed had not only been plugged in to the same dual socket, but a wall socket instead of a cabinet socket had been chosen. And someone had plugged in a kettle. That was all it needed.
Title: Re: The BA IT collapse
Post by: Mr Larrington on 28 May, 2017, 08:27:01 am
The current explanation - that a power failure can knock out one or more critical, global systems for hours - defies credibility. That such systems have no redundancy, failover options, Disaster Recovery procedures, is not credible.

Something's up - and the cover story isn't very convincing.

I used to work for an arm of a BigCo in which we had a UPS to keep things going until the backup gennies kicked in.  One day a lightning strike took out half the town's electricity, ours included.  The UPS did its job admirably, the gennies started, all happy.  Then the gennies stopped and the UPS batteries went flat.  Someone had failed to ensure that the gennies had fuel tanks containing diesel rather than fumes...
Title: Re: The BA IT collapse
Post by: Pickled Onion on 28 May, 2017, 09:30:59 am
Go on, another anecdote (dating back about 15 years).

If we're doing anecdotes, I was at a major financial when 9/11 happened. That was when they found out it was a bad idea to have their secondary systems in Manhattan as a backup to the primary, in Manhattan  :facepalm: Learning from this, they moved the secondary to New Jersey. A few years later Hurricane Sandy hit and took them both down.
Title: Re: The BA IT collapse
Post by: Ham on 28 May, 2017, 10:27:34 am
Was that the corp with split systems across the two towers, or a different one?
Title: Re: The BA IT collapse
Post by: T42 on 28 May, 2017, 10:57:35 am
My employer took the same course of outsourcing in 2005.  It was not successful because the companies doing the work were also working for other clients and did not give priority to one client.  I understand that much of the work has returned to be done by UK employees.

Heh. As Y2K approached a chum and I set up a large order-processing/production system for a manufacturing jeweller in Germany. After a couple of years we were suddenly shown the door, as a palace revolution had swung opinion in favour of taking standard software from a big software house so as to ensure continuity if one of us bit the dirt.  Six months later we heard that the standard SW had required so many adaptations that it had already cost twice as much as our system and was far from finished, and everything was being done by just two blokes. Another six months and one of them had got fed up and left.  We never got the client back, though.
Title: Re: The BA IT collapse
Post by: TheLurker on 28 May, 2017, 11:42:37 am
Let's face it.   Every single one of us who works or who has worked in IT knows that, basically, the whole thing is held together with the electronic equivalents of spit and baler twine by people running desperately hard to stand still in an effort to keep up with the latest hare-brained scheme cooked up by the PHBs in cahoots with *shiny stuff* vendors and sooner or later it all goes wrong.  The trick is to be somewhere else and or not reliant on the failed system when it does.  :)
Title: Re: The BA IT collapse
Post by: Phil W on 28 May, 2017, 01:19:01 pm
We used to called Disaster Recovery (DR) testing Computer Aided Overtime (CAO).  It was meant to mean Continuous Application Operation but out term was far better. We learnt that we could restore the backups and get our system (one of many) fully up and running and verified in about 12 hours.  We also learnt that it could only process about 1/12 of the capacity it needed as the backup setup was rather weedy compared to primary. 

I also found out when oncall (for everything) that one site was run off a mezzanine floor in the other head office.  At night a cleaner unplugged a cabinet powering the core system disks of the Mainframe.   I only found this out when I asked what was down and they said everything. Oh what joy the next hours were as we worked through priority systems, handing off to the next guy on the oncall rota, where we came very close to invoking full DR.  IT systems can fail in interesting ways that aren't always thought of.  In theory you should be able to just remove all power from everything and recover from it, but doesn't seem to work like that in reality and you have partial failures that are often actually harder to recover from. (and take longer as you try and recovery in Primary before invoking DR)

Of course this was back in the early 90's and money was invested since and DR moved on.

So if BA is still in a daily / weekly backup DR setup and alternate cold standby data centre, plus don't have an alternate offices for operations and support to deploy to and operate from "War rooms" and they haven't tested DR properly in a long while.  Then I can well understand why they might not have their systems back so quickly if the entire power supply goes at that primary DC.

I love the way BBC have reported it as a global IT system not realising it's plural as in systems as in probably hundreds of interconnected legacy and modern (which themselves have many moving parts) joined together with spaghetti and it is not as simple as turning your home PC on and off.

Plus as others have said when they no longer value those who built their legacy, didn't encourage and promote knowledge sharing, and lay them off; then they are left with inexperienced staff who are ok when operations are working as normal but have no experience with dealing with a DR situation. 

Title: Re: The BA IT collapse
Post by: T42 on 28 May, 2017, 01:49:30 pm
Plus as others have said when they no longer value those who built their legacy, didn't encourage and promote knowledge sharing, and lay them off; then they are left with inexperienced staff who are ok when operations are working as normal but have no experience with dealing with a DR situation.

Applies well beyond IT, too.
Title: Re: The BA IT collapse
Post by: Asterix, the former Gaul. on 28 May, 2017, 04:20:40 pm
The ownership of BA is interesting.  It's only British in a loose kind of way being part of International Airlines group and run in partnership with Iberia, the Spanish 'flagship', a merger of apparent equals. However BA's profits were badly hit by Iberia's exposure to the downturn of SPain's economy after 2008.  I guess its profits are about to take another hit which is good news only for its rivals.
Title: Re: The BA IT collapse
Post by: ElyDave on 28 May, 2017, 04:46:56 pm
Seems it doesn't matter if its IT or O&G, the seven P's still apply. One of my first questions when talking about their shiny incident control room is "what happens if this room is unavailable, power fails, internet outage, the people dont turn up etc?"

Sometimes they have thought about it.

Its when you have something like Deepwater Horizon, or the Elgin G4 incident that you find out how well it really works over weeks and months rather than hours.
Title: Re: The BA IT collapse
Post by: mrcharly-YHT on 28 May, 2017, 04:52:48 pm
I had a contract working on a DR system for a BigCorpBank. We decided to do a test of the fallback DR - they had a backup site for trading, if the main site went down (think big bomb), then they could use the other site to close down trades.

So, one weekend we shut down the main site, literally pulled the plug, and started to bring up the backup site. All going well until we started to bring up the payment system, the single connection that is required for processing the financial transactions. This is a live connection, never shut off and activated by a swipe card and a long code.

The bank's card and code combination didn't work. Nobody has ever, ever, ever verified the card and code.

Lets think about this. We've turned off the bank's main connection. The backup one won't connect. In 24hours the trading floors open. There is nobody we can contact to do anything about this card and code because it is Weekend and they don't work Weekend. Without this connection up, the BigCorp, can't trade.

We Are Fucked.
Title: Re: The BA IT collapse
Post by: pdm on 28 May, 2017, 05:11:33 pm


Yup. I am currently designing the network .....

And there you have, at least part, of it.

Systems are not just the sum of its components. You can have a great system but, if they aren't designed, installed or maintained properly you have a pile of useless. FTR, my role is the integrator of the various stuffs, like Network, DR, Server, Service, etc etc

Go on, another anecdote (dating back about 15 years). Another major corp with highly redundant highly secure systems lost complete contact with their, erm, contact centre. I happened to be there and got involved. Turned out that the dual redundant inbound circuit terminations when installed had not only been plugged in to the same dual socket, but a wall socket instead of a cabinet socket had been chosen. And someone had plugged in a kettle. That was all it needed.

You don't have to go back 15 years for this sort of idiocy!
Just in the last couple of weeks ago, a major critical UK infrastructure IT system failed and was down for many hours because the primary site failed. The backup site could not come on line because it would not operate without the primary site on line..... This apparently turned out to be part of the (flawed) original design....  :facepalm:
Title: Re: The BA IT collapse
Post by: rr on 28 May, 2017, 05:17:17 pm
Seems it doesn't matter if its IT or O&G, the seven P's still apply. One of my first questions when talking about their shiny incident control room is "what happens if this room is unavailable, power fails, internet outage, the people dont turn up etc?"

Sometimes they have thought about it.

Its when you have something like Deepwater Horizon, or the Elgin G4 incident that you find out how well it really works over weeks and months rather than hours.
There was also the recent incident where the platform Inventory was released when the backup batteries ran down.

Sent from my XT1562 using Tapatalk

Title: Re: The BA IT collapse
Post by: ElyDave on 28 May, 2017, 05:18:34 pm
One of the UKs major utilities had satellite links of data back to mega control centre. One winter, it snowed heavily, dish filled up. Backup was a phone line. Phone line had never been used. BT had noticed lack of use and disconnected it.
Title: Re: The BA IT collapse
Post by: Bledlow on 28 May, 2017, 05:19:59 pm
Quote
We believe the root cause was a power supply issue."

My employer had back up generators in the basement.  Power supplies issues just made the lights flicker....

The simple explanation for BA's plight is that ... they failed to understand the downside of outsourcing.
When I worked for a global megacorp it had back up generators, & for major IT systems a back up data centre geographically separate from the main one, i.e. far enough away for it to be unlikely that a single disaster would affect both. Critical systems were behind locked doors & power supplies to 'em were protected so nobody could knock 'em down by casually pulling out a plug.

Never found out if it'd all work, though. I recall a power cut causing the lights in our office to flicker & calls of "has anything failed?", all answered by "X (Y, Z, etc.) is working" but that was just development systems.

We had issues with outsourcing, though. I recall a mega-worldcorp external supplier's big show of how their replacement billing system, carefully customised for us after exhaustive & expensive studies by hordes of supposedly world-leading consultants & analysts, would replace some, & interface with other, existing systems. A bunch of us stood looking in puzzlement at a big diagram supposedly showing everything until one of us voiced what we were all thinking, i.e. "where's so-and-so?", so-and-so being a system without which (or a replacement) we wouldn't produce any bills.  :facepalm: Many months of work, walking past us to their desks every day, but it had never occurred to them to ask us (the billing team) what we did & how it fitted in.

I also recall how when it was decided to replace the last part of an old system running on old hardware which wasn't doing very much any more, but the little it was doing was essential, lavishly illustrated external proposals galore were put forward by external suppliers, all of which involved great expense - & none of which, as far as those of us working on it could see, took into account quite how little a replacement needed to do. They were all far too complicated, & involved duplication of processing that was done in other systems, & holding the same data in multiple databases. One consisted of emulating the old hardware (which getting away from was one of the main benefits of replacement, since it eliminated the need for separate copies of data, expensive licences for an old mainframe OS & other old software which was being milked before it went out of use, etc.) on new hardware & porting everything across.

It was quite a struggle to get an internal replacement to even get looked at. There was no allocation of time to investigate one until a  low-level manager sneaked in some cover for the best person to look at an internal replacement so he could draw up a proper proposal. It was adopted (a no-brainer once it was actually compared with the external offers), implemented by existing staff & IIRC one contractor (only needed because one person had been made redundant to meet a headcount reduction target  :facepalm: - another person complained that the contractor was asking him questions that he used to ask the redundant person, i.e. me) faster than any of the external proposals & saved the company a fortune. As soon as it went live there were redundancies in the team that had done it, including the bloke who'd drawn up the proposal & thus saved the firm a few million. The perfect reward,eh?
Title: Re: The BA IT collapse
Post by: zigzag on 28 May, 2017, 05:20:19 pm
i hope they are back to normal before my flight on tuesday..
Title: Re: The BA IT collapse
Post by: Bledlow on 28 May, 2017, 05:35:27 pm
One of the UKs major utilities had satellite links of data back to mega control centre. One winter, it snowed heavily, dish filled up. Backup was a phone line. Phone line had never been used. BT had noticed lack of use and disconnected it.
Ex-employer had a few hundred grand worth of electronic bits & pieces in a warehouse. Deliberately bought just before they went out of production as spares for hardware that was still in use & scheduled for replacement gradually over several years. Replacement cost of the bits & pieces several millions, since it'd mean greatly accelerating the replacement schedule for the old hardware. Buying replacements early, hiring contractors to do all the work early, etc.

Bloke responsible for it had a fight to stop it being scrapped one day. Warehouse management had it logged as for disposal because according to their criteria the turnover was too low to make it worth allocating space to. He told me he almost didn't find out until too late.
Title: Re: The BA IT collapse
Post by: Bledlow on 28 May, 2017, 05:49:03 pm
Plus as others have said when they no longer value those who built their legacy, didn't encourage and promote knowledge sharing, and lay them off; then they are left with inexperienced staff who are ok when operations are working as normal but have no experience with dealing with a DR situation.

Applies well beyond IT, too.
Someone was recently telling me about how BR/Railtrack lost the knowledge of where a lot of its signalling cables were. Supposedly it had never had a full inventory because local teams repaired, replaced etc. & had never logged everything centrally. Come privatisation, a lot of those people were laid off - & either weren't asked or didn't want to say where everything was. Paper records may or may not have existed, but where? Knowledge may have been only in heads that were no longer employed.

On Friday I heard something similar about the water management system for a stretch of canal. The people who maintained it had all been got rid of, & went quietly. Years later, an enthusiastic young new bloke met one of the old guys at some heritage thing, & the old bloke took a liking to him - & they went for a walk round & talk through. The oldster remembered it all, & enjoyed describing it.
Title: Re: The BA IT collapse
Post by: Kim on 28 May, 2017, 06:23:07 pm
One of the UKs major utilities had satellite links of data back to mega control centre. One winter, it snowed heavily, dish filled up. Backup was a phone line. Phone line had never been used. BT had noticed lack of use and disconnected it.

Until recently, this would happen to AAISP customers who have broadband on a line, but no voice service.  An engineer poking around a cabinet in search of a spare line would find the line wiht no dialtone and steal it.  So they now play a recorded message with a bit of dialtone (to keep the test equipment happy) and a "do not steal this line" message from RevK.

I believe broadcasters have the same problem with dedicated lines between sites (presumably there's a lot less of that than there used to be).  So will play some music (or other convenient audio signal) down them when not in use.

Rather sensibly, the cold war early warning system was piggybacked on the lines distributing the speaking clock signal between exchanges to avoid this problem.  Someone would notice if the speaking clock wasn't working.  I believe some air raid sirens were also connected via out-of-band signalling on normal customers' phone lines, for the same reason.
Title: Re: The BA IT collapse
Post by: Mr Larrington on 28 May, 2017, 07:33:07 pm
Professor Larrington had the tremendous foresight to elect to use Lufthansa to go to Dresden today :thumbsup:

Meanwhile, O BBC, I know it's the Sunday evening of a Bank Holibob weekend but wheeling out a so-called "expert" who doesn't know what "UPS" stands for is Not Helpful.
Title: Re: The BA IT collapse
Post by: hellymedic on 28 May, 2017, 07:47:00 pm
UPS? See 'Sh!te Courier' thread.

 ;) ;) ;D
Title: Re: The BA IT collapse
Post by: Asterix, the former Gaul. on 28 May, 2017, 07:54:29 pm
Quote
Rather sensibly, the cold war early warning system was piggybacked on the lines distributing the speaking clock signal between exchanges to avoid this problem.  Someone would notice if the speaking clock wasn't working.

Are you sure this wasn't simply so that people would know when their four minutes was up?
Title: Re: The BA IT collapse
Post by: Thor on 28 May, 2017, 08:03:30 pm

Meanwhile, O BBC, I know it's the Sunday evening of a Bank Holibob weekend but wheeling out a so-called "expert" who doesn't know what "UPS" stands for is Not Helpful.
;D Yes, I saw that piece. Is it Universal, or Unlimited?  ;D
Title: Re: The BA IT collapse
Post by: Kim on 28 May, 2017, 08:07:35 pm
Unpossible
Title: Re: The BA IT collapse
Post by: Phil W on 28 May, 2017, 08:31:51 pm
They had UPS as in Useless
Title: Re: The BA IT collapse
Post by: Kim on 28 May, 2017, 08:55:03 pm
Or perhaps Unexploded...
Title: Re: The BA IT collapse
Post by: Phil W on 28 May, 2017, 08:57:08 pm
Maybe the global BA system was danger UXB?
Title: Re: The BA IT collapse
Post by: Bledlow on 28 May, 2017, 09:24:35 pm
One of the UKs major utilities had satellite links of data back to mega control centre. One winter, it snowed heavily, dish filled up. Backup was a phone line. Phone line had never been used. BT had noticed lack of use and disconnected it.

Until recently, this would happen to AAISP customers who have broadband on a line, but no voice service.  An engineer poking around a cabinet in search of a spare line would find the line wiht no dialtone and steal it. 
Somehow this failed to happen to the BT line Vodafone paid for to my house for me to use for working from home. Several years after I'd last used it (VF had switched me to using my own - faster - broadband, then got rid of me), I was contacted by someone telling me that they thought I should be paying for it. It was still live.  :facepalm:

I explained the situation & they said "We've found that you're right. Sorry to have bothered you".
Title: Re: The BA IT collapse
Post by: Asterix, the former Gaul. on 29 May, 2017, 07:47:33 am
It just demonstrates that our increasing reliance on IT has not been matched by success in making systems crash-proof or hack-proof.

The recent cyber attack that hit the NHS and many other systems globally shows that whilst technology gets investment in the flashy go-faster bits the more mundane foundations of the edifice aren't re-engineered to cope. We end up with the walls of Jericho.

Meanwhile shareholders wait nervously for trading tomorrow..
(http://justfunfacts.com/wp-content/uploads/2016/02/meerkats-2.jpg)
Title: Re: The BA IT collapse
Post by: Jaded on 29 May, 2017, 08:57:26 am
Plus as others have said when they no longer value those who built their legacy, didn't encourage and promote knowledge sharing, and lay them off; then they are left with inexperienced staff who are ok when operations are working as normal but have no experience with dealing with a DR situation.

Applies well beyond IT, too.
Someone was recently telling me about how BR/Railtrack lost the knowledge of where a lot of its signalling cables were. Supposedly it had never had a full inventory because local teams repaired, replaced etc. & had never logged everything centrally. Come privatisation, a lot of those people were laid off - & either weren't asked or didn't want to say where everything was. Paper records may or may not have existed, but where? Knowledge may have been only in heads that were no longer employed.

On Friday I heard something similar about the water management system for a stretch of canal. The people who maintained it had all been got rid of, & went quietly. Years later, an enthusiastic young new bloke met one of the old guys at some heritage thing, & the old bloke took a liking to him - & they went for a walk round & talk through. The oldster remembered it all, & enjoyed describing it.

I think you'll find it wasn't just cable runs they lost. As a result of something I did for Railtrack I was offered a place on an Asset Day in 1999. This was a senior person and a handful of other managers going round the area looking at assets. They were ostensibly checking that assets (like bridges, tunnels, etc) had been counted properly. They didn't know how many they had, so had a program of counting them. It was great fun, we saw a modern signal box operating in Newport, a really really old one operating (and got to pull the levers) and then we went in the Severn Tunnel, seeing the controlled raging water and standing on the trackside as 125s dashed past. A few months later they killed a load of people with their asset inadequacies.
Title: Re: The BA IT collapse
Post by: Asterix, the former Gaul. on 29 May, 2017, 10:36:04 am
Quote
BA blames a power outage, but a corporate IT expert said it should not have caused "even a flicker of the lights" in the data-centre.
Even if the power could not be restored, the airline's Disaster Recovery Plan should have whirred into action. But that will have depended in part on veteran staff with knowledge of the complex patchwork of systems built up over the years.
Many of those people may have left when much of the IT operation was outsourced to India.
One theory of the IT expert, who does not wish to be named, is that when the power came back on the systems were unusable because the data was unsynchronised.
In other words the airline was suddenly faced with a mass of conflicting records of passengers, aircraft and baggage movements - all the complex logistics of modern air travel.

From university days I remember that the first airline control system was SABRE (Semi-automated Business Research Environment) now between 50 and 60 years old and still in use.  It is American and its history is one of cut-throat competition.  BA's system was called Travicom that was launched in 1976 and then evolved into the current system, Galileo 12 years later.

What's been added to these venerable systems is the security element or pre-screening of passengers.  The first add-on was Capps in the late 1990s which simply checked names against an FBI database.  After 9/11 highlighted its shortcomings it was replaced by CAPPS II  which checked everything about every one everywhere and reportedly nominated Edward Kennedy as a security risk in 2004. It was replaced by the US Department of Homeland security commissioned 'Secure Flight' pre-screening system but this didn't kick in till 2010.  At this point everything gets very tricky..  For a start there are at least two lists to be used, one with a mere 47k names, the other with 2.5m identities believed to cover 1.8m people.  With a paranoid POTUS, it's easy to imagine systems developers and maintainers tearing their hair out.

And how this security aspect ties in with outsourcing is any one's guess.
Title: Re: The BA IT collapse
Post by: Jaded on 29 May, 2017, 08:46:28 pm
My understanding of outsourcing is that if the contracted company says its employees are vetted to the required standards, then you are covered...
Title: Re: The BA IT collapse
Post by: Chris S on 29 May, 2017, 09:27:37 pm
My understanding of outsourcing is that if the contracted company says its employees are vetted to the required standards, then you are covered...

Really? Wow.

"Can you keep a secret?"
"Yes"
"Oh well, you're in then!"
Title: Re: The BA IT collapse
Post by: Jaded on 29 May, 2017, 10:03:08 pm
Exactly.

Unless the contracting company wishes to go and vet all the employees for skills and clearance themselves. Goodness, they might even be interested in the way those staff are managed. In which case, what's the point of having an outsourcing?
Title: Re: The BA IT collapse
Post by: Greenbank on 30 May, 2017, 07:57:38 pm
Caught a bit of it being discussed with an airline IT expert on Radio 5 Live this evening.

Seems it was the integration between BA's systems and the Amadeus system that handles the boarding side of bookings/luggage-tracking/etc.

It was definitely on the BA side as Amadeus is used by many different airlines at those airports and no other airlines had a problem with that during the same period.

So, whatever BA had written to integrate their own systems and Amadeus goes wonky, the backup/HA/DR system flails-over in exactly the same way and they're left up shit creek until engineers can wrestle something back to life.

If it was power surge related then:-
a) Why wasn't the system behind some level of power conditioning?
b) Surely the primary and secondary systems aren't in the same DC/location?
c) They obviously don't have a playbook for recovering from hardware failure of both systems...
d) I'm assuming they had a secondary system for this integration! Surely it wasn't a SPOF!?!
Title: Re: The BA IT collapse
Post by: Asterix, the former Gaul. on 01 June, 2017, 05:30:08 pm
'They' are still clinging to the power surge explanation.

I'm baffled by this because here in rural France we are told to expect power surges for various reasons.  The one that took out my own computer was a lightning strike a few years ago.

However, my electrician tells me that four hundred euros would buy me a blast-proof surge protector, very unlike the mickey mouse jobs from PC World*.

Being parsimonious I declined his offer and just unplug when I hear thunder.  But why didn't BA have such a device in place or if they did why didn't it work?  Was it a cost-saving PC World job?  If it was a third party installed set up isn't there insurance cover?

* I had one; it didn't save my laptop and unfortunately I'd lost the receipt so could not claim.
Title: Re: The BA IT collapse
Post by: Kim on 01 June, 2017, 05:41:12 pm
A datacentre UPS ought to protect against surges (up to and including nearby lightning strikes) on the mains supply, or normal forms of generator misbehaviour.

It's always possible that the UPS itself (or equipment downstream of it) failed in a way that caused a less impressive 'surge' (repeated inrush current in a short period of time, spikes from inductive A/C loads, or maybe just really out-of-spec power) and damaged a whole load of critical kit.  Maybe some sheepish engineer accidentally swapped a phase and a neutral and is keeping the now legendary low profile, or something.  In nearly all form of human endeavour there's potential for everything to go massively wrong as long as more than one bad thing happens at the same time.

Obviously things ought to be sufficiently redundant they should be able to continue with the total loss of a single site, but we've already covered cost-cutting and the risks of subtle failure modes that are hard to predict and test for.
Title: Re: The BA IT collapse
Post by: PeteB99 on 02 June, 2017, 04:48:16 pm
This Times article points to human error

"A power supply unit at the centre of last weekend’s British Airways fiasco was in perfect working order but was deliberately shut down in a catastrophic blunder, The Times has learnt.

An investigation into the incident, which disrupted the travel plans of 75,000 passengers, is likely to centre on human error rather than any equipment failure at BA, it emerged."

https://www.thetimes.co.uk/article/ba-power-fiasco-blamed-on-staff-blunder-tbfhxwsw2 (https://www.thetimes.co.uk/article/ba-power-fiasco-blamed-on-staff-blunder-tbfhxwsw2)

Not sure I'm convinced

Edit - A bit more detail from the Guardian

https://www.theguardian.com/business/2017/jun/02/ba-shutdown-caused-by-contractor-who-switched-off-power-reports-claim (https://www.theguardian.com/business/2017/jun/02/ba-shutdown-caused-by-contractor-who-switched-off-power-reports-claim)
Title: Re: The BA IT collapse
Post by: Ben T on 03 June, 2017, 12:33:05 am
A datacentre UPS ought to protect against surges (up to and including nearby lightning strikes) on the mains supply, or normal forms of generator misbehaviour.

It's always possible that the UPS itself (or equipment downstream of it) failed in a way that caused a less impressive 'surge' (repeated inrush current in a short period of time, spikes from inductive A/C loads, or maybe just really out-of-spec power) and damaged a whole load of critical kit.  Maybe some sheepish engineer accidentally swapped a phase and a neutral and is keeping the now legendary low profile, or something.  In nearly all form of human endeavour there's potential for everything to go massively wrong as long as more than one bad thing happens at the same time.

Obviously things ought to be sufficiently redundant they should be able to continue with the total loss of a single site, but we've already covered cost-cutting and the risks of subtle failure modes that are hard to predict and test for.

You tend to think with the rise of "the cloud" that all big companies use good data centres but it's amazing how many don't. I once went on a visit to an azure data centre and its ups is.....impressive, to say the least. 8 rooms each containing a not just a generator but a full on ship engine (yes, they're massive). They have to be started up every week just to be tested, and is quite a big job just starting them up , you don't just turn a key# Another room containing nothing but weird batteries but where the acid is visible and sloshing about in open trays. Loads of control rooms containing nothing but switches, some of which in the 6 figures of volts. The whole thing's more about power than computer equipment - and that's the only basis on which it's charged to (corporate, i.e.wholesale) customers.
Title: Re: The BA IT collapse
Post by: Asterix, the former Gaul. on 03 June, 2017, 07:56:48 am
Quote
The whole thing's more about power than computer equipment

That seems reasonable.  However what you describe doesn't suggest that it's easy to lose power and a bit of further reading about battery rooms confirms that it is a highly developed area of power supplies with a long history.  Some of the earliest battery rooms going back into the 19th C and there was of course extensive deployment on board even the early U boats.  In the 70's we monitored the radio traffic for the port of Harwich.  Even we had a battery room in case of power cuts!

In fact the more they claim it was a power failure, the less credible it becomes.

Meanwhile on Ryanair's twitter page:
(https://pbs.twimg.com/media/DA6eyiYXUAEoRek.jpg)
Title: Re: The BA IT collapse
Post by: Thor on 03 June, 2017, 06:21:54 pm
This Times article points to human error

"A power supply unit at the centre of last weekend’s British Airways fiasco was in perfect working order but was deliberately shut down in a catastrophic blunder, The Times has learnt.

An investigation into the incident, which disrupted the travel plans of 75,000 passengers, is likely to centre on human error rather than any equipment failure at BA, it emerged."

https://www.thetimes.co.uk/article/ba-power-fiasco-blamed-on-staff-blunder-tbfhxwsw2 (https://www.thetimes.co.uk/article/ba-power-fiasco-blamed-on-staff-blunder-tbfhxwsw2)

Not sure I'm convinced

Edit - A bit more detail from the Guardian

https://www.theguardian.com/business/2017/jun/02/ba-shutdown-caused-by-contractor-who-switched-off-power-reports-claim (https://www.theguardian.com/business/2017/jun/02/ba-shutdown-caused-by-contractor-who-switched-off-power-reports-claim)

A few years ago, I was working for a Large Financial Institution in which, on a weekday, an electrical contractor drilling a hole in a wall managed to trigger the kill switch for the entire Data Centre.  1200 servers simultaneously experienced an unscheduled outage.  Recovering from that was fun for all of us...
Title: Re: The BA IT collapse
Post by: Asterix, the former Gaul. on 04 June, 2017, 01:30:39 am
Ryanair's apparent hubris seems to be based on their own operation.  They claim to have systems operating in triplicate, sharing changes to data.  Each system can operate independently if the other two are knocked out.  I guess there might be a point at which they converge, the weakest link, maybe simply the user interface or web presence?   Or maybe even that can be triplicated!  I'm now worried their aircraft have only two engines..
Title: Re: The BA IT collapse
Post by: De Sisti on 04 June, 2017, 06:59:55 am
Now that the outage has been resolved BA should now start reimbursing and compensating passengers.
Title: Re: The BA IT collapse
Post by: Vince on 04 June, 2017, 09:57:02 am
If the issue was at an integration point to a third party it can be troublesome to have duplication there. In my experience of doing B2B integration some parties weren't the most practical, wanting to use IP addresses instead of URLs, having one integration point for both test and live systems etc, all making hot switch over... challenging.
(Note - I'm not a network person so the above may contain traces of inexactitude.)
Title: Re: The BA IT collapse
Post by: mattc on 04 June, 2017, 05:30:48 pm
Only just found this thread - I've been jet-lagged/audax-lagged/BA-lagged all week!

Took us 3 days to get back from Nice on Saturday. This was my first massive air-delay, and although it was a goody, I'm grateful that we didn't go through the piles-of-bodies-in-a-departure-lounge scenario. I guess BA wanted to avoid that at all cost due to the horrible PR.

ANYWAY ... partner (who wasn't on the trip!) rather "helpfully" found this Dr Seuss version of the BA statement to "cheer me up":

Statement from British Airways

The plane is not leaving
We’re sorry to say.
And check-in is heaving;
It will be all day!
Your plane to Barbados
Is stuck in Mumbai.
Your bag’s in Honduras.
“We’re sorry! We’re sorry!”The crew’s in Dubai.
What else can we say?
There’s no need to worry,
It’s just a four-day delay.
The reason was clear,
A technical crash.
There’s nothing to fear,
Not a question of cash.
“Cutbacks caused the disaster”.
Ignore those who whine
Our IT is fine
Based in west Maharashtra.
We’ll soon get you flying
There’s no reason to fret.
Oh please stop your crying.
You could try EasyJet!

 :D



My one rant - probably ill-informed: we were ready to fly, on the plane. Surely they could sort us out back at Heathrow? Heathrow air-traffic-control could get the plane down, they don't care about BA's tickets/luggage/passengers, we'd all been security-checked. So why couldn't we fly?

It won't help me, but I'd like to hear an answer!
Title: Re: The BA IT collapse
Post by: Asterix, the former Gaul. on 04 June, 2017, 06:03:58 pm
The thing is you just think it's so easy.  You don't realise what's involved:


Quote
"Transtellar Cruise Lines would like to apologize to passengers for the continuing delay to this flight. We are currently awaiting the loading of our complement of small lemon-soaked paper napkins for your comfort, refreshment and hygiene during the journey. Meanwhile we thank you for your patience. The cabin crew will shortly be serving coffee and biscuits again."

[...]

...he suddenly caught sight of a giant departure board still hanging, but by only one support, from the ceiling above him. It was covered with grime, but some of the figures were still discernible. [...] 'Nine hundred years...' he breathed to himself. That's how late this ship was.

[...]

In every seat sat a passenger, strapped into his or her seat. The passengers' hair was long and unkempt, their fingernails were long, the men wore beards. All of them were quite clearly alive - but sleeping.

[...]

'You're the autopilot?' said Zaphod
'Yes,' said the voice from the flight console.
'You're in charge of this ship?'
'Yes,' said the voice again, 'there has been a delay. Passengers are to be kept temporarily in suspended animation, for their comfort and convenience. Coffee and biscuits are served every year, after which passengers are returned to suspended animation for their continued comfort and convenience. Departure will take place when the flight stores are complete. We apologize for the delay.'

[...]

'Delay?' he cried. 'Have you seen the world outside this ship? It's a wasteland, a desert. Civilization's been and gone, man. There are no lemon-soaked paper napkins on the way from anywhere!'
'The statistical likelihood, ' continued the autopilot primly, 'is that other civilizations will arise. There will one day be lemon-soaked paper napkins. Till then there will be a short delay. Please return to your seat.'

You were lucky.
Title: Re: The BA IT collapse
Post by: ElyDave on 05 June, 2017, 07:10:44 am
Quote

You tend to think with the rise of "the cloud" that all big companies use good data centres but it's amazing how many don't. I once went on a visit to an azure data centre and its ups is.....impressive, to say the least. 8 rooms each containing a not just a generator but a full on ship engine (yes, they're massive). They have to be started up every week just to be tested, and is quite a big job just starting them up , you don't just turn a key# Another room containing nothing but weird batteries but where the acid is visible and sloshing about in open trays. Loads of control rooms containing nothing but switches, some of which in the 6 figures of volts. The whole thing's more about power than computer equipment - and that's the only basis on which it's charged to (corporate, i.e.wholesale) customers.

which keeps me in one stream of work.  The installed capacities on those data centres are huge, although rarely fired up other than testing, they all fall under the EU Emissions Trading Legislation, which affects those with combustion kit > 20MWTh input, so assuming 40% efficiency, that's effectively 8MW electrical generation-ish
Title: Re: The BA IT collapse
Post by: mattc on 05 June, 2017, 07:59:21 am
The thing is you just think it's so easy.  You don't realise what's involved:


Quote
"Transtellar Cruise Lines would like to apologize to passengers for the continuing delay to this flight. We are currently awaiting the loading of our complement of small lemon-soaked paper napkins for your comfort, refreshment and hygiene during the journey. Meanwhile we thank you for your patience. The cabin crew will shortly be serving coffee and biscuits again."

[...]

...he suddenly caught sight of a giant departure board still hanging, but by only one support, from the ceiling above him. It was covered with grime, but some of the figures were still discernible. [...] 'Nine hundred years...' he breathed to himself. That's how late this ship was.

[...]

In every seat sat a passenger, strapped into his or her seat. The passengers' hair was long and unkempt, their fingernails were long, the men wore beards. All of them were quite clearly alive - but sleeping.

[...]

'You're the autopilot?' said Zaphod
'Yes,' said the voice from the flight console.
'You're in charge of this ship?'
'Yes,' said the voice again, 'there has been a delay. Passengers are to be kept temporarily in suspended animation, for their comfort and convenience. Coffee and biscuits are served every year, after which passengers are returned to suspended animation for their continued comfort and convenience. Departure will take place when the flight stores are complete. We apologize for the delay.'

[...]

'Delay?' he cried. 'Have you seen the world outside this ship? It's a wasteland, a desert. Civilization's been and gone, man. There are no lemon-soaked paper napkins on the way from anywhere!'
'The statistical likelihood, ' continued the autopilot primly, 'is that other civilizations will arise. There will one day be lemon-soaked paper napkins. Till then there will be a short delay. Please return to your seat.'

You were lucky.
Good old Douglas. He wrote that nearly 40 years ago, and it damn near just came true for me!
Title: Re: The BA IT collapse
Post by: Ben T on 05 June, 2017, 08:14:24 pm
Quote
The whole thing's more about power than computer equipment

That seems reasonable.  However what you describe doesn't suggest that it's easy to lose power .



Yeah but this is an azure dc, I'm suggesting ba are a lot more cheapskate than to fork out for azure, or similar standard, and thus their ups is far less advanced, which is why this has happened to them.
Title: Re: The BA IT collapse
Post by: Phil W on 05 June, 2017, 08:20:32 pm
Sounds like a whole series of bad decisions were made at BA. Even if they had crappy power at one DC their backup systems should have been in an entirely separate physical location with no chance of a power surge or otherwise taking everything out.
Title: Re: The BA IT collapse
Post by: Ben T on 05 June, 2017, 10:20:37 pm
How do you know they've got more than one dc?
Title: Re: The BA IT collapse
Post by: Phil W on 06 June, 2017, 12:08:34 am
How do you know they've got more than one dc?

Because it's been standard practice for a company of their size for decades, hence the should. If they didn't they have made even bigger blunders and bad decisions over the years than you'd have thought possible.
Title: Re: The BA IT collapse
Post by: Asterix, the former Gaul. on 06 June, 2017, 01:45:27 am
How do you know they've got more than one dc?

Quote
the BBC's transport correspondent, Richard Westcott, has spoken to IT experts who are sceptical that a power surge could wreak such havoc on the data centres.
BA has two data centres about a kilometre apart. There are question marks over whether a power surge could hit both. Also, there should be fail-safes in place, our correspondent said.

http://www.bbc.com/news/business-40159202

Quote
All the big installations have back-up power. If the mains fails, a UPS (uninterruptable power supply) kicks in. It's basically a big battery that keeps things ticking over until the power comes back on, or a diesel generator is fired up.
This UPS is meant to take the hit from any "surge", so the servers don't have to.
All the big servers and large routers, I'm told, also have dual power supplies fed from different sources.

...up until a while ago, British Airways' IT systems had a variety of safety nets in place to protect them from big dumps of uncontrolled power, and to get things back on their feet quickly if there was any problem. I'm assuming those safety nets are still there, so why did they fail? And did human error play a part in all this?

...If BA wants to repair its reputation, its owner IAG needs to convince the public that making hundreds of IT staff redundant last year did not leave them woefully short of experts who could have fixed the meltdown sooner. And that it won't happen again - at least not on this epic scale.

http://www.bbc.com/news/business-40117381

Given the direction the company is taking, a move to compete at the budget end of the market, I suspect Cruz is CEO because he is an axeman not a quality controller.  He's not there to win a popularity contest; he probably has a skin like a rhino.
Title: Re: The BA IT collapse
Post by: Ham on 06 June, 2017, 08:01:27 am
Here's an interesting piece of informed comment on the cause; apparently 200 linked systems in the critical path.....

https://www.theregister.co.uk/2017/06/05/british_airways_critical_path_analysis/

....making the failure a "when" rather than "if"

Title: Re: The BA IT collapse
Post by: mattc on 06 June, 2017, 08:53:50 am
...<snip>


http://www.bbc.com/news/business-40117381

Given the direction the company is taking, a move to compete at the budget end of the market, I suspect Cruz is CEO because he is an axeman not a quality controller.  He's not there to win a popularity contest; he probably has a skin like a rhino.

Cruz has already supervised a similar meltdown at a Spanish (?) airline. His career seemed to survive that OK.

And his hero is Ryanair boss Michael O’Leary ...
Title: Re: The BA IT collapse
Post by: Asterix, the former Gaul. on 06 June, 2017, 09:30:28 am
As a change manager I was the one in the team who had an understanding of the workflow system in our organisation.  It wasn't a huge understanding but I knew the manager whose baby it was and had taken a keen interest in the meetings where he had explained it.

Consequently I knew the components in the system that every task had to reference and get 'permission' from and thus any changes to these components were very high risk.  My colleagues in the team had not been given such training and came to me if in any doubt.  On several occasions I would not sign off changes because I did not accept the testing that had been done.  Sometimes this meant standing up to managers much more senior to me and telling them No!   It all worked pretty well really and the number of user hours lost to IT outages was steadily reducing.

Then it was decided that our work was to be outsourced to Wipro in India.  Previously we'd been told our jobs were safe, then there was a big meeting at which we were told, no they weren't who wants to be made redundant, tick here.  Having little confidence in my future with the company I didn't hesitate to step forward and made my exit. 

Through my role I knew others in the company whose job it was to investigate, record and report IT outages.  In the months following the outsourcing the company IT systems had very many problems, at least one making the news.  The IT director was dismissed (they'd also constructively sacked the guy who designed the crucial system and his vital side-kick whom every one had thought could never be sacked).   User hours lost to IT outages soared and business users were very fed up.  As far as I can recall there were no total power failures during this period.

In a nutshell the management had overseen the dismissal of staff who'd had a key understanding of the systems and handed them over to a company which had many clients and had only a second hand knowledge based on what they were told by staff whose roles they were taking over. 

It's easy to imagine that scenario at BA.  I doubt if the 'executive' have a scooby doo about how their systems work!
Title: Re: The BA IT collapse
Post by: hatler on 06 June, 2017, 09:41:27 am
From my perspective, Wipro were significantly worse than useless.
Title: Re: The BA IT collapse
Post by: ian on 06 June, 2017, 09:56:29 am
Well, BA made hundreds of their experienced and knowledgeable IT staff redundant and outsourced much of their work. But apparently that's nothing to do with it.

Possibly mutating neutrinos then.
Title: Re: The BA IT collapse
Post by: Jaded on 06 June, 2017, 01:42:54 pm
Hah! after Brexit they won't be allowed in. Mutating or not.
Title: Re: The BA IT collapse
Post by: Asterix, the former Gaul. on 08 June, 2017, 08:33:26 am
Does British Airways' explanation stack up? (http://www.bbc.com/news/business-40186929)

Quote
Willie Walsh who runs the airline's parent company, has offered a little more detail about why their computer system crash-landed last week.
Put simply, an "engineer" cut the data centre's power, messed up the reboot and fried the circuits, he has said.

His explanation has raised eyebrows amongst former British Airways IT workers I've spoken to.

The simple explanation for Walsh's explanation is that he doesn't know much about the computer systems which underpin his companies operations.

Top businessmen who suffer the defect of not understanding IT are dinosaurs who don't realise when it's time they became extinct.
Title: Re: The BA IT collapse
Post by: David Martin on 08 June, 2017, 08:39:48 am
Does British Airways' explanation stack up? (http://www.bbc.com/news/business-40186929)

Quote
Willie Walsh who runs the airline's parent company, has offered a little more detail about why their computer system crash-landed last week.
Put simply, an "engineer" cut the data centre's power, messed up the reboot and fried the circuits, he has said.

His explanation has raised eyebrows amongst former British Airways IT workers I've spoken to.

The simple explanation for Walsh's explanation is that he doesn't know much about the computer systems which underpin his companies operations.

Top businessmen who suffer the defect of not understanding IT are dinosaurs who don't realise when it's time they became extinct.
I disagree, but a top businessman who does not seek out and use the best advice on mission critical systems? I don't think he has to understand them himself but he has to understand the importance and balance the business risk.
Title: Re: The BA IT collapse
Post by: Polar Bear on 08 June, 2017, 08:55:33 am
I find the explanation lacking in some credibility.  It's surely inconceivable that the engineer can simply kill the whole system as well as the fail over, backup, redundancy, (call it what you will) just like that. 

BA don't want to unduly concern the markets and investors so they've been playing down what happened.   I'm going to stick my neck out and guess that a scheduled fail over test failed and their recovery routines failed too.
Title: Re: The BA IT collapse
Post by: mattc on 08 June, 2017, 08:57:13 am
That sounds like a lie that he thinks no one will bother to delve into - he's trying to blame a minor human error; make it sound like "just one of those things".
Title: Re: The BA IT collapse
Post by: Greenbank on 08 June, 2017, 09:18:09 am
I find the explanation lacking in some credibility.  It's surely inconceivable that the engineer can simply kill the whole system as well as the fail over, backup, redundancy, (call it what you will) just like that. 

I don't think the other system/DC was taken down by power. From what I've read it's more likely that the power was cut in one DC (they run the two DCs as active:active rather than primary:backup or DR style) after the UPS system was disabled, and then power restored in a panic (possibly botched, e.g. turned on, subsequent surge not protected by the disabled UPS, some machines go down again, data corruption, etc). When enough of the these machines were back online they started to sync partially corrupted data between the two DCs and now you're in a world of trouble.

Having worked in DCs where there are strict procedures for doing things I can see exactly how this kind of thing can happen, because there's rarely any proper enforcement of those strict procedures. A bunch of stuff written down in a playbook is useless if people ignore it. A "needs two people to do something" rule is nigh on useless if both people are complacent. Mistakes will always happen. It's the same everywhere, look at "never events" in the NHS for example.

What BA lacked is a procedure/plan (that they test frequently) for dealing with corruption like this. You can bet they'll be working towards a plan (and then the regular testing of the plan will eventually succumb to the same complacency that caused the problem in the first place).
Title: Re: The BA IT collapse
Post by: mrcharly-YHT on 08 June, 2017, 09:20:39 am
Lets apply logic.

A) Assume he told the truth. That means they only have one data centre with no DR, no fallback. Woah, that's some major failing there.

B) Assume he is outright lying. His job is to protect the value of the company, his motivations to lie would be to protect his job or to protect the value of the company (or both). So the real reason for the failure would be something that would be very damaging if it came out. More damaging than implying (A)

Are there any other conclusions we can definitely derive from his statement, that are anything other than speculation?

[edit] I suppose we can just decide that he didn't have a clue wtf he was talking about so he was just spouting utter bullshit.
Title: Re: The BA IT collapse
Post by: Greenbank on 08 June, 2017, 09:25:29 am
C) He's paraphrasing and maybe putting a bit of spin on what was reported to him. So he thinks he is telling the truth.

It's the difference between truth and fact. He may completely believe what he is saying is true, but it's not what actually happened.

You can't expect accurate technical descriptions of the root cause of the problem from someone non-technical, nor can you expect that from a senior person in the company as a media release because it will almost certainly be mis-understood by the media and misreported.
Title: Re: The BA IT collapse
Post by: Asterix, the former Gaul. on 08 June, 2017, 11:13:40 am
Does British Airways' explanation stack up? (http://www.bbc.com/news/business-40186929)

Quote
Willie Walsh who runs the airline's parent company, has offered a little more detail about why their computer system crash-landed last week.
Put simply, an "engineer" cut the data centre's power, messed up the reboot and fried the circuits, he has said.

His explanation has raised eyebrows amongst former British Airways IT workers I've spoken to.

The simple explanation for Walsh's explanation is that he doesn't know much about the computer systems which underpin his companies operations.

Top businessmen who suffer the defect of not understanding IT are dinosaurs who don't realise when it's time they became extinct.
I disagree, but a top businessman who does not seek out and use the best advice on mission critical systems? I don't think he has to understand them himself but he has to understand the importance and balance the business risk.

I meant in the sense of perhaps knowing (or caring about) the difference between a UPS and USP rather than being a coder.   
Title: Re: The BA IT collapse
Post by: ian on 08 June, 2017, 12:21:01 pm
I don't think it reasonable that executive leadership necessarily know the technical details but they should understand the need for the expertise of those who do understand the technical details. That seems to be the modern failing, executives flit between disparate companies and industries with little actual comprehension of how those business deliver their products and services. They live in world circumscribed by spreadsheets, margins, and EBITAs, and where critical expertise is measured in headcount and salary expenditure. And failure simply means collecting the money and moving on to another executive position or a comfy retirement package. Management layer stacking also ensures senior management levels are very effectively insulated from the actual business. Even if they wanted to know, they're unlikely to find out.
Title: Re: The BA IT collapse
Post by: De Sisti on 08 June, 2017, 07:49:52 pm
I find the explanation lacking in some credibility.  It's surely inconceivable that the engineer can simply kill the whole system as well as the fail over, backup, redundancy, (call it what you will) just like that. 

I don't think the other system/DC was taken down by power. From what I've read it's more likely that the power was cut in one DC (they run the two DCs as active:active rather than primary:backup or DR style) after the UPS system was disabled, and then power restored in a panic (possibly botched, e.g. turned on, subsequent surge not protected by the disabled UPS, some machines go down again, data corruption, etc). When enough of the these machines were back online they started to sync partially corrupted data between the two DCs and now you're in a world of trouble.

Having worked in DCs where there are strict procedures for doing things I can see exactly how this kind of thing can happen, because there's rarely any proper enforcement of those strict procedures. A bunch of stuff written down in a playbook is useless if people ignore it. A "needs two people to do something" rule is nigh on useless if both people are complacent. Mistakes will always happen. It's the same everywhere, look at "never events" in the NHS for example.

What BA lacked is a procedure/plan (that they test frequently) for dealing with corruption like this. You can bet they'll be working towards a plan (and then the regular testing of the plan will eventually succumb to the same complacency that caused the problem in the first place).
Of the acronyms I understood BA, NHS, UPS, but couldn't work out what DC or DR represented.
Title: Re: The BA IT collapse
Post by: Kim on 08 June, 2017, 07:50:25 pm
Data Centre and Disaster Recovery
Title: Re: The BA IT collapse
Post by: De Sisti on 08 June, 2017, 07:51:12 pm
Thanks Kim.
Title: Re: The BA IT collapse
Post by: Asterix, the former Gaul. on 19 September, 2017, 02:41:00 pm
Ryanair crew screw up (http://www.bbc.com/news/business-41311603)

Bizarrely the CEO appears unable to find a way of blaming the computer system or power supply:

Quote
He said the airline did not have an overall shortage of pilots, but said they had "messed up" the rosters for September and October.
"This is our mess-up. When we make a mess in Ryanair we come out with our hands up," he said.
Title: Re: The BA IT collapse
Post by: Sergeant Pluck on 19 September, 2017, 02:56:06 pm
Recruitment and retention seem to have a role to play.

http://www.cityam.com/272219/norwegian-has-been-quietly-poaching-raft-ryanair-pilots

Ryanair seems to employ at least some of its pilots on something that resembles a zero hours contract:

https://www.irishtimes.com/business/transport-and-tourism/german-inquiry-into-ryanair-pilot-work-status-extended-1.2418995

Title: Re: The BA IT collapse
Post by: Morat on 22 September, 2017, 12:52:06 pm
Ryanair crew screw up (http://www.bbc.com/news/business-41311603)

Bizarrely the CEO appears unable to find a way of blaming the computer system or power supply:

Quote
He said the airline did not have an overall shortage of pilots, but said they had "messed up" the rosters for September and October.
"This is our mess-up. When we make a mess in Ryanair we come out with our hands up," he said.

Definitely HR this time :)