Author Topic: The BA IT collapse  (Read 10782 times)

T42

  • Old fool in a hurry
Re: The BA IT collapse
« Reply #25 on: May 28, 2017, 10:57:35 am »
My employer took the same course of outsourcing in 2005.  It was not successful because the companies doing the work were also working for other clients and did not give priority to one client.  I understand that much of the work has returned to be done by UK employees.

Heh. As Y2K approached a chum and I set up a large order-processing/production system for a manufacturing jeweller in Germany. After a couple of years we were suddenly shown the door, as a palace revolution had swung opinion in favour of taking standard software from a big software house so as to ensure continuity if one of us bit the dirt.  Six months later we heard that the standard SW had required so many adaptations that it had already cost twice as much as our system and was far from finished, and everything was being done by just two blokes. Another six months and one of them had got fed up and left.  We never got the client back, though.
I've dusted all those old bottles and set them up straight.

TheLurker

  • Goes well with magnolia.
Re: The BA IT collapse
« Reply #26 on: May 28, 2017, 11:42:37 am »
Let's face it.   Every single one of us who works or who has worked in IT knows that, basically, the whole thing is held together with the electronic equivalents of spit and baler twine by people running desperately hard to stand still in an effort to keep up with the latest hare-brained scheme cooked up by the PHBs in cahoots with *shiny stuff* vendors and sooner or later it all goes wrong.  The trick is to be somewhere else and or not reliant on the failed system when it does.  :)
Τα πιο όμορφα ταξίδια γίνονται με τις δικές μας δυνάμεις - Φίλοι του Ποδήλατου

Phil W

Re: The BA IT collapse
« Reply #27 on: May 28, 2017, 01:19:01 pm »
We used to called Disaster Recovery (DR) testing Computer Aided Overtime (CAO).  It was meant to mean Continuous Application Operation but out term was far better. We learnt that we could restore the backups and get our system (one of many) fully up and running and verified in about 12 hours.  We also learnt that it could only process about 1/12 of the capacity it needed as the backup setup was rather weedy compared to primary. 

I also found out when oncall (for everything) that one site was run off a mezzanine floor in the other head office.  At night a cleaner unplugged a cabinet powering the core system disks of the Mainframe.   I only found this out when I asked what was down and they said everything. Oh what joy the next hours were as we worked through priority systems, handing off to the next guy on the oncall rota, where we came very close to invoking full DR.  IT systems can fail in interesting ways that aren't always thought of.  In theory you should be able to just remove all power from everything and recover from it, but doesn't seem to work like that in reality and you have partial failures that are often actually harder to recover from. (and take longer as you try and recovery in Primary before invoking DR)

Of course this was back in the early 90's and money was invested since and DR moved on.

So if BA is still in a daily / weekly backup DR setup and alternate cold standby data centre, plus don't have an alternate offices for operations and support to deploy to and operate from "War rooms" and they haven't tested DR properly in a long while.  Then I can well understand why they might not have their systems back so quickly if the entire power supply goes at that primary DC.

I love the way BBC have reported it as a global IT system not realising it's plural as in systems as in probably hundreds of interconnected legacy and modern (which themselves have many moving parts) joined together with spaghetti and it is not as simple as turning your home PC on and off.

Plus as others have said when they no longer value those who built their legacy, didn't encourage and promote knowledge sharing, and lay them off; then they are left with inexperienced staff who are ok when operations are working as normal but have no experience with dealing with a DR situation. 


T42

  • Old fool in a hurry
Re: The BA IT collapse
« Reply #28 on: May 28, 2017, 01:49:30 pm »
Plus as others have said when they no longer value those who built their legacy, didn't encourage and promote knowledge sharing, and lay them off; then they are left with inexperienced staff who are ok when operations are working as normal but have no experience with dealing with a DR situation.

Applies well beyond IT, too.
I've dusted all those old bottles and set them up straight.

Re: The BA IT collapse
« Reply #29 on: May 28, 2017, 04:20:40 pm »
The ownership of BA is interesting.  It's only British in a loose kind of way being part of International Airlines group and run in partnership with Iberia, the Spanish 'flagship', a merger of apparent equals. However BA's profits were badly hit by Iberia's exposure to the downturn of SPain's economy after 2008.  I guess its profits are about to take another hit which is good news only for its rivals.
Sic transit and all that..

ElyDave

  • Royal and Ancient Polar Bear Society member 263583
Re: The BA IT collapse
« Reply #30 on: May 28, 2017, 04:46:56 pm »
Seems it doesn't matter if its IT or O&G, the seven P's still apply. One of my first questions when talking about their shiny incident control room is "what happens if this room is unavailable, power fails, internet outage, the people dont turn up etc?"

Sometimes they have thought about it.

Its when you have something like Deepwater Horizon, or the Elgin G4 incident that you find out how well it really works over weeks and months rather than hours.
“Procrastination is the thief of time, collar him.” –Charles Dickens

Re: The BA IT collapse
« Reply #31 on: May 28, 2017, 04:52:48 pm »
I had a contract working on a DR system for a BigCorpBank. We decided to do a test of the fallback DR - they had a backup site for trading, if the main site went down (think big bomb), then they could use the other site to close down trades.

So, one weekend we shut down the main site, literally pulled the plug, and started to bring up the backup site. All going well until we started to bring up the payment system, the single connection that is required for processing the financial transactions. This is a live connection, never shut off and activated by a swipe card and a long code.

The bank's card and code combination didn't work. Nobody has ever, ever, ever verified the card and code.

Lets think about this. We've turned off the bank's main connection. The backup one won't connect. In 24hours the trading floors open. There is nobody we can contact to do anything about this card and code because it is Weekend and they don't work Weekend. Without this connection up, the BigCorp, can't trade.

We Are Fucked.
<i>Marmite slave</i>

pdm

  • Sheffield hills? Nah... Just potholes.
Re: The BA IT collapse
« Reply #32 on: May 28, 2017, 05:11:33 pm »


Yup. I am currently designing the network .....

And there you have, at least part, of it.

Systems are not just the sum of its components. You can have a great system but, if they aren't designed, installed or maintained properly you have a pile of useless. FTR, my role is the integrator of the various stuffs, like Network, DR, Server, Service, etc etc

Go on, another anecdote (dating back about 15 years). Another major corp with highly redundant highly secure systems lost complete contact with their, erm, contact centre. I happened to be there and got involved. Turned out that the dual redundant inbound circuit terminations when installed had not only been plugged in to the same dual socket, but a wall socket instead of a cabinet socket had been chosen. And someone had plugged in a kettle. That was all it needed.

You don't have to go back 15 years for this sort of idiocy!
Just in the last couple of weeks ago, a major critical UK infrastructure IT system failed and was down for many hours because the primary site failed. The backup site could not come on line because it would not operate without the primary site on line..... This apparently turned out to be part of the (flawed) original design....  :facepalm:

rr

Re: The BA IT collapse
« Reply #33 on: May 28, 2017, 05:17:17 pm »
Seems it doesn't matter if its IT or O&G, the seven P's still apply. One of my first questions when talking about their shiny incident control room is "what happens if this room is unavailable, power fails, internet outage, the people dont turn up etc?"

Sometimes they have thought about it.

Its when you have something like Deepwater Horizon, or the Elgin G4 incident that you find out how well it really works over weeks and months rather than hours.
There was also the recent incident where the platform Inventory was released when the backup batteries ran down.

Sent from my XT1562 using Tapatalk


ElyDave

  • Royal and Ancient Polar Bear Society member 263583
Re: The BA IT collapse
« Reply #34 on: May 28, 2017, 05:18:34 pm »
One of the UKs major utilities had satellite links of data back to mega control centre. One winter, it snowed heavily, dish filled up. Backup was a phone line. Phone line had never been used. BT had noticed lack of use and disconnected it.
“Procrastination is the thief of time, collar him.” –Charles Dickens

Re: The BA IT collapse
« Reply #35 on: May 28, 2017, 05:19:59 pm »
Quote
We believe the root cause was a power supply issue."

My employer had back up generators in the basement.  Power supplies issues just made the lights flicker....

The simple explanation for BA's plight is that ... they failed to understand the downside of outsourcing.
When I worked for a global megacorp it had back up generators, & for major IT systems a back up data centre geographically separate from the main one, i.e. far enough away for it to be unlikely that a single disaster would affect both. Critical systems were behind locked doors & power supplies to 'em were protected so nobody could knock 'em down by casually pulling out a plug.

Never found out if it'd all work, though. I recall a power cut causing the lights in our office to flicker & calls of "has anything failed?", all answered by "X (Y, Z, etc.) is working" but that was just development systems.

We had issues with outsourcing, though. I recall a mega-worldcorp external supplier's big show of how their replacement billing system, carefully customised for us after exhaustive & expensive studies by hordes of supposedly world-leading consultants & analysts, would replace some, & interface with other, existing systems. A bunch of us stood looking in puzzlement at a big diagram supposedly showing everything until one of us voiced what we were all thinking, i.e. "where's so-and-so?", so-and-so being a system without which (or a replacement) we wouldn't produce any bills.  :facepalm: Many months of work, walking past us to their desks every day, but it had never occurred to them to ask us (the billing team) what we did & how it fitted in.

I also recall how when it was decided to replace the last part of an old system running on old hardware which wasn't doing very much any more, but the little it was doing was essential, lavishly illustrated external proposals galore were put forward by external suppliers, all of which involved great expense - & none of which, as far as those of us working on it could see, took into account quite how little a replacement needed to do. They were all far too complicated, & involved duplication of processing that was done in other systems, & holding the same data in multiple databases. One consisted of emulating the old hardware (which getting away from was one of the main benefits of replacement, since it eliminated the need for separate copies of data, expensive licences for an old mainframe OS & other old software which was being milked before it went out of use, etc.) on new hardware & porting everything across.

It was quite a struggle to get an internal replacement to even get looked at. There was no allocation of time to investigate one until a  low-level manager sneaked in some cover for the best person to look at an internal replacement so he could draw up a proper proposal. It was adopted (a no-brainer once it was actually compared with the external offers), implemented by existing staff & IIRC one contractor (only needed because one person had been made redundant to meet a headcount reduction target  :facepalm: - another person complained that the contractor was asking him questions that he used to ask the redundant person, i.e. me) faster than any of the external proposals & saved the company a fortune. As soon as it went live there were redundancies in the team that had done it, including the bloke who'd drawn up the proposal & thus saved the firm a few million. The perfect reward,eh?
"A woman on a bicycle has all the world before her where to choose; she can go where she will, no man hindering." The Type-Writer Girl, 1897

zigzag

  • unfuckwithable
Re: The BA IT collapse
« Reply #36 on: May 28, 2017, 05:20:19 pm »
i hope they are back to normal before my flight on tuesday..

Re: The BA IT collapse
« Reply #37 on: May 28, 2017, 05:35:27 pm »
One of the UKs major utilities had satellite links of data back to mega control centre. One winter, it snowed heavily, dish filled up. Backup was a phone line. Phone line had never been used. BT had noticed lack of use and disconnected it.
Ex-employer had a few hundred grand worth of electronic bits & pieces in a warehouse. Deliberately bought just before they went out of production as spares for hardware that was still in use & scheduled for replacement gradually over several years. Replacement cost of the bits & pieces several millions, since it'd mean greatly accelerating the replacement schedule for the old hardware. Buying replacements early, hiring contractors to do all the work early, etc.

Bloke responsible for it had a fight to stop it being scrapped one day. Warehouse management had it logged as for disposal because according to their criteria the turnover was too low to make it worth allocating space to. He told me he almost didn't find out until too late.
"A woman on a bicycle has all the world before her where to choose; she can go where she will, no man hindering." The Type-Writer Girl, 1897

Re: The BA IT collapse
« Reply #38 on: May 28, 2017, 05:49:03 pm »
Plus as others have said when they no longer value those who built their legacy, didn't encourage and promote knowledge sharing, and lay them off; then they are left with inexperienced staff who are ok when operations are working as normal but have no experience with dealing with a DR situation.

Applies well beyond IT, too.
Someone was recently telling me about how BR/Railtrack lost the knowledge of where a lot of its signalling cables were. Supposedly it had never had a full inventory because local teams repaired, replaced etc. & had never logged everything centrally. Come privatisation, a lot of those people were laid off - & either weren't asked or didn't want to say where everything was. Paper records may or may not have existed, but where? Knowledge may have been only in heads that were no longer employed.

On Friday I heard something similar about the water management system for a stretch of canal. The people who maintained it had all been got rid of, & went quietly. Years later, an enthusiastic young new bloke met one of the old guys at some heritage thing, & the old bloke took a liking to him - & they went for a walk round & talk through. The oldster remembered it all, & enjoyed describing it.
"A woman on a bicycle has all the world before her where to choose; she can go where she will, no man hindering." The Type-Writer Girl, 1897

Kim

  • Timelord
Re: The BA IT collapse
« Reply #39 on: May 28, 2017, 06:23:07 pm »
One of the UKs major utilities had satellite links of data back to mega control centre. One winter, it snowed heavily, dish filled up. Backup was a phone line. Phone line had never been used. BT had noticed lack of use and disconnected it.

Until recently, this would happen to AAISP customers who have broadband on a line, but no voice service.  An engineer poking around a cabinet in search of a spare line would find the line wiht no dialtone and steal it.  So they now play a recorded message with a bit of dialtone (to keep the test equipment happy) and a "do not steal this line" message from RevK.

I believe broadcasters have the same problem with dedicated lines between sites (presumably there's a lot less of that than there used to be).  So will play some music (or other convenient audio signal) down them when not in use.

Rather sensibly, the cold war early warning system was piggybacked on the lines distributing the speaking clock signal between exchanges to avoid this problem.  Someone would notice if the speaking clock wasn't working.  I believe some air raid sirens were also connected via out-of-band signalling on normal customers' phone lines, for the same reason.
Careful, Kim. Your sarcasm's showing...

Mr Larrington

  • A bit ov a lyv wyr by slof standirds
  • Custard Wallah
    • Mr Larrington's Automatic Diary
Re: The BA IT collapse
« Reply #40 on: May 28, 2017, 07:33:07 pm »
Professor Larrington had the tremendous foresight to elect to use Lufthansa to go to Dresden today :thumbsup:

Meanwhile, O BBC, I know it's the Sunday evening of a Bank Holibob weekend but wheeling out a so-called "expert" who doesn't know what "UPS" stands for is Not Helpful.
External Transparent Wall Inspection Operative & Mayor of Mortagne-au-Perche
Satisfying the Bloodlust of the Masses in Peacetime

hellymedic

  • Just do it!
Re: The BA IT collapse
« Reply #41 on: May 28, 2017, 07:47:00 pm »
UPS? See 'Sh!te Courier' thread.

 ;) ;) ;D

Re: The BA IT collapse
« Reply #42 on: May 28, 2017, 07:54:29 pm »
Quote
Rather sensibly, the cold war early warning system was piggybacked on the lines distributing the speaking clock signal between exchanges to avoid this problem.  Someone would notice if the speaking clock wasn't working.

Are you sure this wasn't simply so that people would know when their four minutes was up?
Sic transit and all that..

Thor

  • Super-sonnicus idioticus
Re: The BA IT collapse
« Reply #43 on: May 28, 2017, 08:03:30 pm »

Meanwhile, O BBC, I know it's the Sunday evening of a Bank Holibob weekend but wheeling out a so-called "expert" who doesn't know what "UPS" stands for is Not Helpful.
;D Yes, I saw that piece. Is it Universal, or Unlimited?  ;D
It was a day like any other in Ireland, only it wasn't raining

Kim

  • Timelord
Re: The BA IT collapse
« Reply #44 on: May 28, 2017, 08:07:35 pm »
Unpossible
Careful, Kim. Your sarcasm's showing...

Phil W

Re: The BA IT collapse
« Reply #45 on: May 28, 2017, 08:31:51 pm »
They had UPS as in Useless

Kim

  • Timelord
Re: The BA IT collapse
« Reply #46 on: May 28, 2017, 08:55:03 pm »
Or perhaps Unexploded...
Careful, Kim. Your sarcasm's showing...

Phil W

Re: The BA IT collapse
« Reply #47 on: May 28, 2017, 08:57:08 pm »
Maybe the global BA system was danger UXB?

Re: The BA IT collapse
« Reply #48 on: May 28, 2017, 09:24:35 pm »
One of the UKs major utilities had satellite links of data back to mega control centre. One winter, it snowed heavily, dish filled up. Backup was a phone line. Phone line had never been used. BT had noticed lack of use and disconnected it.

Until recently, this would happen to AAISP customers who have broadband on a line, but no voice service.  An engineer poking around a cabinet in search of a spare line would find the line wiht no dialtone and steal it. 
Somehow this failed to happen to the BT line Vodafone paid for to my house for me to use for working from home. Several years after I'd last used it (VF had switched me to using my own - faster - broadband, then got rid of me), I was contacted by someone telling me that they thought I should be paying for it. It was still live.  :facepalm:

I explained the situation & they said "We've found that you're right. Sorry to have bothered you".
"A woman on a bicycle has all the world before her where to choose; she can go where she will, no man hindering." The Type-Writer Girl, 1897

Re: The BA IT collapse
« Reply #49 on: May 29, 2017, 07:47:33 am »
It just demonstrates that our increasing reliance on IT has not been matched by success in making systems crash-proof or hack-proof.

The recent cyber attack that hit the NHS and many other systems globally shows that whilst technology gets investment in the flashy go-faster bits the more mundane foundations of the edifice aren't re-engineered to cope. We end up with the walls of Jericho.

Meanwhile shareholders wait nervously for trading tomorrow..

Sic transit and all that..