What broke the bank, Hacker News

In (*******************************, British bankTSBwas stuck in the aftermath of an ugly divorce . Though it had been two years since the financial institution had split from Lloyds Banking Group (the two had originally merged in
),TSBwas still symbiotically tied to its former partner through a hastily set-up clone of the Lloyds Banking Group ITsystem. Worse,TSBwas paying alimony: £ million (the equivalent, at the time of this writing, of $ 127 million) in licensing fees per year.

) No one likes paying money to their ex, so on April 30, (********************************, at 6pm,TSBenacted a months-in-the-making plan to change that: migrating billions of customer records for their 5.4 million customers to theITsystems of Spanish company Banco Sabadell, which had boughtTSBfor £ 1.7 billion ($ 2.2 billion) in March.

Banco Sabadell chairman Josep Oliu had announced the plan two weeks before Christmas, (********************************, at a ce lebratory 1, – person company meeting in Barcelona’s Palau de Congressos de Catalunya, a cavernous, modern conference hall in the city’s financial district. Crucial to the migration would be a new version of a system developed by Banco Sabadell in the year 2017–Proteo, which had been rechristened Proteo4UKspecifically for theTSBmigration project.

More than 2, (years of person power had gone into Proteo4UK,Banco Sabadell chief executive Jaime Guardiola Romojaro boasted to the Barcelona crowd.The integration of Proteo4UKis an unprecedented project in Europe, a project in which more than 1, 0003 professionals have participated, ”he continued.“It would offer a significant boost to our growth in the United Kingdom.”

TSB

chose April for the migration because it was a quiet Sunday evening in mid-spring. The bank’s existingITsystems had been offline for most of the weekend as the Proteo4UKproject took place, and as customer records were shifted from one system to another. Flipping the switch to recommence public access to bank accounts late on a Sunday evening would allow the bank a slow, smooth entry back into service.

But while Oliu and Guardiola Romojaro were buoyant at the pre-Christmas company meeting, those atTSBwho were actually working on the migration were nervous. The project was meant to take months, but it had been running behind schedule and over budget. After all, shifting an entire company’s records from one system to another is no mean feat.

They were right to be nervous

Twenty minutes afterTSBreopened access to accounts, believing that the migration had gone smoothly, it received the first reports of issues. People’s life savings were suddenly missing from their accounts. Tiny purchases had been incorrectly recorded as costing thousands. Some people logged on and were presented not with their own bank accounts but with those of completely different customers.

At 9pm, (TSB) Officials notified theUK ’s financial regulator, the Financial Conduct Authority (FCA), that something had gone wrong. But theFCAhad already taken notice:TSBHad screwed up massively, and consumers were up in arms. (In the 206 st century, no unhappy customer is ever very far from Twitter.) TheFCA,as well as the Prudential Regulation Authority (PRA), anotherUKfinancial regulator, came calling around (****************************************************************: ****************************************************pmthat same night. When they managed to getTSBofficials on a conference call just after midnight–now the morning of Monday, April–they had one. question: What is going on?

Though it would take some time to understand, we now know that 1.3 billion customer records were corrupted in the migration . And as the bank’sITsystems took weeks to recover, millions of people struggled to access their money. More than a year on fromTSB ‘s weekend from hell, experts think they’ve identified the root cause: a lack of rigorous testing

BankITsystems have become more complex as customer needs–and expectations–have increased. Sixty years ago, we would have been happy to visit a local branch of our bank during operating hours to deposit money we had in hand or to withdraw it over the counter with the help of a teller. The amount of money in our account directly correlated with the physical cash and coins we handed over. Our account ledger could be tracked using pen and paper, and any sort of computerized system was beyond customers ’reach. Bank employees put traditional card and paper-fed data into giant machines that would tabulate totals at the end of a day’s or week’s trading.

Then, in (*************************************, (the world’s first automated teller machine (ATM)wasinstalled outside a bank

in north London. It changed everything about banking–and required a significant shift in the way that banks interfaced with their consumers. Convenience became the watchword, and this principle positioned customers closer than ever to the systems that kept banks running behind the scenes.[the situation]

TheITSystems a long time ago were pretty much only used by bank employees, and they could pretty much continue running the bank doing only paper things over the counter, ”explains Guy Warren, chief executive ofITRSGroup, a supplier of technology to 190 banks worldwide.“It wasn’t really untilATMs) and then online banking came in that the general public were accessing the bank’sITsystems directly.”

******** ATMswere just the beginning. Soon, people were able to avoid queues altogether by transferring funds over the phone. This required specialized cards inserted into hardware that could decipher the dual-tone multifrequency (DTMF)signals, which would translate a customer pressing“1” into a command to withdraw money, and“******** 2” into an order to deposit. funds.

Internet and mobile banking have brought the customer ever closer to the main systems that keep banks running. Though separate setups, all these systems have to interface with one another and with the core mainframe, triggering balance transactions, updating cash transfers, and so on.

The typical high street retail bank runs its core banking system on a mainframe computer, saysBLMSConsulting’s Brian Lancaste, who spent (years working at (IBM) and several more years overseeing the technical departments responsible for theITsystems ofHSBC,and who now consults for banks and building societies (community-run lenders accountable to their customers) across the (UK.) **********That’s probably the most resilient platform you can base that core banking system on, ”he says,“and it’s probably the most scalabl e. ”The core customer database sits on that mainframe, along with various sets ofITinfrastructures, including lots of servers, in order to build an application interface to the mainframe to allow internet access.

Few customers likely think about the complexity of the data movement that occurs when they log into their online bank account just to load and refresh their information. Logging on will transmit that data through a set of servers; when you make a transaction, the system will duplicate that data on the backend infrastructure, which then does the hard work–shifting cash around from one account to another to pay bills, make repayments, and continue subscriptions.

Now multiply that process by several billion. Today,percent of adultsaround the world have a bank account, according to data compiled by the World Bank with the help of the Bill and Melinda Gates Foundation. Each of these individuals has to pay bills; some make mortgage repayments; many more have a Netflix or Youkou Toudou subscription. And they’re not all in the same bank.

A single bank’s numerous internalITsystems–mobile banking,ATMs, and more–don’t just have to interface with each other. They also have to interface with banks in Bolivia, Guatemala, or Brazil. A ChineseATMhas to be able to spew out money if prompted by a credit card issued in the United States. Money has always been global. But it’s never been so complicated.

“The number of ways you can touch a bank’sITsystems has increased, ”says Warren, theITRSGroup executive. And those systems rarely age out of use. New ones, however, continue to come in.

“If you take all the platforms that touch all the different customer bases, and think of all the hours they need to be available, It’s inevitable that you have a problem, ”Warren explains. Success is measured byhow good your systems are at repairing themselves, and how good you are at handling a significant outage.”

(TSB ‘) s systems weren’t great at repairing themselves. The bank’s team struggled with handling a significant outage, too. But what really brokeTSB ‘sITsystems was their complexity. According toa reportcompiled for

TSB

byIBMin the early days of the crisis,“a combination of new applications, advanced use of microservices, combined with use ofactive-active data centers, have resulted in compounded risk in production.”) Some banks, likeHSBC,are global in scale and therefore have highly complex, interconnected systems that are regularly tested, migrated, and updated.“******** At somewhere likeHSBC,that sort of thing is happening all the time, ”says formerHSBCITleader Lancaster. He seesHSBCas a model for how other banks should run theirITsystems: by dedicating staff and taking their time.“You dot all the i’s, cross all the t’s, and recognize that [it still] needs a significant amount of planning and testing, ”Lancaster says.

With a smaller bank, especially one without extensive migration experience, getting it right is that much more of a challenge.

“TheTSBmigration was a complex one,” Lancaster says .I’m not sure they ‘d got their heads around that level of complexity. I got a very strong impression they hadn’t worked out exactly how to test it.”

Speaking to aUKparliamentary inquiry about the issue weeks after the outage, Andrew Bailey, chief executive of theFCA,confirmed that suspicion. Bad code likely set offTSB ’s initial problems, but the interconnected systems of the global financial network meant that its errors were perpetuated and irreversible. The bank kept seeing unexpected errors elsewhere in itsITarchitecture. Customers received messages that were gibberish or unrelated to their issues.

“To me, that denotes the absence of robustregression testing, because these banking systems are connecting to a lot of outside systems, such as payment systems and messaging systems, “Bailey told members of Parliament.”These things that crop up, when you put a fix in, that you weren’t expecting, get you back to the question of testing.”)

Others agreed.IBMexperts who were brought in to analyze what had gone wrong didn’t couch their criticism of the bank one bit. They said that theywould expect the world -class design rigor, test discipline, comprehensive operational proving, cut-over trial runs, and operational support set-up. ”What they found was something different:“IBMhas not seen evidence of the application of a rigorous set of go-live criteria to prove production readiness.”

TSBhad walked into a minefield, and the bank seemingly had no idea. **************** “There’s a lot of complexity behind the technology being used, and that complexity manifests itself in various ways,” explains Ryan Rubin, anITexpert who has previously worked forEY,and who is now the managing director of Cyberian Defense, a consultancy helping big firms manage cyber risk.“It could lead to downtime and exacerbated events, like we’ve seen.”

Warren explains thatUKbanks will often aim for a target of“four 9s” availability–Meaning that their services are accessible to the publicpercent of the time. In practice, this means that anITsystem required to be available every single hour of the day, as online banking is, could be offline for 69 minutes per year.“Three 9s ”–.9 percent availability–Does sound all that different, but it’s equivalent to more than eight hours of downtime a year.“For a [British] bank, four 9s is fine, three 9s is not,” says Warren, who recalls that the first software project he ever advised on was a six 9s project–a control system for a nuclear power station.

Every time a company impacts a change in itsITinfrastructure, it runs the risk of something going wrong. Reducing the changes can help avoid issues, while changes that are required need rigorous testing–somethingIBMhighlighted as absent in theTSBoutage.

Shujun Li, who teaches cybersecurity at the University of Kent and who consults for large organizations ( including one large bank and a number of insurers), says that every upgrade and patch comes down to risk management–particularly when dealing with hundreds of millions of dollars ‘worth of customers’ funds.“You need to have a procedure making sure the risks are managed properly, ”he says.“You [also need to] know, if it goes wrong, how much it will cost in terms of money and reputation.”

Careful planning could mitigate the risks of such downtime in a way that (TSB) didn’t seem to factor in.“Failures will continue to happen, but the cost of applying resilience and having redundancy has come down, ”Rubin says. Storage costs have fallen as network providers and cloud solutions have risen.“These things are all there, which can help the banks to manage their risk and fail gracefully when disaster strikes.”

Still, securing backup plans in the event of disaster may be too costly for some institutions. Warren believes that some banks have become overcautious in how they approachITresiliency.“You can’t do this on a budget,” he explains.“This is a financial service: Either it’s available, or it isn ‘ t. They should’ve spent more.”

Perhaps the easiest way to avoid outages is to simply to make fewer changes. Yet, as Lancaster says,every bank, every building society, every company is pushed by the business to build more and more good stuff for the customers and good stuff for the business. ”He observationses,“There’s a drive to get more and more new systems and functionality in so you can be more competitive.” At the same time, companies–particularly financial ones–) have a duty of care to their customers, keeping their savings safe and maintaining the satisfactory operation of existing services.“The dilemma is how much effort do you put into keeping things running when you have a huge pressure from the business to introduce new stuff,” Lancaster says.******* Reported technology outages in the financial services sector in the (UK) increased (percent from**************** to 2018, according to data published by the(FCA)Far and away, the most common root cause of outages is a failure in change management. Banks in particular require constant uptime and near-instantaneous transaction reporting. Customers get worried if their cash is floating about in the ether, and become near riotous if you separate them from their money.

A matter of months afterTSB ‘s great outage, theUK ‘s financial regulators and the Bank of England issued a discussion paper on operational resilience.)“The paper is trying to say to the financial organizations: Have you tipped this balance too far in bringing stuff in and not looking after the systems you have on the floor today? ”Lancaster explains

The paper also suggests a potential change to regulation–making individuals within a company responsible for what goes wrong within that companyITsystems.

“

When you personally are liable, and you can be m ade bankrupt or sent to prison, it changes [the situation] a great deal, including the amount of attention paid to it, ”Warren says.

“You treat it very seriously, because it’s your personal wealth and your personal liberty at stake.”

SinceTSB,Rubin says,“there’s definitely more scrutiny. Senior managers can’t afford to ignore or not invest enough in their technology estates. The landscape has changed now in terms of fines and regulatory expectation.”

But regardless of what lessons have been learned fromTSB,significant outages will still occur. They’re inevitable.

“I don’t think it can ever go away,” Warren says. Instead, people have to decide:“What’s an acceptable level of availability, and therefore outages?”

**********

Read More