What is Chaos Engineering?

Internet DNA Podcast

The Simian army is coming to a server near you. With chaos monkeys, doctor monkeys, chaos apes janitor monkeys and the perfect Facebook storm. Is chaos engineering the best way to test a system's robustness and what happens when a monkey actually does a good job? We look at the twitter hacking, NHS virus and engineering through elasticity not scale, and well, best listen in to the pitter patter of little monkey feet.

 

Transcription

(this transcription is written by robots… so don’t be surprised!)

 Hello and welcome to this week's episode. of Internet DNA with me Abi. This week, we're going to discuss chaos engineering, which I'm quite drawn to mainly because it's a bit of an oxymoron. The fact, that engineering is not by its nature chaotic. Otherwise it wouldn't be engineered, but Dan maybe explain chaos engineering and why it's chaos engineering.

 Okay. So it's really coming about with the event of cloud, because what you're looking for with chaos engineering is resilience. What you do is you build a system. And then you set off chaos engines to try and take the system down to see if you can. Netflix were the first really famous people to do this. They have a thing I think called the chaos army

 Simian army.

 I knew it had something to do with monkeys.

 It started off with a program called the chaos monkey, which looked it up. He's got a really cool logo and I love the chaos monkey. He does a bit remind me of the coke Badger but that's another story, but the chaos monkey, you can imagine him in a data center going around and just ripping out cables and just hitting things and throwing things in the most sort of really random they're going down.

And that's what it's all about. It's just random outages. Isn't it? That you don't feel it's going to happen. And then you've got to fix it.

 Well, not that you've got to fix it, that you've got to engineer it, that it doesn't take the system down. So it's about resilience. If you think about Netflix, they're not going to know all the things that might happen to their network to take it down.

So they have to build that network in a way that it doesn't matter. If things get taken down, they either restart somewhere else or they can be temporarily circumvented. This happens a lot. So a lot of companies do it with really basic chaos engineering, which is synthetic load testing. So you say, right?

Imagine we got a spike of 10 times our normal traffic. What would actually happen that way? You could test the resilience and because of cloud computing, the fact that everything's elastic anyway, it's not like the old days where you'd have to then build lots of computers are lots of servers in a server room to cope for.

Never going to happen event now is elastic computer. And you can just say right, well scale until the point where it breaks. That's why it much more accepted way of dealing with things nowadays is because we now use the sort of infrastructure that is critically designed to rebuild itself, repair itself.

When I first started learning about cloud architecture, it was stop thinking about servers just as things that are disposable. One crashes, you just start another one, you don't try and  fix that one. You just delete it and then start another one somewhere else. It all goes back to this idea of having all of your infrastructure and everything in code so that you can just go, Oh, it's not a surface drive.  Just start a new one, bang with all the regional settings. Well, it's all written in code you. You can just spin up servers in seconds over and over again. Exactly the same,

 not about breaking things and mending them. It's about building in the redundancy that if it looks like it's going to break something else has triggered to stop it ever becoming a problem.

 Exactly. That that's why you run them. So if you were to run a chaos engineering over your network, it would be to a detect where are you weak? When we took this down, did that take the network down? If it did, then we have to build systems around it in order to make it resilient.

So that next time we run that test, it's not going to go down. Now. The problem is is if you run tests that. Everybody knows about that. Everyone just engineers for that thing. So if you said, what we're going to do is we're going to synthetically hit the servers randomly during the week with four times the traffic people would build it literally to 4.1% of the traffic.

That's what they would do. So the reason for having a chaos engine is that no one knows what it's going to attack. It is literally gleefully, leaping around switching things off and overloading things, getting in the way of things and just trying to break stuff. In a way that's not completely random, obviously, but unpredictable so that you can't just engineer for the test.

 Yeah. Facebook has something called Facebook storm. It's a bit like the monkey. You can feel it come through and just destroy things as it goes, Willy nilly, leaving some bits and not leaving other bits, but they do have quite a laugh with, well, Netflix came up with chaos engineering in 2010, 2011, and now.

It's open source so anyone can use it, but they have things like gremlins and chaos, apes and doctor monkeys and byte monkeys and janitor monkeys. So they have this whole Simian army is full of all sorts of different things as a whole. Monkey toolkit to run over your system constantly to make sure that basically you have more uptime, but we've talked about Netflix and Facebook.

What sort of businesses would take this on

 if you think about it, any business really that has to operate a network at scale. That has to be always on. If you, let's say BBC news, they must get attacked all the time. And so for them having these things running around, trying to switch stuff off, either demonstrates the resilience of their system or shows them where they need to be more resilient, where the weaknesses are.

And that's the point. Social media is another really obvious one where you're always on. You've always got users interacting all the time. You can't just in a pure play publishing. You can mitigate a lot of it with just caching. So you take the surface down. That's fine. because you're only ever really reading off the cash anyway, and it will just go, well, I can't update it.

So you're just going to see what this page used to look like. The last time I looked it up,

 Social media. It's not really the end of the world. I know some people might think its the end of the world, but it's not the end of the world if it comes down to,

 Well Facebook would think it was the end of their world.

 Twitter would think it was the end of the world. And then a lot of people would as well

 I think it's the end of the world. It's if you. Try to watch a film on Netflix and it's constantly buffering because they've got network issues that degrades your experience of Netflix. So it's in their interest to make sure that happens as little as possible. Now, obviously if you're on a three K internet,

 yeah. I was going to say they need to come and watch a movie at my house.

 But they'll come off to the 5th of August and they'll be fine.

 But where I was getting to. Was national security, healthcare finance things that really are life threatening, starting to use these systems on that is where the importance is going to lie.

And I know that Chaos engineering as well now is moving into security, which of course is so important because you have hackers that. I mean the most recent one, they got into Twitter didn't they hijacked import people's accounts. So having the chaos engineering, having these monkeys, looking at security loopholes and backdoors is going to be, again, a whole step up in the security of the internet using this way of.

 As the AI arrives engines improve. So it will get better and better and better. It's not just the pure engineering. So, you know, they've got to actually learn. If you tell a monkey its job is to try and take down the network. It needs to learn better and better. What tools that I used last time to take it down.

When did I get my last success? The thing about security, national health, all of these things, you're always running a risk with chaos engineering that you actually do take yourself down. So you've got to be slightly aware of what you're doing. I mean, that was why it was seen as such an amazing thing when Netflix did it, because you're actively endangering your network, which is so counterintuitive to what you think you would want to do.

And that's why it was revolutionary when it happened. So. You have to be slightly careful that you don't let loose the Simian army and it does its job and takes you down. Especially if you're the NHS

 governmental, something who really pride themselves of being very secure and suddenly their Simian army has just ditch them.

 Well, I mean, if you think the last on the NHS had a big break was the I'm going to forget the virus let's call it Wanna cry, but it was one of those viruses that locked virtually huge parts of the NHS out on computers. If you can imagine it would be great to have a Simian army, literally just go and trawls through people's computers sees kind of in fact that computer, can it allow it to happen? See if it can be downloading attachments, our emails and stuff, because that's how you can then build it. That's stops people from doing the things that break their computers. Isn't  an important step forward. It's quite, I wouldn't run it for example, at my company because the risk rewards are not enough for me,

 it's like a lot of these things. It needs to be a very big system for this to work because actually what you're putting on top of it is pretty severe. So you've got to have a  massive array of clouds for it  not to take the whole lot down were talking of hundreds of thousands of servers and things, not just tens of hundreds

 also, don't forget. They aren't real servers anymore. They are what's called virtualized service. So you're spinning up and spinning down and spinning up and spinning down servers as the Simeon takes one server down. The whole point is that another one just spins up somewhere else.

That's the whole idea. And actually, if you get deeply into cloud engineering, that's exactly what you design, which is. It doesn't matter if any part of the system goes down, it can just be spun up somewhere else, but it's still quite young, quite a technical thing to build these things. It's not like lego

 which is the way you like to build, but it is educating and creating a new breed of engineers that will build in this way as standard, as opposed to building something that you think will last you are building something that already has the built in resilience that these monkeys are showing or exposing. So you're building in a different way.

 Yes. You're building with this audience in mind because you're no longer trapped by I have to buy another eight servers. No, you don't. You just spin them up and down and you only pay for what you use. So it doesn't really matter.

So you stop having this problem of in order to do this, we have to spend vast amounts of money. You can just do it on the fly. And this is how almost all big systems are now built any VPS, people that you now rent VPS from, you're not renting an actual server from them. You'll be renting a spin up desk with something installed on it, and then you can work on that.

And if it crashed, they just spin up exactly another one you would probably never even notice that that server went down. It's a very counter intuitive.

 How does it have all my information? And if they spin up another desk, how does it have all my files?

 Because they separate things out like your drives and your database.

So let's say your web server was to crash. Just spin up another web server. If you've got a database, you've probably got a read database and a right database. Doesn't matter which one you take down, you just duplicate it with another.

 So when you say spin up, it's duplicating very fast.

 Duplicating is not copying an old one. The server is literally spun up from a piece of code that says right. Install, this drop that make these all the settings set it up like this attach to this drive, that database and this thing to it go, and it just builds out of code. And this is why it's really interesting because suddenly all your server configs can be in a piece of code.

And so you can spin up the, exactly the same server over and over and over and over again without duplicating the old one. Cause the old one might be infected. You don't want to duplicate that.

 How does it have your files on it?

 Because you have cloud storage. So are you going to take some servers down? Fine. It's replicated everywhere.

 So this is the memory ones, not the storage ones. What if a storage one goes down that is duplicated somewhere else?

 Yeah, they usually what's called sharded. So there's only a little bit of your files or database in any one place, but there is at least two copies of that little piece everywhere and so. We can just rebuild

 Something. Makes me laugh. That is along this lines. I have a friend he's got a bit of a conundrum. The piece of work he's most proud of will only happen in the event of the end of the world or the internet coming down. So although he wants people to see his work, he doesn't want them to see his work because it means the Internet's come down and up'll pop his work.

I think this was probably pre chaos engineering, but if the internet goes down, there is a backup internet that comes up and I don't know what it says or what it does, but he'll be very proud.

 Excellent. So that's chaos engineering, which is basically rigorously testing your system all the time to make sure that it is resilient. It's actually a very interesting concept.

 It is. I mean, along chaos theory. I assume that they have similar veins.

 Chaos Theory is slightly different chaos theory is that small things have large repercussions or can have large repercussions. Whereas this chaos engineering is more about, you're not doing set tests that people designed to beat the test.

You're randomizing the test completely. So you don't know what you're engineering for and therefore you have to engineer through elasticity rather than through scale. And that's quite an important distinction. You're basically building resilience by making everything easy to replicate very, very quickly and spin back up again, rather than trying to make it not fall over in the first place.

You say fine, you can take that server down. It doesn't matter. I'm just going to rebuild another one somewhere else before it would have been all about, we've got to protect this can't go down. And this server there's the backup server. That kind of stuff is disappearing slowly.

 Which is a lot scarier. If you were in a position where this server cannot go down or else. That's terrifying if you're in a position of when it doesn't really matter, because then this will happen and this will happen. You're much more protective and safe anyway, because you can't be attacked.

 Yeah. Okay. Well, you can be attacked obviously, but you have to be attacked in a really particular way,

 but it's not so catastrophic

 Exactly as soon as you mitigate the problem, You're back up on line, straight away.

And most of the time you wouldn't even notice that you were down, you go through, you know, your cloud warning logs and you go, Oh, look, that server spun up 15 times in a day. That's quite interesting. It's not the same server, but you will see that a server cluster keeps going down and coming back up again.

And that's usually because you've got at problem.

 I'm not sure if I'd go, Ooh, isn't that interesting. Some people would find it fascinating. Others just may not.

 But if you're in the job of checking, whether your servers were secure, if you suddenly notice that a particular type of server was going down more regularly than normal, it would be an indication that you should probably have a look at why that was happening. I'm not saying your general public would be going, Ooh, that's an interesting fact. That's not what I'm saying.

 Well, okay. Okay. Good. All right. Well we better leave it there for this week, but look forward to speaking to you.

 I'll speak to you next week. Have fun. See you in seven days.

 

β€”

Dan & Abi work, talk & dream in tech. If you would like to discuss any speaking opportunity contact us.