curved line
PODCAST
EPISODE
108

Ep. 108: Peeling Back the Layers of Agile Problem Management

SUMMARY

In this episode, we delve into problem-solving in Agile organizations, focusing on scaling methodologies. We discuss strategies like ITIL, Lean, and A3 for root cause analysis and address problem management in microservices using predictive tools. We also highlight the Safety 1 and Safety 2 models in complex systems. Our aim is to offer a systematic approach to reduce failures with effective contracts, smaller batches, and quick releases.

apple podcasts buttonspotify podcasts buttongoogle podcasts button
podcast recording

Description

Imagine facing colossal challenges and complex problems and not knowing where or how to start solving them. In this episode, we wrestle with the complexities of problem-solving within Agile organizations, particularly when adopting new methodologies at scale. We dissect various problem management strategies - from the well-established ITIL to the Lean and A3 methods- providing you with a toolkit for identifying root causes and understanding the factors contributing to your problems.

We delve into the intricacies of problem management within a microservices environment. We discuss the power of predictive problem management and the tools you can leverage to solve problems. We also explore the realm of complex systems and the strategies that can help identify probable causes of problems, specifically focusing on the Safety 1 and Safety 2 models. All this is aimed at equipping you and your teams with a systematic approach to problem-solving, reducing chances of catastrophic failure through proper service contracts, smaller batches, and fast releases.

This week's takeaways:

  • Problems are ubiquitous: Problems exist everywhere, but organizations often lack a formal approach to addressing them.
  • Take the time to understand problems: Instead of reacting to problems instinctively, organizations should create a space for thoughtful, data-driven problem-solving.
  • Establish a common understanding of problems: Having a common agreement and language within the organization is crucial regarding what constitutes a problem.

As we navigate these complex topics, we ensure that you come away with a deeper understanding and practical solutions you can implement in your organization. Tune in! Be sure to listen to Definitely Maybe Agile on your favorite platform, and remember to subscribe. For additional resources and to join the conversation, contact us at feedback@definitelymaybeagile.com with your thoughts, questions, or suggestions for future episodes. Stay tuned for more exciting content!

Transcript

Peter: 0:05

Welcome to Definitely Maybe Agile, the podcast where Peter Maddison and David Sharrock discuss the complexities of adopting new ways of working at scale. Hello and welcome to another exciting episode of Definitely, Maybe Agile with your hosts, Peter Maddison and David Sharrock. How are you today, Dave?

Dave: 0:21

Very good. Well, I think the last couple of times we've got together to have these conversations, it feels to me like the preparation should have been recorded, because we actually get into the meat really quickly. We have some interesting things. I think by the time we get into a recorded conversation it maybe is different, but let's see. So at the moment, I'm just preparing a workshop on root cause analysis, and the difficulty organizations have, of course, of solving problems so they don't come back and I think that's maybe part of our conversation today is how do we solve problems? Because we raise a lot of conversations about things that can go wrong, but we've not really talked about perhaps identifying problems or thinking about how to solve problems.

Peter: 1:06

Yeah, and it's interesting because last week I was on a panel about problem management and so this is kind of fresh in my mind from some of the topics that we talked about there, and it was an even longer conversation about all the different ways in which organizations have approached this, and the concepts from problem management and how organizations look at, identify, manage and solve problems have been around for a long time. Like ITIL introduces the concept and they bring this in as a part of incidents become problems, and this language is almost part of the vernacular in the technology space. And yet I've also found that in a lot of newer organizations that I've worked with, they've been scared off from looking at things like ITSM and ITIL and that way of managing things and it's just not a part of their understanding of hey, this is how you go about managing problems and having a practice to do this and putting the right pieces into place so that we understand what does it take to manage a problem?

Dave: 2:14

You know, when you're describing that, what I find really interesting, I think if you go and talk to people, everybody everywhere is surrounded by problems. In fact, many of us, especially in the technology space, but not just there many of us probably identify ourselves as fantastic problem solvers, identify as some solvers of problems. But what I find interesting when, as you've raised the whole idea of ITSM, the IT service management and specifically problem management within it, is one of the few places where there's a formal process or profession, if you like, group that tackles problems. I mean, the other places I'd look at are things like healthcare and things like that, where we'll come onto, when we're talking about complexity in a minute, what the differences can be. Thank you. It's odd that there are so many of us identifying solving problems day to day and yet there isn't really much in the way of how do we formally understand problems and how to solve them.

Peter: 3:12

Yeah, and there's these pieces of. Well, there's a variety of mechanisms for this and very often when we look at problems and a lot of our conversation before we started recording today, was around the topic of root cause, which can be quite incendiary across all sorts of different spaces there's the concept of is there such a thing as root cause? And if root cause is not real because you can never really truly get to it, at least in a complex system, because there are so many possible causes of something there'll always be something that caused the thing. That's the actual problem. So, like, how far do you dig? At what point do you stop digging? So there's no true way of getting to that root cause. But I think some of that can be alleviated by at least having some agreement at the outset as to what do we mean when we say the word root cause.

Dave: 4:01

Very po yeah, I mean this is part of the conversation that we're having. There is, I think, within a sort of technology spaces that you and I spend a lot of time working in. Problems are rife, everybody's always discussing it, and what I find peculiar is there are very few conversations about how to solve them. You've mentioned a couple, like Lean. There are a number of, I mean many different things A3, vybwise and root cause analysis and things like this.

Peter: 4:30

Fish and playgrounds.

Dave: 4:32

Exactly Ishikawi and Ishikawi and the kind of the rest of those things. All of those provide tools to tackle problems, and then outside of that you've got things like ITIL and problem management, but beyond those there's very little brought up in the technology space. Right, we don't bring that, we just tend to. Kind of it's like almost like you throw a ball at somebody with a bat, they're going to swing the bat at it every time and we're just always just swinging bats at balls rather than sitting down and understanding. Is there some other issue or underlying factor that we have to go and understand?

Peter: 5:04

And this is the types of problems we're describing. There are very much in the service management space we're talking about hey, I've got a service which is delivering some kind of value to a customer. That service performs in a certain way. I need to understand the performance characteristics of it and one of the aspects that starts to cause problems in organizations is as they've shifted to what their IT landscape looks like. I think microservices, for example, have quite a lot to blame here, because the way I always describe this is the complexity has to exist somewhere. Before you had this monolith where all of the complexity of the interaction between your business processes existed in a single place, so operationally it was kind of a black box that wasn't that hard to run because you could at least understand where everything was. But when you took that and you broke it apart into 1,000 different microservices, you took that complexity of the interactions and you spread it out across 1,000 microservices. So now the complexity of managing and identifying where the problems are is now operational, and one of the problems I've seen with organizations is that they do this without strengthening their ability to operationally manage that landscape. And that includes problem management, the understanding and looking for where the root causes are there and probable causes of problems in that environment.

Dave: 6:28

So let me just say welcome to the real world. Finally, it services management is getting out the idea of what the problem is on this box somewhere. We're just going to go find the problem on the box into the world of the problem is out there. How do we go written figure it out? Because, as you were starting to describe problems as being an ITSM type of space, I actually disagree with that. I think a lot of the problems that we're trying to discuss here are business-related problems customer market problems, problems of how do I reach our customers and get the right message across so that when they're in a buying position they're coming to us to work with things. And I think these are the problems. And the sort of ITSM space you describe is quite interesting because that one comes from a well-self-contained, as you described it, a self-contained environment where you know the problems here somewhere into one where it is no longer self-contained. There are now thousands of microservices and they're the ones you own, and then there are services that other people you're drawing from and everything else. So all of a sudden it is now distributed and that's a lot closer to the sort of challenges that business is marketing and approaching customers and taking problem. It's a much more. You know one to many. There's an individual who have their own relationship with your service, your product and how to use it.

Peter: 7:42

So there's an interesting piece here that when you start to talk because it's the problem with the English language, there's never enough variants in a particular word, so a lot of things get overloaded, of course. So when you start to talk about the types of problems we solve on the business side in the product space, a lot of the similar capabilities have come to mind, because you're nearly always in this day and age but not always, but nearly always delivering those services via a technical platform in some way, or the technology is involved in it in some way, so that technology can provide you with information about how that service is being delivered and that provides an input into your problem-solving processes. Now there are, and I guess what we're talking at here is that within a well-functioning technology organization, there is an understanding of how to approach problems and how to manage it. In fact, people will have a title like problem manager and they can be somebody who's the person who looks at that and they will run problem exercises. But that doesn't necessarily translate over into the business side and the tooling that we use is potentially just as appropriate. In fact, I have I've run sessions with product organizations using exactly the same tooling to identify where those problems are to work out. Hey, this is where we should focus. This is the business area. This is where the underlying causes are in the system.

Dave: 9:02

So again, if we just come back and peel things back a little bit or pull things, pull out and view things, there are at least a couple, probably more than that. I'll explain in just a second. But we've got problem management in the technology space and I just have to point out, I would argue, that not everybody in technology understands tools and approaches that you're discussing around problem management.

Peter: 9:26

Oh, I think I said that already.

Dave: 9:28

A very small section of technology specialists understand that because it's sort of hidden away from just as an example. Anybody doing development rarely kind of dabbles in that problem management space.

Peter: 9:41

I think I said it was a well-run.

Dave: 9:42

Good.

Peter: 9:44

One that's had to work with me on occasion.

Dave: 9:47

But if I draw back as well, I mean we also talked about Lean, and Lean has a ton of tools for problem identification and resolution. And I think from a business context, since I've piled in on the business approach. Business schools have a bunch of tools as well. There are tools out there. What I find, I think where we started a bit of our conversation, is a lot of the time problem resolution is too reactive and sort of you know, shoot from the hip, intuitive, and we don't take the time to pause and think about what different tools, approaches, methodologies we might use to solve a problem.

Peter: 10:23

So, interestingly enough, advanced problem management in an ITSM sense is predictive. You look for problems before they occur. You look for indicators that a problem is going to occur and you do it based off learning you have of how the environment has behaved in the past and use that to inform how it might behave in the future, and also looking for like how are things changing. And so by having the right information you can start to look at. We see that there's potentially a problem that's going to occur here. How can we mitigate that risk? And that's where problem management goes beyond that reactive state and goes into one where you're actually being proactive about looking for where's the next problem going to come from.

Dave: 11:04

I think I would argue that any approach to problem management has a predictive element to it. But Leenda's many of the businesses that we've worked with and looked at the metrics and how they're looking at what things are going, they're all looking for some sort of early indicator of where problems might be occurring and so on. So I think that's very true of maybe it's a characteristic of mature problem management systems or approaches. I still kind of come back to you know, we're always in these conversations trying to get that pragmatic, actionable. What can we take away and think about? And what's really struck me in trying to tackle some of these questions and answer them for some of our clients is that the first starting point, if you like, is stop thinking that we're amazing at multitasking and that you can just throw problems at a panel or at any specialist that you have in your organization, get a quick email coming back saying this is the problem, go fix it and think we're fixing problems.

Peter: 12:04

We're not.

Dave: 12:04

We're dealing with symptoms and we're curing symptoms. We're not curing the underlying cause.

Peter: 12:10

And I think that realization that we have to kind of get out of our system one reactive, heuristic driven mindset and take the time to get into a reflective, more thoughtful, inquisitive and curious space to go identify the problem whether as an individual or as a group, yes, and the role of the problem manager is to work out who those people in that group need to be, and then they own the problem and bring those people together and create that space to do that exploration, to go, look and see what are the underlying problems here, how do we define this? What could we put into place to mitigate this? How do we start to resolve it? What are the underlying pieces here, using, whatever, some of the tools you were describing earlier?

Dave: 12:51

So it's almost like you know, and I totally agree with this, with you on this, is we? need to identify the problem, we need to kind of put some boundaries around it, whereas the data that shows it's happening shows it's not happening, whatever it might be. And it's the thoughtful both gathering of observations, and whether it's data or information or just just the people coming together and somehow reinforcing the problem space so that you understand what is the undesirable behavior and outcomes. What's the desirable behavior and outcomes? How might we sort of shift from one to the other?

Peter: 13:23

Yes, right, and so that piece and that exploration and holding that space is really that key role, and some people will do a better job of that than others.

Dave: 13:33

So the Well, it's interesting because I think actually we're not allowing ourselves to get into that space. If I look at your calendar and my calendar, I'm pretty sure we're bouncing backwards and forwards, Certainly the organizations I work with. When you look at the people who have the experience and the knowledge, who should be instrumental in solving problems, you often find they're exactly the people who don't have the time, don't have the, and you kind of need a downtime. You need to move from system one thinking that idea of reactive, heuristic based thinking to system two. We need to. There's a curve. We've got to move from one state to the other. It's not a calendar entry. Now I'm going to be system two thinking. It's I have to reflect or go for a walk and just clear my mind, or whatever it might be.

Peter: 14:20

Yeah, I agree, we were talking about this a few episodes ago, when we were talking about ONA and organizational network analysis and thinking about how organizational structures impacted.

Dave: 14:29

And when you find the person who's like the person that everybody goes to to get anything done, he's also your probably your best problem solver, because he knows the most about everything around it but he's also the busiest person in the organization, and so part of the question becomes how do we recognize that we need a deeper dive and actually take I really like problem management for this, the ITSM world of problem management, because they've got some really clear. They put the foundations in place around some telemetry about what's working, what's not and where you begin to see hot spots within your systems. To say, the next time we discuss any problem management, we've got to focus over here. That sort of thing is often just not in place in a lot of the less ITSM worlds that we look at.

Peter: 15:18

Yes, I agree. What areas haven't we touched on so far?

Dave: 15:22

Well, I just wanted to ring things back a little bit around complexity and intuition and recognizing how important it is that we don't rely on the quick, intuitive or heuristic responses to certain problems.

Peter: 15:37

In a complex space.

Dave: 15:38

As you said right at the outset, root cause isn't really a root cause, because it's one cause of many. It's probably the most relevant cause right now. But if we go and hold that thing down in a complex system, the system learns, it moves around it. All of a sudden I go, this must be the root cause. I fix it in some way and it just finds a way around, it comes back at us from somewhere else.

Peter: 16:03

I like the term probable cause, which Donna Knapp introduced me to, because I think that's a better way of describing it. In a complex system. We're looking for probable cause. What's the thing that's most likely to be causing this problem? The current cause, right? The current cause, yes. And so then we've got an idea of where to look. I think the Sydney Decker's whole piece and it's not just him, but there's the Safety 1, safety 2 movement, the whole understanding that when we're looking for problems, we don't look for the obvious piece, the classic example being the airplanes coming back in World War II and all the airplanes coming back at holes in the wings, so they put more armor on the wings, but until somebody pointed out that you actually want to put more armor on the body, because those are the planes that didn't come back. So this understanding of where to look and how to understand it and how to reinforce the right parts and this is where the whole Safety 2 model of we don't. If you just look at where the problems are and just try and fix the problems in a complex environment, you're paying whack-a-mole. You're going to be just always trying to hit the problem, but you're never going to hit it all, because there's always going to be another one.

Dave: 17:13

If you're, or you're ignoring where the catastrophic problems actually are, you're solving cosmetic, somewhat cosmetic problems at the expense of the critical ones.

Peter: 17:26

Versus working out. Why does the system work in reinforcing that, like what makes it go right more often. What can I do to make the system go right more often which will reduce the amount of errors that you have? Which is the Safety 2 model of, like reinforced positive?

Dave: 17:40

Yeah, and so when I love that story about the sort of bullet holes in the plane because it's I think it really highlights the difference between intuitive, reactive, sort of heuristic type things there's a hole here, we need to fix it which is that immediate, responsive type of thing. And in complex systems, intuition and heuristic sort of reactive behavior don't get you where you think they will. And what I find interesting is the teams, the groups that are really good at working in truly complex environments, use their training and their experience to avoid jumping to the conclusions that heuristic, reactive thinking is going to get us. If you think about pilots, for example, or surgeons and healthcare professionals, their training is about how not to jump to a conclusion and instead how to be formulaic in the way they address what they're seeing and so that they're really being honest with themselves about what they're seeing, instead of kind of taking the shortcuts.

Peter: 18:45

Yeah, I think that is a very good point too. We often, when teaching or training in the ITSM space, we teach the same thing too. You teach through like what are the like have a method by which you break down a problem step by step. This is how you go through it formulaic, understand it. Look at the locks. I don't know, even in this day and age, the number of times that you looked at the logs. At least we have log aggregators which are presenting this back to you now. So it's almost hard not to look at the logs, but still people manage not to.

Dave: 19:21

Well, but I think this is the. There's still an expectation, I think, on, say, delivery teams, where they need pace right. You've got to keep going, you've got to hit some sort of speed or capacity expectation.

Peter: 19:34

You made a commitment.

Dave: 19:35

You've got to go and do it, so there isn't that thought that there isn't the space for somebody to go. Something smells wrong here. We need to do something differently. So we've got to kind of draw out there's a different there, and in ITSM it's a dedicated part of the organization, a team that's set aside to do these things, and that maybe that's the approach. Maybe that's something to consider more broadly than just within the space of ITSM, to start thinking about some of these things. Because the environments are certainly becoming more complex. You've made a great compelling argument around microservices and how that's changing the headache of where the complexity is, and I think that is a great metaphor for what's happening really across your market for customers, or how customers interact with you, or how you build your product and get it to market. All of those places are increasingly complex and increasingly out of your control.

Peter: 20:29

Now there's a topic for next time about how you go about addressing some of this with basically smaller batches and fast releases, less cognitive load on the individual teams that are doing the releases, proper service contracts between those different component pieces so that they can release at any time safely and that you can understand if there is a failure, that you can move back to a working service and understand what the meanings of that is and how you actually do that and how you code and develop a system that is behaving in that manner, which is how you do this at scale across all those and start to reduce the chances of those catastrophic failures. In other words and this is this, is directly in that space You're reinforcing the good, you're reinforcing how to do this well so that you're less likely to run into these issues in the first place. Maybe that's a deeper topic for another day. So what would you sum up as your sort of three main takeaways for people?

Dave: 21:23

Yeah, I'm going to go with a couple and let you go with a couple, because I think, unusually, I think this conversation was a little bit like two different perspectives mashing together. In many cases we're seeing things with a different lens. It really came out, I think, in this conversation. So the first thing, I think the key takeaway would be that problems are everywhere. We all think we're seeing problems all the time but we're not solving them the way you know. There are tools out there that are frameworks and methods and approaches in many spaces that we're just not leveraging. So I think part of it is understanding that maybe we've got to get just raise our head enough to realize that the problems aren't going to go away. They keep coming at us, and some formal way of addressing that, of recognizing it, you know, sort of instrumenting it so we can see what's going on and then actually having a formal problem approach, would be immensely powerful in a lot of different spaces. I think that's the first key takeaway and I think the second one is, as a result of that, is recognizing that somehow somewhere there's a need to take the time to understand the problem rather than just sit and think about it and just sit and deal with it in a sort of reactive way. So taking that time, formally shifting into a sort of system to you know, more informed, data driven, thoughtful place to solve that problem, yeah, and.

Peter: 22:43

I think I agree with both of those points. I think what I add is the having that common understanding as to what a problem is for us, so like creating that common agreement and language so that everybody's on the same page when we start to throw these terms around, so that you're all aligned as to what is it we're looking at. This is what I mean when I say this, and because that can help a lot when you start to dig into these, these problems, and understand where they might be coming from and how we're going to look to address them. I think there's having a think about some of the concepts that we brought up in the last 20 or so minutes as well. I think is quite important to to understand. That is the way in which you are looking at and addressing problems in your environment. Are you playing whack-a-mole with these funds? Is or are you actually looking systemically at the system and starting to understand how do I improve it so that I get better outcomes and that will ultimately result in a better resolution of your problems? So are you actually starting to look at things in that manner? With that, as always, thank you, Dave. It was a great conversation and if everybody wants to subscribe, they can hit that subscribe button and we'll get a new episode every so often, hopefully every week, and if you have any feedback, you can reach us at feedback@ definitelymaybeagile. com. Thanks, Dave, until next time, Peter. Thanks, you've been listening to Definitely, Maybe A gile, the podcast where your hosts, peter Maddison and David Sharrock, focus on the art and science of digital agile and DevOps at scale.

Click here for
a free consult