Nerd Journey: Career Advice for the Technology Professional
John White | Nick Korte
Podcast
Episodes
Listen, download, subscribe
A Love for Troubleshooting: Skill Development through Documentation with David Klee (1/2)
Can writing documentation beef up your troubleshooting skills? This week in episode 315 David Klee returns to explore the connection between effective troubleshooting and documentation. We’ll discuss appropriate levels of detail for documentation and explore it as a skill building exercise. Listen closely to hear why good documentation can make all the difference in a regulatory compliance audit as well as in emergency situations. Also, we’ll talk through some interview questions you can ask to determine the value of good documentation within an organization. Original Recording Date: 01-20-2025 Topics – An Exploration of Troubleshooting, Pre-requisites for Effective Troubleshooting, What Should Be Documented, Forms of Documentation and Emergency Preparedness, Interview Questions and Employer Perceptions 2:32 – An Exploration of Troubleshooting David Klee is a returning guest and the owner and chief architect at Heraflux Technologies. If you missed the previous discussions with David, you can find them below: Episode 119 – Tinkering into Specialty with David Klee (1/2) Episode 120 – A Time to Build with David Klee (2/2) Episode 309 – The Consulting Life: Managing Travel and Becoming a Better Communicator with David Klee (1/2) Episode 310 – Finding a Better Way: Contracting, Independence, and a Consultant’s Reputation with David Klee (2/2) David approached us about an idea for another topic to explore. After many years in the industry (11 of them as a business owner), David began to think about patterns he has seen and what has made him and many others successful. “What has actually made this work? And it’s the art and the science and the luck of troubleshooting…. What makes some of the best technologists arguably some of the best troubleshooters in the world, and then how do you apply that to life? …There’s a lot more than just knowing a technical feature or two or being able to Google faster than the person next to you. I have a lot of fun with this topic.” – David Klee, framing our discussion Philosophically, David believes troubleshooting is as much an art as it is a science. There is a foundation one needs to be a good troubleshooter, and David tells us this stems from our childhood curiosity about why things do what they do. David tells the story of learning to use a screwdriver at age 5, taking the family’s VCR apart, and successfully putting it back together again (which may or may not have landed him in trouble). Over time some people have a constant need to know why something is what is / why it works the way it does. David sees this present in some people but not all people. “When you look at those that are truly great at an industry…they want to know why, and they don’t stop until they know why.” – David Klee David mentions the Dunning-Kruger Effect, which speaks to breaking up the things we know and don’t know into 4 quadrants: Unknown unknowns are the things that get people into trouble because they think they know these but do not Known unknowns – David considers this area enlightenment in IT and a way to know where the boundaries are “Unknown knowns are the things that I consider you a master at a technology or a topic of anything because what you know becomes so integrated into your frame of reference and your being that you don’t know that you know it. You just do it. And, when you hit that point of a mastery of something…you may not be able to explain how you do it or you may not be able to tell somebody the steps to do it. But it’s just muscle memory. It’s just go. You do, and it works…. The truly good educators are the ones that can actually take what they know and dial it to the level of the people that they’re talking to. Some experts cannot do that, but they are so good at what they do. Others can. It’s fascinating…. It’s the unknown unknowns that gets people into trouble. It’s the unknown knowns that really separates people.” – David Klee We did not mention known knowns, but it would be the final quadrant. John says it’s the idea that you can master a skill or process but not have mastery of teaching or explaining that skill or process. Doing ang teaching could overlap, but they do not always overlap. David comes from a family of teachers, actually. His parents were traveling road musicians who fell into education, but they have always continued some sort of musical pursuit on their own. “It’s neat…to be able to explain to somebody how something works and why. I love it.” – David Klee When Nick thinks about troubleshooting, he thinks about both high pressure and low-pressure situations when we’re trying to figure out why something is not doing what it’s supposed to do. David says we’re trying to determine why there is an unexpected outcome and what we need to do to get to the expected outcome. “It’s a formal methodology or informal methodology for understanding why something does not have an expected outcome and working through the process that is an iterative process – either elimination or identification. And you end up with essentially identification, review, remediate, rinse and repeat until you get the desired outcome. That’s about as formal of a definition as I can give you.” – David Klee John thinks this may disguise the art in the troubleshooting process of knowing what issues may be more likely than others. People might discover something is not working and change 10 things. If something then starts working again, how do we know which change (or combination of changes) actually resolved the problem? We are far less likely to undo the changes once something begins working again. John mentions being good at troubleshooting in areas in which he has lost the fear of something going wrong. While John feels comfortable troubleshooting computer systems and software, he’s not good at troubleshooting car problems due to limited knowledge and a feeling of high stakes. Someone with a better knowledge of cars may perceive the stakes to be far lower when making a recommendation for fixing problems. David says it depends on what you are troubleshooting. There is a risk qualification element that needs to be considered with the process used in troubleshooting. David shares the example of troubleshooting a payment processing system with a group of folks who didn’t know what they didn’t know. The process they had developed to troubleshooting ran the risk of preventing payment processing for the entire company. David describes determining the need to speak to the group of people who built the system in order to troubleshoot the system safely. 11:43 – Pre-requisites for Effective Troubleshooting Nick mentions we highlighted a pre-requisite for troubleshooting being knowledge of the systems we’re troubleshooting. What is the correlation between how good a troubleshooter one can be and how well one knows the systems involved in troubleshooting? If we know our systems well, we know what is / is not possible within a given set of constraints. One example is knowing the ramifications of changing different database settings. “You know what’s going to happen because you know the platform and you know your environment, and you know how they come together. If you know this stuff you can resolve these issues a whole lot quicker.” – David Klee Knowledge of the platform and environment would mean we know the systems which interact with the one we are troubleshooting, the impact of the outage, the right person to call for help, and the questions you need to ask them. It can be much harder when you inherit a system someone else built and no documentation on why it was set up the way it was or how other systems communicate with it. Likely you also don’t know what types of changes have been made to it over time (whether they were band aid type fixes or some other kind). John mentions we’re highlighting domain knowledge of a system and its specific failure modes combined with what has happened in the past to diagnose and fix those things. A resilient system should have these things documented. David says the flip side of this is being someone coming in from the outside who has never seen this machine before. Think about the scenario in which you are asked to troubleshoot a system which people with all the domain knowledge can’t fix. As a consultant he runs into this pretty regularly. It can be challenging, but David says it keeps him sharp. Someone troubleshooting a system like this has to keep track of what’s already been done, what should have been done, and what questions need to be asked to extract domain knowledge from others when information hasn’t been documented. One must also know the platform well enough to successfully understand a system’s current state (which might be different than what people tell you). “Perception of a system’s state might be entirely different than the reality of the system’s state. That’s a hard, hard art to master right there.” – David Klee John says someone who doesn’t know a system may have a better chance of doing effective diagnosis. The person who knows a system well is going to make assumptions someone who doesn’t know a system would likely not make (i.e. the database is running great, etc.). David stresses the importance of quantifying performance when we’re troubleshooting. Preconceived notions about an environment might lead to subjective explanations. When he walks into an environment to troubleshoot a problem, David wants to look at the raw data. This data can help provide the true nature of a system’s state and perhaps prevent finger pointing between teams. “Show me the data. Show me why you think this. And most of the time, people cannot produce that data.” – David Klee Even trend information on how past issues of a specific kind were resolved counts as data and may provi
Nerd Journey: Career Advice for the Technology Professional RSS Feed
