The Performance Tour – Season One
On January 27, 2020 I left my home in Florida to go around the United States and talk about performance engineering. It was warm, so I am not sure why I chose January to go darting off like that. I wasn’t selling anything. I didn’t want anything from anyone. I just wanted to raise awareness about software performance within IT. For the general masses performance engineering is not the most exciting topic, so I wanted to make it fun and interesting, even to people who would not normally care about it. I decided to make an online show and put some silliness into the videos to keep people’s attention.
Talk To Me
As I went from city to city and talked to people, I began to notice patterns in the conversation. People were experiencing the same problems across multiple industries, in various company sizes, and across a wide variety of roles and responsibilities. I brought up my concerns at meetup groups and events that I hosted, and noticed that people were actually interested in talking to me about this stuff. Many had questions about performance, because they did not generally understand it.
My observations below are from notes I made during the process of driving 5500 miles, having events in 12 major US cities, covering 2 conferences in Vegas over multiple weeks, talking to countless people at countless companies, and spending 35 days on the road away from home. This includes dressing like Elvis in Las Vegas:
What I Learned
There is a huge vacuum when it comes to education about performance engineering. I can’t find a school, a class, or many chapters in IT-related books that focus on developing performant software. I know there have to be some, but they are far and few between. A lack of education means people learn from trial and error. This is not the fastest way, although some individuals may argue it is the best way. There is generally more information published concerning poor performance at the system and network levels, however it still isn’t always a core focus when implementing these systems. We have a lot of work to do in this area. Who is going to pass down the knowledge to the next generation of developers?
Big and small companies are struggling to shift left with performance to introduce it earlier in the lifecycle and catch issues sooner rather than later. This is actually the fallout of the first issue – lack of education. There has been such an emphasis on functionality and testing functionality within a CI/CD pipeline, when performance is brought up, it is like a big monkey wrench just got thrown into the machine to mess things up. It can be as simple as wrapping timers around existing tests, but for some reason this is a hard concept for some companies to comprehend.
The Site Reliability Engineer (SRE) job description is being “redefined” by some companies. In many companies that claim to be DevOps, it is more like they have a team of level two support personnel that just make more money and deal with more stress on the front lines. A majority of their time is spent in triage, and it is not 50% in development and 50% in operations as originally designed by Google. Using APM tools like Dynatrace, SRE’s are stuck in war room situations trying to fix performance and outage issues that have slipped through the cracks from development and testing. See items #1 and #2 above as to why this is the case. This is putting them under more pressure and making the job more stressful than it needs to be.
Agile development was intended to break down releases into smaller pieces so teams would have time to address functionality, performance, security, and other factors AS IS IT BEING DEVELOPED. This is not happening. Companies are still not taking the time to do this. The focus seems to be on getting the product features out (the functionality). Customers are promised features that are four releases out and a date is set for that release. Then they tell the world they are Agile. Many of these companies cannot react to customer feedback and change course midstream, even though this is the central idea around Agile.
Because individual scrum teams are fully responsible for the software from development to deployment, there is an expectation that everyone (or at least SOMEONE) within each team understands performance and can ensure that performance is validated. This is NOT the case many times, and this is why we still see major performance related issues within Agile development. The same can be said for the SDET and SRE roles. There is an assumption that these roles understand performance, when many times they only understand functionality and technically how things should work, and when a performance issue is due to something outside their own experience, they are stumped. It goes back to basic education. At this point, you should start recognizing a theme.
Automation of business processes used in a performance test does not match what the end user is doing because developers and testers are not looking at the application from the end users perspective. There is rarely end user monitoring and real user monitoring data being provided at all, and when it is available it does not get to the right people who need to see it.
While many companies are claiming that they ARE DevOps or moving towards DevOps, they struggle to actually define DevOps and there are very different definitions depending on who you talk to. Everyone is struggling with the pace of change and all of the options out there.
The SEAL is the Best DEAL
In general, I think education internally distributed to teams inside of a company is the key. It should be shared early and often. My idea of the Software Engineer Across the Lifecycle (SEAL) is one way to approach this. The SEAL can act as an internal consultant to the scrum teams, the testing teams, the ones to support operations (DevOps) or anyone who needs guidance around how to optimize the software in their part of the lifecycle. I don’t think you need a lot of them, but there needs to be at least one who can cover the entire lifecycle with this specific focus.
Are We Going Backwards?
With all of the online blogs and vendors who focus on either front end web performance or back end systems performance, it seems crazy to me that we may have actually regressed in terms performance engineering in IT. However, this is the conclusion I walked away with after the performance tour. I know how negative that sounds. I hope I am wrong. It’s amazing that we are still seeing outages every Black Friday and Cyber Monday from Fortune 500 companies in the retail sector. We continue to hear the words, “I’m sorry. Our systems are slow today. Can you hold on for a moment?” This should not be the conversation in 2020, but here we are.
Pay It Forward
I feel responsible to share what I have learned, so I’ve tried to make part of my career teaching others through blogs, videos, and training classes. I’m only one person, so there needs to be more people willing to develop the heart of a teacher, or we are going to keep making the same mistakes over and over. What else can I do – what else can WE do – to make things better?