Improving Observability and Testing In Production

preview_player
Показать описание
Observability, Monitoring, Testing in Production, SRE - what do these ideas mean and how are they related to one another? Testing in Production is an important aspect of doing a good job of software development. After all, there are some things that we can only learn from the working of our systems in production. This doesn’t mean though that we should only test in production, or that we should be careless with the software that we release. So what is Testing in Production and how does it relate to our overall approach?

In this episode, Dave Farley, author of "Continuous Delivery" and “Modern Software Engineering”, explores what Testing in Production is, and how to Test in Production, and how this is related to working more experimentally.

-------------------------------------------------------------------------------------

Also from Dave:

🎓 CD TRAINING COURSES
If you want to learn Continuous Delivery and DevOps skills, check out Dave Farley's courses

📧 Get a FREE guide "How to Organise Software Teams" by Dave Farley when you join our CD MAIL LIST 📧

_____________________________________________________

📖 Further Reading:

_____________________________________________________

📚 BOOKS:

📖 "Continuous Delivery Pipelines" by Dave Farley

📖 Dave’s NEW BOOK "Modern Software Engineering" is available here

NOTE: If you click on one of the Amazon Affiliate links and buy the book, Continuous Delivery Ltd. will get a small fee for the recommendation with NO increase in cost to you.

-------------------------------------------------------------------------------------

CHANNEL SPONSORS:

Рекомендации по теме
Комментарии
Автор

Another excellent talk. Thank you. Two observations on very different aspects of it. The first is that in many cases the reliability of the system may be dependent on factors way outside its control. I have seen a small but essential online system which handled enormous transaction rates at 8.30 each morning, suddenly jam and have to be cancelled and restarted. This happened for three consecutive days till the cause was detected. A large batch application with huge data sorts always ran in the small hours. To make it go faster, the overall system policies were adjusted to allow it to take all the main and expanded storage it wanted. Which it did. Which is why the small online system had been pushed out to disk page data sets without anyone fore-seeing this consequence. Because the shortening of the big batch suite had been reported as a triumph to senior management, I was asked to write the nastiest program I have ever coded. It simply went through the whole address space of the online system, looking at the first byte pf each 4K block of memory - not to see what it said, but simply to get it back into main storage by running this after the batch, and before 8.30 every day. Yuk. I;d call this "the side-swipe effect".

My other point - at the opposite end of the spectrum - is simply this. How often do we see any sort of invitation to feedback to a commercial website as to the usability of their site? The Heath Service Covid test reporting site did so, and was a model of clarity as a result. I am now registered visually impaired, and see countless websites which are badly designed for people like me. In some cases I have stopped shopping with them and gone to an other supplier as the only response available to me. My book Good Code is Not Enough has a whole chapter dedicated to web page design for those losing their sight, which a normally sighted friend who edited and proof-read it said applied as much to him as it did to me. None of these sites ever solicits feedback. Have you tried this in the cases you describe where you make continuous small changes to systems?

johnwade
Автор

Great video as always! :D

I really like the terms 'uninformed guesses' vs 'informed guesses'. For me these terms explains a lot what kind of "game" we are into here.

gewusst-vim
Автор

To look at application level monitoring and analysis (MA of MAPE-K) as some sort of test might be a good idea to better communicate that this is not nice to have, but can be valuable. Beside that: I have been using monitoring data to do performance evaluation (usually also includes machine parameters), like throughput and response time, but also other qualities, like resilience and robustness. In addition the information can be used to recover the runtime architecture and compare it to design-time models. Based on entry-level events, you can test whether which user behaviors are more common and favored by the system. This might no longer fit the testing metaphor, but it provides valuable info to steer development. Maybe people have to perform many operations to get to a result and so many of them do not finish their process. Maybe there are too many delays. And such issues can appear out of the blue when load changes or behaviors change. (Great inspiring video as always)

reinerjung
Автор

If we only followed the data from telemetry we would increasingly narrow what we think people love. None the less, if we want to try something wild and really out there, testing it as soon as you can in production is a great way to find out if it’s good or not. Another wonderful bit of content, cheers! 👍🏼

sibzilla
Автор

Love you thanks for fixing your audio. We can hear you now!
Do you have a video on API load and performance testing? Or a video on preparing your integrations for prod load considering all factors like all dependency contention.
Looking to figure out how to simulate the contentions we see in prod when everything is going on, whereas, in UAT, other integrated apps are not running on any load… I may have answered my question typing this far, lol

SuperSeoExperts
Автор

Man alive Dave you're sounding more like a scientist instead of an engineer here.

Mozartenhimer
Автор

The biggest challenge is actually getting full stack observability. We have been trying to tackle this by making the integration simple - - we released an open-source project for the auto-instrumentation of Go back in May. We just published another open-source project, Odigos, an observability control plane. Allows you to get logs metrics and traces within minutes from you applications. Its agnostic, its free, and it requires no code changes. We love observability and would love to get feedback from all.

arirecht
Автор

Testing end-to-end in a dedicated environment is a good way to prevent changes from causing damage to real users and production data, but I often see problems where it limits the ability of developers to fix bugs in malformed organizations which are unavoidably doomed to produce malformed systems. A bug is reported, but developers are unable to find data that reproduces it in the testing environment, and the teams that can help be like "don't care, not our bug, not our problem". Testing like this requires a well-formed organization committed so solving problems, even if the team that can help solve a problem is not the one that caused it.

My favourite flavour of limited "testing in production" is:
* Common data stores, data can be marked as test data so that it doesn't impact real users.
* Changes to server-side systems are tested in separate environment, generating and modifying test data only unless that data doesn't impact other users (for example personal data on test accounts not visible to other users)
* Testing builds of UIs are tested with production systems for acceptance tests but modify and generate testing data only. Optionally it should be possible to test them with testing environment as well.
This too requires discipline and commitment from all teams to make all critical data and systems capable of supporting this partial "testing in production".

Ultimately, both ways have serious consequences if not used correctly:
Bad support for testing in a separate environment means some problems are left unresolved because no one is able to reproduce them.
Bad support for "testing in production" could break things for real users.

If both are done correctly, I think my fake version of "testing in production" yields better long term results, eliminating the strain of maintaining test data stores and the risk of omission that it carries with it, while also avoiding damage to real users and real production data. Also, testing with real production infra gives us better confidence that everything will work fine once it's exposed to real users.

elfnecromancer
Автор

Awesome! Great talk. Thank you so much for your work!

arakovskiy
Автор

Was that a recommendation to get other environments than production for acceptance testing? I thought people would shy away from this, because of the complexity involved in maintaing these environments. We just dropped ours, because we can easily replace them with mocks and I still think that's a good idea. All the other things mentioned we will definitely do.

florianfanderl
Автор

I was surprised none of this was mentioned in the Modern Software Engineering book, because I think observability and experiments in the real world are a very important part of working on modern systems. Especially in enterprises people tend to focus on all the people with desires and opinions they see everyday: the architects, managers, UX designers, etc, and forgot about the people who's desires and opinions actually matter: the users.
However, I don't understand (or disagree) with some things in this video.
First of all A/B testing is not that hard, and the part of how your randomly selected cohorts might be skewed is just wrong: if you have a large enough group and you actually assign randomly there is an extremely small chance the cohorts are fundamentally different, and actually easily calculated with basic statistics software.
Second, I always thought the delivery pipeline of my software should tell me if *my software* is releasable. If I do not isolate my software from unreliable external systems my pipeline becomes flaky: it will tell me all the time my software is not releasable, but running the pipeline again will show that the problem was actually outside of my system. Isolating means making assumptions on how those external systems behave in reality, and the first place to actually test those assumptions is production. To manage the risk I would perform a canary deployment, and slowly ramp up until I have enough confidence my assumptions are not completely unrealistic. The problem with an acceptance environment is that no matter how hard you try to make it production-like in terms of infrastructure and software, it is very difficult (and expensive) to have realistic usage for your own system, but also for the coupled external systems, and this can and will influence the behaviour your system cares about!

utilist
Автор

I think you're using the term data-driven while talking about data-informed.

kobac