Kubernetes Events Are Broken (If You Are Building a Developer Portal)

preview_player
Показать описание
Are you building a developer portal and relying on Kubernetes events to monitor and manage your cluster? You might want to rethink your strategy. In this video, we delve into the pitfalls and challenges of using Kubernetes events in developer portal contexts. Learn about Kubernetes events limitations, and how they prevent day 2 operations in Internal Developer Platforms (IDP).

#Kubernetes #DeveloperPortal #InternalDeveloperPlatform

▬▬▬▬▬▬ 🔗 Additional Info 🔗 ▬▬▬▬▬▬

▬▬▬▬▬▬ 💰 Sponsorships 💰 ▬▬▬▬▬▬

▬▬▬▬▬▬ 👋 Contact me 👋 ▬▬▬▬▬▬

▬▬▬▬▬▬ 🚀 Other Channels 🚀 ▬▬▬▬▬▬
Рекомендации по теме
Комментарии
Автор

How do you propagate events to parent resources?

DevOpsToolkit
Автор

Event propagation would probably give rise to scalability issues. Just imagine an application running on hundreds of nodes, with thousands of pods and all the events propagating up to for example one deployment object. How many write operations per second would we push on etcd, just to keep the eventlog for this deployment up to date? I understand the frustration with regards to this, but I believe it's actually more or less correctly implemented. If you want to dig deeper into events, then I believe the way crossplane does it - by digging - is the correct way. If I remember correctly, end point slices where introduced to (amongst others) reduce the need to update service objects on all nodes just when pods got started/terminated, which made it difficult to scale k8s to thousands of nodes. I believe propagating events would put orders of magnitude more pressure on sync. It's probably possible to create some sort of controller, which listens for all events and annotates them with contextual information (cluster, pod, namespace, metadata, whatnot) and push it off to a designated system, which could map them to parent resources and be eventually consistent. That way it won't degrade the k8s controlplane performance that much.

droogielamer
Автор

I can see two problems: aggregating events from child resources, and event filtering (by urgency).
First problem can be relatively easily solved on client side by adding "recursive walk" mode (searching resources with ownerReference to requested one, then searching resources with references to found ones, et cetera) for events as well as for statuses. Also, maybe, it can be done on server side via some API extension, via creating a "aggregated" resource which events and statuses are populated from specific subset of resources (for example, aggregated ClusterRoles does something similar for RBAC rules).
Second problem is event filtering, and it looks no so easy to solve, as it requires to reconsider the usage of "type" property of event object. For now, Kubernetes supports two standard event types: "Normal" and "Warning". Most syslog implementations (for example) supports eight levels of urgency (emergency, alert, critical, error, warning, notice, info, debug). Looks like Kubernetes events needs something similar.

sergeyp
Автор

As a k8s "implementor" I'd say I have been bitten by that at some point. I believe k8s has some dark corners where is kind of hard to understand what's going on when things don't go as expected. The design is not consistent... on one hand you have built in resources like deployments, RS, namespaces, etc. etc. and on another CRDs. Maybe the answer would be to have a bare bone k8s without resources. Then install deployments, RS, namespaces as CRDs with their corresponding controllers? We could even have simpler models where there is no need for RS for example? That would be great. But I believe there must be some side effects, like chicken and egg problem, deadlocks... who knows. These things are really complex problems to solve and normally very smart people are behind.
Finally after working many years in a combination of big companies and small startups... I have my serious doubts that developers can deal with infrastructure and vice versa. Too much for a single role. Each time I land into a company that decided to go with "everybody does everything" in a "devops" way... is far from ideal to say the least. Infrastructure guys (SRE/Devops/Sysadmins you name it) can make a huge impact in teams efficiency, performance and mantainability

juanbreinlinger
Автор

My workaround which I've not build yet, because I'm still at the start of that journey, is to also generate all the monitoring for all the things and have the monitoring show that a lower level thing failed. I suggest you create a Kubernetes enhancement proposal. 🙂

autohmae
Автор

Great video Victor, feeling it every day.

tobiaskasser
Автор

Have implemented somewhat a way for our IDP platform using argo notifications(with a hard-refresh annotation, although seems to have performance implications on argo itself) and pushing the notifications to a redis instance and querying the resource state from there, allowing a custom IDP plugin to consume the events from redis.

neomotsumi
Автор

Even es someone who understands all that it’s a pain having to traverse through all the sub-resources to find out if there is a problem and where it is

randomcontrol
Автор

@DevOpsToolkit Thank you. This exactly problem we facing everyday, another point for our platform roadmap, which partly collected from your videos :)

MrEvgheniDev
Автор

I don't have a solution. I didn't even know what was bothering me before you spelled it all out.

fanemanelistu
Автор

So, the propagation should be done by the operators, when an event related to a resource they created happens, they apply their knowledge of the their resource to create a new event (if necessary) and write it setting the related resource field the resource they are responding to. So the batch operator watches the jobs that it creates and when one fails or does something that the operator feels the need to report on, it creates the new event. What I think may be the break in the system (I agree it could be better), is that the cronjob-controller doesn't populate the related field with a reference to the Job that completed/failed. The information is in the message field, which makes it human readable, but not machine readable. Which is kind of funny, because too many times error messages are machine readable and not human readable. Basically, I'm saying the Events are not broken (all the tools are there), but the controllers are (Not using the tools)

SeanChitwood
Автор

I've a different opinion. Maybe the problem it's not that the people working with it are not experts, they are just not familiar with it, and that for me it's the real problem. There are many ways to solve the issue and we can build as much abstractions we want, in the end, we need to learn how to drive a car before driving it. It's the same situation here. Those problems you describe are not for "experts to solve", it's the bare minimum to drive the car on our roads. Experts are working at configuring and managing everything, as well fixing real hard problems, not some deployment which some developer did wrong because he don't even know how to shift gears, or won't recognize a signal on the road, because he never learned it. Nevertheless, there are work to be done at the logging level and at that end, we both agree. Simply not for the same reasons.

pirolas
Автор

New subscriber. Unrelated, but is your course(s) already out? Your linked site doesn't show any

abhishekpareek
Автор

In our (not yet crossplane based) Operators we just expose CR Related events. As expected, but that's easy as you execute the operations in the scope of the managed resources. That's might be the case with all other tools as well, I can't see automatic abstraction translating from a low level k8s event into a high level CR Event. But if your Platform (i.e. Crossplane) supports Functions and Pipelines the Tool should be able to emit human written events to the actual CR they are operating on, based on the state information the function has. ?!?!?!? BTW_ I never understood why so many operators do not expose CR specific events.

holgerwinkelmann
Автор

Maybe "kubectl describe" can benefit from a knob that acts similarly to "crossplane trace"?

Alex-linb
Автор

Regarding specific Crossplane troubleshooting, I find Komoplane very useful (by the way, quite suprising that Upbound does not provide a similar tool ;))

meyogi
Автор

By my experience is a bad idea to give such freedom to developers. If the no good architect they will do a mess. So architect and devops should tell how their software will be delivered and they must to implement it.

oleksandrsimonov
Автор

Isn't this a problem with "microservice" architecture? Everything is so decoupled that only an observability tool can put everything together to present a unified meaningful thing? There's so much of this with cloud computing as well. I guess in argocd you'd have a perfect visualization of the problems, no? Exactly what you want, parent create a child, who created another child and failed?

ffelegal
Автор

in theory the end developer should never have to understand why the claim deployment fail. validations of what is possible through the claim interface should be done before applying it, and if it fails, then its the platform team job to troubleshoot and add validations to the claim to proper inform the developer of what is wrong when he tries to apply the claim with the same issue. imagine the same with apis. why should the end user recieve all the errors of a chain of api calls that happen beyond the api interface he knows? end users should not even have access to the events of the issues beyond his claim. there is also the security angle. exposing exceptions stacks in api call is a known vulnerability. it should never be done. the same logic can apply to k8 CRD interfaces.

hugolopes
Автор

eventhough developer has access to it, I don't think it will be helpful, because developer can't do nothing about it, that is the primary reason we abstract it, we want to hide low level detail, just poping out the low level event to developer will not help because at the end, the developer will ask the tools (developer portal) team to fix it,
Instead, I think it is better to the tools (developer portal) team, create their own monitoring, for example if some resource provisioning is taking time too long, it will alert the tools team, so they can investigate right away.

kurniadwin