The Absurdity of Error Handling: Finding a Purpose for Errors in Safety-Critical SYCL - Erik Tomusk

preview_player
Показать описание
---

The Absurdity of Error Handling: Finding a Purpose for Errors in Safety-Critical SYCL - Erik Tomusk - CppCon 2023

C++ is hard. Error handling is hard. Safety-critical software is very hard. Combine the three, and you get just one of the exciting problems faced by the SYCL SC working group.

SYCL is one of the most widely supported abstraction layers for programming GPUs and other hardware accelerators using ISO C++. As of March 2023, the Khronos Group has a working group tasked with specifying SYCL SC --- a variant of SYCL that is compatible with safety-critical systems. One of the key features of a safety-critical system is that its behavior must be well understood not just in normal operation, but also in the presence of faults. This raises some difficult technical questions, such as, "How do I implement deterministic error handling?" but also some more philosophical ones, like, “What does an error actually mean, and is the error even theoretically actionable?”

Much of the information on C++ error handling in safety-critical contexts focuses on RTTI and the pitfalls of stack unwinding. Although these are important considerations, I will argue that a far greater problem is a lack of agreement on what *safety* even means. This talk will focus on how *safety* in a safety-critical context differs from *safety* from a programming language design perspective. While the talk is inspired by the pain-points of C++ error handling in safety-critical contexts, the conclusions are relevant to C++ software in general. The talk will challenge the audience to rethink the situations that can be considered erroneous and to carefully consider the expected behavior of their software in the presence of errors.

I am a member of the SYCL SC working group, but this talk will contain my own opinions.
---

Erik Tomusk

Erik Tomusk is the Senior Safety Architect at Codeplay Software, where he is working to bring functional safety to the SYCL API. In a previous role, he spent a few years writing C++ and CMake for Codeplay's OpenCL runtime.

Before joining Codeplay, Erik researched CPU architectures at the University of Edinburgh, and even managed to secure a Ph.D.
---

---

#cppcon #cppprogramming #cpp
Рекомендации по теме
Комментарии
Автор

I was surprised that network connection errors were never brought up as an example where error handling makes sense, since that's the most common case I can think of where the error handling/recovery strategy should be fairly clear-cut for most applications, i.e. shut down the client component that tried to connect, boot the user back to their previous state with an error message, and continue with the rest of the application as normal.

Another example would be trying to read a missing file from the filesystem, like if the user of a text editor tries to open a file from the "recent files" list that they had deleted before starting the application. In that case, just show an error message, remove the filename from the list, and continue as normal.

In these examples, and many more like them, I can't imagine a better API design than the traditional form of error reporting where you return something like an exception (or error code or whatever) and let the surrounding context (like the "recent files" list) deal with resetting the state as appropriate, even if just by stack unwinding, running the destructors of everything that was created up until the point of the error in reverse order.

It definitely makes sense to think about the cases where this doesn't apply as clearly, like the out-of-bounds error that was mentioned (which should probably be caught by an assert during testing rather than throwing an exception), but at least in my experience, the main source of errors isn't necessarily applications entering an "inconsistent state" (since bugs like that aren't reported as errors in the first place), and it's usually pretty rare that there's no blank state to return the application to if something goes wrong in well-behaved code, like the main menu of a GUI or the main connection broker loop of a server. And if all else fails, you can almost always at least write an autosave file, log some data and exit, which is often better for the user than just aborting the process immediately. That might not apply as much to some safety critical applications, but in general I'd say that regular old error handling still has a lot of practical value to many developers, way beyond the theoretical abstract machine that language designers are concerned with, and it would be a mistake to replace standard error reporting with something like terminating on the spot as the default strategy just because it seems easier to reason about.

SonicFan
Автор

I need error handling mostly for error analyzation and rarely for error recovery. This is very critical for my daily work in medical engineering in order to make sure the application is as bug-free as possible. We are using a fail-fast approach for non-safety-critical development.

StefaNoneD
Автор

15:40 I can think of plenty of examples:
- user is playing a game from an external drive, that drive is unplugged by the user's pet, using error candling you can pause the game and inform the user, they can plug the drive back in and the game can recover. Trying to access files from an unplugged drive is definitely an error and definitely recoverable. Similar situations are possible with external GPUs - unplugging the GPU the game is using shouldn't result in a loss of game save file progress.
- A robotic arm's camera gets disconnected accidentally due to shifting load, the system halts and requests user intervention, camera is reconnected, the system recovers after user confirmation. Allowing the load to disconnect the camera is definitely an error and definitely recoverable. There's all kinds of variations of this where a safety-critical system needs to stop moving and require user attention.
- a self-flying airplane suffers engine damage due to failing to avoid a bird, it must make an emergency landing and be serviced before it can fly again. Failing to avoid the bird is definitely an error, and being able to make an emergency landing is an important way of handling that error and (eventually) recovering from it. I'm sure there's plenty other cases where a safety critical system needs to do something more complex than simply stop moving in order to properly handle and recover from certain kinds of errors.

N....
Автор

I've ended up mostly using an "exploding kitty" style handling of situations that might fail. Any function that may fail returns a special optional that will "explode" (abort & write an error message) if you just use it's value without checking or "defusing" it first. Makes the code flow very straight forward, and any unexpected situation is localized (you don't end up getting 100+ stack traces).

brynyard
Автор

I was expecting the reveal to be that the "something else" that we need in the language is often just a variant (or an optional or an expected). That would fulfil a lot of the stated requirements: deterministic (at least as far as the types in the variant are); flexible (i.e. not "a monolithic chainsaw"); and able to communicate different types of information differently.

AGeekTragedy
Автор

Excellent talk, one of the best I've seen. Thank you, Erik!

TheArech
Автор

I'm sure it's in your list, but file open errors would be a type that is recoverable. Generally it's due to the user mistyping a file name and you can simply request that they try again. As a compiler author, another would be malformed code fragments. Say for instance you're designing a language that uses semicolons as statement terminators. You might decide that when you can reason about whether what you've seen thus far is a complete statement that you only need to warn the user that they dropped a semicolon. If you can't reason about it, then by all means abort with an error, but that's frequently not the case.

anon_y_mousse
Автор

You don't really need full understanding of an error to handle it. Quite often, you only need to know some bounds for what the error could affect. Like, for instance, running a third-party plugin, getting an error from it and deactivating it while having confidence that any possible damage is localized to its inner workings. You don't really care what the error is in this case (other thanfor debugging), but you have a very clear handling strategy for it regardless

terragame
Автор

"Undefined behavior isn't so bad as long as it does the same thing everytime" (6:21) -- so that is, never? Even taking the system as a whole and as a black box, how can the funtional safety practitioner sign off on the behavior of a system that is eventually led by undefined behavior? The crutch is in the name, the behavior of the software is undefined, and therefore cannot be relied upon. It can't "[do] the same thing everytime", there are no guarantees or documentation, might change from one compiler to the next, or from one (abstract|concrete) machine state to the next.

I am not in safety-critical software, so maybe I'm missing something here, but I really don't understand the point of view. I also haven't watched the rest of the talk yet, so that may be adressed in a further section or in the questions.

SolarLiner
Автор

Maybe it wasn't said too explicitly, but something I understood is that anything that could be reasonably handled by std::expected is 'not really an error', which is fair, and and I think something the C++ community is starting to understand. Errors like these -- file not found, connection lost and even resource exhaustion -- should be handled gracefully by any non-trivial application, and exceptions make grace difficult.

MatthewWalker
Автор

@15:55
In my current project, I work on a data transformation pipeline that processes incoming client datasets and transforms them into a canonical internal format.
This internal format tries to establish some guarantees on a dataset, e.g. "the sum over all rows of table X needs to be zero". Just now, a new
requirement popped up, where business wants to have data in that canonical format, but they don't care about some of the guarantees.
IMHO this is exactly the gray areay between "not really an error" and "not recoverable" - If you encounter an error, it depends on what you plan to do with your application state: Which are the guarantees you will rely on when continuing execution? Does the encountered error indicate a violation of those guarantees?

In other words, if you can guarantee, that an encountered error will only compromise an isolated part of your application state and the error was caused by violations of guarantees you have given, then the error is both "a real error" and "recoverable".

kiffeeify
Автор

nearly all the exception-handling we have is in the database in the form of either no data being found, or too much data being found - the former being just a normal occurrence as we just try to get data were nothing is there (instead of first checking and then getting the data), the later is an actual error and normally has no direct recovery (whatever caused that problem has to handle it - or stop doing anything).
And there are a few exceptions in the normal code but many of them deal with handling outside interactions like userinput - but there are only very few real error-scenarios there. Most are again just a "you entered the wrong thing - try again" and in terms of the code that is just normal control-flow for an invalid input and not an error.

The last error i saw and had to fix was still the result of invalid user-inputs but at that time these were separate processes and somebody just provided bad but potentially valid data for multiple different systems... a couple numbers that at one point get multiplied/divided and can lead to an overflow - so actual software error - and sadly preventing the error would have been a lot harder than just adding some code to check for the error (but even that was hard as all checks for UB MUST also prevent any UB from happening at all or the checks will be removed by compilers).

ABaumstumpf
Автор

I don't understand how Sycl could ever have an SC considering it's not directly implemented on anything but Intel, this is, in fact, one of the biggest downsides of Sycl. It's super weird that we see a Sycl safety critical talk here.

snbvreal
Автор

Bjarne Stroustrup - Morgan Stanley

Ah yes. That is what Bjarne Stroustrup is most well known for. Nothing to do with being the creator of C++.

vrclckd-zzpv
Автор

It’s a bold claim that the C++ designers want to remove undefined behavior. It seems to me that they’ve proliferated it.

sjswitzer
Автор

I have written, designed and used both mission critical and safety critical systems and software… the worst development combination is lots of abstraction, object oriented whatever, c++ and software engineers who don’t understand the system hardware and system mission or use, …

FredFred-wyjw