str/int: Controversial breaking change added to Python

preview_player
Показать описание
New breaking changes were just introduced in Python.

The default behavior of how int to string and string to int conversions work has been changed, with a limit being set on the number of allowed digits, with the default limit now being 4300. The old behavior can be configured using a new sys call, but why was this change introduced in the first place? In this video, we go over the change, the underlying algorithmic and security reasons it was added, and present different viewpoints of community members both in favor of and against the changes.

SUPPORT ME ⭐
---------------------------------------------------

Top patrons and donors: Jameson, Laura M, Vahnekie, Dragos C, Matt R, Casey G, Johan A, John Martin, Jason F, Mutual Information, Neel R

BE ACTIVE IN MY COMMUNITY 😄
---------------------------------------------------
Рекомендации по теме
Комментарии
Автор

As someone who spends most of my time asking “how can users break my app?” I already do a lot of input validation and sanitization before the values ever hit the business logic so this change won’t really affect me. That said, I also don’t do anything that would require printing integers larger than 100 digits, let alone 4300. Within the context of my projects, I’m okay with the change but knowing that Python is used extensively for data analysis, I see this as more of a “quick and dirty” bandaid than a proper fix. I’d be more in favor of using a more efficient algorithm instead.

KasimAhmic
Автор

In Linux kernel dev there is a saying: "WE DO NOT BREAK USERSPACE". That means that no matter how much better your change is for the kernel, if it breaks a userspace program that relied on the old behavior, then it can't be added. There once was a patch that fixed some weird behavior where in some specific circumstance the kernel would return -EINVAL, yes, a negative error code. The patch changed the value to something else, and this patch broke pulseaudio, which assumed that the kernel would return -EINVAL in some specific circumstance, and therefore was written in a way that when that suddenly didn't happen, the whole daemon would crash. One kernel dev argued this is a bug in pulseaudio, and it is, but Torvalds insisted that none of that matters because, and I quote, "WE DO NOT BREAK USERSPACE". So even though it fixed odd behavior in the kernel, it was classified as a bug and reverted. And it's a good thing, too, because there's a lot of legacy software with no maintainers that, if broke by a regressive kernel change, would never function again. We already deal with dependency hell when it library changes aren't properly versioned, we don't need the component that the entire system is built on doing that, too.

Python could learn a thing or two from Linus. And you'd think they would have learned it after the switch from python2 to python3. If a CVE in the linux kernel demanded a speedy solution and the options were to either take the easy route of breaking something in userspace and expecting userland software to eventually patch themselves, or the more difficult route of developing a more efficient algorithm so that your text editor that hasn't seen in update in over 30 years won't start segfaulting, you already know which option they're gonna take.

Platforms and applications have very different responsibilities.

BradenBest
Автор

I feel like they should've added the option to set a limit, but not set it by default. IMO, sanitizing user input should be an active choice by the developer, not an opt-out system (especially not the way it was done).

lior_haddad
Автор

So previously an exploit could hang the server for ~5seconds and after the fix the same exploit executed on the same code will crash the process by throwing an exception. 🔥This is fine.🔥

sharkinahat
Автор

"This operation might take a long time" is not a security issue, it's an refactoring target. Changing core language functionality in a patch release without widespread advanced communication is just flat wrong.

Also, it's 2022. If you're actually still using unsanitized input from web forms in your application, the onus is on you to fix it (or get hacked.)

TheJimNicholson
Автор

One of the first things I learned in my basic intro comp sci class was "always assume the user will find a way to break your code." If I ever get user input, the very first thing I do is figure out what restrictions I need to place on the user. Passing unsanitized user input is just bad programming. This is not a language problem - this is a usage problem.

jeanfecteau
Автор

I just tested on my machine, formatting a 10 million digit number takes 6 minutes with python and just a few hundred milliseconds with Java (using BigDecimal). Why can't they just make their reference implementation at least competitive with other languages standard libraries.

Sadiinso
Автор

I agree with the viewpoint that it should be up to developers to handle sanitized/unsanitized inputs, rather than making it the default for all

BJTangerine
Автор

I was already grossed out by Python's max recursion limit. But this is just plain ugly.

TheCarmacon
Автор

Operators/admins of vulnerable applications still have to update their Python version to 'fix' the vulnerability. So why implement an arbitrary default limit? Rather publish the new Python version with no limit as default, and include instructions how to set your own site-wide or app-wide default.

malteplath
Автор

I'm curious where the default 4, 300 digit limit comes from - that seems like a very arbitrary number.

CollinHeist
Автор

Adding it as an option is a great thing for those that it affects. There could then have been an open discussion about whether it should be a default in 3.11, where unnecessary breaking changes would be expected. Using a minor version like this just seems wrong

Josh-uinq
Автор

I definitely think it should be up to the api's to sanitize their input. As a student, I like Python because I know I can just deal with arbitrarily large numbers with unparalleled simplicity. This simplicity is a big reason why Python is so popular. Adding such quirks only pushes people away.

adivp
Автор

Thank you for the convenient coverage!

When it comes to security vulnerabilities, one thin to take in to account is severity:
How easy it is to exploit the vulnerability, and if the vulnerability is exploited, how much damage will it cause?

A DoS attack in most cases isn't that severe, because:

A) It will not compromise user or company data
B) It is relatively easy to detect
C) It can be mitigated without altering the code

The problem with this particular "fix", is that it will break code that is not vulnerable to DoS attacks because it is not a direct web service, but rather some simulation, rendering, analysis tool, etc. while not mitigating any real critical issues in vulnerable code.

So it really should not have been rushed.
It isn't just about transparency, but about rollout - if this was some arbitrary code execution vulnerability, privilege escalation, or data exfiltration thing, I could understand the urgency, but a miner DoS possibility?

lvmlvm
Автор

I think it would have been worth mentioning that this can also be enabled from outside the actual Python code by setting the environmentvariable "PYTHONINTMAXSTRDIGITS", so there is no need to wait for updated versions of libraries or programs which may break because of that change.

JohannPetrak
Автор

God damn it. I know a lot of scientists who use python for quick and dirty (or even medium scale) numerical work; I've even had python jobs running on thousands of cores on a HPC myself in my younger and more naive days. One of the key reasons scientists use python is because many of them are *not* programmers and don't want to have to think like one. The might language be slow to run, but its so much quicker to get non-developers to actually produce a result with it that it's worth it. In research, after all, you might only ever need to run a specific script once, better to have it take 3 days longer to run than train someone for 2 months to learn Fortran. [Yes, in physics the choice is often between python and Fortran, but that's a whole other rant].

This kind of change goes completely against that advantage. Is it hard to revert to the old behaviour? No. But it forces those who don't want to have to think about computer science issues (which is precisely why they are using python in the first place) to deal with it.

QuantumHistorian
Автор

I'm confused why they don't use a faster conversion alg too, but it's probably still a security issue even with subquadratic conversion

ajbiffl
Автор

Hmm. You know what is the same as a denial of service attack? Module owners making breaking changes without warning users. Both are going cause broken code or machines. I doubt this effects me, I don’t have a problem with the change, but would have preferred a warning.

eengle
Автор

The addition of the limiter is fine, IMO. The problem is that they implemented the change with a default setting that was functionally different to the previous default of it not existing. In my opinion, the correct course of action would have been to implement the change, document the capability, and leave the default limit as 0.
This kind of release should _never_ change default behavior.
If I were integrating for Python, I'd suggest a bugfix that changes the default value of sys.set_int_max_str_digits to 0.
As it stands, I would consider this to be a bug in Python. A bug with a workaround, but still a bug.

ssholum
Автор

I think most security vulnerabilities should be fixed even retroactively, *but* that this is plausibly an exception to that. DOS can be bad but it’s not ACE, and people were using this behavior, and the actual issue is with the unsanitized inputs.

Probably the best compromise would have been adding the limit but leaving the default at 0 for old versins.

danielrhouck