it's very strange that ascii/utf8 verification functions just return a boolean. like it would be logical to at least return the position of the last valid byte, it literally does not cost anything.
Daniel_Zhu_af
3:18 The vape meme is pretty funny tho lol
RustIsWinning
You are the man tsoding! So smart! Keep it up my brother.
mattcarpenter
Brooo 1:35:20 got me.. "Suffering from success" 🤣🤣
devjcarrillo
I believe your implementation for checking valid utf-8 sequence is longer than it needs to be. Mine was literally 4 functions: calculate expected length in bytes from first byte, calculate expected length for a int32 code point, read code point from a buffer and return its int32 value or error -1 value, write code point to the buffer and return ammount of bytes written or return 0 if buffer was not enough (which is obvious because we know how much bytes each code point would be).
Initial int32 value for code point is first byte ANDed by with 0xFF >> expected length, which we are know from itself just by looking at it, which I also realized I can compress with just inverting all the bits and checking backwards that value must be less than 16 thus returning 4, or less then 32 and return 3, less than 64 and return 2 or returning 0 for error. Ascii just returns 1.
For each byte we multiply code point by 64 (basically shifting it upwards by 6 bits), then check current byte have top bits '10' (the first byte is masked) and return error if not, then just add byte value ANDed with 0x3F. Writing is almost the same, but backwards. The whole utf8.h file is 74 lines of C code and that's with unrelated shit inside and with conversions from xml unicode escapes on top of that.
rogo
1:00:18 my brain is too smol for bit operations I need excessive visual aid to not get repeatedly confused. With a ton of comments and some helper functions.
Once naively thought to roll own Unicode lib as icu4c is just so big. That was a bottomless pit of misery. Initially I just wanted to sort "correctly". After a long while got very lost in the spec and gave up.
alh-xjgt
56:40 if you don’t care about surrogates, you only need to check the first two extension bytes of the 4 bytes case, the others can’t overflow
berndeckenfels
I can't necessarily speak to the intent at 1:08:00, but the C3 site says that contracts enforce both runtime *and* compile-time constraints, so it might be signaling the compiler to check lengths at comptime, if possible. Otherwise, you're depending on LLVM to (hopefully) do that for you in an optimization pass.
j-wenning
You need to use the volkswagen npm package
mrcrafter_y
Wouldn't it be sufficient to check that the first UTF-8 byte is <= 0xF4 and the second <= 8F, since the UTF-8 representation of largest valid Unicode codepoint
U+10FFFF is 0xF4 0x8F 0xBF 0xBF.
sebschrader
It took me 4 days to watch this video. Literally started it the day of and had to keep refreshing the page so YouTube would work when I came back to play more. It's not even one of your longest, just life got in the way. I've looked at both C2 and C3, and I don't think either is particularly good. There may be some elements here and there to copy, but overall I think they're both pale imitations of C++ trying to simplify back to C.
I'm sure plenty will disagree, if not with every point at least some, but I think if a language is going to proclaim itself better than C, it should 1) have arrays and strings as first-class objects in the language and default to a _view type of some kind 2) all strings should by default be f-strings if the compiler can resolve them at compile time and we should try our damnedest to avoid printf style functions as the most prevalent means of printing messages anywhere while retaining the ability to use it if truly necessary 3) have operator and function overloading, as well as UDL's 4) have RAII with defer being an option, but one that's not encouraged 5) an import system that isn't annoying, of which Rust and Python are closest but still kind of suck and most importantly 6) stop deleting useful features just because they're not the recommended method of doing things, such as goto or include.
I also obsess over the names I give types, variables and functions. I tend towards the Whatever_This_Case_Is_Called for types and this_case_for_functions, but I prefer giving variables shorter and less annoying names. I don't want to type 20 characters at a time to reference a variable, and I don't want to constantly tab complete to get every variable out. I'm generally only verbose when writing prose, not code.
I abhor exceptions in my own code, preferring the style of returning an error code and taking all arguments that require modification as references or pointers. I was thinking about how I'd handle that for my own language, and I'm thinking that there should be a compiler or linker flag that allows a stub to be generated which would capture all exceptions and translate them into a simple error code. That way you could disable exceptions with one fell swoop and check for an error like C's errno. I'm sure that'll leave a few people aghast, but I hate exceptions that much. I feel like most error code in "modern" languages tends to be that of explicit denial. The programmer denies that an error can happen, usually with a ! or a ? attached to the function call, and either ignores it anyhow, or the program "crashes" with a language specific stub generated message. It's the equivalent of wrapping the entirety of main with a try/catch block and just saying whatever. I'd prefer that if the programmer is going to ignore errors anyway that we don't have a bunch of random !'s and/or ?'s all over the place.
anon_y_mousse
very interessting series, learned alot!
PyguK
Hi Tsoding, a fellow minion here. Would love to see a OS and macros setup stream. As a habitual watcher, would love to personally put in place better coding habits that i have not developped over my time learning on the web. As a fellow recreational programmers I would really appreciate a good standard method to follow when programming. Take care.
diegocaumont
I don't have your skills or level of understanding, but I also have my YouTube watch history turned off. this way I don't watch random videos
mehmeh
Its not just Windows that uses uft-16. JavaScript also has uft-16 problems: JavaScript indexes strings by uft-16 code units (but iterates by code points), and the strings can have unmatched surrogate pairs (and thus fail to round trip to utf-8). I personally have dealt with multiple bugs where JavaScript code (Including the TypeScreipt compiler) mishandled strings due to uft-16 issues. While I love making fun of Windows for its utf-16 problems (I dealt with uft-16 issues there as well), I can't miss this chance to also make fun of JavaScript.
---..
As a Gernan I approve of the Autobahn thing xD
cheebadigga
Dear Mr. Tsoding, would you like to do the ZverDVD unboxing video in 2024, what do you think?
kirsanov
Honestly, UTF-8 is easy to implement because you can just check a mask of top '10', remember the length of the UTF-8 sequence and just calculate what value did you get. To combat overlong characters you just switch on the length of recieved sequence and check that resulting code point is no less than a minimum code point on which you should have that length. That's the entirety of UTF-8, it's stupid simple and beautifful. Just ignore any linkage to the Unicode code points and pass int32 value to the higher level that actually will check is this a valid glyph or not, 2000 years later we gonna have more languages (emojies probably) anyway.
rogo
Guys, anyone knows what Tsoding thinks about Ladybird development, and whether there's a chance he'll take a look at its code?