Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Unsigned int considered harmful for Java (2014) (nayuki.io)
28 points by networked on March 6, 2023 | hide | past | favorite | 57 comments


This post has an introduction that is noteworthy of its disrespect towards proponents of unsigned integers.

"Oftentimes it’s a novice coming from C/C++ or C# who has worked with unsigned types before, or one who wants to guarantee no negative numbers in certain situations."

This is a neutral observation first and foremost. By itself there is nothing wrong with it. I would expect newcomers of a language to miss features of where they came from as well. Although it should be noted that this statement is only a gut-feeling without proper statistics to back it up. What irks me is what comes next:

"Said novice C/C++/C# programmer typically does not fully understand the semantics and ramifications of unsigned integers in C/C++/C# to begin with."

Not only does he presume that the newcomer does not fully understand the semantics and ramifications of unsigned values in java. He also asserts that this is the case for the language that they came from. How do you expect a reasonable debate to continue from here on out? Every disagreement is settled by just stating, that the proponent just doesn't understand the problem.

Now what if they understand the semantics and ramifications and still want that feature? Does a world exist for the author in which this is the case?

Moreover, the article is pretentious as well. The semantics of unsigned integers in java hasn't been decided yet, so you either have to argue against each proposed model or in general. Here, a model in which implicit type conversion from unsigned to signed values is argued against. Other possibilities are not considered. At this point I am inclined to just step away, as there is no information about both the semantics that is being argued against (method: find it by reading it) and, whether it is the only model proposed. There is no insight to be had except that the author doesn't want unsigned integers and is content with how it works as is (with an asterisk for bytes).

I do believe that the people in favour of the addition have to make a case for it. But I also thought that we were beyond arguments of the "you're holding it wrong" kind. As silly as it may sound, an argument like this makes me consider the other side immediately. And I say that as someone who never felt the need for unsigned integers in Java.


The author makes it clear from the first paragraph that they're not trying to argue with the strong case for unsigned integers; they'd rather argue against the novice's case.

I'd say that in this contest betweeen the author and somebody described as literally not understanding the implications of what they are asking for, the author might have eked out a victory. Now they can move on to the winner's bracket and argue against an intermediate-level engineer.


The title itself is incredibly aggravating. "Considered harmful" is the most passively arrogant phrase with the exception of Dijkstra's original paper, and it really irks me when authors use it.


Dijkstra's chosen title was "A Case against the GO TO Statement".

Finally a short story for the record. In 1968, the Communications of the ACM published a text of mine under the title "The goto statement considered harmful", which in later years would be most frequently referenced, regrettably, however, often by authors who had seen no more of it than its title, which became a cornerstone of my fame by becoming a template: we would see all sorts of articles under the title "X considered harmful" for almost any X, including one titled "Dijkstra considered harmful". But what had happened? I had submitted a paper under the title "A case against the goto statement", which, in order to speed up its publication, the editor had changed into a "letter to the Editor", and in the process he had given it a new title of his own invention! The editor was Niklaus Wirth.

https://www.cs.utexas.edu/users/EWD/transcriptions/EWD13xx/E...


TIL. So, it's always been and will forever be arrogant in all cases. Good to know.


> Dijkstra's original paper

That was an editorial title. The original title was : "A Case against the GO TO Statement"


The people that advance these notions generally don't understand how computer arithmetic works and are lost in a scholarly fantasy world where they think integers solely exist on an unbounded number line. Anything that violates this predicate must be wrong because it forces them to be careful about edge cases.


When I was a new C++ programmer, I thought I was being safe by using unsigned int parameters instead of signed ints. "Oh, this will prevent negative values from being passed. I guess the compiler or magical elf inside my computer will prevent it." I was a little emotionally damaged to learn that ((unsigned int) -1) resulted in a ginormous value. So yeah, it's definitely a thing for newbies to associate auto-magical properties to the keyword unsigned. And I have heard this story repeated to me by good developers many times. We laugh about it together -- "Me too!"


What percentage of C and C++ programmers do you think really understand all the rules regarding integer types in C? Namely: https://www.nayuki.io/page/summary-of-c-cpp-integer-rules , https://en.cppreference.com/w/cpp/language/types , https://en.cppreference.com/w/c/language/conversion#Integer_... , https://en.cppreference.com/w/c/language/conversion#Usual_ar...

Never mind the fact that the vast majority of C/C++ programmers don't appreciate the fact that signed overflow in their languages is undefined behavior and can result in anything from correct behavior to complete unpredictable garbage. It's easy to argue for adding unsigned to Java because "I want this feature"; it's harder to argue against it because "I don't want others to use this feature / I don't like how this feature interacts with existing features".

Another example of not understanding types, did you know that Rust's usize can be as small as 16 bits? And that even on a 32-bit system, objects are limited to 2 GiB in size, not 4 GiB? https://stackoverflow.com/questions/32324794/maximum-size-of...

> The semantics of unsigned integers in java hasn't been decided yet

I think they have been decided. Java SE 8 added functions to treat bits as unsigned and perform operations (e.g. Integer.divideUnsigned(), Long.toStringUnsigned()). They might have also stated that it was a non-goal to add unsigned primitive integer types to the language.

For what it's worth, I am aware that the lack of unsigned integers in Java is a very contentious topic for decades, and the loudest voices are definitely on the pro-unsigned side. My favorite example is this thread: https://stackoverflow.com/questions/430346/why-doesnt-java-s...


> What percentage of C and C++ programmers do you think really understand all the rules regarding integer types in C?

Who cares? You could ask the same question about operator precedence, and the answer to both is similar: If it's unclear to you or those who might read your code, just be more explicit than strictly necessary. Handle or assert cases of possibly overflow. And by all means, you don't have to mimic the C/C++ implicit casting rules. If something's a potential pitfall, just make it an explicit cast (like `long` to `int` already is).

The status quo has some huge pitfalls of its own. `uint64` is out there, whether certain Java folks like it or not. It's there in protobuf, it's there in your databases, it's there in your FFI. This leaves people who have to interoperate with things from Java with the choice of either using `long` and hoping beyond hope your users read the docs, or using BigInteger and taking the accompanying performance hit, plus more correctness problems if your users try to operate on the value and take it outside its range.

The Java version of unsigned integers could completely disallow casting to and from signed integers and they would still be a very useful addition to the language.

Instead what's going to happen is we're going to get JEP 401 and you'll have a hundred different uint64 primitive classes that are exactly the same as each other except that they have different types. That or we'll get lucky and they'll add one to the standard library, at which point everybody who confidently asserted that adding unsigned integers to Java was a bad idea will have to rationalize to themselves that it doesn't really constitute adding them to the language because they're not in the bytecode, or they're not accessible without an import, or something.


Sigh. “You don’t need unsigned types, just these annoying workarounds. Why use unsigned types when you can cast and bit fiddle!”

The proof is in the use. If people create and use libraries that abstract for a primitive that every other compiled language has, then the primitive is missing from your language.


well, if only a minority uses those workarounds/libraries then maybe it's best to have it as a library and not in the language. That way the language can worry about other things instead of having to support twice as many integer types.

I also sometimes want to have unsigned integer types, mostly for data model correctness sake (aka this value can't be negative anyways, lets use unsigned). But so far I only really missed unsigned bytes in java (which the author acknowledges). And even in other languages I rarely "need" unsigned types.


> if only a minority uses those workarounds/libraries

Only a minority uses those workarounds, because they are cumbersome workarounds. If they were given proper unsigned types, they would use them much more often. I find myself using unsigned types in 99% of cases in Rust. I don't even remember the last time I had to use signed type.


The only time I've ever used a signed integer rust is to store (temporarily) an offset of some unsigned value!


How do you almost never run in to negative numbers?


Describing the sizes of collections seems like a reasonable prolific example.


But if you do any sort of math on those sizes then you probably want them to be signed anyway to avoid -1 rolling over to UINT_MAX and bugging out your system. Automatic signed to unsigned type coercion is extremely bug prone, so better avoid using unsigned integers entirely unless you never want to use them in math formulas. You avoid many more bugs by limiting collections to 2 billion than by mixing signed and unsigned integers. I have never had a bug related to signed collection sizes, but I have had a lot of bugs related to unsigned to signed casts.


When I write algorithmic code based on collections, the math is usually done on offsets, not sizes. And since negative offsets don't exist, situations where they could arise require explicit handling anyway.

As I work in bioinformatics, the in-memory collections tend to be pretty large. While the arithmetic is usually done with 64-bit integers, it often makes sense to store the numbers in 32 bits to save space. And since the length of a human genome is ~3 Gbp, that means unsigned 32-bit integers. Signed 32-bit integers are just bugs waiting to happen.

And sometimes the integers are stored in bit-packed arrays, where the width could be 29 bits, 33 bits, or something like that. Those are much easier and less error-prone with unsigned integers.


There is no automatic signed to unsigned coercion in Rust, so maybe that's why I don't have that problem. And I almost entirely use unsigned, even when doing math on sizes, so I never mix them. The only place one has to be careful is subtraction.


> Automatic signed to unsigned type coercion is extremely bug prone

When adding unsigned int as a language feature, don't we get to make up the rules? It seems like we can choose to make the rules not awful; we are not beholden to what C++ has done.


> But if you do any sort of math on those sizes then you probably want them to be signed anyway to avoid -1 rolling over to UINT_MAX and bugging out your system.

Both are shit and will create bugs if you let them through, doesn't make much a difference.

In fact there are languages which allow negative indices, in which going to -1 is a lot worse, because that's a valid index, just a nonsensical one.

> Automatic signed to unsigned type coercion is extremely bug prone, so better avoid using unsigned integers entirely unless you never want to use them in math formulas

Or you can just not have "automatic signed to unsigned type coercion" in the first place. Or any sort of automatic coercion for that matter.

> You avoid many more bugs by limiting collections to 2 billion than by mixing signed and unsigned integers.

Hell you'd avoid many more bugs by limiting collections to 32000 too!

> I have had a lot of bugs related to unsigned to signed casts.

I've had a lot of bugs related to signed to signed casts.

That doesn't say anything about signed values, that says something about casts.


Yeah, that is a strange comment. For me, the canonical use case for all non-negative but -1 for error is index_of(collection, value). If they "never run into negative numbers", do they not see/use this pattern?


To me, the lack of unsigned integers in Java, is the perfect example of how "simplicity" doesn't come from just taking things away. The theory is that unsigned types add complication but the reality is that the complication exists regardless but now you have to do workarounds.

I think the same thing about strongly typed languages that lack generics. Generics are obviously a complex feature but, if you don't have them, you're just spreading the complexity elsewhere.


I don’t agree with the argument. It shows one design for unsigned integers with confusing implicit conversions and concludes unsigned integers are bad. There is probably a different design for adding unsigned integers to Java that would be good.

I’m not saying a good design exists or would be worth the cost to implement. This post just does not explore the solution space much.


I'm sorry, but byte should not be a signed type.

Every time I work with byte in Java I have to cast it to an int with (b & 0xFF).

Utter madness.


From the article:

> Bytes are a different story however. I argue that signed bytes make programming needlessly annoying, and that only unsigned bytes should have been implemented in the first place

> https://www.nayuki.io/page/javas-signed-byte-type-is-a-mista...


Makes that article even more schizophrene.


I should read the article, doh!


It could be worse. It could be unspecified whether it is signed or not. But surely nobody would design a language like that.


Well, technically C and C++ do not have mistery-signed byte types ... because they don't have byte types but char types which may be larger than a single byte.

If someone designed a language this way today they would be called a master troll. At least you can require CHAR_BITS == 8 in practice for everything except weird embedded systems - both POSIX and Windows do just that.


Char is the byte type in C (and thus C++), both in theory and in practice. The problem it is a very overloaded type: as a char type it is used for strings, as an integral type is used as a small integer, and, as it is blessed with special semantics (strict aliasing and sizeof(char) = 1), it is also the byte type.

Technically C++ now does indeed have a blessed byte type (std::type). But it is not an integral type: it support bitwise operations but not artihmetic ops, so signedness doesn't come into the picture.


> Char is the byte type in C (and thus C++), both in theory

Not an 8-bit byte which is the definition most people think of when talking about bytes and the only one that matters once you need to interact with other systems.

> and in practice.

Yes.

> Technically C++ now does indeed have a blessed byte type (std::type). But it is not an integral type: it support bitwise operations but not artihmetic ops, so signedness doesn't come into the picture.

Its signedness does matter for casts to integral types - int(std::byte(255)) == 255 whereas int(char(255)) is implemenation-defined.


One of my favorites; guess the difference between the output of that code:

byte a1 = 0x40; byte a2 = (byte) 0x80; a1 >>= 1; a2 >>= 1; System.out.printf("0x%02X\n", a1); System.out.printf("0x%02X\n", a2);

And that code (note the operator ">>>" instead of ">>"):

byte a1 = 0x40; byte a2 = (byte) 0x80; a1 >>>= 1; a2 >>>= 1; System.out.printf("0x%02X\n", a1); System.out.printf("0x%02X\n", a2);

Congrats if you guessed right on your first try, because I certainly did not.


heh. if you know what is happening its guessable.

my model which might not be correct is:

this `byte a2 = (byte) 0x80` is a cast from the integer 0x80 to a byte and when there is a cast from integer to byte it just truncates the bits. when printing out this value it interprets it as 2s compliment so its 'value' is -128. so it just takes 8 bits starting from the least significant bit. so you end up with the bits 1_000_0000. then when you do this `>>=` operator that promotes to an integer and then it does the right shift then casts back to a byte. the promotion from integer to byte is just done by sign extending the byte to 32 bits. so you get 25 1's followed by 7 0's. the shift is a signed shift so you get 26 1's followed by 6 0's. then the cast back to byte leaves you with 11_000_000 which is why you get the 'weird' result of 0xC0 which is larger than 0x80. and the unsigned shift works in a similar way except the intermediate result is 0 followed by 25 1's followed by 6 0s. but this values in the high bits have no effect after the truncation which is why you get the same result.

you need to do `a2 & 0xFF` to clear out the sign bits outside the byte range before applying the shift in order to do the unsigned byte to unsigned int (or signed int?) promotion correctly. but using unsigned types in java is super dangerous.


I don’t know if this classifies as a party trick. If one is aware of the existence of “>>>”, I would presume one would solve this problem without guesswork.


So obvious it's not even fun, right ? And yet... there is a twist !

I don't want to spoil the ending. I encourage people to run the code on their own after they made their guess to compare.


This seems to mostly be an argument that implicit conversions are harmful. I think that's generally agreed to be true, most newer programming languages require all type conversions to be explicit.


I write a lot of firmware in C. I find it's implicit conversions to be broken and tediously annoying at this point. I think it's impossible problem because you can't capture semantics of what the programmer is trying to do without extra input.

I suspect part of the problem is conversions between types is complex and annoying to implement in a compiled language. With the added bonus that's only something the end users care about not the compiler writer. Hence in the wild this stuff tends to be broken or inadequate. Like in C it's way broken. In Java it's way inadequate.


Thankfully, modern C and C++ compilers have -Wconversion or equivalent flags so you can require explicit conversions.


indeed - implicit conversions are the underlying cause of bugs in most of the issues i get tasked with fixing.

i think it would be sensible to add unsigned, and disallow implicit conversion (definitely at least narrowing conversions).

but java is a distant world to me.


I would think that "lossless" implicit conversion are ok. Right?

For example, implicit conversion from a 32bit signed integer into a 64bit signed integer.


My only beef with unsigned types in Java is when working with external systems that _actually have_ them and converting everything correctly. The 12 or so times in my career where I've had to write or implement protocols on that low of a level might have been a pain, but otherwise I never once have missed them.


Personally I think that protocol description languages should support uint63 and int53 natively.

Uint63 can be promoted to whatever is the largest native integer type on most systems, and int53 fits into a JavaScript number. It would be useful to have explicit checks for these values on the protocol level.


I mentioned above, but in Kotlin going from a packed C struct to JVM:

    Mapping for C type to ByteBuffer call:
        
    uint16_t ->  getShort().toUShort()
    uint32_t ->  getInt().toUInt()
     int32_t ->  getInt()
       
It's not the most ergonomic set of calls, but definitely limits the error prone nature of it all since if you try to pass an Int into a function that needs a UInt, you'll get a compiler error


> To operate on two uint32_t values, we need to widen them to Java long using val & 0xFFFFFFFFL. Unfortunately this technique does not allow us to operate on two uint64_t values because there is no wider signed type in Java

> I argue that signed bytes make programming needlessly annoying, and that only unsigned bytes should have been implemented in the first place: Java’s signed byte type is a mistake.

> None of these alternatives is as correct as using uint32_t in C/C++, but I don’t think this situation comes up often enough to matter.

Never mind that the supposed benefits are extremely slim, and mostly due other weaknesses in Java


> > To operate on two uint32_t values, we need to widen them to Java long [...] Unfortunately this technique does not allow us to operate on two uint64_t

This was filed under the heading "Straightforward emulation". In the next section titled "Efficient emulation", I show that operating on two 64-bit values in unsigned mode is easily doable.

> other weaknesses in Java

Please explain. Keep in mind that Java is a simpler C++, and consciously chose to remove features that would be confusing or dangerous (e.g. unsigned integers, pointer arithmetic, destructors, multiple inheritance, operator overloading).

> when a file/network format specifies an unsigned data field – how should we represent it in Java? [...] None of these alternatives is as correct as using uint32_t in C/C++

Types like uint32_t don't buy you as much functionality as you think. In the last few projects I worked on, the domain of allowed values was a weird range. For example, 1 <= QrCode.version <= 40 ( https://github.com/nayuki/QR-Code-generator/blob/2643e824eb1... ). For example, 1 <= Png.Ihdr.width < 2^31 ( https://github.com/nayuki/PNG-library/blob/536acb238f9c000e2... ). Neither Java int nor C uint32_t capture these constraints properly.


I don't follow this argument - the only thing I can think of that's MORE confusing than mixing signed + unsigned types is passing around values of the signed type that are "supposed to be interpreted as unsigned", like "Long.divideUnsigned" does.

https://docs.oracle.com/javase/8/docs/api/java/lang/Long.htm...

So in a program that uses this a lot, a reader can never tell reliably if a value typed "long" is signed or unsigned and the compiler can't catch a mistake like using "divideUnsigned" and then printing the result without using "toUnsignedString"...


I generally agree with the author, and Google style guides generally discourages use of unsigned types (even in C++) for anything that isn't actually a bag of bits.

I don't know how many strange issues I've tracked down that amounted to "this protocol buffer has a uint32 field, and surprise now the value is negative in Java and oops there was a check that cared about that." At least five or six issues.

At least when it comes to serialization, enforce invariants above the serialization layer.

I would love an unsigned byte type in Java though. What a pain.


Kotlin solves this very nicely with Uxxx types, like UByte, UInt, UShort, etc.

Typing allows enforcing the boundaries where you go from signed to unsigned, and the bit fiddling is handled in the class

   UShort.toInt(): Int = data.toInt() and 0xFFFF
They're defined as inline classes/functions that have no runtime cost so it's equivalent to the "Straightforward emulation" from the article


Another good reason to switch to Kotlin. :)


Signed and unsigned should both go, there should only be wordN types (word8, word16, etc) with two's complement semantics.

Then you can have explicit functions for the operation you want: signed_mul(), signed_less_than(), unsigned_greater_than(), add_assume_no_overflow(), etc. (add your favorite syntactic sugar/operator symbols for these).

Assembly is more explicit and clearer to understand in this regard.


This would be terrible, the point of the type system is to stop you making mistakes by accidentally treating one type as another


How would you make a mistake?

The only time signed vs unsigned matters is for comparisons and mul/div, and those have explicit names so you're never surprised that 0xFFFFFFFF is less than 0 when doing signed_less_than().


That's how Java works. See >>>, compareUnsigned, divideUnsigned, parseUnsignedInt, etc.


A lot of comments here say that Java currently lacks unsigned integer types. This isn’t quite right. Java’s char type is specified and works as an unsigned 16 bit integer.


Correct, but there are several things to consider.

Due to byte/short/char being promoted to int for any arithmetic, any time you do an operation like char+char→int, you'll still come into contact with signed types and need to continually cast back to char. This problem also occurs in C/C++, and you can typically find that uint16_t * uint16_t → signed int.

Using char to represent numbers (e.g. 16-bit channel values for pixels) instead of text is semantically incorrect. There's no problem if you're just working within Java, but once you get into comparing or porting code across languages, this is a code smell. Just as an example, the range for char in Rust is [0x000000, 0x10FFFF], and this is checked at run time when converting to char; this would be seen as weird from the lens of C/C++/Java where char is just a sequence of bits where all values are allowed.


> this is a code smell

Worry not - the only time i ever authored java code was when i was making the world's smallest (still) JVM. You have no reaosn to worry about bad smelling java code using char as uint16 from me :D




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: