Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> NUL Characters Are Allowed In Text Strings

Any raw byte sequence is allowed in text strings.



Which seems logical to me.

Though there is an issue that some of sqlite's own functions are unaware of this and will end early if a NUL is encountered: https://www.sqlite.org/nulinstr.html


> Which seems logical to me.

I expect text to be text (i.e., Unicode, these days).


\u0000 (NUL) is a perfectly valid unicode character.

Though I can understand people from mostly C/C-alike background where NUL termination is the norm for strings less uncomfortable with that.


I think it was against "Any raw byte sequence is allowed in text strings.", and not against NUL.


But aren't there extended ASCII encodings that have 256 8-bit characters?


There are other encodings, yes, where any byte string is valid within that encoding. The Unicode encodings (UTF-*) don't have that property, however.

The SQLite docs say this about text values or columns, though it's a bit muddy which is which. (But it doesn't really matter.)

> TEXT. The value is a text string, stored using the database encoding (UTF-8, UTF-16BE or UTF-16LE).

But it's the best reference I have for "what are the set of values that a `text` value can have".

E.g.,

  sqlite> PRAGMA encoding = 'UTF-8';
  sqlite> CREATE TABLE test (a text);
  sqlite> INSERT INTO test (a) VALUES ('a' || X'FF' || 'a');
  sqlite> SELECT typeof(a), a from test;
  text|a�a
Here we have a table storing a "text" item, whose value appears to be the byte sequence `b"a\xffa"`¹. That's not valid UTF-8, or any other Unicode. The replacement character here ("�") is the terminal saying "I can't display this".

Presumably for this reason, the Rust bindings I use have Text value carrying a [u8], i.e., raw bytes. Its easy enough to turn that into a String, in practice. (But it is fallible, and the example above would fail. In practice, it gets rolled into all the other "it's not what I expect" cases, like if the value was an int. But having a language have a text type that's more or less just a bytestring is still regrettable.)

¹borrowing Rust or Python's syntax for byte strings.


That seems odd, I normally think of text strings as explicitly disallowing control characters, and instead, consider this a binary blob.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: