Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

To think about the difference between serialization formats, here's an analogy I hope will help.

Protocol Buffers (and I think Thrift, and maybe Avro) are sort of like C or C++: you declare your types ahead of time, and then you take some binary payload and "cast" it (parse it actually) into your predefined type. If those bytes weren't actually serialized as that type, you'll get garbage. On the plus side, the fact that you declared your types statically means that you get lots of useful compile-time checking and everything is really efficient. It's also nice because you can use the schema file (ie. .proto files) to declare your schema formally and document everything.

JSON and Ion are more like a Python/Javascript object/dict. Objects are just attribute-value bags. If you say it has field fooBar at runtime, now it does! When you parse, you don't have to know what message type you are expecting, because the key names are all encoded on the wire. On the downside, if you misspell a key name, nothing is going to warn you about it. And things aren't quite as efficient because the general representation has to be a hash map where every value is dynamically typed. On the plus side, you never have to worry about losing your schema file.

I think this is a case where "strongly typed" isn't the clearest way to think about it. It's "statically typed" vs. "dynamically typed" that is the useful distinction.



That's a great analogy! However, I do think strongly typed vs. weakly typed has a role in thinking about this, just a different dimension than the one you're describing. Let's say we come across a JSON structure that looks like this:

  {"start": "2007-03-01"}
Is that a timestamp? Maybe! Does it support a time within the day? Perhaps I can write "2007-03-01T13:00:00" in ISO 8601 format if we're lucky. Can I supply a time zone? Who knows for sure? It's weakly typed data. The actual specification of that type of that field lives in a layer on top of JSON, if it's even specified at all. It might be "specified" only in terms of what the applications that handle it can parse and generate. I could drop that value into Excel and treat it as all sorts of different things.

Ion by comparison has a specific data type for timestamps defined in the spec [1]. The timestamp has a canonical representation in both text and binary form. For this reason, I know that "2007-02-23T20:14:33.Z" and "2007-02-23T12:14:33.079-08:00" are valid Ion timestamp text values. In this instance I would describe Ion as strongly typed and JSON as weakly typed. Or, as the Ion documentation puts it, "richly typed".

To make an analogy, weakly typed is the Excel cell that can store whatever value you put in it, or the PHP integer 1 which is considered equal to "1" (loose equality). Strongly typed is the relational database row with a column described precisely by the table schema. Weakly typed is the CSV file; strongly typed is the Ion document.

[1] http://amznlabs.github.io/ion-docs/spec.html


Ion has more data types than JSON, it's true. Ion has a timestamp type and JSON does not, so you could say it's "richer" if you want, but that just means "it has more types."

However I don't think it's accurate to say that the typing of Ion is any "stronger." Both Ion and JSON are fully dynamically typed, which means that types are attached to every value on the wire. It's just that without an actual timestamp type in JSON, you have to encode timestamp data into a more generic type.


The notions of "strong" and "weak" typing have never been particularly well-defined, but I think my usage is in line with their usual meaning: https://en.wikipedia.org/wiki/Strong_and_weak_typing

> Some programming languages make it easy to use a value of one type as if it were a value of another type. This is sometimes described as "weak typing".

Strong typing makes it difficult to use a value of one type as if it were another. In PHP, you can compare the integer value 1 to the string value "1" and the equality test returns boolean true. Conflating integer 1 and string "1" is weak typing. A data format that expresses the concept of the timestamp 1999-12-31T23:14:33.079-08:00 using the same fundamental type as the string "Party like it's 1999!" is what I would call weakly typed.

Ion does not make it easy to use a string as if it were a timestamp or vice versa. It has types like arbitrary precision decimals, or binary blobs, that can't easily be represented in a strongly-typed way in JSON. You can certainly invent a representation, like specifying strings as ISO 8601 for timestamps, or an array of numbers for binary -- actually, wait, how about a base64-encoded string instead? Where there's choice there's ambiguity. These concepts of "type" live in the application layer in JSON, instead of in the data layer like they do in Ion.

Note as well that stronger is my term. The Ion documentation says "richly-typed". Certainly Ion does not include every type in the world. Perhaps a future serialization framework might capture "length" with a unit of "meters", or provide a currency type with unit "dollars", and if that existed I'd call it stronger-(ly?)-typed or more richly typed than Ion. In that case, the data layer would prevent you from accidentally converting "3 inches" to "3 centimeters" by accident, since those would be different types. That would be stronger typing than an example where you simply have the integer 3, and it's the application's job to track which integers represent inches, and which represent centimeters. So perhaps "strong" and "weak" are not the best terms, so much as "stronger" and "weaker".


By your definition, any language with strings is weakly typed, since you can always interpret a string as being something else. Strongly/weakly typed has never been a particularly useful description (as the page you linked notes), and I think it's particularly unhelpful here.


> By your definition, any language with strings is weakly typed, since you can always interpret a string as being something else

No, I wouldn't say that's the case. For example, in PHP you can literally write:

  if (1 == "1") { ...
... and the condition evaluates to true. You can do similar things in Excel; Excel doesn't even really differentiate between those two values in the first place. (At least that's how it seems as a casual user.)

This is not the case in strongly typed programming languages that have strings such as C++ or Java. You can convert from one type to another, sure, by explicitly invoking a function like atoi() or Integer.toString(), but the conversion is deliberate and so it is strongly typed. A variable containing a string (java.lang.String) cannot be compared against one containing a timestamp (java.util.Date) by accident. An Ion timestamp is a timestamp and can't be conflated with a string, although it can be converted to one.

Edit: The set of types that are built in, in conjunction with how those types are expressed in programming languages (e.g. timestamp as java.util.Date, decimal as java.math.BigDecimal, blob as byte[]), is why I'd call Ion strongly typed or richly typed in comparison to JSON. Specifically, scalar values that frequently appear in common programs can be expressed with distinctly typed scalar values in Ion. I don't know if there's a good formal definition. You could probably define a preorder on programming languages or data formats based simply on the number of distinct scalar or composite types (so in that sense, yes, it's the fact that Ion has more). However it goes beyond that subjectively. Subjectively it's about how often you have to, in practice, convert from one type to another in common tasks. There is no clear way to represent an arbitrary-precision decimal in JSON, or a byte array, or a timestamp -- so you must "compress" those types down into a single JSON type like string-of-some-format or array-of-number; and several different scalar types must all map to that same JSON type, which creates the risk of conflating values of different logical types but the same physical JSON type with each other. There's no obvious or built-in way to reconstruct the original type with fidelity. There's no self-describing path back from "1999-12-31T23:14:33.079-08:00" and "DEADBEEFBASE64" back to those original types.

I subjectively call JSON weakly typed because its types are not adequately to uniquely store common scalar data types that I work with in programs that I write. I call Ion strongly typed because it typically can. I acknowledged earlier that a data format would be even more strongly typed if it was capable of representing not just the type "integer", but "integer length meters". Ion does not have this kind of type built in, though its annotations feature could be used to describe that a particular integer value represents a length in meters.


> You can't misuse any kind of Ion value that is a string as if it were a timestamp without performing an explicit conversion.

The same is true of JSON. There is no difference, except that Ion has a timestamp type and JSON does not.

If you disagree, please identify what characteristic of Ion's design makes it more strongly typed than JSON, other than the set of types that is built in.


You are choosing a definition of strong typing that supports your argument, but the argument is over the meaning of strong typing to begin with. It's not as if there's some universally accepted definition of strong typing. Like functional programming, functional purity, object oriented, etc.—none of these terms are universally defined.


The fact that "strong typing" has no universal definition is exactly why I think it's not useful.


I hate feeling like I'm nitpicking, but I don't think that's true. I think they do have a well-accepted definition, which appears in Wikipedia, in assorted articles online, and in computer science publications. Here are some examples of CS publications that describe a research contribution in terms of strong typing:

> Strong typing of object-oriented languages revisited. This paper is concerned with the relation between subtyping and subclassing and their influence on programming language design. [...] The type system of a language can be characterized as strong or weak and the type checking mechanism as static or dynamic. http://dl.acm.org/citation.cfm?id=97964

> GALILEO: a strongly-typed, interactive conceptual language. Galileo, a programming language for database applications, is presented. Galileo is a strongly-typed, interactive programming language designed specifically to support semantic data model features (classification, aggregation, and specialization), as well as the abstraction mechanisms of modern programming languages (types, abstract types, and modularization). http://dl.acm.org/citation.cfm?id=3859

> Design and implementation of an object-oriented strongly typed language for distributed applications. http://dl.acm.org/citation.cfm?id=99813

> Strongly typed heterogeneous collections. (Oleg Kiselyov et al.) http://dl.acm.org/citation.cfm?id=1017488

> Strongly typed genetic programming. Genetic programming is a powerful method for automatically generating computer programs via the process of natural selection [but] there is no way to restrict the programs it generates to those where the functions operate on appropriate data types. [When] programs manipulate multiple data types and contain functions designed to operate on particular data types, this can lead to unnecessarily large search times and/or unnecessarily poor generalization performance. Strongly typed genetic programming (STGP) is an enhanced version of genetic programming that enforces data-type constraints and whose use of generic functions and generic data types makes it more powerful than other approaches to type-constraint enforcement http://dl.acm.org/citation.cfm?id=1326695

The argument that the terms have no universal definition cannot be sound in light of their widespread use in computer science publications, even in the title and abstract. Perhaps what you mean to say is that the terms don't have a completely unambiguous or formal definition. That's probably true, but not all CS terms do. The words are contextual and exist on a spectrum, in the sense that a strongly-typed thing is typically in comparison to a more-weakly-typed thing [1]. However, the fact that they're widely used by CS researchers is why I think we should reject the argument that they don't have a universal definition or are not useful. CS researchers like Oleg Kiselyov use the term when describing their papers and characterizing their contributions.

[1] This is true for static and dynamic typing as well: they exist in degrees. Rust can verify type proofs that other languages can't regarding memory safety. Some languages can verify that integer indexes into an array won't go out of bounds. Thus it's not the case that a given language is either statically typed or dynamically typed; rather, each aspect of how it works can be characterized on a spectrum from statically verified to dynamically verified.


> I think they do have a well-accepted definition [...] [You] shouldn't confuse your dislike for them for the absence of a well-accepted definition that's widely used in computer science literature.

Just upthread, you said:

> The notions of "strong" and "weak" typing have never been particularly well-defined

And the Wikipedia article you cited (https://en.wikipedia.org/wiki/Strong_and_weak_typing) says:

> These terms do not have a precise definition

The Wikipedia article also says:

> A number of different language design decisions have been referred to as evidence of "strong" or "weak" typing. In fact, many of these are more accurately understood as the presence or absence of type safety, memory safety, static type-checking, or dynamic type-checking.

Also on Wikipedia (https://en.wikipedia.org/wiki/Type_system):

> Languages are often colloquially referred to as "strongly typed" or "weakly typed". In fact, there is no universally accepted definition of what these terms mean. In general, there are more precise terms to represent the differences between type systems that lead people to call them "strong" or "weak".

...which is exactly what I'm saying in this entire thread.

It's very strange to me how you seem really seem to want other people to be on board with your particular interpretation of what everybody (even you, 13 hours ago) agrees is not a very well-defined concept.


> This is true for static and dynamic typing as well: they exist in degrees. Rust can verify type proofs that other languages can't regarding memory safety. Some languages can verify that integer indexes into an array won't go out of bounds. Thus it's not the case that a given language is either statically typed or dynamically typed

Memory safety and static/dynamic typing are orthogonal. C is statically typed but memory unsafe. Rust is statically typed but memory safe (except in unsafe blocks). Lua is dynamically-typed but memory safe.

I agree that it's possible to mix elements of static and dynamic typing in a single language. C++ is generally statically typed, but also supports dynamic_cast<>.

But generally speaking, static and dynamic typing have a very precise definition. Something that carries around type information at runtime is dynamically typed. Something that does type analysis at compile time so that the runtime doesn't need to carry type information is statically typed.


I generally agree, except the "type" of JSON numbers isn't well-defined with respect to precision and binary-vs-decimal floating point representation. An application that cares deeply about either aspect of numbers can't rely on JSON alone to ensure that the values are properly interpreted by all consumers.


That is a good point in that it is a very accurate reading of the JSON spec. In practice many (even most) JSON implementations don't give applications access to any precision beyond what an IEEE double can represent. So while you may take advantage of arbitrary precision in JSON and be fine according to the spec, your users will probably suffer data loss unless they are very picky about what JSON library they use. For example, JSON.parse() in JavaScript is out.


It's more than just precision, it's making sure that the same value comes out that went in, and that things haven't been subtly altered via unintended conversions between decimal and binary floating-point representations. Obviously this is quite important when you've got both text and binary formats.

Some applications really need decimal values, and some really need IEEE floats. Ion can accurately and precisely denote both types of data, making it easier to ensure that the data is handled properly by both reader and writer.


That's a good description, but I'd say that we have a strongly <-> weakly typed axis and a statically <-> dynamically typed axis here. Or I might actually prefer to name the first axis poorly <-> richly typed.

            poorly typed <-------------> richly typed
    dynamic CSV, INI          JSON          YAML, Ion
    static        Bencode, ASN.1      Protobuf
What I mean by "richly typed" is that you would never read a timestamp off the wire and not know that it's a timestamp. By comparison, with CSV or INI files, you just have strings everywhere. Formats on the richly typed side have separate and explicit types for binary blobs and text, for example.


Sure, I think your "poorly typed" vs. "richly typed" axis just refers to how many built-in types it has. It's true that CSV and INI only have one type (string). And it's true that when more types are built in, you have fewer cases where you have to just stuff your data into a specially-formatted string.


Yes, that's exactly what I said.


> It's "statically typed" vs. "dynamically typed" that is the useful distinction.

I officially propose to use the term "accidentally typed" or "eventually typed".


!!!! My understanding went up several orders of magnitude! Thank you!!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: