That is kind of the point of this repo. If your API implementation is too expensive for lambda move the same container to run on EKS/ECS fargate or ec2. The opposite holds as well - if you can save money by moving it from fargate to lambda then you can use the same container.
Ion has a binary format but is not specifically about real-time streaming. It is a JSON replacement.
Ion originated 10+ years ago from the Amazon catalog team - the team that kept data about the hundreds of millions of items available on Amazon. Nearly every team in the company called the catalog to get information about items all the time - scanning the entire catalog, parts of the catalog, millions of individual item lookups every second, etc.
They did the math and some very large percentage of network traffic in Amazon Retails data centers was catalog data. If that data, currently in XML or JSON format, was sent in a more compact format it would save some ridiculous millions of dollars every year. So Ion was born and eventually open sourced.
The intent of the stored schema isn't really for self-description. A typical use case for Avro is data storage over long periods of time.
It is expected that the schema will evolve at some point during this time. Therefore you still need to specify a target schema to read the data into which is allowed to be different than the stored schema. Avro then maps the stored data into the target schema by using the stored schema. Most avro libraries expect you to get the target schema from a separate source before reading data.
That's not true. Avro at least existed at the time. However they wanted something self-describing to replace JSON/XML usage. Avro is better suited as a data storage format rather than a transit oriented format. Of course, both Ion and Avro can be used for either, but Avro will give you better compression on disk, but Ion is less cumbersome since it doesn't require a schema
Finally! I've had to live the JSON nightmare since I left Amazon.
Some of the benefits over JSON:
* Real date type
* Real binary type - no need to base64 encode
* Real decimal type - invaluable when working with currency
* Annotations - You can tag an Ion field in a map with an annotation that says, e.g. its compression ("csv", "snappy") or its serialized type ('com.example.Foo').
* Text and binary format
* Symbol tables - this is like automated jsonpack.
* It's self-describing - meaning, unlike Avro, you don't need the schema ahead of time to read or write the data.
Sounds a lot like Apple's property list format, which shares almost everything you listed in common, except for annotations and symbol tables.
Its binary format was introduced in 2002!
Edit: Property lists only support integers up to 128 bits in size and double-precision floating point numbers. On top of those, Ion also supports infinite precision decimals.
Plists are nifty, but the text format's XML-based, which makes it too complex and too verbose to be a general-purpose alternative to something like JSON.
(plutil "supports" a json format, but it's not capable of expressing the complete feature set of the XML or binary formats.)
Like Property Lists the binary format is TLV encoded as well. Ion has a more compact binary representation for the same data and additional types and metadata. Also, IIRC, Plist types are limited to 32-bit lengths for all data types. The binary Ion representation has no such restriction (though in practice sizes are often limited by the language implementation).
True, but my point was that there's enough talent at Amazon, working on SDKs, and others, and there are precedents where even more complex projects such as JMESPath have wide support [0].
I'm sure there are ion bindings for every language in common use at Amazon. But a huge percentage of Amazon code is Java, so presumably this one was the best maintained and documented.
I doubt it, when I was there Ion was only used by only a handful of Java teams doing backend work. It was also horribly documented and supported at the time (3.5 years ago).
I am still in Amazon and Ion is definitely the most widely used library around. It has among the best documented code and some of the extensions that have been built on top of Ion are simply amazing.
Precision is the number of significant digits. Oracle guarantees the
portability of numbers with precision ranging from 1 to 38.
Scale is the number of digits to the right (positive) or left (negative)
of the decimal point. The scale can range from -84 to 127.
Worth noting this isn't specifically an Oracle thing, most financial systems need to be sure that it can store currency numbers accurately and this convention is widely used to ensure this.
Typically when dealing with currencies scale is only used to represent the units less than whole unit of the currency, i.e. cents and pence. But there isn't anything that restricts it from being used to accommodate larger numbers with the use of negative scales.
<CcyNtry>
<CtryNm>UNITED STATES OF AMERICA (THE)</CtryNm>
<CcyNm>US Dollar</CcyNm>
<Ccy>USD</Ccy>
<CcyNbr>840</CcyNbr>
<CcyMnrUnts>2</CcyMnrUnts>
</CcyNtry>
The CcyMnrUnts property denotes 2 decimal places for the sub-unit of the dollar.
So for the above example of 99999.999 you would store an amount of 99999999 and a scale of 3.
So that it can be an integer with an arbitrary length, or a float/double without precision problems. You can also let our own integer classes do the parsing (which might even be able tonhandle complex types).
After all, everything in JSON is a string since it doesn't have a binary format and it shouldn't cause a huge overhead to do the parsing yourself (that might depend on the library, though).
I find that many financial technology companies opt to store currency as strings. The small overhead is typically well worth freedom from floating-point errors.
Any text format is a technically a string. Just because some numeric token has no double quotes around it doesn't mean it isn't a string (in that representation).
It's just that we can have some lexical rules that if that piece of text has the right kind of squiggly tail or whatever, it is always treated as a decimal float instead of every programmer working with it individually having to deal with an ad hoc string to decimal type conversion.
Until you need to deal with decimals instead of floats, then you are going to hate yourself because you have to pull in some third party library because the language treats every single number as a float (and floating point errors are a lot more common than most people think even when they are adding together simple numbers).
Sure, but most other languages have built-in support for decimal types. Java has BigDecimal, as does Ruby, Python has the decimal module, C# has System.Decimal, the list goes on.
Javascript doesn't even have proper integers to guarantee the functioning of this correctly, it's a really sad state.
I just did some ownership percentage stuff where it's not uncommon to go 16 decimal places out...working with JavaScript on this was a pain. Never thought I'd care about that .00000000000001 difference hah...
Ion has functions to turn Ion into JSON which will, of course, lose information. Annotations are dropped, decimals turn into a JSON number type which may lose precision, etc.
To be honest, we weren't really sure what to pick. We just wanted to have something in place. It's not set in stone and we'll likely change things as we see different use cases.
What's interesting to note is that Dynamo/Cassandra usage was killed at both Amazon and Facebook (DynamoDB from AWS is actually not based on Dynamo tech except in terms of what not to do).
Damien Katz (one of my cofounders at Couchbase and therefore has a dog in the fight) wrote this take-down of the Dynamo model, for folks wondering why?
The Dynamo system is a design that treats the probability of a network switch failure as having the same probability of machine failure, and pays the cost with every single read. This is madness. Expensive madness.
> Network Partitions are Rare, Server Failures are Not
Network partitions happen all the time. Sure, the whole "a switch failed and that piece of the network isn't there anymore" doesn't happen a lot, but what does happen a lot is a slow or delayed connection, or a machine going offline for a few seconds.
Indeed. We have some external servers in a partner's datacenter which runs under VMware's vMotion. Every now and then vMotion will shuffle a VM to another physical server, causing the entire OS to freeze for several (often 20-30) seconds, and everything that is partition-sensitive, like RabbitMQ and Elasticsearch, throws a tantrum and keels over.
Even VMs on more statically allocated clouds like DigitalOcean and AWS will experience small, constant blips that affect your whole stack.
What annoys me in particular is that these blips affect everything. Every app needs to fail gracefully, be it a PostgreSQL client connections, a Memcached lookup or an S3 API call. The fact that such catch-and-retry boilerplate logic needs to built into the application layer, and every layer within it, is still something I find rather insane. It leaks into the application logic in often rather insidious ways, or in ways that pollutes your code with defenses. Everything has to be idempotent, which is easy enough for transactional database stuff, less easy for things like asynchronous queues that fire off emails. Erlang has already provided a solution to the problem, but I suspect we need OS-level support to avoid reinventing the wheel in every language and platform. /rant
Our customers tend to be the kind who need extreme performance, so they aren't spanning cluster across WANs. For well-tuned datacenters rack awareness (putting the replicas in sane places), is more useful.
For WAN replication we have a cross-datacenter replication which works on an AP model.
To be clear, I was in no way making a judgement about CouchDB vs. Cassandra. I've only give Couch a cursory glance so I wouldn't be qualified to make such a judgement.
I was simply trying to point out that while you may have a very good argument as to why Couch is better, the network partition argument is not sound, and you may want to look for a better argument to make.
I'm personally against single masters because they are SPOFs. With a master, at some point there needs to be a single arbiter of truth, and if that is unavailable, then the system is unavailable.
A nitpick, but an important one which I wish the Couchbase folks wouldn't let slip as often as they do, CouchDB has very different properties from Couchbase and should be considered entirely different database designs regardless of the availability of a sync gateway for replication with a number of JSON stores.
Different database designs, but similar document model. In fact, Couchbase Sync Gateway is capable of syncing between Couchbase Server and Apache CouchDB. Also our iOS, Android, and .NET libraries can sync with CouchDB and PouchDB. Everything open source, of course. More info: http://developer.couchbase.com/mobile/
It's that these are fundamentally very different databases with different trade offs. You can't just take one and adjust some API calls and expect things to work in a similar way. It only confuses people when it's quietly ignored and others assume that since it wasn't pointed out to be wrong that it must be the same thing.
I've had far too many conversations with people who use Couchbase that can't tell the difference that I would say that it's just general confusion. It's lax work on Couchbase's part and a thorn in the Apache CouchDB project that there is no effort to help clarify the fact that they are indeed independent and now very different databases.
Exactly. Couchbase trades some availability during rebalance for more lightweight client interactions. As Damien argues in his post, this allows it to meet the same SLA with less hardware.
Really it doesn't matter and your point only goes to show one of Cassandra's flaws. When either a network partition or a server failure happens Cassandra starts reshuffling data amongst multiple hosts and filling network pipes. Contrast this with a setup where you have a static partitioning of hosts to partition and a leader per partition. Then you only need to (possibly) elect a new leader and carry on.
This is especially relevant when you need to do these things because of unexpected load increases or the loss of hosts in your cluster.
I realize your previous comment was more in reference to transient network partitions so my comment is out of place. But whether the mechanism is manual or automatic once a network partition is discovered the reshuffle begins.
It would be very uncommon to perform a token rebalance (or bring up replacement nodes) under a network partition, since those nodes are fine. The idea is to be tolerant to a network partition, which multi-master is, not to then attempt to patch up a whole new network while the partition is in place. That could easily bring down the entire cluster.
Data is only typically shuffled around when a replacement node is introduced to the cluster.
If you have multiple independent network partitions that isolate all of your RF nodes, then there is no database that could function safely in this scenario, and this has nothing to do with data shuffling.
While this is true, I'm not sure why that's a problem. The system can still function while the data is in transit from one node to another. As long as the right 2/3 of the machines are up, the cluster can function at 100%, and as long as the right 1/3 are up, it can still serve reads (assuming a quorum of 3).
You mean replication factor of 3? It doesn't work like that. If your replication factor is 3 and you're using quorum read/writes then as soon as two machines are down some of the reads and writes will fail. The more machines down the higher the probability of failure. That's why you have to start shuffling data around to maintain your availability which is a problem... (EDIT: assuming virtual nodes are used, otherwise it's a little different)
Like I said, it depends on how you lay out your data. Let's say you have three data centers, and you lay our your data such that there is one copy in each datacenter (this is how Netflix does it for example).
You could then lose an entire datacenter (1/3 of the machines) and the cluster will just keep on running with no issues.
You could lose two datacenters (2/3s of the machines) and still serve reads as long as you're using READ ONE (which is what you should be doing most of the time).
If you read and write at ONE (which I think NetFlix does) then this kind of works. Still with virtual nodes losing a single node in each DC leaves you with some portion of the keyspace inaccessible.
You're susceptible to total loss of data since at any given time there will be data that hasn't been replicated to another DC and you're OK with having inconsistent reads.
That works for some applications where the data isn't mission critical and (immediate) consistency doesn't matter but doesn't for many others. I'm not sure what exactly NetFlix puts in Cassandra but if e.g. it's used to record what people are watching then losing a few records or looking at a not fully consistent view of the data isn't a big deal...
If a machine is down (as in the machines themselves are dead) then you absolutely need to move the data onto another node to avoid losing it entirely. There is no way to avoid this, whatever your consistency model.
If there is a network partition, however, there is no need to move the data since it will eventually recover; moving it would likely make the situation worse. Cassandra never does this, and operators should never ask it to.
If you have severe enough network partitions to isolate all of your nodes from all of its peers, there is no database that can work safely, regardless of data shuffling or consistency model.
It can be hard to tell if the machine is down as in permanently dead including the disk drives or if it's temporarily down, unreachable or just busy. Obviously if you want to maintain the same replication factor you need to create new copies of data that is temporarily down to a lower replication number. In theory though you could still serve reads as long as a single replica survives and you could serve writes regardless.
You can kind of get this with Cassandra if you write with CL_ANY and read with CL_ONE but hinted handoffs don't work so great (better in 3.0?) and reading with ONE may get you stale data for a long while. It would be nice, and I don't think there's any theory that says it can't be done, if you could keep writing new data into the database at the right replication factor in the presence of a partition and you could also read that data. Obviously accessing data that is solely present on the other side of the partition isn't possible so not much can be done about that...
We don't even need to agree or disagree. There are numerous studies on how often network partitions happen. Aphyr & Bailis paper 'The Network Is Reliable' [1] has a very detailed overview of these studies.
I think that's a simplistic analysis. What you want really depends on the workload characteristics. If you want an immutable data store with high write performance but infrequent reads the Dynamo model works quite well. Writes can be fast with a small quorum (you don't need to persist to disk immediately as you're relying on multiple machines not going down) and read consistency is not usually a huge issue for analytics tasks (who really cares if you miss an event in your processing?) and if it is you can usually afford to wait for quorum reads.
The redundant work has a drastic improvement on tail latency. This is both true in a health cluster but even more true once even small failures or fluctuations occur. You might be able to tune your network and your software to _minimize_ observed partitions but you are very unlikely to see a system free of fluctuations outside of the hard-realtime system construction which none of the above qualify for by a very long shot.
It's a heavy price, sure, but in return you'll be able to round off the tail in most scenarios. This is a trade off that many don't properly consider when they try to focus on that mean performance while they miss part of the point of the dynamo model which helps provide better guarantees about how things perform in more cases, including highly tuned clusters with very few major failures.
Reads absolutely must go through the consensus system (whatever it may be) if you make any guarantees against stale reads. Yes it's expensive, but that's why there's no such thing as a magical distributed database that scales linearly while providing perfect consistency guarantees all the time.
Both Dynamo and Cassandra are written in Java. Are the replacements still Java?
At least Google's counterparts seem to be written in C++. Such as BigTable and GFS. Presumably also Spanner is C++.
In addition to C++, especially considering its less good safety/security record, Rust, Golang and Nim should be interesting alternative, safer, implementation language choices. In contrast to idiomatic Java, those languages provide significantly higher CPU cache hit rate for internal data structures due to no value boxing [1] and ability to reliably reduce problems like false sharing [2].
1: http://www4.di.uminho.pt/~jls/pdp2013pub.pdf (These issues can be worked around in Java by abandoning object orientation and instead having one object with multiple arrays (SoA, structure of arrays). In other words, not List or Array etc. of Point-objects, but class Points { int[] x; int[] y; ... } that contains all points.)
In what way is Rust, Golang, Nim safer than Java ? Rust I understand provides some nice semantics for thread safety but these exist in Java as world.
And I don't understand why anybody should care about CPU cache hit rate. The bottleneck is always going to be in the I/O pipeline. And Java is faster than C++ and vice versa in various situations.
Rust, Golang and Nim are safer than C++. Sorry that I didn't express it clearly enough; I considered the safety issues to be rather obvious.
> Rust I understand provides some nice semantics for thread safety but these exist in Java as world.
How do you get Java to fail compiling if thread safety constraints are not met? I'd be interested to try it out!
One should definitely care about cache hit rate, because it significantly affects runtime performance. There are just 512 L1D cache lines per CPU core.
I/O is the bottleneck? It is becoming less so, one of the few areas where there's actually some nice progress happening. PCIe SSDs are up to 1.5 - 2 GB/s (=up to 20 Gbps). More and more servers have 10 Gbps or 40 Gbps networking.
Sure, Java is faster when it can use JIT to prune excessive if-jungle, aggressively simplify and inline and adapt to running CPU. But memory layout control is where Java is rather weak. The problem is getting only worse, because the gap between CPU and memory performance is only widening year by year. Memory bandwidth is increasing slowly and latency hasn't improved for a decade.
C++ is going to be always faster especially if specialized to certain machine and use case. C++ is also going to win by a large margin when there's auto-vectorizable code or heavy use of SIMD-intrinsics. 10x is not unusual, if the problem maps well to AVX2 instruction set.
That is a good thing. It does not mean that Dynamo and Cassandra are bad choices, but that companies and teams can move onto other solutions when their requirements changes. In an Open Source model just because you invented it does not mean you are stuck with it.
Not sure I could work in a one product company where you have to eat your own dog food in all situations even where it really is not suited for.
> What's interesting to note is that Dynamo/Cassandra usage was killed at both Amazon and Facebook
IMO, the only way use, or former use, is interesting is in understanding why a specific company moved to, or away from, a given technology.
Specifically, what was the original use case that was the basis for original use?
Why did the company choose to change technology? Did the use case change? Did the technology fail to satisfy the original requirements? Did a new technology with substantially better capabilities emerge? And so on?
Simply stating that company X uses Y (or company X no longer uses Y) does not provide a lot of information that other companies can use as the basis for their decision.
It's not so simple. The main complaint I have heard is that programmers find it difficult to deal with eventual consistency exposed through vector clocks. It's only recently that CRDTs have been reasonably well known, and they solve this problem. Riak 2.0 includes a CRDT library but it might be too late.
Note that Cassandra doesn't actually handle eventual consistency properly, and has weird corner cases a result (e.g. it's infamous "doomstones"). As an immutable data store it works very well, particularly when you have a high write load.
I found Riak's CRDT implementation disappointing. They require that you define them beforehand in a schema, which defeats much of the point of using a schemaless database in the first place.