If conflict resolution is timestamp based, It feels like you could very easy end up with inconsistent data. I update field Y based upon the value of field X I see. So does someone else, but based upon a different value for field X. If I can't wrap this this in a transaction, then I could update fields based on bad/out-of-date data.
> Clients keep a copy of data that they are interested in (not the whole data set, correct), and together they can reconstruct the entire data set. You can also run multiple server peers that backup the whole data set, or use a shard key to determine which subsets. If this isn't fault tolerant, would you mind giving me some examples of what is?
I mean, I guess if you accept being able to loose data as fault tolerant, then OK. Having multiple, dedicated, full-database backups is what every other distributed database does as well.
I mean, perhaps I'm being too hash; there are many usecases, particularly with data of little value, that systems like this make easy to build. I just feel like so many people are working on datastores with loose guarantees about anything and hail them as awesome, new things.
Heck, even Wave had a more refined system where you wouldn't just randomly loose data.
You're not Google. You're not Amazon. You don't need to embrace all of this looseness and lack of guarantees.
Thanks for the reply! I really appreciate the dialogue.
Yes, the first half of the talk reviews the problem, but the last half (starting with the boundary functions) explains the CRDT. If that wasn't helpful, then there is also this article on the implementation (not written by me) https://github.com/amark/gun/wiki/Conflict-Resolution-with-G....
Timestamps are bad, yes (I mention this in the talk as well), GUN uses a hybrid vector/timestamp/lexical CRDT. Lets take your analogy "updating Y when someone else sees X", a perfect example of this is realtime document collaboration (gDocs, etc.). Even with GUN, you'd not want to have the collaborative paragraph as a value on a property in the node. Each value is treated as atomic, which if two people write at the same time would cause what you are saying: them to overwrite each other. Instead, we can preserve the intent by running distributed linked list (a DAG, actually) on gun, this works quite nicely. See:
If instead, you don't want the results to merge, but indeed be atomic (only one person or the other "wins") then even with transactions only one is going to overwrite the other. Transactions don't help unless you have a journal to rollback from - and guess what, that is possible with gun too. Or alternatively, if you just want users to be informed of the conflict such that they can decide, it is trivial to store both updates and then present them with the conflicting values to choose from, which it then saves.
Does that make sense? Do you see/have any problems with those approaches? Thanks for your input so far.
Regarding clients. Yes, absolutely, you should still run multiple dedicated full-database backups. I don't disagree with you on that point, in fact we make it easy and scalable (see our demo video of a prototype storage engine that did 100M+ records for $10/day all costs - servers, disk, and backup: https://youtu.be/x_WqBuEA7s8). The unique thing about GUN is that it is still capable of surviving complete data center outages, because the data is also backed up on the edge peers.
You can also reduce your bandwidth costs by having edge peers distribute data to their nearby peers, versus always having to pull from your data center. Aka, the Bittorrent model.
Lets be clear here, there is a big difference between Master-Slave systems and data guarantees. Databases like Cassandra have better data availability guarantees because they are HA (and gun is the same), despite not being Master-Slave.
But at the end of the day, you are right: Your not-losing-your-data is only as good as how many full replication backups you have. What I hope to have communicated to you is that gun makes it ridiculously easy to make full (and partial) replications beyond traditional databases/datastores/whatever-you-call-thems, and I hope you do think that is awesome.
Just don't use us to balance bank account data, cause we don't provide those types of guarantees, but we do provide the HA / AP / fault-tolerance ones. :)
> Timestamps are bad, yes (I mention this in the talk as well), GUN uses a hybrid vector/timestamp/lexical CRDT. Lets take your analogy "updating Y when someone else sees X", a perfect example of this is realtime document collaboration (gDocs, etc.). Even with GUN, you'd not want to have the collaborative paragraph as a value on a property in the node. Each value is treated as atomic, which if two people write at the same time would cause what you are saying: them to overwrite each other. Instead, we can preserve the intent by running distributed linked list (a DAG, actually) on gun, this works quite nicely. See:
This is actacly my point. And GDocs is a perfect example of what _I_ mean. The Wave protocol (and operational transforms in general), or I guess they use a decedent of it now, was made for exactly these types of use-cases so that you don't loose data unexpectedly when editing values.
> Or alternatively, if you just want users to be informed of the conflict such that they can decide, it is trivial to store both updates and then present them with the conflicting values to choose from, which it then saves.
> Does that make sense? Do you see/have any problems with those approaches? Thanks for your input so far.
How is it trivial? Do you have array-value types? What's trivial about it?
And yes, I do have problems when the default conflict resolution method is to simply select a value and toss the other out. It's a bad default. You have a default that causes data loss, it _will_ come back to bite you.
This is perhaps the most egregious thing and honestly makes me have 0 trust in your system.
> Lets be clear here, there is a big difference between Master-Slave systems and data guarantees. Databases like Cassandra have better data availability guarantees because they are HA (and gun is the same), despite not being Master-Slave.
I never mentioned master-slave. You can set up most SQL databases to be async- or sync- master-master.
> You can also reduce your bandwidth costs by having edge peers distribute data to their nearby peers, versus always having to pull from your data center. Aka, the Bittorrent model.
Maybe I'm not sure what you mean by edge here. Do you mean it in the CDN sense, or in the client sense? If you're expecting your clients to be part of your DR scheme, I'm sorry, but that's moronic. You can't guarantee client availability. You can't guarantee that all available clients will even have a copy of all data. Sure, it's great that you're using the same protocol between servers, but you can't say that because of that clients can be part of DR.
Maybe I'm just being silly, but that talk describes the problem. It doesn't really going into detail about the CRDT(s) you're using.
> Yes, good point, to prevent data from getting trashed you need to run auth on the system, more info here: https://github.com/amark/gun/wiki/auth
I wasn't talking about auth.
If conflict resolution is timestamp based, It feels like you could very easy end up with inconsistent data. I update field Y based upon the value of field X I see. So does someone else, but based upon a different value for field X. If I can't wrap this this in a transaction, then I could update fields based on bad/out-of-date data.
> Clients keep a copy of data that they are interested in (not the whole data set, correct), and together they can reconstruct the entire data set. You can also run multiple server peers that backup the whole data set, or use a shard key to determine which subsets. If this isn't fault tolerant, would you mind giving me some examples of what is?
I mean, I guess if you accept being able to loose data as fault tolerant, then OK. Having multiple, dedicated, full-database backups is what every other distributed database does as well.
I mean, perhaps I'm being too hash; there are many usecases, particularly with data of little value, that systems like this make easy to build. I just feel like so many people are working on datastores with loose guarantees about anything and hail them as awesome, new things.
Heck, even Wave had a more refined system where you wouldn't just randomly loose data.
You're not Google. You're not Amazon. You don't need to embrace all of this looseness and lack of guarantees.