> - My tech talk of how the CRDT works: http://gun.js.org/distributed/matters.ht...

marknadal · on June 12, 2017

Thanks for the reply! I really appreciate the dialogue.

Yes, the first half of the talk reviews the problem, but the last half (starting with the boundary functions) explains the CRDT. If that wasn't helpful, then there is also this article on the implementation (not written by me) https://github.com/amark/gun/wiki/Conflict-Resolution-with-G....

Timestamps are bad, yes (I mention this in the talk as well), GUN uses a hybrid vector/timestamp/lexical CRDT. Lets take your analogy "updating Y when someone else sees X", a perfect example of this is realtime document collaboration (gDocs, etc.). Even with GUN, you'd not want to have the collaborative paragraph as a value on a property in the node. Each value is treated as atomic, which if two people write at the same time would cause what you are saying: them to overwrite each other. Instead, we can preserve the intent by running distributed linked list (a DAG, actually) on gun, this works quite nicely. See:

- Early working prototype/demo here: https://youtu.be/rci89p0o2wQ

- Based off the interactive data algorithms explainer article here: http://gun.js.org/explainers/school/class.html

If instead, you don't want the results to merge, but indeed be atomic (only one person or the other "wins") then even with transactions only one is going to overwrite the other. Transactions don't help unless you have a journal to rollback from - and guess what, that is possible with gun too. Or alternatively, if you just want users to be informed of the conflict such that they can decide, it is trivial to store both updates and then present them with the conflicting values to choose from, which it then saves.

Does that make sense? Do you see/have any problems with those approaches? Thanks for your input so far.

Regarding clients. Yes, absolutely, you should still run multiple dedicated full-database backups. I don't disagree with you on that point, in fact we make it easy and scalable (see our demo video of a prototype storage engine that did 100M+ records for $10/day all costs - servers, disk, and backup: https://youtu.be/x_WqBuEA7s8). The unique thing about GUN is that it is still capable of surviving complete data center outages, because the data is also backed up on the edge peers.

You can also reduce your bandwidth costs by having edge peers distribute data to their nearby peers, versus always having to pull from your data center. Aka, the Bittorrent model.

Lets be clear here, there is a big difference between Master-Slave systems and data guarantees. Databases like Cassandra have better data availability guarantees because they are HA (and gun is the same), despite not being Master-Slave.

But at the end of the day, you are right: Your not-losing-your-data is only as good as how many full replication backups you have. What I hope to have communicated to you is that gun makes it ridiculously easy to make full (and partial) replications beyond traditional databases/datastores/whatever-you-call-thems, and I hope you do think that is awesome.

Just don't use us to balance bank account data, cause we don't provide those types of guarantees, but we do provide the HA / AP / fault-tolerance ones. :)

jimktrains2 · on June 13, 2017

> Timestamps are bad, yes (I mention this in the talk as well), GUN uses a hybrid vector/timestamp/lexical CRDT. Lets take your analogy "updating Y when someone else sees X", a perfect example of this is realtime document collaboration (gDocs, etc.). Even with GUN, you'd not want to have the collaborative paragraph as a value on a property in the node. Each value is treated as atomic, which if two people write at the same time would cause what you are saying: them to overwrite each other. Instead, we can preserve the intent by running distributed linked list (a DAG, actually) on gun, this works quite nicely. See:

This is actacly my point. And GDocs is a perfect example of what _I_ mean. The Wave protocol (and operational transforms in general), or I guess they use a decedent of it now, was made for exactly these types of use-cases so that you don't loose data unexpectedly when editing values.

> Or alternatively, if you just want users to be informed of the conflict such that they can decide, it is trivial to store both updates and then present them with the conflicting values to choose from, which it then saves.

> Does that make sense? Do you see/have any problems with those approaches? Thanks for your input so far.

How is it trivial? Do you have array-value types? What's trivial about it?

And yes, I do have problems when the default conflict resolution method is to simply select a value and toss the other out. It's a bad default. You have a default that causes data loss, it _will_ come back to bite you.

This is perhaps the most egregious thing and honestly makes me have 0 trust in your system.

> Lets be clear here, there is a big difference between Master-Slave systems and data guarantees. Databases like Cassandra have better data availability guarantees because they are HA (and gun is the same), despite not being Master-Slave.

I never mentioned master-slave. You can set up most SQL databases to be async- or sync- master-master.

> You can also reduce your bandwidth costs by having edge peers distribute data to their nearby peers, versus always having to pull from your data center. Aka, the Bittorrent model.

Maybe I'm not sure what you mean by edge here. Do you mean it in the CDN sense, or in the client sense? If you're expecting your clients to be part of your DR scheme, I'm sorry, but that's moronic. You can't guarantee client availability. You can't guarantee that all available clients will even have a copy of all data. Sure, it's great that you're using the same protocol between servers, but you can't say that because of that clients can be part of DR.