Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

What do you think is the best way to do deep JSON comparisons? We work with 2GB JSONs all day, and it is super annoying how long they take to process.


Not parse them into a tree, to start with.

Use a streaming JSON parser, and compare them token by token unless/until they diverge, at which point you take whatever actual suitable to identify the delta.

Parsing it into a tree may be necessary if you want to do more complex comparisons (such as sorting child objects etc.), but even then depending on your need you may well be better off storing offsets into the file depending on your requirements.

https://github.com/lloyd/yajl is an example of a streaming JSON parser (caveat: I've not benchmarked it at all), but JSON is simple enough to write one specifically to handle two streams.


I believe this comparison benchmark could be useful for you and you can expand further with more tests. Although I got downvoted for sharing a link.

https://github.com/kostya/benchmarks/blob/master/README.md


That still parses into a tree.


Rust is absolutely wonderful for tasks like this. They don't hit any of the cases where Rust's ownership can make things tricky. And the serde library makes deserializing JSON a piece of cake.

You end up with code which looks pretty similar to the equivalent JavaScript or Python code, but performs much faster (10x, 100x or even 1000x faster).


There's also pikkr (https://github.com/pikkr/pikkr) if you need really really fast JSON parsing.


Start with a compiled language, I guess? I don't operate on anywhere near that scale, but json-rust reaches 400 MB/s for me.

It doesn't parallelize, and you'd need memory enough for the entire structure, but of course Rust doesn't have GC overhead. You could trivially parse both files in parallel, at least.


(1) Try a language with fast allocations (C, C++, Rust, maybe Go or Java) -- anything except Python or Ruby

or

(2) Try using streaming API (I don't know Ruby, but quick google found https://github.com/dgraham/json-stream ). Note that this method will require you to massively restructure your program -- you want to avoid having all of the data in memory at once.

The streaming API might work better with jq-based preprocessing -- for example, if you want to compare two unsorted sets, it may be faster to sort them using jq, then compare line-by-line using streaming API.


Python is fast at parsing JSON, Go had hard time to match parsing speed of it. Additionally you have PyPy to help.



Python is fast at doing anything that doesn't involve running Python.

That's an important caveat. Python's C JSON parser library is super-fast, but if you want to use the data for anything but a simple equality check afterwards, it'll be slow as molasses.

Or you'll write a C extension for it...


nodejs comes to mind!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: