Anyways, the proof is in the pudding. The relevant benchmarks are impact on file size and load time.
Benchmark 1: file size
Splitting one of our sparse JSON data files into multiple chunks has no real hit to the file size, and due to some uninteresting phenomena, actually helps when gzip is used. I'm also showing the benefit of compressing repeating zeros ("sparse json") for our particular data set.
Benchmark 2: speedup
For the second benchmark, I'm testing loading the data set from a local hard drive when using different formats. The first two columns are sequential data formats, and with the second column primarily demonstrating the benefit of doing preprocessing for my particular application. The blue part of each column shows, in ms, how much time is spent reading the file from disk and parsing it.
The important part is comparing the blue of the third column against the blue part of the second: using workers to parse JSON chunks is 2x faster than parsing the file in one sequential go.
For further parallel speedups, it is important to realize that this application hit Amdahl's Law. In particular, the sequential stuff, like the black subcolumns, must be parallelized. The blue columns also have sequential code due to APIs I'm using. First, Safari does not have zero-copy message passing (transferable objects), so a sequential copy is made in communicating a parsed fragment back to the master. Second, I'm actually creating a big typed array, so even current web worker APIs do not help: I want to the Buffer into the worker, not get a new buffer out of one, which means I have to do a sequential copy even in Firefox and Chrome. Basically, we're seeing the web's poor support of message passing forcing a lot of slow, sequential copies.
Finally, I only benchmarked the case of local file loads. Our application, ultimately, will be used online with a server providing the data on-the-fly. The optimizations I showed should still help even if the application is network bound: they also enable pipeline parallelism. By splitting the JSON, parsing one chunk is now done in parallel with downloading the next. Modern webpages alternate between waiting on the network and the CPU, which this solution elegantly addresses.
So there you have it -- parallel parsing, in practice, should be easy. I did a constrained case of handling big JSON arrays. Even for more general JSON, it should only take a day more of work: partition the JSON via work stealing -- I successfully did a similar thing in my thesis for HTML in a few hundred lines of C++.
If you'd like the code, it'll be part of our Superconductor (automatically GPU-accelerated web programming framework for big data visualization) release next month. Ping me if you'd be interested in working on parallel parsing of arbitrary JSON -- it should be awesome!
I more carefully measured the overheads and also did a couple more optimizations. The chart below shows the new results. It also now includes the message passing overheads in gray, where ~90% of it would go away in Chrome with transferable objects and the other 10% might be eliminated with ownership transfer of buffer views (sub-buffers), not just buffers.
Two big things are going on:
- The workers now perform decompression rather than the master. However, decompressing in the worker has a cost: each worker now returns an inflated result message. Safari does not have ownership transfer so almost 100ms is spent copying the message back to the main thread rather than passing a pointer.
The overhead is still worth it -- the blue region shrunk a lot, even if there is now a big gray region. (The gray region is in both because the sequential job runs in a worker in order to avoid pressuring the UI thread.) If/when Safari gets transferrable objects, 90% of the gray bars will disappear.
- Faster reduction of worker results into the final GPU buffer. Typed arrays support method set that can also help such reductions. It acts like a memcpy, so result.set(chunk, len) copies array buffer "chunk" into array buffer "result" starting at byte offset "len" into "result." It only takes about 13ms. If the typed array spec was extended so that disjoint sub-buffers could be transferred across workers, the other 10% of the gray bars would disappear.