Wednesday, May 22, 2013

How We Fixed Temporal Locality of Work Stealing for the Parallel Browser

Broken Temporal Locality in Work Stealing
Red shading indicates a temporally-induced miss.


The end of my last post mentioned a big problem with work stealing: it has no temporal locality. For example, if a layout engine traverses Wikipedia's HTML nodes twice, 2/3rds of the nodes would move to different threads for the second traversal. To understand such a harsh critique and how we tackled it, you can try running the next version of the visualization:

var s = document.createElement("script"); 
s.type = 'text/javascript'; 
s.src = 'https://raw.github.com/lmeyerov/devnull/master/one-offs/workstealMulti.js';  
document.getElementsByTagName('head')[0].appendChild(s);
Run visualization by putting into your browser's JavaScript console

Open a website, go to the JavaScript console, and put in that script. Hit the run button on the top-left once to do a normal work stealing traversal, and once the traversal finishes, hit the button again to run another pass. The first pass uses colors to show which thread got which node. As described in the previous post, spatial and intra-traversal temporal locality are great. The second pass switches to blue/red coloring to show, on the second pass, if a thread got the same node (blue) or a different one (red).

What we see is that, near the beginning of the second pass, work stealing is great: lots of blue near the top of the page. As the traversal continues, however, slight deviations occur. These compound, so the longer the schedule, the more random the assignment.  By the end of the traversal, only 33% of the nodes were temporal hits. Not good! Our parallel layout engines do 3-9 passes, and often the data used in one pass is computed in the previous one, so that hurts.

For the parallel browser, we had a simple and effective solution: just record the old schedule and reuse it. This worked because it does several passes in succession of a big tree. Temporal locality isn't an issue for the first traversal, so we use work stealing to find a load balanced schedule. For subsequent traversals, we play the schedule forwards (top-down traversal) or backwards (bottom-up). We lose a bit of load balancing, but not much in practice (hurray big trees!), and gain a bit by lowering overheads (no dynamic queues etc., just p2p locking). Simple to implement and we've seen near-ideal scaling: 80-95% of ideal on 4-8 cores.

We did this work years ago, and since then, others have published nice generalizations.  Common elements of locality-aware work stealing schedulers are allowing programmers to annotate locality groups (e.g., mark a node as part of a subtree/tile) or using static/dynamic analysis to infer it,  and then using that information to pick the steal. I'm not aware of any that are smart enough to 'wait' as in our case, but it's a sensible next step in profile-guided systems. See Trishul Chilimbi's late 90's thesis work to get in the right mindset.

Sunday, May 5, 2013

Visualize Parallel Work Stealing in the Browser

I try to write at least one interesting program every week. This weekend, to spice up my dissertation talk, I built an animated visualization of how my parallel CSS engine partitions a webpage via work stealing.


Work stealing is a great parallel task scheduling algorithm that provides dynamic load balancing and spatial locality. The idea is that each thread has a local queue of ready-to-fire tasks that it loops through, and if it ever runs out of local tasks, it will steal tasks from another thread's queue. The visualization shows how this works for a parallel top-down tree traversal over a webpage's HTML tree. In this case, a task is a node, and executing a task is denoted by changing the background color of that node to match the thread's color. The node's children are then added to the local queue for firing, and the process repeats. Whenever one thread steals a task from another, the webpage node corresponding to the stolen task has its border highlighted by the color of the thieving thread.

To try it it out, open your favorite website, paste the following code into the JavaScript console, and hit the start button:

var s = document.createElement("script"); 
s.type = 'text/javascript'; 
s.src = 'https://raw.github.com/lmeyerov/devnull/master/one-offs/worksteal.js'; 
document.getElementsByTagName('head')[0].appendChild(s);

Alternatively, copy the following into your URL bar and hit start after it loads:

javascript:(function(){var s = document.createElement("script"); s.type = 'text/javascript'; s.src = 'https://raw.github.com/lmeyerov/devnull/master/one-offs/worksteal.js'; document.getElementsByTagName('head')[0].appendChild(s); }())

For this version, make sure you browser doesn't strip the "javascript:" at the front.

The animation shows several important things for parallelization. First, the same thread will run many tasks that are near each other, which indicates good spatial locality. The evidence of this occurring is that many regions of the webpage will share the same color. Second, steals are infrequent, which keeps overheads low and again highlights locality benefits. This shows up in the page as very few stolen nodes (colored borders). To measure it, the per-thread percentage counters at top show the ratio of stolen nodes to non-stolen ones, and on most webpages, I get a pleasingly low 5% miss rate.  Finally, you can see a problem with the algorithm: the steal rate spikes at the beginning and end of a tree traversal. I made a simple optimization to the initial traversal -- I do a short sequential BFS traversal rather than immediately start with parallel DFS -- but did nothing here for the end.

My layout engine actually uses a novel semi-static work stealer rather Cilk-style dynamic work stealing. The reason is that Cilk does not account for temporal locality, which is important in situations such as multiple traversals over a tree because they revisit the same nodes over time. With Cilk-style scheduling, which thread a node lands on can easily change across different traversals, which is a huge hit. Instead, my approach is to use work stealing once to partition the tree, and then reuse the precomputed schedule across different traversals: load balancing suffers a bit, but temporal locality goes up. 

Enjoy!

Friday, May 3, 2013

Monday, Monday, Monday! Dissertation Talk


About time.


---

Your message for list eecs-announce has been forwarded to editor(s):


[Dissertation Talk] Parallelizing the Browser: Synthesis and Optimization of Parallel Tree Traversals



Speaker: Leo Meyerovich
Advisor:  Rastislav Bodik
Date: Monday, May 13th, 2013
Time: 3:30pm
Room: 373 Soda Hall



ABSTRACT:
From low-power phones to speed-hungry data visualizations, web browsers need a performance boost. Parallelization is an attractive opportunity because commodity client devices already feature multicore, subword-SIMD, and GPU hardware. However, a typical webpage will not strongly benefit from modern hardware because browsers were only designed for sequential execution. We therefore need to redesign browsers to be parallel. This thesis focuses on a browser component that we found to be particularly challenging: the layout engine.

We address layout engine implementation by identifying its surprising connection with attribute grammars and then solving key ensuing challenges:

1. We show how layout engines, both for documents and data visualization, can often be functionally specified in our extended form of attribute grammars.

2. We introduce a synthesizer that automatically schedules an attribute grammar as a composition of parallel tree traversals. Notably, our synthesizer is fast, simple to extend, and finds schedules that assist aggressive code generation.

3. We make editing parallel code safe by introducing a simple programming construct for partial behavioral specification: schedule sketching.

4. We optimize tree traversals for SIMD, MIMD, and GPU architectures at tree load time through novel data representation and scheduling optimizations.

Put together, we generated a parallel CSS document layout engine that can mostly render complex sites such as Wikipedia. Furthermore, we ran data visualizations in the browser that support interacting with over 100,000 data points in real time.

Sunday, April 7, 2013

Parallel Parsing isn't Hard (Or, Parallel JSON via Web Workers!)

Generalized parallel parsing is hard. Many a Ph.D. has been burned by attempting parallel parsing. Guy Steele wrote papers about it, and you'll find custom ones for favorite modern technology X for X = JavaScript, HTML, XML, etc. However, in practice, practical parallel parsing does not have to be so hard because parallel parsing research generally focuses on extreme cases. For example, our group examined automatically generating parallel parsers and handling arbitrary HTML.

Actual parallel parsing is easy. Real applications provide significantly simplifying assumptions. Case in point: for our browser-based big data visualization framework, we needed to somehow load in big JSON arrays. Modern browsers provide a fast native function "JSON.parse :: String -> JSObj" that is fine for normal sized data sets, but parsing 9MB of sparse data takes over 1s, which requires a loading screen and eats your battery. The solution took ~2 days: on the server, split the data into multiple arrays, and on the client, use web workers to parse each array in parallel. Even though I'm controlling the parsing in JavaScript, the actual parsing is still being done natively by calling into JSON.parse for handling each fragment.

Anyways, the proof is in the pudding. The relevant benchmarks are impact on file size and load time.

Benchmark 1: file size




Splitting one of our sparse JSON data files into multiple chunks has no real hit to the file size, and due to some uninteresting  phenomena, actually helps when gzip is used. I'm also showing the benefit of compressing repeating zeros ("sparse json") for our particular data set.


Benchmark 2: speedup




For the second benchmark, I'm testing loading the data set from a local hard drive when using different formats. The first two columns are sequential data formats, and with the second column primarily demonstrating the benefit of doing preprocessing for my particular application. The blue part of each column shows, in ms, how much time is spent reading the file from disk and parsing it.

The important part is comparing the blue of the third column against the blue part of the second: using workers to parse JSON chunks is 2x faster than parsing the file in one sequential go.

For further parallel speedups, it is important to realize that this application hit Amdahl's Law. In particular, the sequential stuff, like the black subcolumns, must be parallelized. The blue columns also have sequential code due to APIs I'm using. First, Safari does not have zero-copy message passing (transferable objects), so a sequential copy is made in communicating a parsed fragment back to the master. Second, I'm actually creating a big typed array, so even current web worker APIs do not help: I want to the Buffer into the worker, not get a new buffer out of one, which means I have to do a sequential copy even in Firefox and Chrome. Basically, we're seeing the web's poor support of message  passing forcing a lot of slow, sequential copies.

Finally, I only benchmarked the case of local file loads. Our application, ultimately, will be used online with a server providing the data on-the-fly. The optimizations I showed should still help even if the application is network bound: they also enable pipeline parallelism. By splitting the JSON, parsing one chunk is now done in parallel with downloading the next. Modern webpages alternate between waiting on the network and the CPU, which this solution elegantly addresses.

Conclusion

So there you have it -- parallel parsing, in practice, should be easy. I did a constrained case of handling big JSON arrays. Even for more general JSON, it should only take a day more of work: partition the JSON via work stealing -- I successfully did a similar thing in my thesis for HTML in a few hundred lines of C++.

If you'd like the code, it'll be part of our Superconductor (automatically GPU-accelerated web programming framework for big data visualization) release next month. Ping me if you'd be interested in working on parallel parsing of arbitrary JSON -- it should be awesome!


UPDATE

I more carefully measured the overheads  and also did a couple more optimizations. The chart below shows the new results. It also now includes the message passing overheads in gray, where ~90% of it would go away in Chrome with transferable objects and the other 10% might be eliminated with ownership transfer of buffer views (sub-buffers), not just buffers.



Two big  things are going on:

  1. The workers now perform decompression rather than the master. However, decompressing in the worker has a cost:  each worker now returns an inflated result message. Safari does not have ownership transfer so almost 100ms is spent copying the message back to the main thread rather than passing a pointer.

    The overhead is still worth it -- the blue region shrunk a lot, even if there is now a big gray region. (The gray region is in both because the sequential job runs in a worker in order to avoid pressuring the UI thread.) If/when Safari gets transferrable objects, 90% of the gray bars will disappear.
  2. Faster reduction of worker results into the final GPU buffer. Typed arrays support method set that can also help such reductions. It acts like a memcpy, so result.set(chunk, len) copies array buffer "chunk" into array buffer "result" starting at byte offset "len" into "result." It only takes about 13ms. If the typed array spec was extended so that disjoint sub-buffers could be transferred across workers, the other 10% of the gray bars would disappear.







Saturday, January 26, 2013

Adoption-Oriented Languages

Language design decisions should not just increase adoption but also exploit adoption. For example, I just measured how popular languages can rely upon Stackoverflow as a code assistant. The chart below shows that increased language popularity on SourceForge (x axis) leads to faster answer times for questions asked on Stackoverflow (colored bins on the y axis):

Stackoverflow Time-to-Answer vs. Sourceforge.net Language Popularity. Raw data grabbed a stackoverflow query that modifies G. Winker's, and language rankings from our previous analysis.

Top 10 languages have questions answered within 5 hours over 80% of the time, but for languages worse than Top 50, that shrinks to only 50% of the time. In terms of a simple blocker at work, that's a big deal: you can reliably ask a question, go work on something else, and come back to the answer!


Of course, measuring response time vs. popularity is up to interpretation: it could be that as adoption increaess, people ask easier questions. However, I suspect the chart is actually undercounting the adoption benefit: the more questions are answered, the more people don't need to wait for answers because they can just search for old ones! As one measure, we see that a language's popularity relates to its answer corpus size:

 Stackoverflow Answer Corpus Size vs. Sourceforge.net Language Popularity. Same dataset as above. 

I had to switch to a log-scale axis for answers! For popular languages, search becomes a valuable tool because Stackoverflow alone provides a FAQ of 100,000 solutions. An enterprising individual might want to further examine the duplicate rate, which I suspect would provide an insightful indicator for the likelihood of a query already being answered by a previous response.

Open source library repos are another case of a feature that improves with adoption: the library of builtins is ever expanding. Likewise, one of my favorite research projects is cooperative bug isolation, which exploits the adoption of a program to help narrow down causes of bugs. (AFAICT, most software companies do some form of tracking error and performance dumps.) A/B testing for feature design is also increasingly common. As another example, in academia, code completion tools are also  increasingly incorporating corpus-based solutions. A corpus can provide more value than type and namespace based solutions because, for example, synthesized code completions will resemble what people would actually write.

I started sketching out the notion of adoption-oriented language design in a position paper last week. I struggled to think of features that couldn't be designed to strengthen with adoption. For example, verification, testing, security, program synthesis, and optimization could all be designed to automatically exploit adoption of either the language, the library, or a program.  What strikes me is how little adoption is exploited today. Outside of the cases, am I missing any that are commonly practiced, or should be?

Sunday, January 20, 2013

Superconductor Language in Action: Easy Big Data Viz in Your Browser

Exciting news: we'll be talking about Superconductor at the O'Reilly Fluent conference! This will be the first time we show it outside of small seminars, and will hopefully coincide with our open source release. This video proposing our session gives a feel for what Superconductor can do. The link skips right into the demo:


The basic idea is that you write a normalish webpage (HTML5, JS, JSON, CSS) for including and manipulating your big data viz, except:
  1. Instead of relying up on the browser to run it (parse, layout, render, & animate), you load in our GPU-accelerated JavaScript libraries for the heavy lifting. Behind the scenes, we automatically exploit workers, WebCL, WebGL, etc. 
  2. Instead of using JavaScript for fancy layouts, you load in a widget made with our declarative layout DSL (or use it to define your own). Our parallel program synthesizer automatically finds the parallelism in the widget specs and our code generators output GPU code based on it. 
We're finally ramping up on a paper with the end-to-end system details; this will be the first time we discuss running inside of a commercial browser and using GPUs. Papers have a ridiculous delay period, so of you'll be at SIGGRAPH (LA), PPOPP (Shenzen, China), or Intel's TechFest (Portland), you'll be able to catch a demo before then.

Wednesday, December 26, 2012

Reactive Objects and Arrays via Direct Proxies

Arrays are a crucial building block for programming. Consequently, an essential question for functional reactive programming is how to react to imperative array updates. Designing a clean interface gets bogged down in issues like identity, API consistency, and efficient incrementalization. I used a recent extension to JavaScript, Direct Proxies (Firefox-only),  to sketch a transparent dual-mode model (JSFiddle) for Flapjax. I made a reactive array interface that covers the entire Array interface and supports both instant and batched updates. A fun holiday break from Superconductor :)

Update: I generalized the code in two ways: it works with object instances and also allows choosing whether the prototype is Array or Behavior.

Transparency


The idea is to create a reactive array that is tied to an existing normal array:

var arr = ['init'],
var arrB = bindB(arr);

From then on, interactions with one array are automatically reflected from one to the other. (In particular,  arrB uses arr as its backing store and proxies interactions to it):

// assert(arr[0] == arrB[0]);
arr.push('a');  // assert(arr.length == fxArr.length);
arrB.push('b'); // assert(arr.length == fxArr.length);


By defining traps for JavaScript's meta-object proxy protocol, I guarantee that all of JavaScript's array methods are supported by the reactive array arrB. All in all, for normal JavaScript, it should not matter if you pass around arr or arrB: they act on the same underlying data.

(Full transparency might be wrong. I'll raise a semantic issue at the end about what if you *do* want to distinguish reactive arrays vs. flat ones using operators such as instanceof.In my implementation, I let you pick as an optional parameter to bindB.)

Functional Reactivity and Instant Reactions


Now the cool thing is that array mutations can be automatically reacted to using normal functional reactive programming constructs.

For example, we can declare a live counter of the array size:

insertDomB(arrB.liftB(function (av) { return av.length; })); 

The above snippet is the usual functional reactive code that is a joy to read. It displays the length of the array. Whenever arrB updates, the GUI will automatically update to show the correct length. 

The new ability here is that wrapped array mutations automatically trigger reactions:

arrB[30] = 20; //updates length counter on screen
arrB.push('hello'); //updates length counter on screen

I'm calling these instant reactions because each mutation registers an update event. In the above statements, two update events will be fired, one for each statement.

The original array and reactive array refer to the same data, but differ on how they interact with the event system. Updates to the original array do not trigger reactions:

arr.push('world');  //no screen update


Maintaining non-triggering behavior of the original array is important so that non-reactive code not ready for the extended behavior does not trigger massive DOM updates. In places where it is safe to do the switch, arrB can be swapped in.


Batched Reactions

A nice pattern falls out of the wrapped and unwrapped view of the array. When many updates to an array should happen, we can batch them for a single subsequent response:

function addSome(n, arrB) {
  //batched update
  for (var i = 0; i < n; i++) {
     arrB.last.push(i);
  } 
  //notification
  if (n < 0) 
     arrB[0] = arrB.last[0]; //updates length counter on screen
}
addSome(10, arrB);

The "arrB.last.push(i)" calls trigger no reactions, while the assignment to arrB[0] triggers an instant reaction. This pattern of silent array updates followed by an instant reaction on the reactive view is a batched reaction. It is largely an optimization, but if you create local invariants inside the body of addSome, you can also use the pattern for  (non-nested) transactional reactions.


Object Instances and Meta Protocols

In response to Shriram's comment, the technique here is a late-bound dual to his work in making legacy classes reactive. For Racket's class system, he showed how to make a reactive subclass: the case of arrays would act like "var myArrB = new ArrayB(1, 2, 3); assert(myArrB instanceof Array)". The dual problem examined in my post is where an instance of a class must be made reactive ("var myArr = [1,2,3]; ...; var myArrB = bindB(myArr)"). It is an important practical distinction because the developer writing "var myArrB = bindB(myArr)" might not have control over "var myArr = [1,2,3]" nor "...".

His connection is important for two reasons: it provides a principle for how bound arrays should act (as subclass instances) and, more generally, shows that the same approach can be used for making object instances reactive.

1. Design principle of subclassing. Given an array, his connection tells us that reactive version should act as a subclass of Array. As seen above, this works for methods like push and [] (get). However, the desired object protocol behavior is not obvious. For example, should "assert(myArrB instanceof Array)"  succeed? By the subclass reasoning, yes!

Unfortunately, setting myArrB to be an instance of Array breaks introspection within the Flapjax library. To add dependencies to an object, Flapjax checks whether it can register event listeners. For example, "liftB(function (v) { return v.length; }, myArrB)" triggers a check "arg2 instanceof Behavior": if myArrB is an Array and not a Behavior, Flapjax will not register listeners on it! Some examples still work, such as  "myArrB.liftB(function (v) { return v.length; })", because the check is elided. For now, I support two constructor options:

var xB = bindB(arr);
assert(xB instanceof Array); //indistinguishable from Array
xB.liftB(function (v) { return v.length; }); //only supports chained reaction


var yB = bindB(arr, true);
assert(xB instanceof Behavior); //distinguishable from Array
yB.liftB(function (v) { return v.length; }); //chained reactions
liftB(function (v) { return v.length; }, yB); //and as arguments



The non-default constructor is more readily distinguished from an Array, and Flapjax uses this ability to  out-of-the-box to to react to it. If code consuming "myArrB" does not need "myArrB instanceof Array", I suggest the second, non-default form.


2. Generalization to Objects. The motivation of this post was to handle array instances, but as Shriram implies, those are a special form of object instances. We get the simpler case of objects for free!

var o = {};
var p = bindB(o);

insertDomB(DIV("o has x: ", p.liftB(function (v) { return v.x ? true : false; })));
//display shows false

p.x = 1; // o.x == 1 and display updates to show true
delete p.x; //o.x == undefined and display updates to show false

The code snippet above shows creating a reactive view of an object. The original object is the backing store for the view, and edits to the reactive view automatically trigger DOM updates. As with arrays, when reactions occur can be controlled by choosing to edit the raw object or the view.


Conclusion


Via proxies, I was able to make imperative arrays play nicely along their entire API with our FRP library and support both instant and batched/transactional updates.  You can play with this in Firefox via my JSFiddle. There is more to do (for example, the direct proxy API precludes behavioral indices  like fxArr[secondsB]), but the flexibility of this approach is handy.


APPENDIX


function bindB(objasObj//Behavior prototype by default

    function close(fthat{
        return typeof(f== 'function' ?
            function({
                return f.apply(thatarguments);
            
            f;
    }

    var resE receiverE();
    var resB resE.startsWith(obj);

    var protocol {
        setfunction(_pv{
            obj[pv;
            resE.sendEvent(arr);
        },
        getfunction(_p{
            return == "toSource" 
                close(obj.toSourceobj
            resB.hasOwnProperty(p
                resB[p
            Behavior.prototype.hasOwnProperty(p
                close(Behavior.prototype[p]resB
            obj[p];
        }
    };

    return new Proxy(asObj obj resBprotocol);
}
        
    
var arr ['init'];
var fxArr bindB(arr);


Thursday, November 29, 2012

Google Tech Talk Video on Socio-PLT: Quantitative and Social Theories of Programming Language Adoption

My Google Tech Talk on the Socio-PLT work was recorded and posted. It starts with the quantitative stuff + questions on it, and then switches to the social theories: pick your poison. Enjoy!


Wednesday, November 28, 2012

Temp URLs for Flapjax documentation

The Flapjax server is a bit wonky right now, so if you are looking for online documentation, check out the following:
Of course, you can always check the git repo for the latest.

Hope that helps!

Thursday, November 22, 2012

What Women Want

Trending on Quora is the question, "Which programming language is the most female friendly?"I do not know (I can provide primary sources who will verify this), but based on our survey of 802 men and 154 women, I can answer which languages females prefer over males. To learn how to think about these questions, see my recent Google Tech Talk or the links below.

Gayle presciently suggests that "Python is also a no-no because, I mean, you just named a language after a snake. Ick! And, I mean, do I need to call out the Freudian imagery there?" Go Gayle -- there is indeed a 3% drop! However, despite Gayle's insight that "All women love shiny objects," that does not help Ruby. Our numbers show a 7% drop for Ruby, so Ruby's bad press might be representative.



Overall, we saw little difference among the top 10 languages known by males and females. The ordering is fairly stable, and overall knowledge is similar (see the pink trendline). A few language jumps stand out: C, PHP, C#, and Ruby drop for female knowledge, while  Excel rises.


If you're curious, check out our methodology, more of our statistics and raw data, and what a language designed for women might look like. Long story short, for the above statistics, we had good respondents: mostly educated, mostly employed, and biases because we polled students taking a free UC Berkeley online Ruby course on software as a service.