Wednesday, June 26, 2013

Programming Language Popularity Revisited: Power Law vs. Exponential

Understanding the 'tail' behavior of programming language popularity can help tell practicing language designers whether to sink effort into a domain specific language for a niche audience (long tail) vs. focusing on winner-takes-all with general purpose languages (short tail). I previously showed an analysis of languages in 200,000+ SourceForge projects from 2000-2010, which found popularity to follow an exponential curve. That means a short tail and most language use on SourceForge is centered in the big languages... or does it?

I cross-validated the result today with two other data sources: Ohloh stats of 500,000+ projects across different open source repositories, and our survey of 1,600+ Wired, Slashdot, etc. readers that asked about the primary language for their most recent project. Ohloh also shows exponential short tail behavior, but not our Survey. What the heck happened, right?

Conflicting cross-validation. The red and green exponential curves drop off in popularity starting around the bottom half of the languages. In contrast, just like there being a page for everything on the web, the blue power law line keeps going.

My first thought was that the data sets were measuring different things. Maybe commercial behavior is different from open source, but lines have blurred nowadays. It could have been that the SourceForge data was from 2-4 years before the others -- but both the survey and Ohloh were recent and in conflict.  Or, perhaps, it's the double-dipping in Ohloh from counting any project a language is used in, rather than measuring just for when its the primary language. These differences are real,  but I realized a simple but deadly methodological issue is at play.

SourceForge and Ohloh curate their list of languages! They only have about 100 languages -- if a language is not in their list, its use in a project will not be counted. The languages that would make up the long tail are not being counted! To get a taste, our survey was of magnitudes less people, yet it elicited 20% more languages. The survey measures what actual developers do, while my SourceForge and Ohloh data crunching was a meta-analysis restricted to what SourceForge/Ohloh chose to annotate. Other techniques, such as trawling Google search results as in langpop.com, are therefore similarly suspect for this style of curve fitting.

Going forward, I'd like to predict, for example, what an estimated tail of SourceForge and Ohloh language popularity would look if we assume its similar to the survey data. Furthermore, I'd like to test whether SourceForge and Ohloh are actually power laws once the truncated data set is considered (e.g., is the completion using survey data a power law?). My stats skills bottom out at what's posted here, unfortunately -- if this sounds like your type of thing, I'd love to talk :)

Just for fun: the cumulative distributions with matching trendlines

No comments: