I cross-validated the result today with two other data sources: Ohloh stats of 500,000+ projects across different open source repositories, and our survey of 1,600+ Wired, Slashdot, etc. readers that asked about the primary language for their most recent project. Ohloh also shows exponential short tail behavior, but not our Survey. What the heck happened, right?
My first thought was that the data sets were measuring different things. Maybe commercial behavior is different from open source, but lines have blurred nowadays. It could have been that the SourceForge data was from 2-4 years before the others -- but both the survey and Ohloh were recent and in conflict. Or, perhaps, it's the double-dipping in Ohloh from counting any project a language is used in, rather than measuring just for when its the primary language. These differences are real, but I realized a simple but deadly methodological issue is at play.
SourceForge and Ohloh curate their list of languages! They only have about 100 languages -- if a language is not in their list, its use in a project will not be counted. The languages that would make up the long tail are not being counted! To get a taste, our survey was of magnitudes less people, yet it elicited 20% more languages. The survey measures what actual developers do, while my SourceForge and Ohloh data crunching was a meta-analysis restricted to what SourceForge/Ohloh chose to annotate. Other techniques, such as trawling Google search results as in langpop.com, are therefore similarly suspect for this style of curve fitting.
Going forward, I'd like to predict, for example, what an estimated tail of SourceForge and Ohloh language popularity would look if we assume its similar to the survey data. Furthermore, I'd like to test whether SourceForge and Ohloh are actually power laws once the truncated data set is considered (e.g., is the completion using survey data a power law?). My stats skills bottom out at what's posted here, unfortunately -- if this sounds like your type of thing, I'd love to talk :)