[Update: we have published a more accurate and validated report, please have a look at it]
WebKit is a well known free, open source software project which is producing the core of several of the most popular web browsers. Several companies (and other actors) are collaborating together to build this component, which is key to many of them. The two main players in WebKit are Apple and Google, but it is less known that there are many others participating actively as well. They are far away from the big players, but all together account for a sizable fraction of the total activity.
This post is the first of a series on different aspects of WebKit development, based on the analytics we at Bitergia are gathering about it. Our take is that WebKit is one of those projects massively used by the industry, and therefore worth studying with the aim of providing quantitative and objective data about it.
Specifically, this post is focused on the analysis of the evolution of the activity of companies in the WebKit source code management repository (currently Subversion, formerly CVS) since it was released as an open project back in 2005 and before, when it was still an internal project at Apple (if you don’t know about it, have a look at the fascinating history of the project since its ancient origins in KDE). The analysis of this activity provides useful information to understand, for example, how strongly companies are betting for the project (in terms of contributions to it), and what is probably more relevant,which companies are having some kind of “soft” control.
Being WebKit an open community, clear policies and procedures have been established to avoid control by companies in the traditional, direct meaning. But meritocracy, together with varying amounts of contributions and involvement in the community, let some companies be more central to the project. From this point of view we were interested in paying attention to developers with rights to directly modify the source code (committers), and their activity (commits).
With respect to commit activity, Figure 1 shows a general overview of the project over all its life time, with commits assigned to companies according to the affiliation of developers. It is clear how Apple (close to 40% of commits), followed very closely by Google (about 38%), lead the activity. There are also some other actors with a relevant level of activity, such as Nokia, with 4.91% of the total commits, Igalia with 3.4%, Research in Motion with 3.12%, and the University of Szeged with 2.23%.
But a very different picture emerges if we focus on the latest activity, and consider only commits during 2012.
Figure 2 shows how during 2012 (up to October 24th) Google is by far the most active company, with almost 50% of all commits. Apple is now second, with about 19%, while previously mentioned players such as Nokia, RIM, Igalia or the University of Szeged are among the most active in 2012 as well.
From the differences between Figure 1 and 2, it seems clear that Google has been the major contributor for quite some time, certainly for longer than 2012. Indeed, the policy for granting committer status asks for developers to “have submitted around 10-20 good patches, shown good judgment and understanding of project policies, and demonstrated good collaboration skills”, which means that it is not possible to suddenly have a large increase in committers for a certain company, because they must follow a certain training and testing process. When was the time when Google overtook Apple in terms of activity? Can we visualize this process?. These two questions are answered by Figure 3 and 4 (below).
In Figure 3, we have isolated the activity (again, number of commits) per year of Google and Apple. It shows how Google reached the current level of activity after about 3 years of increasing activity, starting in 2009. And how this happened despite Apple maintained a stable level of activity since 2007-2008. It can be said, therefore, that Google activity came as a surplus to Apple’s, not as a substitution. In other words, Apple has been steadily contributing for more than five years, but since about three years ago, and on top of that contribution, Google is putting its own, which right now is close to doubling Apple’s, thus significantly helping to boost the project.
Figure 4 shows the activity of the next ten companies, with some different patterns. In addition to Apple and Google, which are contributing almost 70% of the activity, there is a second group of companies and institutions which have been clearly increasing their participation during the last years: Nokia, Igalia, RIM and University of Szeged . They account for about 17% of the activity, have been increasing their net activity during the last years, and are currently between 1,000 and 2,000 commits per year. Finally, there is a third group with yet more actors involved, with a lower (not less important!) activity. In that group we can find names such as Collabora, Adobe, Nuanti, Openbossa, Samsung or Intel, all well below the 500 commits per year.
All in all, this analysis is showing not only how Google is pushing WebKit with Apple, but also a glimpse of the structure of the community of companies participating in it. Behind these numbers, it is certain that a story of strategies, competition and collaboration between competitors could be written. But that’s another story: we’re only providing the numbers 😉
From a methodological point of view, this analysis is based on committing activity in the Subversion repository of the WebKit project. This means that authors (that is, developers actually submitting the changes, when they are different from committers) are (so far) not considered: we are only taking into account contributions by people who have the right to commit to the project repository.
In addition, we scrutinized commit-queue activity (commit-queue is the bot which actually commits changes to the source code in many cases, such as those that follow code review procedures). Of a total of about 10,000 commit-queue commits, we identified code reviewers in charge for about 8,500, and considered them as committers for those commits.
We then sorted committers by number of commits, and tried to find the institutions (companies or other) to which they are affiliated. We succeeded with a high degree of certainty for committers accounting for more than 95.6% of the commits. In fact, all committers with more than 600 commits were linked to an institution, and only 4 were not linked with more than 200 commits. In total, we have identified affiliations for 387 committers (out of a total of 439 identities found in the Subversion repository), corresponding to 29 institutions. We have tracked only the current company: if some committers were hired for a different company in the past, all their commits during that time would be wrongly assigned to the current one. Of course, we could have some errors in the assignment of affiliations, but data is correct to our best knowledge.
The exact data we used for this analysis, and all the methodological details will be published when we release our upcoming report on WebKit. The data shown in this post could have (small) errors, which should not affect the general statements in it.
Did you filter tests rebaseline and platform integration?
If not, you may be just measuring the background noise over real patches.
This is based on finding a committer for *all* commits in Subversion, including assigning a committer (person) to commit_queue commits (as said in the last part of the post).
I agree that filtering out commits, and measuring other parameters (such as size or entropy of commits) provide complementary views of the project. The same can be said of considering activity in the ticketing system (such as which companies are closing or helping to close tickets), or how code reviewing is working.
Stay tuned for more posts…
[However, after our experience in other projects, those other views tend to show a similar image, at least for the big players]
This explains some of your surprising conclusion.
Due to the nature of how the repository is used, you need to do some smart filtering.
It would be nice to see the code you used to understand why your results are so different.
Benjamin, can I send you an email message, and we follow up by email?
Please do. I can explain some more by email.
Assume the 2012 dip in Google commits is due to data not being for entire year? Or has there been a genuine reduction? Would be fascinated to see the types of commits Google we’re making vs Apple if that is possible? Maybe as a wordcloud or graph by keyword? 🙂
Yes, the data for 2012 is from Jan 1st to Oct 24th, so for 2012 about 20% of the year is still missing.
The idea of the tag cloud is nice. I have some software for that, I will give it a try. Thanks for the suggestion.
I wonder how you took into account the @webkit.org email addresses because many commiters of foobar company uses @webkit.org to commit in behalf of their company.
I tend to believe it is incorrect conclusions as it doesn’t take into account how the repository is used. Google is for fact a huge contributor, true but they are the only ones having pixel tests running on multiple platforms therefore they land much more rebaseline of their tests than other companies. It is work in WebKit for sure but it is not development so it gives a false indication on how the work is distributed. I believe that if the LayoutTest references were in a separate repo the numbers would be different.
Matching between developer and company was manually performed in the cases you mentioned. However, as declared in the methodology section, not all of the developers were assigned (up to a 95% of the total commits).
I agree with you that this is only a way of retrieving and visualizing the data. Indeed, one of our next posts is partially focused on the analysis you mentioned. And this is by port or by directory under the “Source” first level one.
For instance, another study is related to make a division between reviewed and not reviewed commits and similar ones or the evolution of authors vs reviewers.
Thank you very much for your comments, any suggestion is more than welcome!
Yes we have few “free” contributors indeed.
I’m happy you are actually doing another shot removing from the equation the layout test maintenance.
I believe also that the study based on commits with Reviewed By is more valuable than the current study you posted. I strongly think (but you will tell me soon) that the tests maintenance biased a lot the results, not necessarily on Apple/Google but on other companies numbers.
Based on your other studies and their outcome maybe you should post an update on this one. Many websites are using your article to make weird conclusions that make no sense :).
Thanks for the work anyway, it’s nice to see numbers like this.
Thank you again for your comments :). We are now polishing the dataset based on your suggestions!!
Our idea regarding an update of this analysis is to finally have a HTML report where you can play with the data, download the dataset and create your own scripts or run current ones, as similarly done in the Folsom release analysis of OpenStack project: http://bitergia.com/public/reports/openstack/2012_09_folsom/
A great idea – would love to have a way to play with this data!