1. Summary of proposal
To study the structure and roles of the community of the Apache Software Foundation.
Ramón: These bullet points don't reflect the questions to be answered proposed in the next section.
- Identify individuals in all the data sources
Assign a "fingerprint" for each profile described in CommunityMetrics.
- With the fingerprints, match individuals to each profile
This should be done over time. Then we can obtain how the roles of each individual change over time.
Ross: What is the "community"? The ASF is an umberella organisation for a number of projects. Are we studying individual projects, or the ASF as a whole? Personally, I (RossGardler) would feel it far more interesting to study the ASF as a whole, how has the foundations community grown over time in relation to the growth of the projects it houses?
Herraiz: The community is formed by the group of people interacting in the ASF, through the mailing lists, the subversion repository, etc. The study would take in account all the mailing lists of all the hosted projects in the ASF, as well as the lists of ASF itself (at least, of all the public lists). The study would be done at the project level, and at the community (or ASF as a whole level).
Ramón: I think that going with the ASF as a whole is going to be too complicated. From a practical point of view, I'd say let's study one project. Can we get useful community metrics? If the answer is yes, then we apply the same methodology to all projects. Once we have that, we can think how to tackle the whole Foundation. One problem will be, e.g. how to deal with people who appear in different projects.
Ross: There is a mailalias.txt file in the ASF that (theoretically) defines all mail addresses used by committers in the ASF. Doesn't help with non-committers using multiple addresses though. I'm not sure if this is a public file or not, I'll check if necessary.
Using the developers and users mailing lists, it should be possible to identify the profiles described in CommunityMetrics.
2. Questions to be answered
For 1 top level with subprojects, identify people in different roles according to the CommunityMetrics model
- Study temporal evolution of roles
Ramón: In the proposal below, too many questions, and too generic, the research problem is effectively unspecified. As it is posed now, this is enough for an open research project for 1 year, not a 1 month study. I would hesitate having more than 2 questions.
Does the ASF (or actually each project within the ASF) follow the model proposed by CommunityMetrics?
- Do those projects match the onion model?
- What is the distribution of profiles in the community?
- How do those distributions change over time?
- How individuals migrate from one role to another?
- Is there any correlation with any other parameter of the project? How the project is affected by the distribution of roles?
3. Other ideas to work on
Ramón: While the ideas below may be interesting, I think they stray from the study. In fact, they could constitute a new study. I'd rather have 1 study finished that 2 incomplete, so I'd forget about this completely for the moment. I'd like to delete this section if nobody disagrees.
I am interested in finding out how ideas spread in a community. The global trend of the community is determined by the common wisdom of the community. This is, people accept some ideas about the community. How that common wisdom generates and spreads?
The methodology proposed is to take the text of the messages in the mailing list, and to extract "keywords" or "ideas" from that text. There some approaches to do this. The easiest one is just to filter out the most common words and to obtain a list of the 10 least commond words. Those words would be the "keywords". Another approach could be to reuse the text matching methods used by the FOSSology project.
The evolution of keywords over time could give an idea of the evolution of the community. We could try to make some Social Network Analysis, generating a network of people connected by keywords. This analysis could be done on a monthly basis, and maybe we could try add information about the profile of each individual (identified using the method described on the top of this page).
This approach has some other applications as well. For instance, we could identify who first introduced a keyword in the project, and how keywords propagate depending on who introduced them.
Another application would as a summary of messages. This could be useful for newcomers. When someone wants to ask something in a mailing list, she has to carefully review previous messages in order not to ask about something that has been asked before, because otherwise people get annoyed. This "archeology" of mailing lists archives supposes a entry barrier for new comers in the mailing lists. If messages could be summarised using keywords, the digging process would be easier.
From another point of view, developers have to deal with large amounts of mail. Having keywords of the messages would help to identify which messages are important and which are not.
4. Data sources
Ramón: Pointers needed.
- Mailing list archives
- SVN repository log file
5. Work plan
List of tasks to be performed during the visit.
5.1. Profiles and Migration processes
Ramón: Unless Israel has already done this before, knows the tools by heart, and knows exactly what results he needs to get at each step, and how to use them in the following steps, I don't think there's any chance of getting this done within the proposed schedule. I think optimistic estimates should start at twice the time.
- Define profiles, and the fingerprint of each profile in mailing list archives and SVN log (to be done before the visit)
Obtain data sources and process them with MLStats and CVSAnaly (2 days) Ramón: How is the output from MLStats and CVSAnaly going to be used to respond the research questions? What kind of output are we going to focus on?
Identify individuals in the data sources using heuristics (1 day) Ramón: What kind of heuristics? I suppose this is something you have implemented in your software.
Identify the profile of each individual each month during the lifetime of the project (2 days) Ramón: Identifying profiles is something completely new for Israel and us (as far as I know). Unless we define exactly what we are going to do, there's no way this is going to get done in 2 days.
Define migration processes. How the profiles change over time? (2 days) Ramón: Same comment as above. Besides, what method do you have in mind to compute the evolution of profiles?
Distribution of profiles. Evolution of this distribution. Similarity using the Bhattacharya Metric. (3-4 days) Ramón: Why this metric? We don't even know how the results are going to look like, or whether we are going to get some kind of model.
- Discussion on results so far (1 day)
5.2. Spreading of ideas
Ramón: As above, this is another project and doesn't belong in this page.
- Obtain keywords from the databases (1-2 days)
- Review the FOSSology text matching tools (1 day)
- Evolution of the keywords over time (1-2 days)
- Discussion on results. Identify key-keywords (1 day)
- Search for the individuals who introduced each keyword. Study the evolution of the keywords depending on who introduced them. (1 day)
- Correlation about the "popularity" of a keyword and the role or profile of its original author (1 day)
- Social Network of keywords. Parameters of the network. Identify individuals with high importance in the network. (2 days)
- Evolutionary study of the above (2 days)
6. References
Kevin Crowston and James Howison. The social structure of free and open source software development. First Monday, 10(2), February 2005. http://www.firstmonday.dk/issues/issue10_2/crowston/.
Chris Jensen and Walter Scacchi. Modeling recruitment and role migration processes in OSSD projects. In Proceedings of 6th International Workshop on Software Process Simulation and Modeling, St. Louis, May 2005.
Comparing the Similarity of Statistical Shape Models Using the Bhattacharya Metric

