The Dynamic Nature of System Areas Affected by New Work

Author: Alex Yakyma


In this study we publish research results showing how the areas of the system affected by new work are shifting over time. The importance of this question is substantial to improved understanding of the software development process at large scale as it provides empirical evidence as to how much does the scope of work shift in the future compared to the current period of time.

In this article we will see that such a shift is significant and the extent of the shift has very stable statistical properties. A separate article will be published to discuss the organizational impact of this study.

Method and Data

In this research we analyzed the bulk of 140 open-source software projects created in different technologies and languages (including JAVA, Go, C#, Javascript, Python, C++). GitHub was used as the source for all repositories. The projects vary in size from couple hundred to couple hundred thousand files in a repository. Only repositories with at least 1,000 commits were selected.

We will need a few notions to proceed further.

Assume that a software system S consists of files a, b, c, …:

For any period of time T, let’s define a vector:


where xf stands for the number of commits to file f within the time period of T. So, for example, if our system consists of just three files (this is just an illustration, actual projects contain thousands of files) and a period of time is from Jan 1, 2017 through Feb 1, 2017 then vector (15, 0, 7) means that first file was changed 15 times within this period, second file was not affected at all and third file got 7 commits.

We will be calling vector vT the Effort Profile (for time period T) or EP, for short. Indeed it shows how exactly the effort within certain timeframe impacts different areas of the system. Clearly, the shorter the timeframe, the more components of the EP will be equal to zero, generally speaking.

A good way to think about the EP vector is as of a “heat map” that describes how different areas of the system are affected during T. The higher the specific vector component, the “hotter” it shows on the “map”; the more change it endures during T. So, in our example above when the vector is (15, 0, 7), the “heat map” would look as follows:

Let T1 and T2 be two different, non-overlapping time periods. The Absolute EP Overlap for these two time periods will be defined as:

where xf and yf are the components of the EP vectors for T1 and T2 respectively and f iterates through all files in software system S.

Finally, the Relative EP Overlap is defined as follows:

Relative EP Overlap simply shows how “similar” are effort profiles for the two periods of time. So, if there’s no overlap at all, Lrel will obviously be 0, assuming that there were at least some commits to the system. On the opposite, if the EPs are exactly the same, Lrel will be equal to 1.

Further in the article, for the sake of simplicity, we will be referring to the Relative EP Overlap as simply EP Overlap, or just Overlap, always assuming that we are talking about relative values and will express them in corresponding percentage.

In this research we studied the relative overlap across adjacent, equal-size periods of time. Following were the actual periods 14 days, 30 days, 90 days, 180 days, 365 days and 720 days. For every project, all consecutive overlaps were considered, as the project duration allowed. The first overlap was excluded however, to prevent from data bias (first time period typically includes a large initial commit of all available files, which distorts the results). So, for project XYZ that was active for two years and time period of 90 days, there would be the following initial time sequence: Q1, Q2, Q3, Q4, Q5, Q6, Q7, Q8. The first quarter, Q1, would be skipped and the average value would be calculated over effort overlap of pairs: (Q2, Q3), (Q3, Q4), (Q4, Q5), (Q6, Q7), (Q7, Q8).

Main Results

The main results are compiled in the table:

The results can be roughly interpreted the following way:

The Effort Profile Overlap across two consecutive periods constitutes approximately one third of the period’s scope.

Or more thoroughly:

The next timeframe only 1/3 of the effort will overlap with areas of the system affected in the current timeframe. The remaining 2/3 of the effort will affect other areas of the system. 

This result represents empirical evidence to inherent instability in how business demand affects the underlying systems and will be interpreted in business and organizational terms in other articles adapted for business audience.

Finally, estimated averages in the table above are a statistically significant (p < .05) approximation of the population mean with 2% accuracy. In other words, row 3, for example, represents statistically significant evidence that the actual value is within the [33%, 35%] interval.


Ⓒ Org Mindset, LLC