By Alex Yakyma
In this article we explore one crucial component of software complexity – the interconnectedness of the key codebase entities (classes, in our case). This is a critical discriminator of complexity: the higher the number of such connections, the higher the variability in development activities, the higher the risks and the less reliable the estimates. This is especially important as enterprises are dealing with ever larger software solutions to serve their growing business needs. Based on empirical evidence, in this article we show that interconnectedness in a codebase has a heavy-tail distribution, which implies that significant, impactful risks are not rare events. This also implies that the “volatility” of code grows with size and we provide interpolated formula that shows the accurate dynamic.
Method and Data
In this research we used a custom-built program to analyze static dependencies between classes in a codebase. Java codebases were used as the input. A total number of 52 open-source projects were used as data sets with size (expressed in number of classes) varying from as little as 13 classes to as high as 25, 485 classes in the solution. For every solution and every class within the solution, the program was calculating the degree (in a graph-theoretic sense, see  or  for more detail on the definition). The degree for class A was calculated as a cumulative number of static references class A has on other classes and other classes have on class A.
In the example in figure 1, class A has degree of 4, class B has degree of 2, class C of 1 and class D of 4. A codebase therefore was treated as an unordered graph. The primary tactical method was probabilistic analysis of the distribution of class degrees in a codebase.
Test code was manually removed from every project prior to processing, to avoid any potential bias created by test classes.
Two main values were initially considered for the codebases: the mean and the standard deviation of degrees in a codebase. Following results were achieved (figure 2):
After considering the table, two additional data points were obtained:
- Pearson correlation between mean and size values turned out to be 0.185, which is very low and basically means: “no correlation”.
- Pearson correlation between standard deviation and Size was 0.687, which is quite high and implies significant correlation between the two data series.
In this article we will try to focus primarily on immediate results, leaving economic interpretations to the upcoming articles on the topic. For that reason, while low correlation has its own meaning, we will attend it separately in the future, while for the rest of this article we will focus on further exploration of the nature of correlation in point (2).
It is critical to acquire the right understanding of what large standard deviation really implies. We will do that on an example. So “RxJava” project has the mean class degree of 12 (12 dependencies per class on average, in other words). But given that the standard deviation is ~36, it means that there will be a relatively large number of classes with degree of 48. Even if we imagined that the mean was 0, there would be lots of classes with degrees of near 36, creating significant ripple effect for software change. Also please note that the absolute majority of our 52 projects have “skewed” distribution of degrees, which explains why standard deviation is so often higher than the mean.
As a result of simple iterative interpolation, it turns out that for a codebase containing N classes, N0.25 is better correlated with the standard deviation. Namely, Pearson correlation of N0.25 and standard deviation constitutes 0.846 (as opposed to 0.687 that was obtained for r(N, STDEV)). This is a very high number that suggests very strong correlation. Using CN0.25 with a constant C as a tool to approximate the standard deviation, we obtain C = 5 as a close-to-optimum value in the least-squares metric. Therefore we can use an approximate formula for standard deviation (or, in a sense, “volatility” of a codebase) as follows:
Standard Deviation = 5N0.25
The graph of this function is shown in figure 3:
This function is monotonic and unbounded, therefore as the codebase grows bigger, the standard deviation grows bigger, too. Moreover, with N infinitely big, standard deviation as well grows to infinity. The latter in a local sense implies that class degrees have heavy-tailed distribution. Overall we conclude that:
In a large codebase, extreme events have significant probability.
Lastly, it would be interesting to compare statements (1) and (2). The data suggests that (1) has lower impact on code volatility than (2). Indeed, even with relatively low mean value, huge standard deviation introduces very significant levels of uncertainty.
The p-value for Pearson’s correlation of N0.25 and the standard deviation in our case (r=0.846 and sample size equal 52) is less than .000001, which means that the data is highly reliable and the inferencing is valid.
The research is going to continue in terms of further exploration of complexity of codebases. The importance of the System Under Development is crucial in managing complexity of the development organization.
In the development process of the dependency analyzer program, JavaParser, a wonderful open-source library was used (see link below). The author is very grateful to the group of enthusiasts who created such a useful toolset.
- R. J. Trudeau, Introduction to Graph Theory, Dover Publications, 2015.
- E. S. Keeping, Introduction to Statistical Inference, Dover Publications, 2015.
- JavaParser. http://javaparser.org.
- E. Parzen, Stochastic Processes, Holden-Day, 1962.
Ⓒ Org Mindset, LLC