This post is based on an interesting paper about managing technical debt at Google.
The corporation’s source code is more than hundreds of millions LOC and for the most part monolithic. One of the advantages of having monolithic source code is that it allows for a uniform development style to be adopted across the company: there are written, language specific styles that engineers must follow, a single build system to build code for all projects, and a single testing infrastructure that runs all unit tests. This allows software tool writers to run analysis on the whole source code, for everyone to use the same tools for code reviews, and for everyone to use the same index for the to search the code base. However, the monolithic style of the code base makes it easy to introduce technical debt.
For example, imagine a scenario where someone has introduced a low level API and then later realizes that there are issues with the original design and wishes to replace it for a more sound version. However, at this point there are hundreds of other projects that depend on the original API and replacing it is now extremely challenging. Simple changes, like renaming a class or moving it to another package, can take days or weeks when working on such a large code base. In this paper, the efforts of controlling and managing the technical debt build up in one part of the code base (the build system) are formulated after the following principles:
- Automation: using automated techniques to analyze and fix issues that contribute to the largest technical debt; where several teams work on methods to make large scale changes easier in the future.
- Make it easy to do the right thing: technical debt is often incurred unknowingly, if steps are taken to analyze what makes people incur technical debt methods to prevent it can be imposed.
- Make it hard to do the wrong thing: similar to the previous principle, but with stronger emphasis on imposing harsh checks on the actions developers can do – like creating dependencies on code that is not ready for launch.
The focus is on technical debt that is domain independent and across product boundaries – such debt at Google is called build debt – as it has accumulated in build specifications. The rest of the paper describes specific forms of technical debt, why it is so difficult to manage, and some of the processes used to reduce this debt.
Google’s Build System Debt
The specifications for building software at google are under BUILD files, which define modules of code. These BUILD files are primarily manually maintained, which can be a nuisance for engineers. Through these BUILD files engineers can specify the dependencies between difference libraries and software components; however, overtime the specifications can diverge from the actual dependencies needed to build, test, and execute the software. This is where technical debt can accumulate unless engineers are attentive and keep the source code and dependencies aligned. Asides from this build dependency debt, abandoned targets can also become a problem – the most extreme case of abandoned targets, those that have not been successfully built for months, are called zombie targets.
Google’s original build system was entirely open with no control over targets, as any target could depend on the internal details of any other project – sometimes leading to unwanted coupling of projects. This can also be a source of technical debt when the lower level project, unintentionally, exposes internal details imposing higher costs for future modifications where encapsulation has been violated. Since dependencies solely go in one direction, the project being depended on would only find out they had broken something else when a complaint arose. At this point, the technical debt needs to be repaid in order for both teams to continue on to something else.
Dependency debt causes two issues within the build system. First, it slows down the build and test systems due to extra costs tacked on from work having to be done on over-declared dependencies – which are identified as unnecessary in this paper. Secondly, it causes brittleness of a projects build as a result of under-declared direct dependencies – adjustments made to the transitive closure of under-declared target’s dependencies can cause a loss of a necessary unspecified dependency, thus resulting in a broken build. Such a problem is felt by both the project with the under-declared dependency and the project being depended on. Therefore, the general existence of under-declared dependencies makes the proper removal of over-declared dependencies risky.
One possible, but problematic, solution is to hold a global refactoring day. This would allow engineers to set aside their other duties in order to focus on fixing build rules. Such a solution presents several issues as it is not an automated process and it does not inhibit new debt from being taken on. At Google several tools have been developed in order to partially automate the process. Step one is to find all the under-declared dependencies so as to add them to the existing Google BUILD files and then create a warning flag every time the source code references a class from a indirect or transitive dependency. A warning is also displayed when the build system recognizes a missing dependency in the future. Such steps towards automation make it more difficult for for engineers to do the wrong thing.
The following step is to discover all the over-declared dependencies, because once under-declared dependencies are not permitted the deletion of unnecessary dependencies can be done sans risk. This effectively prevents dependency debt from forming in the future.
At Google a large percent of files are altered every month, which creates opportunities for things to break. The most critical aspects will be fixed immediately but there often are things that are forgotten and the build targets that assemble them can be abandoned. Due to Google’s monolithic code base these dead targets impede efficiency – depending on existing source code is often helpful, but loses its value when time has to be spent fixing forgotten code. In order to repay this technical debt categorizing broken targets as either transient breakages or long term zombies is necessary. There is then an automated process which assesses the last time each target was properly built and when the last time it was attempted to be build, which can then be used to make a list of zombie targets. If these targets fail to be built after 90 days they can be deleted, given permission from the product’s owner.
In order to eliminate the unwanted coupling of projects – where one project depends on the internal processes of another – Google switched their visibility default from public to private. The desired effect of such a change was to allow engineers to decide if they wanted other projects to depend on their code immediately; because when the target is private, permission from the target’s owner is needed to use it. This effectively makes it more difficult for undesired coupling to occur.
The technical debt accumulated from dead flags (flags that no longer have a use) is difficult to evaluate, as the dead flags themselves cause little damage. However the issue with dead flags is that they often protect dead code, make refactoring more difficult, and make reading the code harder. What was discovered through dead flag identification and removal (only of those that were fully identified as dead), was a high ratio between deleted lines of code per dead flag: when deleting 2,300 dead flags, approximately 272,000 lines of code were as well. Therefore, it was concluded that dead flags are the most basic level of code identification, and possibly the easiest to access.
Each of the above specific types of technical debt within Google’s build system are highly revealing, not only to how Google works but also of how infectious and difficult to manage technical debt can be within any large organization. Several of the steps taken by Google to handle their mounting technical debt can be found in a great assortment of technical debt posts from other authors, such as: identification, prioritization, and automation. This paper explains, in detail, just how difficult it can be to implement these steps, and provides excellent data to demonstrate how helpful they are once set in place.
To read the full paper go to: http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/37755.pdf