Why are there so many hurdles to efficient SAM benchmarking?

Two opposite sides

When dealing with Software Analysis and Measurement benchmarking, people's behavior generally falls in one of the following two categories:

  • "Let's compare anything and draw conclusions without giving any thought about relevance and applicability"
  • "There is always something that differs and nothing can ever be compared"

As often, there is no sensible middle ground.

Benchmarking challenges

Some of the most common reasons for objecting to comparing SAM results are:

  • Applications use different technological stacks, with variable thresholds on the level of difference that matters, such as:
    • Object-oriented vs. procedural, implying that comparing object-oriented applications together is OK
    • JEE vs. .NET, implying that comparing object-oriented applications together is not OK if they are not of the same flavor
    • With or without Hibernate framework (or any other framework for that matter), implying that comparing applications using different frameworks is not OK
  • Measurement capability evolves with
    • New measure elements, to check for new risk-inducing patterns
    • Improving existing measure elements to diminish false positives
  • Measurement process relies on some contextual information, such as
    • Target architecture
    • Vetted libraries
    • In-house naming norms

All of the above reasons make perfect sense and one has to be well aware of such situations. However, this is no ground for dismissing any possibility of comparison altogether.

Built-in clutch

A well-designed measurement model can help overcome these challenges. Indeed, with a measurement model which aligns on a Goal-Question-Metric approach, the three levels act like a built-in clutch.

How so? Because the Metrics required to answer the Question one has to ask themselves to measure the level of achievement of each Goal can differ from one technology stack to another, or evolve from one release of the measurement platform to another without invalidating the results. This is also true about the number and nature of the Questions one has to ask themselves to measure the level of achievement of each Goal.

Then, with a measurement model using a compliance ratio to consolidate and aggregate (as opposed to raw count of non-conformity), this statistical processing acts as a built-in clutch.

From one technology stack to another

At the Question level, even if the number of contributing Metrics differ, capabilities in different technologies differ and the sources of issues as well, hence a different number of Metrics. However, if the Question is well answered by the fewer or larger number of Metrics, the difference is then normal and acceptable.

At the Goal level, even if the number of contributing Questions differ: the difference is inherent to the technology and is therefore normal and acceptable.

For example, there is no possibility to over-complexify COBOL code with coding and architectural practices related to Object-oriented capabilities, while there are many ways with JEE. This comparison might seem unfair to JEE that is assessed with more rules and technical criteria than COBOL, but this is inherent to their respective capabilities and a fair assessment of transferability or changeability risk must take this difference into account.

As this assessment result will guide a decision for resource allocation, it is critical to know that in this JEE component, there is not only some regular algorithmic or SQL or XYZ complexity, but there is an additional load of complexity due to excessive polymorphism or XYZ.

From one release to another

New releases of the measurement system are designed to deliver more accurate assessment results. However, added accuracy will impact results. The use of compliance ratio can help limit the impact, assuming the impacts have to be limited.

  • Assuming the new supported syntax or extra capability leads to the testing of new objects and that the quality level is similar to objects with previously supported syntax, the ratio will be stable.
  • Assuming the new supported syntax or extra capability leads to the testing of new objects and that the quality level is really worse than objects with previously supported syntax, the ratio will be negatively impacted for the better: it makes quality issues visible.
  • This reasoning can also be applied to a new quality check: there’s no impact if the quality level is similar, and impact if there is a new piece of information to be known.

Why not rely on a raw count of violations?

Using a raw count of violations would lead to an increase in the number of violations, regardless of the quality level. Any new rule can lead to more violations as it turns invisible quality issues into visible ones, even if the compliance ratio for the new rule is better than the rest of the pack.

From one context to another

This area is perhaps the most delicate because, the ability to compare relies on the assumption that human contextual input differs but that it is fairly set.

To expect fairness may seem naïve, but a lot of the measurement process already relies on some amount of fairness. For example, what are the true boundaries of the application you’re measuring? Nowadays, with cross-app principles, the definition is blurred and it could be easy to omit part of the application to hide some facts from management.

One could also define a target Architecture that would work in their best interest, but that would mean entering into the system some flawed configuration data that anyone can review.

One could vet the libraries to prevent security-related vulnerabilities, improving assessment results, but that would also mean entering some flawed configuration data.

At the end of the day

Yes, there can be true differences between the way multiple applications are assessed.

Although, this is no reason to dismiss benchmarking altogether. Not hiding the differences and their impact is important though.

As Dr. Bill Curtis stated during the CISQ (http://www.it-cisq.org/) Seminar in Berlin on June 19th when explaining the key factors to conduct a clever productivity analysis -- always inspect data, investigate outliers and extreme values, and always question the results too.

In other words, always use your brain.

Philippe-Emmanuel Douziech
Philippe-Emmanuel Douziech Principal Research Scientist
Philippe Emmanuel Douziech is a Principal Research Scientist at CAST Research Labs and is the Head of European Science Directorate at CISQ. He has worked in the software industry for more than 20 years and is skilled at assessing software risk and quality.
Load more reviews
Thank you for the review! Your review must be approved first
You've already submitted a review for this item