There is code duplication detection and code duplication detection


Many software solutions feature the detection of duplicated source code. Indeed, this is one cornerstone of software analysis and measurement:

It is easy to understand the value of dealing with duplicated code: avoiding the propagation of bugs and evolutions in all copies of the faulty piece of code, promoting reuse, and avoiding an unnecessarily large code base (especially when maintenance outsourcing is billed by the line of code).

Now that everyone is convinced of the importance of such capabilities, lets dive deeper into how to do it. There are various solutions and not all are equal.

Can the difference be explained without looking at an algorithm or cryptic formulas? Let’s try.

I believe the major difference stands in how you consider your code:

  • You can approach your application source code as a collection of source files written in development languages that have rigid lexical and grammatical constructs.
  • Or you can approach your application source code as a book -- sometimes as large as an encyclopedia -- written in different languages.

If your approach is the former, you will surely end up with a technology-oriented solution, driven by detection rules as you hope to rely on the rigid constructs. This ranges, and can be simple computations of a numerical value associated with segments of code (such as hash code). There’ are also more advanced capabilities that account for the fact that development languages are composed of statements that follow given syntaxes and not just words and verbs. Those sentences syntaxes are more or less flexible (as the human languages can be).

If your approach is the latter, you may end up with a flexible and adaptive solution inspired by information retrieval technologies -- the ones used to "understand" and "classify" the content of the Internet. This is called Natural Language Programming and it drives search engines on the web.

Obviously, you don't have to type a whole segment of the page or document into a search engine to get the answer you’re looking for. But you need to get the pages and documents that are relevant to the search query you type in. You need to put more importance on discriminating words of the query and less on generic words (e.g.: "how", "that", etc.). You need to learn based on which content is available on the Internet.

What is the relation with duplication? It increases your level of expectation in terms of similarity, and you’re left with a plagiarism detector that understands any language, an evolution of that language, or even a mix of languages. That is what CAST Application Intelligence Platform (AIP) does, and what you won’t find elsewhere.

But does it make a difference? You can start by asking: Did you see any improvements in recent years in terms of search engine result accuracy?

But, let us look at what it truly does:

  1. It will learn any new language or evolution of a language (such as new statements). CAST AIP proposes to add some language support. Thanks to a Universal Analyzer, all these languages can be searched for duplicated code automatically.
  2. It will become more accurate with experience (that is, the volume of the code it processes) as there is more material to learn from.
  3. It will be able to handle huge volumes of code by design, as it has to work for Internet content classification.
  4. It will be more flexible in front of code alteration between one segment and the other (as natural languages are).
  5. It will be adaptable in terms of required similarity level, as opposed to hash code methods (more on that topic in the dedicated panel). Hash code methods look at code segments with the same hash code to remove the non-duplicate by any other means -- they cannot simply look at close hash code values.

It is worth looking under the hood to know what you will truly get with the code duplication detection solution you use, don't you think?

More on code detection adaptability

Why does the code detection adaptability matter? As a first example, let us look at the following code snippets:

Click on the image to see the full code snippets

These snippets are part of a larger program, yet they illustrate the NLP capability: Some significant alterations can take place, yet the NLP-based detection algorithm still catches them (required similarity of 90 percent and size greater than 10 lines of code). As a side note, when looking at the hash code for these snippets, the values are useless to understand how different they are: 1620161396, 16215447, and -209797344. In other words, traditional hash-code-based algorithms would not catch them.

The question is therefore: would YOU want to be able to catch them as well? If so, be sure to pick up the right code detection solution as they are obviously not equal.

As another example, using different parameter values that are set up to catch larger programs which share most of their code (required similarity of 75 percent and size greater than 100 lines of code), one gets:

Click for full snippet
Click for full snippet
Click for full snippet

The main difference comes down to only a half of a single paragraph:

Click for full snippet
Click for full snippet

Traditional code duplication detection will never catch these examples as duplicated code. However, they share enough to be good candidates for code mutualization and illustrate the value one can get from clever NLP-based code duplication detection.

Filed in:
Philippe-Emmanuel Douziech
Philippe-Emmanuel Douziech Principal Research Scientist
Philippe Emmanuel Douziech is a Principal Research Scientist at CAST Research Labs and is the Head of European Science Directorate at CISQ. He has worked in the software industry for more than 20 years and is skilled at assessing software risk and quality.
Load more reviews
Thank you for the review! Your review must be approved first
You've already submitted a review for this item