Software Composition Analysis: How Open Source Component Detection Works


How CAST Crawls Open Source and Feeds Our Database

The Software Composition Analysis feature in CAST Highlight leverages exclusive patents to detect the presence and the origin of third-party components in source code. While this information is useful for vulnerability identification and IP and license compliance, it is also important to understand how the overall mechanics work to get the full value from SCA.

To constitute our Open Source knowledge base, we continuously crawl different platforms, whether it’s for source code (i.e. Github and GitLab), packages (i.e. NPM, NuGet and PyPi) or binaries (i.e. Maven).

There are four primary steps to crawling:

1. Cloning a copy of an open source component, including all versions, revisions, commits and files (i.e. .java, .js, .jar, .dll, .xml, .md and others).

2. Starting from day one of the project (i.e. the very first version in chronological order), we compute a unique key for each file using the standard SHA256 secure hash algorithm from NIST associated with its creation timestamp. These are fingerprints. In other words, if a given file is not modified across the project timeline, its fingerprint will remain the same. As you can understand, as soon as this file is modified – whether it’s a new line of code, instruction or comment, a different fingerprint with the modification timestamp will be computed.

3. All component fingerprints are stored in a database along with the version details and component meta data, including the version number, release date, component name, license, origin platform URL, technology and project metrics.

4. Before inserting any fingerprint in the database, the system checks if the same fingerprint has been found previously in the database and is linked to another component. If that’s the case, the system retains only the oldest fingerprint based on its timestamp, which is the commit date if it’s a source code repository and a push date if it’s a package archive.

Why It’s Important to Check Temporal Anteriority

When teams decide to use Open Source components within an application, they don’t often know much about the origin of the OSS, except perhaps if it’s a fork from another component. However, traceability of what composes software is key when talking about business applications. It is becoming increasingly important to be aware from a vulnerability standpoint (i.e. if a malicious file somewhere in a repository makes the whole application vulnerable), but it’s especially true when legal and license compliance concerns are raised.

For instance, when using a UI Javascript/CSS component under an MIT license that has copy/pasted several stylesheets and scripts directly from an old repository under GNU GPL 3.0, you will need to track this from a licensing perspective. Classic SCA tools often overlook these details, but CAST Highlight’s temporal analysis spots license inheritance immediately.

How CAST Highlight Scans and Understands Application Portfolios

Whether using the command line or local agent to scan source code, CAST Highlight computes SHA256 fingerprints of files using the same algorithm as other Open Source crawlers. These fingerprints are stored in resulting CSV files produced by CAST Highlight’s analyzers. Because a binary file (.jar, .dll, etc.) is not a type of file that Highlight parses in depth, binary files are detected in parallel and fingerprinted in a separate CSV.


Once the results have been uploaded into the CAST Highlight platform, fingerprints are checked against our proprietary SCA database – the largest in the industry – to determine matches across billions of fingerprints and millions of projects. This information is then aggregated at the component/version level within the CAST Highlight dashboards.

Filed in: Risk & Security
Michael Muller
Michael Muller Product Owner Cloud-Based Software Analytics & Benchmarking at CAST
Michael Muller is a 15-year veteran in the software quality and measurement space. His areas of expertise include code quality, technical debt assessment, software quality remediation strategy, and application portfolio management. Michael manages the Appmarq product and benchmark database and is part of the CAST Research Labs analysis team that generates the industry-renowned CRASH reports.
Load more reviews
Thank you for the review! Your review must be approved first
You've already submitted a review for this item