Software Composition Analysis: How Open Source Component Detection Works


How CAST Crawls Open Source and Feeds Our Database

The Software Composition Analysis feature in CAST Highlight leverages exclusive patents to detect the presence and the origin of third-party components in source code. While this information is useful for vulnerability identification and IP and license compliance, it is also important to understand how the overall mechanics work to get the full value from SCA.

To constitute our Open Source knowledge base, we continuously crawl different platforms, whether it’s for source code (i.e. Github and GitLab), packages (i.e. NPM, NuGet and PyPi) or binaries (i.e. Maven).

There are four primary steps to crawling:

1. Cloning a copy of an open source component, including all versions, revisions, commits and files (i.e. .java, .js, .jar, .dll, .xml, .md and others).

2. Starting from day one of the project (i.e. the very first version in chronological order), we compute a unique key for each file using the standard SHA256 secure hash algorithm from NIST associated with its creation timestamp. These are fingerprints. In other words, if a given file is not modified across the project timeline, its fingerprint will remain the same. As you can understand, as soon as this file is modified – whether it’s a new line of code, instruction or comment, a different fingerprint with the modification timestamp will be computed.

3. All component fingerprints are stored in a database along with the version details and component meta data, including the version number, release date, component name, license, origin platform URL, technology and project metrics.

4. Before inserting any fingerprint in the database, the system checks if the same fingerprint has been found previously in the database and is linked to another component. If that’s the case, the system retains only the oldest fingerprint based on its timestamp, which is the commit date if it’s a source code repository and a push date if it’s a package archive.

Why It’s Important to Check Temporal Anteriority

When teams decide to use Open Source components within an application, they don’t often know much about the origin of the OSS, except perhaps if it’s a fork from another component. However, traceability of what composes software is key when talking about business applications. It is becoming increasingly important to be aware from a vulnerability standpoint (i.e. if a malicious file somewhere in a repository makes the whole application vulnerable), but it’s especially true when legal and license compliance concerns are raised.

For instance, when using a UI Javascript/CSS component under an MIT license that has copy/pasted several stylesheets and scripts directly from an old repository under GNU GPL 3.0, you will need to track this from a licensing perspective. Classic SCA tools often overlook these details, but CAST Highlight’s temporal analysis spots license inheritance immediately.

How CAST Highlight Scans and Understands Application Portfolios

Whether using the command line or local agent to scan source code, CAST Highlight computes SHA256 fingerprints of files using the same algorithm as other Open Source crawlers. These fingerprints are stored in resulting CSV files produced by CAST Highlight’s analyzers. Because a binary file (.jar, .dll, etc.) is not a type of file that Highlight parses in depth, binary files are detected in parallel and fingerprinted in a separate CSV.


Once the results have been uploaded into the CAST Highlight platform, fingerprints are checked against our proprietary SCA database – the largest in the industry – to determine matches across billions of fingerprints and millions of projects. This information is then aggregated at the component/version level within the CAST Highlight dashboards.

Filed in: Risk & Security
  This report describes the effects of different industrial factors on  structural quality. Structural quality differed across technologies with COBOL  applications generally having the lowest densities of critical weaknesses,  while JAVA-EE had the highest densities. While structural quality differed  slightly across industry segments, there was almost no effect from whether the  application was in- or outsourced, or whether it was produced on- or off-shore.  Large variations in the densities in critical weaknesses across applications  suggested the major factors in structural quality are more related to  conditions specific to each application. CRASH Report 2020: CAST Research on  the Structural Condition of Critical Applications Report
Open source is part of almost every software capability we use today. At the  very least libraries, frameworks or databases that get used in mission critical  IT systems. In some cases entire systems being build on top of open source  foundations. Since we have been benchmarking IT software for years, we thought  we would set our sights on some of the most commonly used open source software  (OSS) projects. Software Intelligence Report <> Papers
Making sense of cloud transitions for financial and telecoms firms Cloud  migration 2.0: shifting priorities for application modernization in 2019  Research Report
Michael Muller
Michael Muller Product Owner Cloud-Based Software Analytics & Benchmarking at CAST
Michael Muller is a 15-year veteran in the software quality and measurement space. His areas of expertise include code quality, technical debt assessment, software quality remediation strategy, and application portfolio management. Michael manages the Appmarq product and benchmark database and is part of the CAST Research Labs analysis team that generates the industry-renowned CRASH reports.
Load more reviews
Thank you for the review! Your review must be approved first
You've already submitted a review for this item