How CAST Crawls Open Source and Feeds Our Database
The Software Composition Analysis feature in CAST Highlight leverages exclusive patents to detect the presence and the origin of third-party components in source code. While this information is useful for vulnerability identification and IP and license compliance, it is also important to understand how the overall mechanics work to get the full value from SCA.
To constitute our Open Source knowledge base, we continuously crawl different platforms, whether it’s for source code (i.e. Github and GitLab), packages (i.e. NPM, NuGet and PyPi) or binaries (i.e. Maven).
There are four primary steps to crawling:
1. Cloning a copy of an open source component, including all versions, revisions, commits and files (i.e. .java, .js, .jar, .dll, .xml, .md and others).
2. Starting from day one of the project (i.e. the very first version in chronological order), we compute a unique key for each file using the standard SHA256 secure hash algorithm from NIST associated with its creation timestamp. These are fingerprints. In other words, if a given file is not modified across the project timeline, its fingerprint will remain the same. As you can understand, as soon as this file is modified – whether it’s a new line of code, instruction or comment, a different fingerprint with the modification timestamp will be computed.
3. All component fingerprints are stored in a database along with the version details and component meta data, including the version number, release date, component name, license, origin platform URL, technology and project metrics.
4. Before inserting any fingerprint in the database, the system checks if the same fingerprint has been found previously in the database and is linked to another component. If that’s the case, the system retains only the oldest fingerprint based on its timestamp, which is the commit date if it’s a source code repository and a push date if it’s a package archive.
Why It’s Important to Check Temporal Anteriority
When teams decide to use Open Source components within an application, they don’t often know much about the origin of the OSS, except perhaps if it’s a fork from another component. However, traceability of what composes software is key when talking about business applications. It is becoming increasingly important to be aware from a vulnerability standpoint (i.e. if a malicious file somewhere in a repository makes the whole application vulnerable), but it’s especially true when legal and license compliance concerns are raised.
How CAST Highlight Scans and Understands Application Portfolios
Whether using the command line or local agent to scan source code, CAST Highlight computes SHA256 fingerprints of files using the same algorithm as other Open Source crawlers. These fingerprints are stored in resulting CSV files produced by CAST Highlight’s analyzers. Because a binary file (.jar, .dll, etc.) is not a type of file that Highlight parses in depth, binary files are detected in parallel and fingerprinted in a separate CSV.
Once the results have been uploaded into the CAST Highlight platform, fingerprints are checked against our proprietary SCA database – the largest in the industry – to determine matches across billions of fingerprints and millions of projects. This information is then aggregated at the component/version level within the CAST Highlight dashboards.