How to Configure a Keyword Scan for GDPR Compliance


What is GDPR

According to Wikipedia, the General Data Protection Regulation (GDPR) is a regulation in EU law on data protection and privacy for all individuals within the European Union and the European Economic Area. It also addresses the export of personal data outside the EU and EEA areas. GDPR aims primarily to give control to citizens and residents over their personal data and to simplify the regulatory environment for international business by unifying the regulation within the EU.

Concretely speaking, it means that organizations will have to know if their applications read, process or store Personally Identifiable Information (PII) data of EU-based users, in order to set up the appropriate actions to qualify the nature and purpose of data collection, modify the application behavior and ask users whether they consent to share their data.

Why You Need to Scan Code to Prove GDPR Compliance

According to GPDR regulations, organizations now must know if their applications are processing PII data. This is something obvious and easy to determine if an application is connected with a central database that holds tables or columns like “first_name”, “email_address” or “social_security_number,” but as software is increasingly complex, finding this information is not always so simple.

First, applications and databases don’t necessarily have a 1:1 ratio. You may have a few central databases that are accessed by hundreds of apps. Then, GPDR is not only about identifying the databases you have and putting them in the GPDR process. This verification also needs to be done at the application level.

Secondly, applications can manipulate PII data without any database. API, JSON, web- and micro-services are the norm, meaning that a piece of source code can read, process and share data with other components without having a clue about the database that initially stored it. For example, it’s likely your HR department uses a non-compliant script to read and identify relevant LinkedIn profiles that works by manipulating PII data, including names, locations and profile pictures.

Fortunately, developers love to write code they can easily read and maintain. 99% of the time they call their classes, methods, parameters with names that are not obscure (e.g. getCustomerName, updateProfile($CreditCardNumber) etc.). As a result, it is possible to approximate (if not determine) that an application processes PII data by scanning its source code and counting occurrences of PII-related keywords. Scanning code to search for patterns? That’s exactly where CAST Highlight comes into the game.

How to Configure a Keyword Scan in CAST Highlight  

The Keyword Scan feature works with our command line and takes the path to your keyword configuration file (–keywordScan “path/to/your/file.xml”). This file will tell the analyzers in a structured way what to search during a code scan. Its structure is detailed below:

  • UserScan: the root node that contains the configuration.
  • keywordScan: the main node for a keyword topic. You can indicate a name and a version (e.g. name=”GDPR” version=”1.2″). You can have multiple topics in a single configuration file as you may want to search for GDPR-related keywords but also keywords for licenses, specific unauthorized functions and other regulation tags.
  • keywordGroup: the node that will search in code for a keyword or a set of similar keywords (e.g. “social security number”, “ssn”, “social security nbr”, etc.). For each keyword group, you can define a specific weight (for instance, in a GDPR context, a passport number will weigh more than a first name) and search options such as case sensitivity or full vs. partial word-matching.
  • keywordItem: one of the search element. You can have multiple items for a given keyword group.

Your final configuration file would look like this.

CAST Highlight_Keyword_Configuration

CAST Highlight_Command Line

When scanning an app with the feature active, the command line will produce one result CSV per keywordScan and per technology (e.g. Java-[date].KeywordScan.GDPR.csv, Java-[date].KeywordScan.Passwords.csv, Python-[date].KeywordScan.GDPR.csv, Python-[date].KeywordScan.Passwords.csv, etc.).

Each produced CSV will contain the list of scanned files as rows and the number of found keyword group occurrences as columns.

CAST Highlight_Keyword_Scan_Dashboard

Explore the Results

In Highlight dashboards, Keyword Scan information can be visualized at the portfolio level from the menu entry “KEYWORD SCAN”. Using the weights and number of occurrences for a keyword group, the dashboard displays the aggregated scores by domain, application or keywords. Keyword scores are simple to understand, as their purpose is to quickly identify the relative volume of occurrences (in regards of keyword severities), density and file scope of a keyword set:

  • Score: number of occurrences * weight
  • Density: score / total files of the application
  • Impacted Files: number of files where keyword occurrences have been found

Visually, you can easily see when applications contain a lot of occurrences and/or occurrences with high severity by looking at the scores (horizontal axis), right side of the chart corresponding to high scores. Depending on your use case, you can also change the vertical axis with the main Highlight KPIs (Business Impact, FTEs, Lines of Code, Cloud readiness, etc.) and see how your application portfolio is distributed on this new metric.

Take Concrete Actions

In a GDPR assessment context, it also makes sense to leverage one of the features we introduced in the last version: the capability to filter application results on survey answers.

Create a custom survey and simply ask your application owners the question “Does your application manipulate Personally Identifiable Information?”. Back in the dashboard, select the answer “No” and see if the applications that are supposed to not manipulate PII data, have in fact occurrences on GDPR-related keywords. You now have a solid list of application candidates to investigate for a GPDR registration.

Filed in: Risk & Security
Michael Muller
Michael Muller Product Owner Cloud-Based Software Analytics & Benchmarking at CAST
Michael Muller is a 15-year veteran in the software quality and measurement space. His areas of expertise include code quality, technical debt assessment, software quality remediation strategy, and application portfolio management. Michael manages the Appmarq product and benchmark database and is part of the CAST Research Labs analysis team that generates the industry-renowned CRASH reports.
Load more reviews
Thank you for the review! Your review must be approved first
You've already submitted a review for this item