The DLP indexer creates the irreversible hash fingerprints of your critical data records and uploads them to Umbrella into the template of the configured Exact Data Match Identifier.
Prior to generating the hash fingerprints, the DLP indexer validates that the submitted records and their values conform to the defined and supported field types as part of the Exact Data Match template.
Table of Contents
Prerequisites
- Full admin access to the Umbrella dashboard. See Manage User Roles.
- JVM version 17+
- The machine where the data indexer is downloaded must be able to connect to the following endpoints:
- POST https://api.umbrella.com/auth/v2/token
- GET https://api.umbrella.com/policies/v2/edm/<edm_template-id>
- POST https://api.umbrella.com/policies/v2/edm/<edm_template-id>/data
Note: <edm_template-id> is the ID of the EDM identifier retrievable from the Umbrella UI. (See Step 7 in Create an Exact Data Match Identifier.)
- The DLP Indexer must be downloaded after the template for the EDM identifier is created. See Steps 1-6 in Create an Exact Data Match Identifier. After downloading the DLP indexer, move it from the downloads folder to a convenient location, such as the folder where the data records are stored.
- The API Key and Secret must be generated for the DLP indexer. See Step 8 in Create an Exact Data Match Identifier.
- The indexer supports files with up to 55 million records. The exact records limit is determined by the total number of columns and how many of those are of Alphanumeric type. The indexer will display the exact limit when attempting to load a file that exceeds it. If your dataset is larger than the limit, you need to split the records into multiple files. For errors received when indexing a large file, see Memory Tuning for DLP Exact Data Matching Indexer.
- The source data CSV file you index must meet the following requirements:
- The file name must not include space characters.
- A multi-term (multi-word) field can contain a maximum of 6 space-separated words.
- The data file must contain only 1 byte or 2 byte UTF-8 encoded characters.
- The first row of data must have between 1 and 50 fields and each row must have the same number of fields.
- The first row of data must specify the name of each field, and each value must be unique.
- Data in the second and ensuing rows must comply with the EDM field types and supported formats (see Exact Data Match Field Types).
- The field names in the sample data template must match the field names in the actual data source file. The field names must appear in the same order in both files.
Caution: Do not create, edit, or view the source data CSV file using Microsoft Excel, as this may corrupt the file. Use a text editor.
Note: If any of the values provided in the source file to the DLP indexer fail to be validated as per the supported format, then the DLP indexer will skip that record and proceed with indexing the remaining records. The indexer also behaves in this manner for any records that may exceed the template-defined fields, and for empty rows or records with empty primary values. The position of the skipped records in the file will be provided as part of the output of the DLP indexer.
Run the Initial Data Index
When you create a new EDM identifier, you need to run the DLP Indexer for the first time to upload the first set of data records. For the full procedure on creating an EDM identifier, see Create an Exact Data Match Identifier.
- Run the indexer in a terminal window with the following command:
java -jar <directory_path>/dlp-indexer.jar -i <directory_path>/<source_file>.csv
-e <edm_template-id> -k <authKey> -s <authSecret>
where:- <directory_path>/dlp-indexer.jar —the relative path to the location of the DLP indexer
- <directory_path>/<source_file>.csv —the relative path to the csv spreadsheet with the actual data records
- <edm_template-id> —the ID of the EDM identifier retrievable from the Umbrella UI as shown in the following screenshot. (See also Step 7 in Create an Exact Data Match Identifier.)
- <authKey> —the API Key previously saved at Step 8d of Create an Exact Data Match Identifier
- <authSecret> —the API Secret previously saved at Step 8d of Create an Exact Data Match Identifier
The exact data matcher now has a status of Data Indexed.
Note: When the EDM has a status of Data Indexed, you can add the EDM to a data classification but you can not edit the field types, primary field selection, or matching condition.
Update the Indexed Data Set Periodically
When your source file CSV is updated with new records, the existing EDM Data Identifier on your configured policy must be updated to reflect the new data fingerprints. This procedure allows you to rerun the DLP indexer periodically to update your source data to Umbrella without performing the initial procedure over again. After you rerun the DLP indexer with the updated version of the source file against the EDM ID of your EDM Data Identifier, the DLP Policy configured with this EDM Data Identifier accounts for the most recent updates to your critical records.
- In a terminal window, set the the API Key and Secret previously saved in Step 8d of the Create an Exact Data Match Identifier procedure as values to the environment variables EDM_AUTH_KEY and EDM_AUTH_SECRET.
- Run the following command as part of a periodically executed script or as needed:
java -jar <directory_path>/dlp-indexer.jar -i <directory_path>/<source_file>.csv
-e <edm_template-id> -k EDM_AUTH_KEY -s EDM_AUTH_SECRET
Troubleshooting
If the indexer returns an error message that reads, "Error: A JNI error has occurred, please check your installation and try again," check the following:
Confirm you have the latest version of the Java Development Kit installed.
Confirm that you have your PATH system variable set correctly:
Check the location where you have Java installed.
- For Windows this is normally C:\Program Files\Java\jdk-<version-number>\bin
- For Linux this is normally /usr/java/jdk-<version-number>/bin
Use the instructions here to set the PATH system variable appropriately for your operating system.
Note: If the data indexer fails to process the input file and returns a base64 encoded error code, provide that code to the Umbrella Support to assist you with troubleshooting.
Built-In Data Identifiers < DLP Indexer > Copy and Customize a Data Identifier
Updated 3 days ago