Configuring Optical Character Recognition for Data Loss Prevention

The Optical Character Recognition (OCR) feature enhances Data Loss Prevention in Cloudlock by scanning and extracting data from images in jpeg, gif, tiff, and png formats, and images embedded in various files, such as Excel spreadsheets, PowerPoint presentations, Word documents, PDFs and ZIP files. The text extracted from these files is scanned for violations. OCR evaluates the contents against the configured set of policies. For example, if a policy is configured to identify credit card numbers and raise incidents, an image file with a credit card number will also be identified as a violation and the corresponding Remediation Action will be enforced.

Limitations

File type is not supported for Webex Meetings.
For images embedded in files, only the first 15 embedded images are scanned.
Nested images are not scanned.
For OCR, the output depends on the quality of the input image. Documents that are not ideal for OCR include:
- Images with significant noise that is similar in colour to the text.
- Images with dark text on a dark background.
- Light images with dot‑matrix characters.
- Language supported is only English.

Procedure

Select a content-based policy configuration and navigate to the Content tab.
Check the Enable optical character scanning checkbox to enable optical character scanning for the policy.

Navigate to the Context tab.
Under File Type,

a. Select All File Types to scan all the file types for the regular expression configured in the policy.
b. Select Specific File Types to scan only specific file types.

If the Enable optical character scanning checkbox is checked in the Content tab, an option to select Images appears under Specific File Types > Scan file name and content in the Context tab. Selecting Images enables the optical character scanning for standalone images as well. The supported formats are jpeg, png, jiff, and tiff.

Note:

If the Images checkbox is not checked, the optical character scanning is only applied to the embedded images in other documents.
If file types such as Documents, PDFs or Presentations alone are chosen, any images embedded in these files are also scanned along with the text. The extracted text is scanned for violations.

Table of Contents:

Limitations

Procedure