The Smartker server-side OCR function can provide OCR (Optical Character Recognition) for PDF and TIFF documents found in the Smartker library so that they can be indexed and searched. The OCR mechanism is located on the Smartker server which uses a queue to process the documents. Once the mechanism completes the document OCR processes, the document is registered as a new version containing a text layer that allows the document to be indexed and searched within the document management system.
The criteria for adding a document to the OCR processing queue are:
* The document must be in an «Electronic Document» format. Electronic records and offline documents will not be processed.
* Only PDF and TIF / TIFF documents are processed. TIFF images are converted to searchable PDF documents.
* Only the latest version of documents can be processed. This is because a new version is created once the document has been OCRed. The owner of the original document remains the owner of the new OCR version.
OCR Engine The resulting text layer depends on the quality of the document being OCRed. Therefore, to ensure the accuracy of the resulting text layer, the quality of the documents must be reasonably high. Lower quality scans will be difficult to perform with OCR, so quality checks must be performed on these documents. The OCR engine cannot detect if an image is rotated, so make sure that your documents can be read from right to left and that the text is oriented horizontally.
Documents processed by the OCR engine can be compressed to reduce repository space. The image / PDF compression feature must be licensed from Smartker. Document compression consists of several techniques. System administrators can decide which technique should be enabled or disabled to maintain the required level of document optimization. See Image / PDF compression options for settings.
Server-side OCR is an optional feature that is controlled in the Smartker license. To purchase server-side OCR, please contact sales@Smartker.com.
If a document goes through the server-side OCR process, a new version of the document is generated. This new version will not be associated with any workflows that occurred in the previous version and will therefore lose its review and approval statuses. The newly generated version will need to go through the workflow process again if those statuses need to be maintained between versions.
Server-side OCR can be a time-consuming mechanism; therefore, documents are added to a queue for processing. All new documents, new versions, added manually or via an automatic import mechanism (such as monitored folders or managed imports) are automatically added to the queue. Existing repository documents can be manually added to the queue.
You can apply priority for newly added documents or versions to have a higher priority in the queue through a setting. They will be processed before any existing documents in the queue. If the setting is not applied, documents are taken from the queue in the order they are added regardless of priority.
For the «Add existing documents to the OCR queue» option, an «OcrTotalOfExistingDocuments» configuration setting is used for the OCR queue. First, Smartker processes any newer documents or versions, then looks at the queue. If the queue is large because it also needs to process a large number of existing documents, this can affect system performance.
Ocr Total Of Existing Documents can help reduce these effects. The default value is 1,000,000 but can be adjusted in the web.config file located in C: \ Program Files \ Smartker Systems \ Application Server \ LibraryManager. With a larger number of documents in the queue, it is recommended that:
Perform after-hours operation.
Extend the WebServiceCallTimeoutSec setting for WebClient to avoid client-side timeout. Not necessary, regardless of client-side timeout, the operation will continue on the server side.