Digitizing medical studies using Amazon Textract

James C. Malin, Chief Technology Officer, Diverse Programmers, LLC, is an AWS Certified Solutions Architect. He has held software and engineering roles in the public, private, and government sector. Recently, he tackled an AWS Textract project for a medical company.
A customer in the health care industry wanted to extract the text from tens of thousands of medical study results stored in PDF files. In the past this customer had digitized these files manually. While effective, the manual approach can be time consuming, expensive, and error prone. The customer wanted to tackle this problem through the use of machine learning (ML) instead. In this case the customer was also working under a tight deadline, and they needed this solution to have a high accuracy rate in order to to meet FDA regulations.
The technology to turn images into text is called Optical Character Recognition, or OCR. The customer wanted an OCR solution that could achieve at least 95% accuracy. To arrive at the best OCR solution I first compared a several open source libraries along side Amazon Textract. Amazon Textract is an AWS service that automatically extracts text and data from scanned documents. Amazon Textract goes beyond simple optical character recognition (OCR) to also identify the contents of fields in forms and information stored in tables.
I was already familiar with many of the libraries available. Despite this tuning these libraries to produce good results can still be cumbersome and time-consuming. This would be especially challenging as the data structures contained within the documents were numerous and inconsistent. This adds significantly more complexity as it requires me to configure the open source libaries with several different options.
After running a battery of tests I determined the best approach was to rely on Amazon Textract. I first converted each PDF document to a JPG formatted picture. I then enhanced the quality of the image before finally feeding them into Amazon Textract. Using Amazon Textract was valuable because of its ease of use and its ability to achieve a high level of accuracy while requiring minimal adjustments to its engine.
The results I produced with Amazon Textract were very successful. The digitized medical studies showed consistently high accuracy across many different tables, layouts, symbols, and fonts. In the end we achieved a 97% accuracy rate across all documentation. With the goals met I then handed off the digitized data back to the customer so that they may complete their project.