AWS IQ is the fastest way to work with AWS Certified experts. Below is a case study of a customer’s experience working with an AWS IQ expert. To work with an AWS IQ expert visit AWS IQ.
AWS IQ CASE STUDY

Digitizing medical studies using Amazon Textract

Expert
James Malin – AWS IQ Expert

James C. Malin, Chief Technology Officer, Diverse Programmers, LLC, is an AWS Certified Solutions Architect. He has held software and engineering roles in the public, private, and government sector. Recently, he tackled an AWS Textract project for a medical company.

A customer in the health care industry wanted to extract the text from tens of thousands of medical study results stored in PDF files. In the past this customer had digitized these files manually. While effective, the manual approach can be time consuming, expensive, and error prone. The customer wanted to tackle this problem through the use of machine learning (ML) instead. In this case the customer was also working under a tight deadline, and they needed this solution to have a high accuracy rate in order to to meet FDA regulations.

The technology to turn images into text is called Optical Character Recognition, or OCR. The customer wanted an OCR solution that could achieve at least 95% accuracy. To arrive at the best OCR solution I first compared a several open source libraries along side Amazon Textract. Amazon Textract is an AWS service that automatically extracts text and data from scanned documents. Amazon Textract goes beyond simple optical character recognition (OCR) to also identify the contents of fields in forms and information stored in tables.

I was already familiar with many of the libraries available. Despite this tuning these libraries to produce good results can still be cumbersome and time-consuming. This would be especially challenging as the data structures contained within the documents were numerous and inconsistent. This adds significantly more complexity as it requires me to configure the open source libaries with several different options.

About: Amazon Textract
Amazon Textract is a service that automatically extracts text and data from scanned documents. Amazon Textract goes beyond simple optical character recognition (OCR) to also identify the contents of fields in forms and information stored in tables.

After running a battery of tests I determined the best approach was to rely on Amazon Textract. I first converted each PDF document to a JPG formatted picture. I then enhanced the quality of the image before finally feeding them into Amazon Textract. Using Amazon Textract was valuable because of its ease of use and its ability to achieve a high level of accuracy while requiring minimal adjustments to its engine.

The results I produced with Amazon Textract were very successful. The digitized medical studies showed consistently high accuracy across many different tables, layouts, symbols, and fonts. In the end we achieved a 97% accuracy rate across all documentation. With the goals met I then handed off the digitized data back to the customer so that they may complete their project.

Need expert help? Work directly with James Malin now.

More AWS IQ case studies

Ready to work with an AWS Certified expert? Get connected in minutes.