A website for my code, projects, and bio
Amazon Textract is a fully managed machine learning service that automatically extracts text, handwriting, and structured data (tables, forms) from scanned documents. It goes beyond simple optical character recognition (OCR) by also identifying the context in which the data is presented, such as the relationship between fields in forms or the headers in tables.
Researchers at UCSF studying the JUUL Labs collection can benefit from Textract by automating the labor-intensive task of manually transcribing and organizing text from a vast array of documents. With Textract, you can quickly and accurately extract relevant data, allowing more time for analysis and interpretation.
In this workshop, we will:
Before starting, ensure you have:
We’ll be using the boto3
library to interact with AWS services, Pillow
for image handling, and the AWS Textract libraries. The code below is how you would install this with iPython in a Jupyter notebook, but you could also pip install these libraries through your terminal.
!pip install boto3
!pip install pillow
!python -m pip install -q amazon-textract-response-parser --upgrade
!python -m pip install -q amazon-textract-caller --upgrade
!python -m pip install -q amazon-textract-prettyprinter==0.0.16
!python -m pip install -q amazon-textract-textractor --upgrade
Now, if you are using iPython, restart the kernel to ensure we can access these libraries.
import IPython
IPython.Application.instance().kernel.do_shutdown(True)
To interact with AWS, you need to set up your AWS credentials, which we’ll do through SageMaker. First, import all the necessary libraries
import boto3
import botocore
import sagemaker
import pandas as pd
from IPython.display import Image, display, JSON
from textractcaller.t_call import call_textract, Textract_Features, call_textract_expense
from textractprettyprinter.t_pretty_print import convert_table_to_list
from trp import Document
import os
Then, set the necessary variables and get a Textract and Comprehend client. This retrieves information from the SageMaker session, so this exact code won’t work if you’re not on SageMaker. If you’re not, you would need to set up your credentials through the AWS CLI.
data_bucket = sagemaker.Session().default_bucket()
region = boto3.session.Session().region_name
account_id = boto3.client('sts').get_caller_identity().get('Account')
os.environ["BUCKET"] = data_bucket
os.environ["REGION"] = region
role = sagemaker.get_execution_role()
s3 = boto3.client('s3')
textract_client = boto3.client('textract', region_name=region)
comprehend_client = boto3.client('comprehend', region_name=region)
Textract can extract raw text from various types of documents. This includes printed text as well as handwritten notes.
def extract_text_from_document(image_path):
with open(image_path, 'rb') as document:
response = textract_client.detect_document_text(Document={'Bytes': document.read()})
extracted_text = ''
for item in response['Blocks']:
if item['BlockType'] == 'LINE':
extracted_text += item['Text'] + '\n'
return extracted_text
# Example usage
image_path = 'path/to/your/document-image.png'
extracted_text = extract_text_from_document(image_path)
print(extracted_text)
Textract can also detect and extract structured data from tables and forms within documents.
def extract_tables_from_document(image_path):
with open(image_path, 'rb') as document:
response = textract_client.analyze_document(Document={'Bytes': document.read()}, FeatureTypes=['TABLES'])
tables = []
for block in response['Blocks']:
if block['BlockType'] == 'TABLE':
table = []
for relationship in block['Relationships']:
for id in relationship['Ids']:
cell = next(item for item in response['Blocks'] if item['Id'] == id)
table.append(cell['Text'])
tables.append(table)
return tables
# Example usage
tables = extract_tables_from_document(image_path)
print(tables)
Often, we won’t have an image file we can directly work with. Especially with IDL documents, we’ll often encounter PDFs that we can’t easily convert with Textract. So, we’ll use this helper function below to convert a PDF to a series of PNGs. Keep in mind, this stores all the PNGs which may use more storage. We could also use a bytearray, but that introduces excessive complexity.
This function converts one PDF to a series of PNGs and puts them in output_folder. DPI is an integer specifying the resolution of the images. A lower resolution will save space, but too low of a resolution will decrease Textract performance. The function returns a list of image paths.
from pdf2image import convert_from_path
def convert_pdf_to_pngs(pdf_path, output_folder='output_images', dpi=300):
images = convert_from_path(pdf_path, dpi=dpi)
image_paths = []
for i, image in enumerate(images):
image_path = f'{output_folder}/page_{i+1}.png'
image.save(image_path, 'PNG')
image_paths.append(image_path)
return image_paths
Now, we’ll use a helper function that extracts OCR from a list of PNGs.
def extract_text_from_png_list(image_paths):
extracted_text = ''
for image_path in image_paths:
text = extract_text_from_document(image_path)
extracted_text += text + '\n'
return extracted_text
Once the text is extracted, it often needs cleaning and normalization to be useful for analysis.
def clean_extracted_text(text):
# Example cleaning process: remove excessive whitespace and normalize case
cleaned_text = text.replace('\n', ' ').strip().lower()
return cleaned_text
# Example usage
cleaned_text = clean_extracted_text(extracted_text)
print(cleaned_text)
For researchers focusing on specific themes, such as the impact of nicotine or vaping, it can be useful to analyze the frequency and context of certain keywords.
def analyze_text_for_keywords(text, keywords):
results = {}
for keyword in keywords:
occurrences = text.count(keyword)
results[keyword] = occurrences
return results
# Example usage
keywords = ['nicotine', 'vaping', 'health']
analysis_results = analyze_text_for_keywords(cleaned_text, keywords)
print(analysis_results)
Amazon Comprehend can be integrated with Textract to perform sentiment analysis on the extracted text. This can be useful for identifying the bias of documents.
def analyze_sentiment(text):
response = comprehend_client.detect_sentiment(Text=text, LanguageCode='en')
return response['Sentiment']
# Example usage
sentiment = analyze_sentiment(cleaned_text)
print(f"Sentiment of the document: {sentiment}")
Beyond sentiment, Comprehend can also recognize entities such as organizations, locations, and people mentioned in the text.
def analyze_entities(text):
response = comprehend_client.detect_entities(Text=text, LanguageCode='en')
entities = [(entity['Text'], entity['Type']) for entity in response['Entities']]
return entities
# Example usage
entities = analyze_entities(cleaned_text)
print(f"Entities found: {entities}")
Honestly, this is one of my favorite features. We can extract data according to natural language queries we pass in. This can be perfect for easy automated extraction of simple information, but also extraction of more nuanced data. This function provides a utility for extraction based on queries based on a document path, and a list of queries.
from textractcaller import QueriesConfig, Query, call_textract
import trp.trp2 as t2
def query_based_extraction(document_path, queries):
# Setup the queries
textract_queries = [Query(text=query['text'], alias=query['alias'], pages=query['pages']) for query in queries]
# Setup the query config with the above queries
queries_config = QueriesConfig(queries=textract_queries)
# Read the document
with open(document_path, 'rb') as document:
imageBytes = bytearray(document.read())
# Call Textract with the queries
response = call_textract(input_document=imageBytes, features=[t2.Textract_Features.QUERIES], queries_config=queries_config)
doc_ev = t2.TDocumentSchema().load(response)
# Extract and return query answers
entities = {}
for page in doc_ev.pages:
query_answers = doc_ev.get_query_answers(page=page)
if query_answers:
for answer in query_answers:
entities[answer[1]] = answer[2]
return entities
Here’s an example usage. The text is the natural language query, and the alias is simply a name we give to the result of the query.
# Define the queries
queries = [
{"text": "Who is the applicant's date of employment?", "alias": "EMPLOYMENT_DATE", "pages": ["1"]},
{"text": "What is the probability of continued employment?", "alias": "CONTINUED_EMPLOYMENT_PROB", "pages": ["1"]}
]
# Path to your document
document_path = "./path/to/employment_document.png"
# Perform the query-based extraction
extracted_entities = query_based_extraction(document_path, queries)
# Display the results
print(extracted_entities)
Let’s combine everything we’ve learned so far to analyze a document from the JUUL Labs collection.
# Load and extract text
image_path = 'path/to/juul/document.png'
extracted_text = extract_text_from_document(image_path)
# Clean the text
cleaned_text = clean_extracted_text(extracted_text)
# Analyze keywords
keywords = ['nicotine', 'vaping', 'health']
keyword_analysis = analyze_text_for_keywords(cleaned_text, keywords)
# Analyze sentiment
sentiment = analyze_sentiment(cleaned_text)
# Analyze entities
entities = analyze_entities(cleaned_text)
# Display results
print("Extracted Text:\n", cleaned_text)
print("Keyword Analysis Results:\n", keyword_analysis)
print("Document Sentiment:\n", sentiment)
print("Entities Detected:\n", entities)
For large collections, such as the JUUL Labs dataset, you can batch process documents and store results for further analysis.
def process_documents_in_directory(directory_path):
results = []
for filename in os.listdir(directory_path):
if filename.endswith(('.png', '.jpg', '.jpeg')):
image_path = os.path.join(directory_path, filename)
text = extract_text_from_document(image_path)
cleaned_text = clean_extracted_text(text)
sentiment = analyze_sentiment(cleaned_text)
keyword_analysis = analyze_text_for_keywords(cleaned_text, keywords)
entities = analyze_entities(cleaned_text)
results.append({
'filename': filename,
'sentiment': sentiment,
'keywords': keyword_analysis,
'entities': entities
})
return results
# Example usage
directory_path = 'path/to/juul/collection'
document_analysis_results = process_documents_in_directory(directory_path)
for result in document_analysis_results:
print(result)
This only scratched the surface of the power of AWS tools like Textract and Comprehend. They can also perform a variety of tasks from key-value pair extraction (e.g. getting somebody’s birth date from a form), signature identification, and even PHI and PII identification with Comprehend Medical (which would be of particular relevance to UCSF).
This workshop provided a brief introduction to using Amazon Textract and Comprehend for document processing and analysis. These are powerful tools in streamlining and enhancing research workflows, especially when dealing with large document collections like those the IDL has.
By automating text extraction, cleaning, and basic analysis, important documents can quickly be identified and more time can be devoted to drawing meaningful insights from the data.