Processing medical documents using PyTesseract and Spacy

Introduction
We all love the convenience of simply clicking pictures of our assignments, results, lab reports, etc., and uploading them wherever needed right? Well now changing the perspective, think about it from an organization’s point of view. Your organization is receiving lacs of images each month from which user data needs to be extracted, how do you go about solving this?
In this blog I’ll give you a bird’s eye view of how to implement a simple document reader, I will cover all the steps from reading an image file (lab report) to identifying medical test names written in the report.
So let's begin!!!
Table of Contents:
- What is OCR?
- Setting Up Your Environment
- Installing Pytesseract and SciSpacy
- Performing OCR with Pytesseract
- Extracting Medical Test Names with SciSpacy
- Putting It All Together
- Conclusion
1. What is OCR?🤔
Optical Character Recognition (OCR) is a technology that converts different types of documents, such as scanned paper documents, PDF files, or images captured by a digital camera, into editable and searchable data. In the context of medical records, OCR can be incredibly useful for extracting information from handwritten or printed documents, including medical test names.
2. Setting Up Your Environment🐼
Before diving into OCR and medical entity recognition, you need to set up your Python environment. Make sure you have Python installed, preferably Python 3.x. You’ll also need to install the required libraries: Pytesseract and SciSpacy.
NOTE: Tesseract will need special instructions to be correctly installed which can be found here: https://linuxhint.com/install-tesseract-windows/
3. Installing Pytesseract and SciSpacy🙌
You can install Pytesseract and SciSpacy using pip:
pip install pytesseract
pip install scispacy
Additionally, you’ll need to install a language model for SciSpacy. You can choose from various models depending on your specific needs. For medical text, “en_core_sci_md” is a good choice:
Installation method 1:
# The link might get broken in future, make sure to use latest version
pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.3.0/en_core_sci_md-0.3.0.tar.gz
Installation method 2 (Recommended):
Firstly, download “en_core_sci_md” model from https://allenai.github.io/scispacy/
Then run the following command to install the model
pip install en_core_sci_sm-0.5.1.tar.gz
4. Performing OCR with Pytesseract 💯
Pytesseract is a Python wrapper for Google’s Tesseract-OCR Engine. It allows you to extract text from images. Here’s a simple example of how to use it:
import pytesseract
# Perform OCR
text = pytesseract.image_to_string("path to your image.jpg")
# Print the extracted text
print(text)
Ensure that you have the image containing medical test names saved in the same directory as your Python script or you provide the correct image path.
5. Extracting Medical Test Names with SciSpacy ✅
SciSpacy is a library built on top of Spacy, designed specifically for processing biomedical and clinical text. It can recognize medical entities such as test names. Here’s how to use it:
import spacy
# Load the SciSpacy model
nlp = spacy.load("en_core_sci_md")
# Process the text extracted from OCR
doc = nlp(text)
6. Putting It All Together😎
Now that you have learned how to perform OCR with Pytesseract and extract medical test names with SciSpacy, you can combine these steps to create a complete workflow. Here’s a basic example:
import pytesseract
import spacy
### Preprocessing layer ###
# You can perform rotation and skewness correctiona,
# colored - B/W conversion etc here
### OCR layer ###
# Perform OCR
text = pytesseract.image_to_string("path to your image.jpg")
### NER layer ###
# Load the SciSpacy model
nlp = spacy.load("en_core_sci_md")
# Process the text extracted from OCR
doc = nlp(text)
### Postprocessing layer ###
# Loop through recognized entities and print canonical name
for ent in doc2.ents:
print(f"Entity: {ent.text}, Label: {ent.label_}, Canonical Name: {ent.text}")
7. Conclusion💯
In this blog, we’ve explored how to perform Optical Character Recognition (OCR) using Pytesseract and then extract medical test names from the OCR output using SciSpacy. This combination of tools will streamline the process of digitizing and structuring medical data, making it easier to manage and analyze. Whether you are a healthcare professional or a data scientist, mastering these techniques can significantly enhance your ability to work with medical records and contribute to improved patient care and medical research.
Remember that the accuracy of OCR and entity recognition can vary depending on the quality and complexity of the documents you are working with. It’s essential to fine-tune your approach and handle specific challenges that may arise in your healthcare data extraction tasks.