Learn More

Try out the world’s first micro-repo!

Learn More

How we made our optical character recognition shine for code

How we made our optical character recognition shine for code

For the past year, I’ve been a member of a team of machine learning engineers at Pieces.app working to develop an optical character recognition (OCR) engine specifically for code.

Our OCR engine at Pieces.app uses Tesseract as its main OCR engine. Tesseract performs a layout analysis and then uses an LSTM trained on text-image pairs to predict the characters in the image. It is currently one of the best free OCR tools and supports over 100 languages, but is not particularly fine-tuned to code. To make it really shine for code snippets, we added specific pre and post-processing steps to our recognition pipeline.

Standardized inputs through image pre-processing

To best support software engineers when they want to transcribe code from images, we fine-tuned our pre-processing pipeline to screenshots of code in IDEs, terminals and online resources like YouTube videos and blog posts. Since programming environments can be in light or dark mode, both modes should yield good results. Additionally, we wanted to support images with gradients or noisy backgrounds, as might be found in YouTube programming tutorials or retro websites, as well as images with low resolution, for example from being compressed from uploading or sending a screenshot.

Dark and light mode input images

Since Tesseract works best on binarized, light-mode images, we needed to invert dark-mode images in pre-processing. To determine which images are in dark mode, our engine first median-blurs the image to remove outliers, and then calculates the average pixel brightness. If it is lower than a specific threshold, it is determined to be dark and thus inverted. To handle gradient and noisy backgrounds, we use a dilation-based approach. We generate a copy of the image and apply a dilation kernel and median blur on it. We then subtract this blurred copy from the original image to remove dark areas without disturbing the text on the image. For low-resolution images, we upsample the image depending on the input size using bicubic upsampling.

Inverted and binarized input image

Code requires layout formatting

On the text prediction of Tesseract, we perform a layout analysis and infer the indentation of the produced code. Tesseract, by default, does not indent any output, which can not only make code less readable, but even change its meaning in languages such as Python. To add indentation, we use the bounding boxes that Tesseract returns for every line of code. Using the width of the box and the number of characters found in it, we calculate the average width of a character in that line. We then use the starting coordinates of the box to calculate by how many spaces is it indented compared to the other code lines. After that, we use a simple heuristic to push the indentations into even numbers of spaces.

Transcribed code with and without inferred indentation

Evaluating our pipeline

To evaluate our modifications to the OCR pipeline, we use multiple sets of hand-crafted and generated datasets of image-text pairs. By running OCR on each image, we then calculate the Levenshtein distance between the predicted text and the ground truth. We treat each modification as a research hypothesis and then use experiments to validate them. For upsampling small images, for example, our research hypothesis was that super-resolution models like SRCNN (Super-Resolution Convolutional Neural Network) would boost OCR performance more than standard upsampling methods like nearest-neighbor interpolation or bicubic interpolation. To test this hypothesis, we ran the OCR pipeline multiple times on the same datasets, each time using a different upsampling method. While we found that nearest-neighbor upsampled images yield worse results, we did not find a significant difference between super-resolution based upsampling and bicubic upsampling for our pipeline. Given that super-resolution models need more storage space and have a higher latency than bicubic upsampling, we decided to go with bicubic upsampling for our pipeline.

Samples from generated text image pairs

Overall, code is a challenging objective for OCR, since it has to capture highly structured syntax and formatting, while allowing for unstructured variable names and code comments. We’re happy to provide one of the first OCR models fine-tuned to code and are continuing to improve the model to make it faster and more accurate, so you can get usable code from your screenshots and continue coding.

To test our model on your own code screenshots, visit codefromscreenshot.com or download the Pieces app.  If you’re a developer interested in our APIs, check out runtime.dev or email us at support@runtime.dev.

Interested in becoming a Pieces Content Partner?

Learn More

Get our latest blog posts and product updates by signing up for our monthly newsletter! 

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Table of Contents

Machine Learning

OCR

Programming

More from Pieces