ECE 5760 | Character Recognition Using OpenCV on DE1-SOC

Introduction

The goal of this project was to build a system using the DE1-SOC that was capable of recognizing letters and small words and then speak them out using pre-recorded speech samples. The motivation for this project came from people who have vision problems that cannot be corrected, and so can struggle with reading text on signs or labels.

The DE1-SOC is a system-on-chip FPGA coupled with a dual ARM core processor. It has numerous peripherals both on the FPGA and the ARM side and this makes it ideal for projects which require customizable hardware. This project involves using a speaker, VGA monitor and a camera. The project was divided into three main parts: recording speech and playing it on the speaker, displaying a live video feed on the VGA monitor as well as being able to capture and save frames from that feed, and word or character recognition using OpenCV. The character and word recognition was based on a technique known as "template matching", which compares various template images to an image to determine how well those templates match the content of the image. By making each letter into a template, the system could determine which letters were present.

Design

Design Overview

This project uses a NTSC camera to display a live video feed to a VGA monitor from the FPGA. The user must initiate character detection by entering the number of characters to be detected using the keyboard. Once the user initiates detection, one of the frames from the video feed is captured and storead as a bitmapOnce on the HPS to be processed. The HPS uses OpenCV to determine which characters are present in the image. Once character matches are obtained from the image, a speaker is used to read out the detected letters. If the letters form a meaningful word stored in a small dictionary on the HPS, that word is spoken instead of the individual letter. Sound is played by opening a file of samples that correspond to a specific letter or word and playing them using the audio FIFO. The block diagram below provides a high-level overview of the system.

Sound Recording and Playback

The first objective was to record sound on a PC. This was achieved by using a microphone and plugging it into the CPU. There is a sound recorder application in Windows which was used to record the alphabets and some chosen words. These were saved as .wma files.

Now browsing through the ECE 5760 website, we found that there is a section where an audio file is read by Matlab and then through UDP is sent to the HPS where it is received and stored in a common buffer. This common buffer is then accessed by another program and loads the samples into the speaker FIFOs.

So without changing any of the code, we attempted to just replace the default audio file which this project was using with our recorded file. When it was played on the speaker, junk was heard. One issue was recognized was that the audiread function in matlab cannot read .wma files and hence the recorded files were converted into .wav files. This did not resolve the problem. It was then realized that the audio had to be recorded at 8000khz and then it worked fine. An external software Audacity was downloaded which converted our recorded file to 8000khz and then we were able to play it back on the speaker.

Now that the project from the website was working, we wanted to change the code so that instead of playing the file from the PC, we wanted to store the audio file in the SD card and play it from the linux filesystem. So initially normal C file i/o was used for reading the audio file, but we quickly realized that an audio file cannot be read as want to extract the samples which can be fed to the speakers. So to overcome this problem, a quick search online showed that a library existed which provided APIs to extract the samples from an audio file. But the problem is that when the samples have been extracted, they have to be converted to fixed point values which require additional code.

In the UDP matlab program which is on the website, samples are extracted using the audioread function, stored in an array and then sent via UDP. We added some code wherein, when it's sending the samples, it also writes these samples to a text file.

Now we have a text file full of the samples of the audio. This is transferred to the SD card. We want to read the samples of this text file and write them to the common buffer. But when we read from the text file, all the text is string. So it has to be converted to integers first. This is accomplished using the strtol function which extracts integers from a string. So the integer samples are written to the common buffer and then audio buffer program is run which takes the samples from the buffer and puts them into the audio buffer. We were able to playback the alphabets A to Z and some words.

Installation and Build of OpenCV

Installing OpenCV and successfully compiling OpenCV programs on the HPS proved to be much more challenging than first expected. The documentation for installing OpenCV online is quite detailed, but we could not find any specific examples where OpenCV had been installed on a DE1-SOC or a similar system. The first difficulty was simply transferring the install files to the the HPS. Typically the files are downloaded directly onto a system, but the restrictions on network access when using the HPS made this impossible. The files were instead first placed onto a flash drive and from there transferred to the HPS.

Building OpenCV natively on the HPS from the source proved to be a more difficult challenge. The build initially appeared to be progressing normally, though extremely slowly, but after several hours the build failed. The problem was not immediately clear from the error message, but eventually it was determined that the HPS had run out of memory and the build was terminated. Apparently, the small amount of memory available on the HPS could not handle such a complex build process. To counter this problem, we tried to create a swap file, which is a file on the system that can essentially serve as RAM, though it is significantly slower. To try to meet the recommended memory requirements for building OpenCV, we tried to create a 1 GB swap file. The partition on the SD card used to hold the filesystem of the HPS was only 4 GB, so devoting 1 GB to the swap file would leave relatively little space for the rest of our project files. In order to have enough space on the SD card, we had to expand the partition. A larger 16 GB SD card was purchased to allow for extra space. A GParted livedisk was created using a flashdrive and GParted, a popular application used for managing disk partions on Linux, was used to expand the size of the partition. After expanding the partition, a 1 GB swap file was created. After creating this swap file, the build was able to complete successfully. Completing the build with the swap file took about 8 hours on the HPS.

Video Display and Image Capture

Displaying video on the HPS was relatively simple thanks to existing video display projects on the ECE 5760 webpage. The QSYS and Verilog for was based on Bruce Land's project, "VGA display of video input using a bus_master to copy input image". This project used the Video_in_rgb_resampler, Video_in_clipper, Video_in_scaler, and Video_in_DMA modules to be able to handle a VGA screen using 640x480 resolution. The video input was stored in on-chip SRAM. A bus master was used to read from the SRAM and write that data to the SDRAM where the VGA display was refreshed from. Video input was enable by putting SW1 up. The top-level module contained a simple state machine for reading from the video input and writing a pixel out to the VGA memory, which is summarized in the diagram below. Note that this state machine would only run when SW0 was up and that KEY0 had to be pressed initially to reset the state machine. The READ PIXEL state sets the x-coordinate and y-coordinate to read from the video input, with the one pixel being handled for each cycle of the state machine. A request to read the byte corresponding to the color of this input pixel was then issued. The READ ACK state waited for an acknowledgement of the successful read on the bus and would then copy the color data. The VGA WRITE state then wrote this color to the SDRAM used for the VGA in order to display the pixel from the input. The WRITE ACK state waited for acknowledgement of the write on the bus before returning to the READ PIXEL state to handle the next pixel.

The state machine above simply copied the video input to the VGA screen, but in order to process a frame of the video input for character recognition we wanted to store an image from the VGA in a format that OpenCV could handle. To do this we decided to capture and store one frame of the video input as bitmap image. The bitmap file format was chosen because it was the easiest to manually convert from individual pixels too, having a fairly simple format. The color value of individual pixels could be easily found on the HPS using the video_in_pixel function from Bruce Land's previously specified video input project, but determining how to combine these individual pixels into a properly formatted bitmap image proved more challenging. The bitmap file format is well-defined and freely available, but deciphering exactly how to format the headers was challenging. A breakdown of the specifics of the headers, which is linked to here from Stefan Hetzl proved extremely helpful. The values used for the info and file headers for the 320x240 bitmap images captured for this project are summarized in the tables below, based on the previously mentioned work by Stefan Hetzl.

BITMAPFILEHEADER

Start Byte	Size (Bytes)	Name	Value Used	Purpose
0	2	bfType	"BM"	Set to "BM" to indicate that this is bitmap file
2	4	bfSize	230454	Size of file. 3 bytes for each pixel in the 320x240 image since 24 bit color is used, and 54 bytes for the headers.
6	2	bfReserved1	0	Must always be set to 0.
8	2	bfReserved2	0	Must always be set to 0.
10	4	bfOffBits	1078	Specifies the Offset from the beginning of the file to the bitmap data.

BITMAPINFOHEADER

Start Byte	Size (Bytes)	Name	Value Used	Purpose
14	4	biSize	40	Size of INFOHEADER in bytes
18	4	biWidth	320	Width of image in pixels
22	4	biHeight	240	Height of image in pixels
26	2	biPlanes	1	Number of color planes, must be set to 1.
28	2	biBitCount	24	Specifies the number of bits per pixel, 24 since 24 bit color was used.
30	4	biCompression	0	Specifies compression, set to 0 for no compression.
34	4	biSizeImage	0	Only used with compressed image, set to 0 if image is uncompressed.
38	4	biXPelsPerMeter	0	Specifies horizontal resolution, should be set to 0.
42	4	biYPelsPerMeter	0	Specifies vertical resolution, should be set to 0.
46	4	biClrUsed	0	Specifies number of colors used, set to 0 to allow biBitCount to set number.
50	4	biClrImportant	0	Specifies important colors, set to 0 to make all colors important.

After generating the proper headers, creating the bitmap image was fairly straightforward. The red, green, and blue components of each pixel were loaded into consecutive bytes in the color array of the image. The headers, padding, and color array could then be written to the file normally using the fwrite function. One problem with this method was that the images were all upside down, since apparently the rows in a the bitamp color array are stored upside down. By processing and storing the image from the bottom to the top instead of the top to bottom, the image would be stored right-side up. An example of an image captured using this method can be seen below. A simple program for capturing a single image can be seen in still_cap_v2.c, which can be seen here.

Detecting A Single Character

Detection was approached in stages, and the first stage was to successfully identify individual characters. This proved to be the most difficult stage of detection, with multi-letter and word detection coming naturally out of single character detection.

Two different approaches were considered for single character detection: training a cascade classifier for each letter and template matching. A cascade classifier file specifies a series of features that should be present in an image if the desired object is in the image. The classifier file can then be run against an image to detect objects. Training a cascade classifier involves creating a training set of positive and negative samples for the object you wish to identify. Positive samples contain the object and negative samples do not contain the object. OpenCV provides relatively easy access to tools for generating either HAAR or LBP classifier files based on a set of samples provided by the user, by the tools do require significant configuration to perform well for each different application. Effective training also requires quite a large set of positive and negative images, ideally hundreds or even thousands of each. Generating this many images proved to be difficult and time-consuming. Also, actually using the tools to produce a cascade classifier proved to be extremely time consuming. A Raspberry Pi was used to run the classifier tools in order to speed up generation, but even using this system generating a HAAR classifier file with only 50 positive images and 100 negative samples required 30 minutes. Classifier files generated using these small numbers of samples were tested, but the results were problematic. The classifier was able to detect the presence of letters. but it struggled to distinguish between letters, often misidentifying a letter as a differet letter. Larger training sets were necessary to correctly identify the letters, but the training time increased rapidly with number of samples, and training with the thousands of samples that would be required for an accurate classifier was projected to take days or even weeks to complete. Taking that much time to train each letter would take far too long, so the cascade classifier approach was abandoned.

After abandoning the classifier approach, we focused on template matching. This is a more simplistic approach that takes existing template images and compares them to an image to determine how closely the template matches to each region of the image. Each template is moved across the entire image and depending on the exact template matching mode, the quality of the match is calculated and recorded. The quality of the match of the template to each region is indicated by a correlation coefficient.

At a high level, letters were detected by creating 26 templates, one for each letter, and comparing each template to the entire input image. All of the templates were moved across the image, calculating the correlation coefficient at each location. The maximum correlation coefficient for each letter was recorded, and all of the maximums were compared. The letter with the greatest correlation coefficient had the best match to the input image, and so it reported that that letter is present in the image.

The first step was to generate the 26 templates. Pictures of each letter were taken in the lab where the demonstration would take place so that the conditions in the template would match the conditions in the input image as much as possible. Each captured image was edited down to a template that included only the letter of interest. Each template was stored in a separte templates directory for later use. An example of a template can be seen in the image below.

The templates could then be used to actually perform the template matching. This was made much simpler by leveraging existing functions in OpenCV intended for use with template matching. Especially useful was the aptly named matchTemplate function which allowed a matrix of correlation coefficients to be generated for a given image, template, and matching method. These correlation coefficients could be parsed to determine the maximum correlation coefficient value and location for each letter. Comparing these maximum coefficients across letters allowed the best matching letter to be determined.

There were initial problems using this method, with about 25% of letters being misidentified. Eventually this problem was attributed to the matching method used. We were intially using the squared-difference matching method. Switching to using the more complex correlation-coefficient normalized method greatly reduced the number of misidentifications to only a fraction of a percent of all tests, but the more complex calculation increased the time to compute the correlation coefficients. The mathematical functions used to calculate the correlation coefficient, R, at each location (x,y) using template T on image I for bot the square-difference and the correlation-coefficient normalized method are reproduced below. The first equation is the square-difference method and the second is the correlation-coefficient normalized method. These functions are reproduced from the OpenCV template matching documentation.

Square-Difference Matching Method

Correlation-Coefficient Normalized Matching Method

Detecting Multiple Characters and Words

After achieving a high success rate with single character detection, the next step was reading multiple characters and words. An early problem that was encountered was difficulty in determining the number of letters present using template matching. It was initially believed that by determing the number of correlation coefficients that exceeded a minimum threshold the number of letters could be determined. Unfortunately, some letters tended to produce relatively high correlation coefficients not only for themselves, but for similar letters as well. For example, an O would generate a high correlation coefficient for O, but also for Q. This could lead to the system identifying more letters than are actually present. Template matching proved to be ill-suited to determining the number of letters, so the system began by prompting the user for the number of letters present. The system would then attempt to read the entered number of letters, identifying them starting with the highest correlation coefficient and proceeding in order based on correlation coefficient from there.

This method had problems however, which were attributed to falsely high correlation coefficients due to similarities between letters. For example as previously mentioned, having an O in the image would also generate a high correlation coefficient for Q due to the similarities between the shapes of the letters. When identifying multiple letters this could lead to similar letters being falsely identified after identifying the original letter. To address this issue, the input image was modified after the detection of each letter. A white rectangle was placed in the image, covering up the letter that was just identified to prevent that letter from generating any future false positives. The template matching was then run again on the newly modified image to identify the next letter. This method proved highly successful, allowing multiple letters to be read with a high success rate. An example of what an image put through this process looks like after identifying all 3 letters in the picture can be seen below.

For any combination of letters, each individual letter would be read off. A few select words were also added to the system so that the word itself could be read instead of the individual letters. To accomplish this, the final string of detected letters was compared to the list of recorded words before reading out the letters. If the string was found in the list of recorded words, the recording for that word was played instead of the recordings for the individual letters. The words that were added to the system were: DOG, CAT, ECE, IS, FUN.

Reading From Left to Right

The final addition to detection was to determine the proper order of the letters. Having the letters in the correct order from left to right is necessary to properly detect words. Ordering the letters properly was fairly simple. The identified character and match location for each iteration of character detection were stored to arrays. Once all of the letters were detected, the character array was re-ordered based on the x-coordinate of the match location. This re-ordered string had the characters in the proper order, and could then be played.

Testing

The project was first created and tested as separate modules, which were then integrated and tested together. First the sound playback module was created and tested. All of the letters were recorded and converted to files of samples, and playback of every letter was tested. The video input and image capture was developed and tested next. Numerous pictures were captured in the lab to ensure that the conversion to a bitmap would always be successful and to test that the images captured would be consistent in their quality and appearance. Taking picturs of letters at various distances were captured to test that the letters would be distinguishable within the image.

The testing then turned to focusing on letter detection. Black letters using a 500 point Calibiri font were printed out and used for all testing. A large amount of time was spent testing the possible use of cascade classifiers for identification. Sample images were created of the A, B, and C and used to train the system. These letters were then tested to ensure that they could be detected and differentiated. This proved unsuccessful, as the letters were commonly being mistaken for one another by the system. This testing indicated that properly traning a cascade classifier would be prohibitively difficult and time-consuming to implement. This led to the switch to the using template matching.

For testing template matching, the same 500 point, black, Calibri letters were used. A wooden board was also used to attach the letters to, in order to keep them relatively straight and so that they did not need to be held by a team member. The template matching was intially tested outside of the lab using a Raspberry Pi, but it quickly moved to being tested in the lab so that the conditions in the templates and during testing would better match the eventual demonstration. All 26 letters were tested for individual character recognition. Initial tests only had a 73% success rate at individual character recognition, which was considered unacceptably low. Modifying the matching method led to a nearly 100% success rate for individual characters. We also tested the ability of the system to handle changes in the distance of the letters from the camera. The testing showed that changing the distance did reduce the quality of the match, and moving the letter more than 1.5 to 2 feet from its starting position would lead to consistent failure in detection.

A large number of 2 and 3 letter combinations were tested, but it was impossible to test all of the many possible combinations. The words that were specifically recorded were all tested multiple times however. Testing on multiple character combinations revealed some issues with letters being misidentified when they were near the edge of the screen, but keeping the combination centered in the captured image seemed to eliminate this problem.

Results

Overall, the project was successful. The system was able to recognize individual characters and multi-character combinations with a very high success rate and play the letters that were detected over the speakers. The system was also able to recognize and read out a number of pre-programmed words with the proper pronunciation instead of letter by letter. Images were successfully captured from the FPGA and stored on the HPS for analysis as intended. OpenCV was also able to be successfully installed on the HPS as we had hoped. The speed of identification was somewhat slow however, requiring several seconds to identify a three-letter combination. Despite this somewhat slow performance, the system was ultimately successful, achieving the goal of character and word recognition and playback.

Below is a video which describes our project briefly and shows its functionality:

Conclusion

The project was successful, identifying all the letters of the alphabet and several words. However, there is still significant room for improvement in the project. Though the project functioned properly, it performed more slowly than we had hoped, taking several seconds to identify 3 letter words. All of the template matching is currently done on the HPS, which results in relatively slow performance. Transferring some or all of the processing to parallel hardware on the FPGA could lead to significant improvements in performance. Another, possibly simpler method to improve performance would be to switch to fixed-point arithmetic on the HPS instead of floating point. Another useful improvement would be to remove the need for the user to enter the number of letters to be identified. Though template matching seems to not be promising for determining the number of characters present, it is possible that a cascade classifier could be used in conjunction with template matching to overcome this issue. The cascade classifiers tested with the project struggled to distingusih between letters, but they did show promise in detecting the presence of a letter. By using a cascade classifier to first count the number of letters and then using template matching to identify which letters they are, the system could possibly be improved. With sufficient time to complete the training for the classifier, this extension could be quite feasible. With even more time, the entire system could be transferred to using only well-trained cascade classifiers, which could make the system more resilient to changes in scale and angle, assuming sufficiently diverse sample images are provided for training.

Code Appendix

All code can be found at the GitHub page for the project . To help sort through the files and figure out what they all do, the table below lists the most important, finalized files (not including executables and text files) and their purpose.

Filename	Purpose
withWords.cpp	This is the file which contains the template matching code to detect multiple characters as well as the function which opnes the audio text file and transfers the samples to a common buffer.
bufferToFIFO.c	This file detects if any samples have been transferred to the buffer and then puts those samples into the audio buffer.
audioToBuffer.c	Transfers samples to buffer to be played over audio.
DE1_SoC_Computer.v	Top Level Verilog module, contains VGA state machine.

Character Recognition Using OpenCV on DE1-SOC

Fred Kummer and Amardeep Manak and Rubaiyeth Rafiea