Tessarct OCR recognition optimization

I have been trying to optimize tesseract OCR recognition.

But for some reason even though the images are optimized sometimes they do not have the desired output.

Here for example are some of the results I have. So far the algorithm I have is:

  • Deskew the image using Tesseract.
  • Convert the image to Grey Scale using Tesseract.
  • Otsu’s Binarization.

Possible addons to the algorithm:

  • Increase size of the image.
  • Remove shadows and highlights
  • Crop borders.

I have noticed though that the OCR accuracy is ok but not great. If I modify the picture with my iPhone and all I do its remove the color, then I get a lot of great results.

Book Column

Poetry

Thing is, to the eye the picture created by the Algorithm above seems better than the one's coming from the iPhone and still the results are worst. This is strange.

Any idea on how I can improve the algorithm

This is the code I am using for the binarization:

public InputStream binarize(InputStream fileInputStream) throws Exception {
    OpenCV.loadShared();
    byte[] bytes = IOUtils.toByteArray(fileInputStream);
    Mat gray = Imgcodecs.imdecode(new MatOfByte(bytes), CV_LOAD_IMAGE_GRAYSCALE);
    Mat gray_with_gauss = new Mat();
    Mat finalResult = new Mat();

    GaussianBlur(gray, gray_with_gauss, new Size(5, 5), 0);
    Imgproc.threshold(gray_with_gauss, finalResult, 0, 255, THRESH_BINARY + THRESH_OTSU);

    return IOHelper.Mat2InputStream(finalResult);
}

Thanks

0 Comment

NO COMMENTS

LEAVE A REPLY

Captcha image