stay dry

This is a cross-post, original post can be found here.

Tesseract is an optical character recognition (OCR) engine originally developed by Hewlett Packard, in 2005 it was open sourced under Apache license. Its development is now supported by Google. Version 3.0 was released in September 2010 apart from other things this version offers support for Polish language.

Wiki at Tesseract website is a bit messy, that is why I decided to describe my experience with building and installation of Tesseract 3.0. I was working on Ubuntu 10.10 server edition, deployed on virtual machine created using Oracle Virtual Box.

First, I’ve install build-essential and autoconf (it may be also required to install libtool):

sudo apt-get install build-essential
sudo apt-get install autoconf

Next, step according to Tesseract wiki is to install dependencies:

sudo apt-get install libpng12-dev
sudo apt-get install libjpeg62-dev
sudo apt-get install libtiff4-dev
sudo apt-get install zlib1g-dev

Please note, that the name of zlib1g-dev package is misspelled in the wiki.

I have downloaded sources of Leptonica 1.6.7 from its Google Code website and than followed rather standard build process (you may also try to install libleptonica-dev package instead):

./configure
make
sudo make install
sudo ldconfig

The next step was downloading tesseract-3.00.tar.gz from Tesseract project website. Uncompress archive, go to tesseract-3.0 directory and invoke:

./runautoconf
./configure

After invoking ./configure you should check config_auto.h if dependencies were recognized correctly by ./configure script. Header file should contain #define for HAVE_LIBLEPT, HAVE_LIBPNG, HAVE_LIBTIFF, HAVE_LIBJPEG and HAVE_ZLIB.

make
sudo make install
sudo ldconfig

Without ldconfig you might experience problems with launching Tesseract.

Download languages of your choice from Tesseract website and place them (uncompress first) in your tessdata folder (by default /usr/local/share/tessdata).

Now run the OCR using:

tesseract phototest.tiff out.txt -l eng 
more out.txt

Hope that this was helpful.

Update (19th of October 2011):
I was trying to complie revision 627 of Tessearct on my Ubuntu 11.04, after compilation of Leptonica, and invoking ./configure for Tesseract source code I was still getting “leptonica library missing” error. Everything went smoothly after adding, these two lines in the beginning of the configure file.

CPPFLAGS="-I/usr/local/include" 
LDFLAGS="-L/usr/local/lib"

This solution was found here – thank you.

Share/Bookmark

Maciej

Is it text only or do they support page layouts, formatting, etc?
adudczak

According to release notes of 3.0 there is page layout analysis module (http://code.google.com/p/tesseract-ocr/wiki/ReleaseNotes).

Tesseract can output hOCR format which can hold formatting, confidence, bounding box and layout information. In fact this is simple XHTML.

At the moment I am playing with Ocropus (http://code.google.com/p/ocropus/) you may try also this one.
derhecht

Debian 6 (Squeeze) also works this way (similar) but pay attention on http://code.google.com/p/tesseract-ocr/issues/detail?id=563&q=Debian%206 (as remarked here at the bottom).
Nicky Gurbani

“leptonica library missing” error

CPPFLAGS=”-I/usr/local/include” LDFLAGS=”-L/usr/local/lib” ./configure
A. Dudczak

A week ago or so, I had to compile Tesseract and I must say that howto at project website is quite accurate now.

Tesseract 3.0 installation on Ubuntu 10.10 server

Search

Blogroll

My posts from outside

Recent DRY posts

Archives