This is a cross-post, original post can be found here.
Tesseract is an optical character recognition (OCR) engine originally developed by Hewlett Packard, in 2005 it was open sourced under Apache license. Its development is now supported by Google. Version 3.0 was released in September 2010 apart from other things this version offers support for Polish language.
Wiki at Tesseract website is a bit messy, that is why I decided to describe my experience with building and installation of Tesseract 3.0. I was working on Ubuntu 10.10 server edition, deployed on virtual machine created using Oracle Virtual Box.
First, I’ve install build-essential and autoconf (it may be also required to install libtool):
sudo apt-get install build-essential sudo apt-get install autoconf
Next, step according to Tesseract wiki is to install dependencies:
sudo apt-get install libpng12-dev sudo apt-get install libjpeg62-dev sudo apt-get install libtiff4-dev sudo apt-get install zlib1g-dev
Please note, that the name of zlib1g-dev package is misspelled in the wiki.
I have downloaded sources of Leptonica 1.6.7 from its Google Code website and than followed rather standard build process (you may also try to install libleptonica-dev package instead):
./configure make sudo make install sudo ldconfig
The next step was downloading tesseract-3.00.tar.gz from Tesseract project website. Uncompress archive, go to tesseract-3.0 directory and invoke:
./runautoconf ./configure
After invoking ./configure you should check config_auto.h if dependencies were recognized correctly by ./configure script. Header file should contain #define for HAVE_LIBLEPT, HAVE_LIBPNG, HAVE_LIBTIFF, HAVE_LIBJPEG and HAVE_ZLIB.
make sudo make install sudo ldconfig
Without ldconfig you might experience problems with launching Tesseract.
Download languages of your choice from Tesseract website and place them (uncompress first) in your tessdata folder (by default /usr/local/share/tessdata).
Now run the OCR using:
tesseract phototest.tiff out.txt -l eng more out.txt
Hope that this was helpful.
Update (19th of October 2011):
I was trying to complie revision 627 of Tessearct on my Ubuntu 11.04, after compilation of Leptonica, and invoking ./configure for Tesseract source code I was still getting “leptonica library missing” error. Everything went smoothly after adding, these two lines in the beginning of the configure file.
CPPFLAGS="-I/usr/local/include" LDFLAGS="-L/usr/local/lib"
This solution was found here – thank you.