Tessnet2 a .NET 2.0 Open Source OCR assembly using Tesseract engine
http://www.pixel-technology.com/freeware/tessnet2/
Tessnet2 a .NET 2.0 Open Source OCR assembly using Tesseract engine
Keywords: Open source, OCR, Tesseract, .NET, DOTNET, C#, VB.NET, C++/CLI
Current version : 2.04.0, 02SEP09 (see version history)
The big picture
Tesseract
is a C++ open source OCR engine. Tessnet2 is .NET assembly that expose very
simple methods to do OCR.
Tessnet2 is multi threaded. It uses the engine the
same way Tesseract.exe does. Tessdll uses another method (no thresholding).
License
Tessnet2 is under
Apache 2 license (like tesseract), meaning you can use it like you want,
included in commercial products. You can read full license info in source
file.
Quick Tessnet2 usage
-
Download binary here, add a reference of the assembly Tessnet2.dll to your .NET project.
-
Download language data definition file here and put it in tessdata directory. Tessdata directory and your exe must be in the same directory.
-
Look at the Program.cs sample
Note: Tessnet2.dll needs Visual C++ 2008 Runtime. When deploying your application be sure to install C++ runtime (x86, x64)
Tessnet2 usage
Bitmap image = new
Bitmap("eurotext.tif");
tessnet2.Tesseractocr =
new
tessnet2.Tesseract();
ocr.SetVariable("tessedit_char_whitelist","0123456789"); // If digit only
ocr.Init(@"c:\temp", "fra", false); // To use correct tessdata
List<tessnet2.Word>
result = ocr.DoOCR(image, Rectangle.Empty);
foreach
(tessnet2.Word
word in
result)
Console.WriteLine("{0}
: {1}", word.Confidence, word.Text);
Tessnet2 source code and recompiling
-
Download Tesseract source code here and expand it in a directory
-
Download Tessnet2 source code here and expand it in Tesseract source code root directory (it should create dotnet sub directory)
-
Open the project solution tessnet2.sln. It‘s a Visual Studio 2008 C++/CLI project
Memory leak
Tesseract C++ source code is full of memory leak. Using tessnet2 assembly several time will cause memory overflow. This is not tessnet2 leak, this is tesseract leak and I spent two days in tesseract source code trying to improve this with no success.See what I think about this.
Tessnet2 demo
In the Tessnet2
source code you have two C# demo project. TesseractOCR is a multi-tread WinForm
demo with a progression bar. TesseractConsole is a console demo.
The confidence score is between braquets. <
160 mean not bad
07JUN08: First release on Tesserect 2.03
10JUN08: Version 2.03.1. Change Confidence behavior, now it‘s calculated from each word letter and not from the first letter. Type change from byte to double. 0 = perfect, 100 = reject
13JUN08 : Version 2.03.2
After 3 days in Tesseract code (urgh), here is Tessnet2
version 2.03.2
The corrections deals with the following problems
*
Confidence was not very useful, the value was strange. This has been corrected,
setting the variable tessedit_write_ratings=true. After many test I found this
mode is the best for confidence accuracy. Value range from 0 (perfect) to 255
(reject) . When value goes over 160 this really mean the OCR was bad.
*
Calling DoOCR twice was not giving the same result. It was, as expected, a
problem with global variables. The problem is almost fixed, sometime it doesn’t
work but right now I can’t find what is not correctly reinitialized.
Tessnet2 a .NET 2.0 Open Source OCR assembly using Tesseract engine,古老的榕树,5-wow.com
郑重声明:本站内容如果来自互联网及其他传播媒体,其版权均属原媒体及文章作者所有。转载目的在于传递更多信息及用于网络分享,并不代表本站赞同其观点和对其真实性负责,也不构成任何其他建议。