说明:本人由于在google code下载了源码,先生成LIB_Debug,再生成DLL_Debug,所以直接从E:\BuildFolder\tesseract-ocr\vs2008\LIB_Debug拷贝出
tesseract-dlld.exe,unicharset_extractord.exe,mftrainingd.exe,cntrainingd.exe,combine_tessdatad.exe到E:\BuildFolder\tesseract-ocr\testing下
步骤有:
1.1. Make Box Files
E:\BuildFolder\tesseract-ocr\testing>tesseract-dlld ABC.Roman.exp0.tif ABC.Roman.exp0 -l eng batch.nochop makebox
Tesseract Open Source OCR Engine v3.02 with Leptonica1.2. Fix Box
使用CowBoxer编辑内容,要看help
1.3. Run Tesseract for Training
E:\BuildFolder\tesseract-ocr\testing>tesseract-dlld ABC.Roman.exp0.tif ABC.Roman.exp0 nobatch box.train
Tesseract Open Source OCR Engine v3.02 with LeptonicaAPPLY_BOXES: Boxes read from boxfile: 14 Found 14 good blobs.TRAINING ... Font name = RomanGenerated training data for 2 words1.4. Compute the Character Set
E:\BuildFolder\tesseract-ocr\testing>unicharset_extractord ABC.Roman.exp0.box
Extracting unicharset from ABC.Roman.exp0.boxWrote unicharset file ./unicharset.1.5. Clustering
这一步要先建立一个font_properties.txt的文件,文件内容格式如下:
- <fontname> <italic> <bold> <fixed> <serif> <fraktur>
- Roman 0 0 0 0 0
E:\BuildFolder\tesseract-ocr\testing>cntrainingd ABC.Roman.exp0.tr
Reading ABC.Roman.exp0.tr ...Clustering ...Writing normproto ...1.6. Combine
此时,在目录下应该生成若干个文件了,把unicharset, inttemp, normproto, pffmtable这四个文件加上前缀“Roman.”。然后输入命令:
E:\BuildFolder\tesseract-ocr\testing>combine_tessdatad Roman.
Combining tessdata filesTessdataManager combined tesseract data files.Offset for type 0 is -1Offset for type 1 is 140Offset for type 2 is -1Offset for type 3 is 939Offset for type 4 is 140232Offset for type 5 is 140335Offset for type 6 is -1Offset for type 7 is -1Offset for type 8 is -1Offset for type 9 is -1Offset for type 10 is -1Offset for type 11 is -1Offset for type 12 is -1Offset for type 13 is 141961Offset for type 14 is -1Offset for type 15 is -11.7. Test
把生成的Roman.traineddata拷贝到E:\BuildFolder\tesseract-ocr\testing\tessdata
tesseract ABC.Roman.exp0.tif result -l Roman -psm 7 nobatch
这样就ok了。
参考:http://blog.wudilabs.org/entry/f25efc5f/?lang=zh-CN
http://www.lixin.me/blog/2012/05/26/29536
http://wenku.baidu.com/view/5eafc201e87101f69e3195f4.html
http://www.84kf.com/html/22453.html
http://blog.csdn.net/fengbingchun/article/details/7022421
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------以下转自:http://blog.wudilabs.org/entry/f25efc5f/?lang=zh-CN
需要用到的程序
(1) (2) (3) (4) (非必需)使用 Universal Extractor 将 Tesseract 的安装包解开,再用 Bugfix 里的 tesseract.exe 覆盖原来的主程序,Tesseract 就可用了。CowBoxer 是用于修改 box 文件的程序。生成第一个 box 文件演示中将 Tesseract 解压到了 E:\tesseract-ocr 目录。然后在该目录中建立了一个 build 目录用于存放原始数据和训练过程中生成的文件。原始图片数据一个有 3 个 (test.001.tif - test.003.tif):首先生成第一个图片 test.001.tif 的 box 文件,这里使用官方的 eng 语言数据进行文字识别: