Tesseract 3 语言数据的训练方法【转】http://blog.csdn.net/dragoo1/article/details/8439373-白红宇

Tesseract 3 语言数据的训练方法【转】http://blog.csdn.net/dragoo1/article/details/8439373

阅读量：5328 次

发布时间：2019-06-14

本文共 5124 字，大约阅读时间需要 17 分钟。

分类：

2012-12-26 15:42

92人阅读

(0)

说明：本人由于在google code下载了源码，先生成LIB_Debug，再生成DLL_Debug，所以直接从E:\BuildFolder\tesseract-ocr\vs2008\LIB_Debug拷贝出

tesseract-dlld.exe，unicharset_extractord.exe，mftrainingd.exe，cntrainingd.exe，combine_tessdatad.exe到E:\BuildFolder\tesseract-ocr\testing下

步骤有：

1.1. Make Box Files

E:\BuildFolder\tesseract-ocr\testing>tesseract-dlld ABC.Roman.exp0.tif ABC.Roman.exp0 -l eng batch.nochop makebox

Tesseract Open Source OCR Engine v3.02 with Leptonica

1.2. Fix Box

使用CowBoxer编辑内容，要看help

1.3. Run Tesseract for Training

E:\BuildFolder\tesseract-ocr\testing>tesseract-dlld ABC.Roman.exp0.tif ABC.Roman.exp0 nobatch box.train

Tesseract Open Source OCR Engine v3.02 with Leptonica

APPLY_BOXES:

Boxes read from boxfile: 14

Found 14 good blobs.

TRAINING ... Font name = Roman

Generated training data for 2 words

1.4. Compute the Character Set

E:\BuildFolder\tesseract-ocr\testing>unicharset_extractord ABC.Roman.exp0.box

Extracting unicharset from ABC.Roman.exp0.box

Wrote unicharset file ./unicharset.

1.5. Clustering

这一步要先建立一个font_properties.txt的文件，文件内容格式如下：

[plain]

<fontname> <italic> <bold> <fixed> <serif> <fraktur>

我的内容是

[plain]

Roman 0 0 0 0 0

E:\BuildFolder\tesseract-ocr\testing>mftrainingd -F font_properties.txt -U unicharset ABC.Roman.exp0.tr

Warning: No shape table file present: shapetable

Reading ABC.Roman.exp0.tr ...

Flat shape table summary: Number of shapes = 12 max unichars = 1 number with multiple unichars = 0

Done!

E:\BuildFolder\tesseract-ocr\testing>cntrainingd ABC.Roman.exp0.tr

Reading ABC.Roman.exp0.tr ...

Clustering ...

Writing normproto ...

1.6. Combine

此时，在目录下应该生成若干个文件了，把unicharset, inttemp, normproto, pffmtable这四个文件加上前缀“Roman.”。然后输入命令：

E:\BuildFolder\tesseract-ocr\testing>combine_tessdatad Roman.

Combining tessdata files

TessdataManager combined tesseract data files.

Offset for type 0 is -1

Offset for type 1 is 140

Offset for type 2 is -1

Offset for type 3 is 939

Offset for type 4 is 140232

Offset for type 5 is 140335

Offset for type 6 is -1

Offset for type 7 is -1

Offset for type 8 is -1

Offset for type 9 is -1

Offset for type 10 is -1

Offset for type 11 is -1

Offset for type 12 is -1

Offset for type 13 is 141961

Offset for type 14 is -1

Offset for type 15 is -1

1.7. Test

把生成的Roman.traineddata拷贝到E:\BuildFolder\tesseract-ocr\testing\tessdata

tesseract ABC.Roman.exp0.tif result -l Roman -psm 7 nobatch

这样就ok了。

参考：http://blog.wudilabs.org/entry/f25efc5f/?lang=zh-CN

http://www.lixin.me/blog/2012/05/26/29536

http://wenku.baidu.com/view/5eafc201e87101f69e3195f4.html

http://www.84kf.com/html/22453.html

http://blog.csdn.net/fengbingchun/article/details/7022421

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

以下转自：http://blog.wudilabs.org/entry/f25efc5f/?lang=zh-CN

需要用到的程序

(1)

(2)

(3)

(4) (非必需)

使用 Universal Extractor 将 Tesseract 的安装包解开，再用 Bugfix 里的 tesseract.exe 覆盖原来的主程序，Tesseract 就可用了。CowBoxer 是用于修改 box 文件的程序。

生成第一个 box 文件

演示中将 Tesseract 解压到了 E:\tesseract-ocr 目录。然后在该目录中建立了一个 build 目录用于存放原始数据和训练过程中生成的文件。原始图片数据一个有 3 个 (test.001.tif - test.003.tif):

首先生成第一个图片 test.001.tif 的 box 文件，这里使用官方的 eng 语言数据进行文字识别：

E:\tesseract-ocr\build >..\tesseract test.001.tif test.001 -l eng batch.nochop makebox

Tesseract Open Source OCR Engine with Leptonica

Number of found pages: 1.

执行完这个命令之后，build 目录下就生成了一个 test.001.box。使用 CowBoxer 打开这个 box 文件，CowBoxer 会自动找到同名的 tif 文件显示出来。

CowBoxer 的使用方法可以看 Help -> About 中的说明。修改完成之后 File -> Save box file 保存文件。

生成初始的 traineddata

接下来使用这一个 box 文件先生成一个 traineddata，在接下来生成其他图片的 box 文件时，使用这个 traineddata 有利于提高识别的正确率，减少修改次数。

..\tesseract test.001.tif test.001 nobatch box.train

..\training\unicharset_extractor test.001.box

..\training\mftraining -U unicharset -O test.unicharset test.001.tr

..\training\cntraining test.001.tr

rename normproto test.normproto

rename Microfeat test.Microfeat

rename inttemp test.inttemp

rename pffmtable test.pffmtable

..\training\combine_tessdata test.

在 build 目录下执行完这一系列命令之后，就生成了可用的 test.traineddata。

生成其余 box 文件

将上一步生成的 test.traineddata 移动到 tesseract-ocr\tessdata 目录中，接下来生成其他 box 文件时就可以通过 -l test 参数使用它了。

..\tesseract test.002.tif test.002 -l test batch.nochop makebox

..\tesseract test.003.tif test.003 -l test batch.nochop makebox

这里仅仅是使用 3 个原始文件作为例子。实际制作训练文件时，什么时候生成一个 traineddata 根据情况而定。中途生成 traineddata 的目的只是为了提高文字识别的准确率，使后面生成的 box 文件能少做修改。

生成最终的 traineddata

在所有的 box 都制作完成后，就可以生成最终的 traineddata 了。

..\tesseract test.001.tif test.001 nobatch box.train

..\tesseract test.002.tif test.002 nobatch box.train

..\tesseract test.003.tif test.003 nobatch box.train

..\training\unicharset_extractor test.001.box test.002.box test.003.box

..\training\mftraining -U unicharset -O test.unicharset test.001.tr test.002.tr test.003.tr

..\training\cntraining test.001.tr test.002.tr test.003.tr

rename normproto test.normproto

rename Microfeat test.Microfeat

rename inttemp test.inttemp

rename pffmtable test.pffmtable

..\training\combine_tessdata test.

在文件较多时可以用程序生成这种脚本执行。

转载于:https://www.cnblogs.com/songtzu/archive/2013/01/28/2880497.html

你可能感兴趣的文章

404 Note Found 队-Alpha9

查看>>

javascript 中==和===的区别

解析json对象出现$ref: "$.list[0]"的解决办法

查看>>

LeetCode--Longest Common Prefix

读Thinking in Java(1~4)

查看>>

后缀自动机专题

查看>>

Js选择框脚本移动操作select 标签中的 option 项的操作事项

查看>>

《Algorithms 4th Edition》读书笔记——2.4 优先队列(priority queue)-Ⅵ

Altium Designer生成网表导出网表【worldsing笔记】

查看>>

[poj2449]Remmarguts' Date(spfa+A*)

查看>>