官方github地址:
https://github.com/tesseract-ocr/tesseract
注意,tess4j中用到的JAI类库只支持以下图像类型:
详情可到进入下面链接查看:
https://github.com/jai-imageio/jai-imageio-core
安装系统环境(可选)
https://github.com/UB-Mannheim/tesseract/wiki
如果不安装,则会在执行OCR识别时出现如下作物提示:
java.lang.RuntimeException: Unsupported image format. May need to install JAI Image I/O package. https://github.com/jai-imageio/jai-imageio-core at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:215) ~[tess4j-4.5.5.jar:4.5.5] at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:195) ~[tess4j-4.5.5.jar:4.5.5]
下载语言数据
https://github.com/tesseract-ocr/tessdata
到这个连接中,下载zip包:
做OCR识别之前,不配置tessdata的话,会出现以下错误:
当然,你也可以只下载eng.traineddata
引入依赖
//OCR依赖 implementation 'net.sourceforge.tess4j:tess4j:4.5.5' //JAI Image I/O 扩展库 implementation group: 'com.github.jai-imageio', name: 'jai-imageio-jpeg2000', version: '1.4.0'
代码示例
File ocrFile = new File("ocr.png "); //使用OCR提取图片文字 Tesseract tesseract = new Tesseract(); //设置 Tesseract 数据文件的路径,如果不是默认路径的话 //tesseract.setDatapath("path_to_your_tessdata_folder"); try { String result = tesseract.doOCR(ocrFile); System.out.println(result); } catch (TesseractException e) { System.err.println(e.getMessage()); }
其他
java.lang.RuntimeException: Unsupported image format. May need to install JAI Image I/O package. https://github.com/jai-imageio/jai-imageio-core at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:215) ~[tess4j-4.5.5.jar:4.5.5] at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:195) ~[tess4j-4.5.5.jar:4.5.5]
出现这种情况,请注意一下是否你的图像类型不属于支持的范围,请查看JAI的官网链接,别怀疑,PNG和JPG都是不支持的~!