在我之前的问题的后续跟进中,尝试使用CGPDF *函数从PDF文件中提取文本,具有:
CGPDFStringRef pdfString
我发现它可以转换为一个字符代码数组,如下所示:
const unsigned char *characterCodes = CGPDFStringGetBytePtr(pdfString);
现在,我试图提取的文本是用14种类型1基本字体之一编写的,它不在PDF本身中编码.为此,我解析了该字体的相关AFM文件,给出了从字符代码到字形名称的映射,它的尺寸如下:
C 61 ; WX 600 ; N equal ; B 80 138 520 376 ;
C 63 ; WX 600 ; N question ; B 129 -15 492 572 ;
C 64 ; WX 600 ; N at ; B 77 -15 533 622 ;
C 65 ; WX 600 ; N A ; B 3 0 597 562 ;
C 66 ; WX 600 ; N B ; B 43 0 559 562 ;
我的问题是,知道字符代码,说:“61”我如何从它的字形名称:“等于”NSString @“=”.
特别是当该字符代码被重新映射到另一个字形名称时,比如说,例如:PDF的字体编码选项的“问题”.
以前的问题:
iOS PDF parsing Type 1 Fonts metrics和
iOS PDF to plain text parser
最佳答案 我没有测试过这个,但在我看来你需要使用
Adobe Glyph Naming convention:
The purpose of the Adobe Glyph Naming convention is to support the
computation of a Unicode character string from a sequence of glyphs.
This is achieved by specifying a mapping from glyph names to character
strings.
在该页面上链接的glyphlist.txt似乎与您的问题相关.
样本片段:
…
epsilon;03B5
epsilontonos;03AD
equal;003D
equalmonospace;FF1D
equalsmall;FE66
equalsuperior;207C
…
那么你需要做的就是putting those unicode values in your NSString instance.
编辑:
确认上面提供的信息后,我在PDF Reference Document from Adobe第5.9节 – 文本内容的提取中找到了以下解释:
If the font is a simple font that uses one of the predefined encodings
MacRomanEncoding, MacExpertEncoding, or WinAnsiEncoding, or that has
an encoding whose Differences array includes only character names
taken from the Adobe standard Latin character set and the set of named
characters in the Symbol font (see Appendix D):
- Map the character code to a character name according to Table D.1 on
page 996 and the font’s Differences array.- Look up the character name in the Adobe Glyph List (see the
Bibliography) to obtain the corresponding Unicode value.