|
Abstract : |
We describe algorithms for identifying the language of text in document images which are complex, unoriented, and degraded. We distinguish among seven languages: Chinese, English, French, German, Italian, Japanese, and Spanish. The page layouts may be complex, containing text blocks in unknown roughly Manhattan arrangements. The pages may be unoriented, that is, upright or rotated by 90, 180, or 270 degrees. The images may be degraded by digitization at coarse and unequal spatial sampling rates as in FAXes. We begin by segmenting the page into text lines in a manner oblivious to page skew and both page and text-line orientation. Then we distinguish between Asian and Latin scripts at any orientation. Chinese versus Japanese is decided at any orientation, and then their orientation is detected. On Latin scripts, we detect first orientation and then language. A variety of decision procedures are used, some hand-crafted (e.g. using spatial features and optical density distributions) and others trainable (e.g. using word unigram relative entropy models). Tests on 1088 standard (low) resolution FAX images show that our method accurately identifies scripts (98.16%), and language and page orientations (94.76%). 1, |