Home

Table structure recognition based on robust block segmentation


Author(s) : Thomas G. Kieninger, 
Publisher : N/A
Publication Date : 1998
ISSN : N/A
Abstract : This paper presents an efficient approach to identify tabular structures within either electronic or paper documents. The resulting T-Recs system takes word bounding box information as input, and outputs the corresponding logical text block units (e.g. the cells within a table environment). Starting with an arbitrary word as block seed the algorithm recursively expands this block to all words that interleave with their vertical (north and south) neighbors. Since even smallest gaps of table columns prevent their words from mutual interleaving, this initial segmentation is able to identify and isolate such columns. In order to deal with some inherent segmentation errors caused by isolated lines (e.g. headers), overhanging words, or cells spawning more than one column, a series of postprocessing steps is added. These steps benefit from a very simple distinction between type 1 and type 2 blocks: type 1 blocks are those of at most one word per line, all others are of type 2. This distinction allows the selective application of heuristics to each group of blocks. The conjoint decomposition of column blocks into subsets of table cells leads to the final block segmentation of a homogeneous abstraction level. These segments serve the final layout analysis which identifies table environments and cells that are stretching over several rows and/or columns.,