Home

Hardening soft information sources


Author(s) : David Mcallester Henry Kautz William Cohen, 
Publisher : N/A
Publication Date : 2000
ISSN : N/A
Abstract : The web contains a large quantity of unstructured information. In many cases, it is possible to heuristically extract structured information, but the resulting databases are \soft": they contain inconsistencies and duplication, and lack unique, consistentlyused object identiers. Examples include the large bibliographic databases harvested from raw scientic papers by systems such as ResearchIndex or Cora, and information extracted by various crawlers from sources such as classied ads, movie listings, and online catalogs. Soft databases can also arise as a result of merging heterogenous \hard " databases. Here we formally model a soft database as a noisy version of some unknown hard database. We then consider the hardening problem, i.e., the problem of inferring the most likely underlying hard database given a particular soft database. A key feature of our approach is that hardening is global | many sources of evidence for a given hard fact are taken into account. We formulate hardening as an optimization problem and give a nontrivial nearly linear time algorithm for nding a local optimum. It is hoped that global hardening methods will allow a greater variety of soft sources to be reliably structured. 1,