<technical report>
Optimized Substructure Discovery for Semi-structured Data

Creator
Language
Publisher
Date
Source Title
Vol
Publication Type
Access Rights
Related DOI
Related URI
Relation
Abstract We address the problem of finding interesting substructures from a colletion of semi-structured data such as XML or HTML. Our framework of data mining is optimized pattern discovery introduced by Fuku...da et al., where the goal of a mining algorithm is to discover a pattern that optimizes a given statistical measure such as the information entropy over a class of simple patterns. In this paper, modeling semi-structured data with labeled ordered trees, we study the efficient algorithm for the optimized pattern discovery problem for the class. In a previous paper, we developed the rightmost expansion technique and the incremental occurrence update technique by generalizing enumeration technique developed by Bayardo (SIGMOD'98) for discovering long itemsets to implement an efficient frequent pattern miner for the class of labeled ordered trees. By combining these technique with the pruning technique for optimized patterns of Morishita and Sese (PODS'00), we present an efficient algorithm for finding optimized patterns for labeled ordered trees of bounded size. Experimental results show that our algorithm perform well on a variety of size of data and range of parameters. We also show an approximation hardness result for labeled ordered trees of unbounded size.show more

Hide fulltext details.

pdf trcs206 pdf 930 KB 304  
gz trcs206.ps gz 0.98 MB 172  

Details

Record ID
Peer-Reviewed
Type
Created Date 2009.04.22
Modified Date 2018.08.31

People who viewed this item also viewed