<technical report>
A Fast Algorithm for Discovering Optimal String Patterns in Large Text Databases

Creator
Language
Publisher
Date
Source Title
Vol
Publication Type
Access Rights
Related DOI
Related URI
Relation
Abstract We consider a data mining problem in a large collection of unstructured texts based on association rules over subwords of texts. A two-word association pattern is an expression such as $ (TATA, 30, AG...GAGGT) Rightarrow C $ that expresses a rule that if a text contains a subword TATA followed by another subword AGGAGGT with distance no more than 30 letters then a property C will hold with a probability. We present an efficient algorithm for computing frequent patterns ($ alpha $, $k$ , $\beta $) that optimize the confidence with respect to a given collection of texts. The algorithm runs in time $ O(mn^2) $ and space $ O(kn) $, where $ m $ and $ n $ are the number and the total length of classification examples, respectively, and $ k $ is a small constant around 30 ~ 50. Furthermore for most random and nearly random texts like DNA sequences, the algorithm runs very efficiently in time $ O(kn log^2 n) $. Thus, this algorithm is much faster than a straightforward algorithm that enumerates all the possible patterns in time $ O(n^5) $. We also discuss some heuristics such as sampling and pruning for practical improvement. Then, we evaluate the efficiency and the performance of the algorithm with experiments on genetic sequences.show more

Hide fulltext details.

pdf trcs148 pdf 276 KB 450  
gz trcs148.ps gz 152 KB 29  

Details

Record ID
Peer-Reviewed
Type
Created Date 2009.04.22
Modified Date 2017.01.20

People who viewed this item also viewed