A Fast Algorithm for Discovering Optimal String Patterns in Large Text Databases - Collections | Kyushu University Library

Back to Results List

＜technical report＞
A Fast Algorithm for Discovering Optimal String Patterns in Large Text Databases

Creator	Creator Name Arimura, Hiroki 有村, 博紀 Affiliation Affiliation Name Department of Informatics, Kyushu University 九州大学大学院システム情報科学研究院情報理学部門
	Creator Name Wataki, Atsushi 渡木, 厚 Affiliation Affiliation Name Department of Informatics, Kyushu University 九州大学大学院システム情報科学研究院情報理学部門
	Creator Name Fujino, Ryoichi 藤野, 亮一 Affiliation Affiliation Name Department of Informatics, Kyushu University 九州大学大学院システム情報科学研究院情報理学部門
	Creator Name Arikawa, Setsuo 有川, 節夫 Affiliation Affiliation Name Department of Informatics, Kyushu University 九州大学大学院システム情報科学研究院情報理学部門
Language	English
Publisher	Department of Informatics, Kyushu University
Publisher	九州大学大学院システム情報科学研究院情報理学部門
Date	1998-03-19
Source Title	DOI Technical Report
Vol	148
Publication Type	Accepted Manuscript
Access Rights	open access
Related DOI	DOI Technical Report \|\| 148
Related DOI	http://www.i.kyushu-u.ac.jp/research/report.html
Related URI	DOI Technical Report \|\| 148
Related URI	http://www.i.kyushu-u.ac.jp/research/report.html
Relation	DOI Technical Report \|\| 148
Relation	http://www.i.kyushu-u.ac.jp/research/report.html
Abstract	We consider a data mining problem in a large collection of unstructured texts based on association rules over subwords of texts. A two-word association pattern is an expression such as $ (TATA, 30, AG...GAGGT) Rightarrow C $ that expresses a rule that if a text contains a subword TATA followed by another subword AGGAGGT with distance no more than 30 letters then a property C will hold with a probability. We present an efficient algorithm for computing frequent patterns ($ alpha $, $k$ , $\beta $) that optimize the confidence with respect to a given collection of texts. The algorithm runs in time $ O(mn^2) $ and space $ O(kn) $, where $ m $ and $ n $ are the number and the total length of classification examples, respectively, and $ k $ is a small constant around 30 ~ 50. Furthermore for most random and nearly random texts like DNA sequences, the algorithm runs very efficiently in time $ O(kn log^2 n) $. Thus, this algorithm is much faster than a straightforward algorithm that enumerates all the possible patterns in time $ O(n^5) $. We also discuss some heuristics such as sampling and pruning for practical improvement. Then, we evaluate the efficiency and the performance of the algorithm with experiments on genetic sequences.show more

Hide fulltext details.

File	FileType	Size	Views	Description
trcs148	pdf	276 KB	450
trcs148.ps	gz	152 KB	29

Details

Record ID	3016
Peer-Reviewed	Unrefereed
Type	テクニカルレポート
Created Date	2009.04.22
Modified Date	2017.01.20