A Java Implementation of Ngram Statistic Package
Copyright © 2008 by
Cam-Tu Nguyen (ncamtu at gmail
Thu-Trang Nguyen (trangnt84 at gmail
2.2.3. Option Parameters
2.3. Output Data Format
2.4. Case Study
JNSP is a Java implementation of Ngram Statistic Package at http://ngram.sourceforge.net/ . This allows user to count and find collocations through hypothesis testing with t-score, mutual information, etc.
- September 07, 2008:
· Released version 2.0
We highly appreciate any suggestion, comment, and bug report.
JNSP is a free software; you can
redistribute it and/or modify it under the terms of the GNU General Public
License as published by the Free Software Foundation.
JNSP is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You can find and download document, source code of JNSP at http://sourceforge.net/projects/jnsp
Here are some other tools developed by the same author:
· JVnSegmenter: A Java-based Vietnamese Word Segmentation Tool
· JGibbLDA: A Java-based Latent Dirichllet Allocation using Gibbs Sampling for estimation/inference
In this section, we describe how to use this tool for parameter estimation and inference for new data. Suppose that the current working directory is the home directory of JNSP and we are in Linux platform. The command lines for other cases are similar.
$ java [-mx512M] -cp bin:lib/tokenfilter.jar jnsp.Counter <option file>
$ java [-mx512M]-cp bin:lib/tokenfilter.jar jnsp.Statistic <option file>
Parameters which can be set in option file are presented as follows:
For example, in order to refer to the file D:\count2.cnt (here, 2 means the number of gram, and cnt is the suffix in the counting file name), we specify cntFile=D:\count. Associating with the information of the number of grams, we can construct the counting file name from this cntFile.
[Number of ngrams]
[Ngram 1]<>[Couting Parameters]
[Ngram 2]<>[Couting Parameters]
[Ngram 1]<>[statistic value]
[Ngram 2]<>[statistic value]
run<>begins<>1 33 11
months<>ago<>21 35 123
test<>match<>4 74 108
team<>effort<>3 207 10
aussie<>bowlers<>1 5 15
glenn<>mcgrath<>4 6 5
colin<>miller<>1 34 19
clb<>benfica<>4 808 11
benfica<>chuẩn_bị<>1 4 85
chuẩn_bị<>xây_dựng<>1 67 28
xây_dựng<>svđ<>1 31 11
ban<>điều_hành<>1 163 48
điều_hành<>clb<>15 38 447
phá_bỏ<>svđ<>1 2 11
svđ<>hiện_thời<>1 24 4
vòng<>chung_kết<>69 543 318
chung_kết<>euro<>5 217 191
Our code is based on the NSP of Ted Pedersen et. al. and the design descriptions in [Banerjee and Pedersen]. I would like to thank Ted Pedersen, Satanjeev Banerjee, etc for sharing the code and a comprehensive technical report.
We would like to thank Sourceforge.net for hosting this project.
· [Banerjee & Pedersen] The Design, Implementation, and Use of the Ngram Statistics Package - Appears in the Proceedings of the Fourth International Conference on Intelligent Text Processing and Computational Linguistics, February 17-21, 2003, Mexico City
Last updated September , 2008