JNSP

JNSP is a Java implementation of Ngram Statistic Package at http://ngram.sourceforge.net/ . This allows user to count and find collocations through hypothesis testing with t-score, mutual information, etc.

1.2. News, Comments, and Bug Reports.

- September 07, 2008:

· Released version 2.0

We highly appreciate any suggestion, comment, and bug report.

1.3. License

JNSP is a free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation.

JNSP is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

2. How to Use JNSP from Command Line

2.1. Download

You can find and download document, source code of JNSP at http://sourceforge.net/projects/jnsp

Here are some other tools developed by the same author:

· JVnSegmenter: A Java-based Vietnamese Word Segmentation Tool

· JGibbLDA: A Java-based Latent Dirichllet Allocation using Gibbs Sampling for estimation/inference

2.2. Command Line & Input Parameters

In this section, we describe how to use this tool for parameter estimation and inference for new data. Suppose that the current working directory is the home directory of JNSP and we are in Linux platform. The command lines for other cases are similar.

2.2.1. Counting a collection of text

$ java [-mx512M] -cp bin:lib/tokenfilter.jar jnsp.Counter <option file>

<option file> path to option file. Parameters in the option file are listed as in the next Section

2.2.3. Doing statistics on the counted data

$ java [-mx512M]-cp bin:lib/tokenfilter.jar jnsp.Statistic <option file>

<option file> path to option file. Parameters in the option file are listed as in thenext Section

2.2.3. Option Parameters

Parameters which can be set in option file are presented as follows:

window: number of grams
datadir: The directory contain text data to be counted and performed statistics on
stopFile: File containing stop words in a specific language
freqCutoff: cutoff ngrams with frequency larger than this threshold.
rareCutoff: cutoff ngrams with frequency less than this threshold.
cntFile: <path to counted file> + <file path separotor> + <prefix of counting file> .

For example, in order to refer to the file D:\count2.cnt (here, 2 means the number of gram, and cnt is the suffix in the counting file name), we specify cntFile=D:\count. Associating with the information of the number of grams, we can construct the counting file name from this cntFile.

agressiveCount: If this set to true, for a collection and a specified window, we generate counting information about 1,2,..., window grams. This helps avoiding scan collection each time for one type of ngram. Counting for each type of ngram is written in one file with name cntFile + <number of grams> + ".cnt".
freqComboFile: specify file countaing frequency combo to count. Since we can count several types of ngrams at the same time (with agressiveCount), each frequency combo need to be specified clearly that it is applied to which type of ngrams. For example: for trigrams and its frequency combo 3=0:1:2|0|1|2|0:1|0:2, we count the frequency of w0w1w2, w0 at the first place, w1 at the second place and so on.
statFile: path to file containing statistic information of the collection

2.3. Output Data Format

Format of the output of Counter:

[Number of ngrams]
[Ngram 1]<>[Couting Parameters]
[Ngram 2]<>[Couting Parameters]
.....

Format of the output of Statistic

[Ngram 1]<>[statistic value]
[Ngram 2]<>[statistic value]
...

2.5. Case Study

2.5.1. Find collocation in English

Data and options to count/anaylyse statistic are shown in the samples and data in the package of JNSP which you can download.
Sample output of the counting process. Here, the bigram "months ago" occurs 21 times in the collection, the word "months" occurs 35 times, and the word "ago" occurs 123.

29843
run<>begins<>1 33 11
months<>ago<>21 35 123
test<>match<>4 74 108
team<>effort<>3 207 10
aussie<>bowlers<>1 5 15
glenn<>mcgrath<>4 6 5
colin<>miller<>1 34 19
....

Sample output of the statistic analysis process. Calculate T-score to the ngrams counted, we got the results as follows. Here, the larger t-score is, the more likely a bigram be a collocation. The "world champion" is a collocation with the confidence of more than 99.5% because its t-score (9.51860..) larger than 2.675.

bbc<>sport 10.392742222776956
world<>champion 9.518601451101189
grand<>prix 9.486832980505138
world<>number 9.221981556055333
ve<>got 8.888194417315589
west<>ham 8.831760866327846
world<>record 8.728715609439696
years<>ago 8.485281374238571
world<>cup 8.410956309868196
sri<>lanka 8.246211251235321
.....

2.5.2. Find collocation in Vietnamese

Similarly to the above case study, we found collocations for a collection of Vietnamese text and show results as follows. The output of the couting process:

54166
clb<>benfica<>4 808 11
benfica<>chuẩn_bị<>1 4 85
chuẩn_bị<>xây_dựng<>1 67 28
xây_dựng<>svđ<>1 31 11
ban<>điều_hành<>1 163 48
điều_hành<>clb<>15 38 447
phá_bỏ<>svđ<>1 2 11
svđ<>hiện_thời<>1 24 4
vòng<>chung_kết<>69 543 318
chung_kết<>euro<>5 217 191
....

The output of statistic analysis with T-score

world<>cup 22.116209726652244
champions<>league 19.774293161202728
mùa<>giải 16.743157806499145
tp<>hcm 16.401331959231246
hlv<>trưởng 14.565737684227129
real<>madrid 14.526000648343672
trận<>chung_kết 14.39746489892214
mùa<>bóng 13.089776391875638
đội<>chủ_nhà 13.005370613883684
ghi<>bàn 12.96170831152056
vòng<>loại 12.728868621583858
châu<>âu 12.409673645990857
chức<>vô_địch 12.125448035690276
.....

3. Acknowledgements, and References

3.1. Acknowledgements

Our code is based on the NSP of Ted Pedersen et. al. and the design descriptions in [Banerjee and Pedersen]. I would like to thank Ted Pedersen, Satanjeev Banerjee, etc for sharing the code and a comprehensive technical report.

We would like to thank Sourceforge.net for hosting this project.

3.2.References

· [Banerjee & Pedersen] The Design, Implementation, and Use of the Ngram Statistics Package - Appears in the Proceedings of the Fourth International Conference on Intelligent Text Processing and Computational Linguistics, February 17-21, 2003, Mexico City

· [Manning] Manning, C. D. and Schutze, H. 1999. Foundations of Statistic Natural Language Processing. MIT Press.

Last updated September , 2008