A Java Implementation of Ngram Statistic Package

Copyright © 2008 by

Cam-Tu Nguyen (ncamtu at gmail dot com), College of Technology , Vietnam National University, Hanoi

Thu-Trang Nguyen (trangnt84 at gmail dot com), College of Technology , Vietnam National University, Hanoi

1. Introduction

        1.1. Description

        1.2. News, Comments, and Bug Reports

        1.3. License

2.How to Use JNSP from Command Line

        2.1. Download

        2.2. Command Line & Input Parameters

                2.2.1. Counting a collection of text

                2.2.2. Doing statistics on the counted data

                2.2.3. Option Parameters

        2.3. Output Data Format

        2.4. Case Study

3.Acknowledgements, and References

1. Introduction

1.1. Description

JNSP is a Java implementation of Ngram Statistic Package at . This allows user to count and find collocations through hypothesis testing with t-score, mutual information, etc.

1.2. News, Comments, and Bug Reports.

- September 07, 2008:

·         Released version 2.0

We highly appreciate any suggestion, comment, and bug report.

1.3. License

JNSP is a free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation.

JNSP is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

2. How to Use JNSP from Command Line

2.1. Download

You can find and download document, source code of JNSP at

Here are some other tools developed by the same author:

·         JVnSegmenter: A Java-based Vietnamese Word Segmentation Tool

·         JGibbLDA: A Java-based Latent Dirichllet Allocation using Gibbs Sampling for estimation/inference

2.2. Command Line & Input Parameters

In this section, we describe how to use this tool for parameter estimation and inference for new data. Suppose that the current working directory is the home directory of JNSP and we are in Linux platform. The command lines for other cases are similar.

2.2.1. Counting a collection of text

$ java [-mx512M] -cp bin:lib/tokenfilter.jar jnsp.Counter <option file>

2.2.3. Doing statistics on the counted data

$ java [-mx512M]-cp bin:lib/tokenfilter.jar jnsp.Statistic <option file>

2.2.3. Option Parameters

Parameters which can be set in option file are presented as follows:

For example, in order to refer to the file D:\count2.cnt (here, 2 means the number of gram, and cnt is the suffix in the counting file name), we specify cntFile=D:\count. Associating with the information of the number of grams, we can construct the counting file name from this cntFile.

2.3. Output Data Format

[Number of ngrams]
[Ngram 1]<>[Couting Parameters]
[Ngram 2]<>[Couting Parameters]

[Ngram 1]<>[statistic value]
[Ngram 2]<>[statistic value]

2.5. Case Study

2.5.1. Find collocation in English

run<>begins<>1 33 11
months<>ago<>21 35 123
test<>match<>4 74 108
team<>effort<>3 207 10
aussie<>bowlers<>1 5 15
glenn<>mcgrath<>4 6 5
colin<>miller<>1 34 19

bbc<>sport 10.392742222776956
world<>champion 9.518601451101189
grand<>prix 9.486832980505138
world<>number 9.221981556055333
ve<>got 8.888194417315589
west<>ham 8.831760866327846
world<>record 8.728715609439696
years<>ago 8.485281374238571
world<>cup 8.410956309868196
sri<>lanka 8.246211251235321

2.5.2. Find collocation in Vietnamese

3. Acknowledgements, and References

3.1. Acknowledgements

Our code is based on the NSP of Ted Pedersen et. al. and the design descriptions in [Banerjee and Pedersen]. I would like to thank Ted Pedersen, Satanjeev Banerjee, etc for sharing the code and a comprehensive technical report.

We would like to thank for hosting this project.



·   [Banerjee & Pedersen] The Design, Implementation, and Use of the Ngram Statistics Package - Appears in the Proceedings of the Fourth International Conference on Intelligent Text Processing and Computational Linguistics, February 17-21, 2003, Mexico City

·        [Manning] Manning, C. D. and Schutze, H. 1999. Foundations of Statistic Natural Language Processing. MIT Press.  

Last updated September , 2008