This is a library for indexing a document or extracting unique non stopwords tokens and getting their frequency
For indexing call the function IndexDocument and listen for the finish event when indexing completed and also you can access extracted token using the tokens property and it is a Map data structure
const HtmlIndexer =require('./htmlIndexer');
const indexer = new HtmlIndexer();
indexer.IndexDocument("tests/test.html");
indexer.on("indexFinished", () => {
for (var key of indexer.tokens.keys()) {
console.log(`Term : ${key} Frequency : ${indexer.tokens.get(key)}`);
}
});
You can access generated tokens with using stream with getOutPutStream passing chunk size or number of tokens
per chunk and the output is json based with format { term: 'test', freq: 1, isFirstChunk: true, isLastChunk: true }
var stream =indexer.getOutPutStream(2);
stream.on('data',(data)=>console.log(data));