TF-IDF Document Search
TF-IDF Document Search is a Node.js library for information retrieval and keyword search. The implementation allows one to instantiate a corpus of documents and compute their similarity relative to an input query and/or document. Using this measure of similar one can compute which document in their corpus is most relevant to a particular query and/or input document.
Background
The library is based on the vector space model wherein each document is tokenized into a vector who's indices correspond to a unique term in an input query/document. Subsequently, we caclulate the TF-IDF for each term in the document vector, where TF-IDF is defined to be:
frequency of term in document
, where is the humber of documents in the corpus and is the number of documents in which term appears
Once we define TF-IDF vectors for each of the documents in our corpus, we calculate their similarity to a query vector to be:
Ranking the similarity measure for each document, one can determine which is the most relevant to a given query:)
Installation
npm install tf-idf-search
Usage
In order to begin using the package please instantiate a TF-IDF object and load your documents. Loading documents can either be done by passing a String array or an array of paths to the respective files:
TfIdf = ;tf_idf =//initialize corpus from an array of file paths, returns the current state of the corpusvar corpus = tf_idf;//initialize corpus from an array of Strings, returns the current state of the corpusvar corpus = tf_idf;//add document to the corpus by string , returns the current state of the corpusvar corpus = tf_idf;//add document to the corpus by file path, returns the current state of the corpusvar corpus = tf_idf;
After instantiating a corpus and loading your documents you can calculate the tf-idf vectors, deterimine cosine similarity and rank the relavance of documents based on queries:
//Rank documents relative to a query containing a String of keywordsvar search_result = tf_idf
This will output an array of objects, which contain the relevent search results in order of most relevant to least relevant. The index property corresponds to the documents position in the corpus:
[
{
document: ['this', 'is', 'content', 'of, 'document', '1'],
similarityIndex: 0.534,
index: 0,
},
{
document: ['this', 'is', 'content', 'of, 'document', '2'],
similarityIndex: 0.102,
index: 1,
},
{
document: ['this', 'is', 'content', 'of, 'document', '3'],
similarityIndex: 0.003,
index: 2,
},
]
If you wish to see the mapping between the returned indicees and the original input format of your document, you can use the indicesOfInputs() function as follows:
/** Find mapping between indices and original inputs* returns an array of objects*/var indices = tf_idf
This function returns an array outlining the mapping of every document in the corpus as follows:
[
{
index: 0,
document: '\test1.txt',
},
{
index: 1,
document: '\test2.text',
},
{
index: 2,
document: "document added as a string format",
},
]
In addition to the above operation, you can also compute individual aspects of the tf-idf model as follows:
// Create the idf vector for an input queryvar vector = tf_idf; // [1.2341, 0.124, 1.531]// Create tf-idf vector for a document given input queryvar vector = tf_idf; // [0.257, 1.245, 0.1]// Calculate similarity index between a document and a given input queryvar similarity = tf_idf; // 0.3421
Contributing
Pull requests are welcome. If you wish to contribute feel free to fork the repo :).