naivebayes
A Naive-Bayes classifier for node.js
适用于Node.js的用于文本学习的朴素贝叶斯算法库。
naivebayes
takes a document (piece of text), and tells you what category that document belongs to.
简单说:它可以学习文本和标签,并告诉你新的未知文本应该属于什么标签/分类。
核心公式:
文本:[W1,W2,W3,W4,W5...Wn]
分类:[C1,C2,C3,C4,C5...Cn]
P(C|D) = P(D|C) * P(C) / P(D)
= P(C|W1W2...Wn) = P(W1W2...Wn|C) * P(C) / P(W1W2...Wn)
=> Cn.forEach(C => P(W1W2...Wn|C))
=> Wn.forEach(W => P(W|C)
Example
生成了Web版便于体验:Web experience Page
What can I use this for?
You can use this for categorizing any text content into any arbitrary set of categories. For example:
- Is an email spam, or not spam ?
- Is a news article about technology, politics, or sports ?
- Is a piece of text expressing positive emotions, or negative emotions?
它可以用于任何文本学习类项目。比如:
- 判断未知邮件是否为垃圾邮件
- 判断不同的未知文本风格对应的作者
- 判断未知文本内容的分类,可以是任何你想要的维度
- ...
Installing
npm install naivebayes --save
Usage
基本方法
// 导入const NaiveBayes = // 实例化(创建分类器)const classifier = // 学习文本和分类,teach it positive phrasesclassifierclassifier // 学习不同文本和分类,teach it a negative phraseclassifier // 判断文本归属,now ask it to categorize a document it has never seen beforeclassifier// => 'positive' // 导出学习数据,serialize the classifier's state as a JSON string.const stateJson = classifier // 导入学习数据,load the classifier back from its JSON representation.const revivedClassifier = NaiveBayes
实践场景
const NaiveBayes = // 使用第三方中文分词库const Segment = const segment = // 使用默认的识别模块及字典,载入字典文件需要1秒,仅初始化时执行一次即可segment // 分词测试console// 测试中文分词库 [ '这是', '一个', '基于', 'Node.js', '的', '中文', '分词', '模块', '。' ] const classifier = // 自定义分词器 { // 仅保留英文、中文、数字 const sanitized = sentence // 中英文分词 return segment } // 利用词库进行一些复杂的测试classifierclassifierclassifierclassifierclassifierclassifierclassifierclassifierclassifierclassifierclassifierclassifierclassifierclassifier // 测试console // 脏话console // 脏话console // 正常console // { category: '正常', probability: xxx } // 获取对于各分类的概率数组console// [{ category: 'xx', probability: xxx }, { ... }, ...]
API
Class
const classifier = options
Returns an instance of a Naive-Bayes Classifier.
Pass in an optional options
object to configure the instance. If you specify a tokenizer
function in options
, it will be used as the instance's tokenizer. It receives a (string) text
argument - this is the string value that is passed in by you when you call .learn()
or .categorize()
. It must return an array of tokens.
你可以自定义一个分词器,用于将被学习的文本进行处理后,返回一个数组; 默认分词器仅保留中文、英文、数字字符,英文按照空格分割词汇,中文按照单个汉字分割词汇,代码在此。
Eg.
const classifier = { return text }
Learn
classifier
学习:使分类器学习一些新的内容,内容包括文本和文本对应的标签/分类;标签/分类可以是已经存在的;学习的样本越多,分类的准确率越高。
Teach your classifier what category
the text
belongs to. The more you teach your classifier, the more reliable it becomes. It will use what it has learned to identify new documents that it hasn't seen before.
Probabilities
classifier
计算概率:返回一个由分类名称和分类对应的概率(计算后的)组成的数组,已经从大到小排序完毕,classifier.categorize(text)
使用的便是此数组中的最大值。
Returns an array of { category, probability }
objects with probability calculated for each category. Its judgement is based on what you have taught it with .learn()
.
Categorize
classifier
分类:确定一段文本所属的分类,probability
参数用于标识是否返回概率,如果为true
,则返回一个对象{ category: xxx, probability: xxx }
,否则直接返回分类。
Returns the category
it thinks text
belongs to. Its judgement is based on what you have taught it with .learn()
.
ToJson
classifier
导出:将类实例化之后进行的一系列学习成果导出为标准json格式(字符串),以便下次导入增量学习。
Returns the JSON representation of a classifier. This is the same as JSON.stringify(classifier.toJsonObject())
.
ToJsonObject
classifier
基本同上,异同:导出的是json对象,可直接用于运算。
Returns a JSON-friendly representation of the classifier as an object
.
FromJson
const classifier = NaiveBayes
导入:将上次的学习成果导入并实例化,格式为标准Json(字符串/对象);当然你也可以将其他地方已学习的计算结果转化为NaiveBayes
需要的json格式,然后初始化NaiveBayes
分类器,json对象的具体格式可以通过这里的代码一探究竟。
Returns a classifier instance from the JSON representation. Use this with the JSON representation obtained from classifier.toJson()
.
相关库
中文分词库:
英文分词库:
Credits
This project was forked from bayes by @Tolga Tezel 👍