stopwords-json
Stopwords for various languages in JSON format. Per Wikipedia:
Stop words are words which are filtered out prior to, or after, processing of natural language data [...] these are some of the most common, short function words, such as the, is, at, which, and on.
You can use all stopwords with stopwords-all.json (keyed by language ISO 639-1 code), or see the below table for individual language stopword files.
Languages
There are a total of 50 supported languages:
Language | Stopword count | Filename |
---|---|---|
Afrikaans | 51 | af.json |
Arabic | 162 | ar.json |
Armenian | 45 | hy.json |
Basque | 98 | eu.json |
Bengali | 116 | bn.json |
Breton | 126 | br.json |
Bulgarian | 259 | bg.json |
Catalan | 218 | ca.json |
Chinese | 542 | zh.json |
Croatian | 179 | hr.json |
Czech | 346 | cs.json |
Danish | 101 | da.json |
Dutch | 275 | nl.json |
English | 570 | en.json |
Esperanto | 173 | eo.json |
Estonian | 35 | et.json |
Finnish | 772 | fi.json |
French | 606 | fr.json |
Galician | 160 | gl.json |
German | 596 | de.json |
Greek | 75 | el.json |
Hausa | 39 | ha.json |
Hebrew | 194 | he.json |
Hindi | 225 | hi.json |
Hungarian | 781 | hu.json |
Indonesian | 355 | id.json |
Irish | 109 | ga.json |
Italian | 619 | it.json |
Japanese | 109 | ja.json |
Korean | 679 | ko.json |
Latin | 49 | la.json |
Latvian | 161 | lv.json |
Marathi | 99 | mr.json |
Norwegian | 172 | no.json |
Persian | 332 | fa.json |
Polish | 260 | pl.json |
Portuguese | 408 | pt.json |
Romanian | 282 | ro.json |
Russian | 539 | ru.json |
Slovak | 110 | sk.json |
Slovenian | 446 | sl.json |
Somalia | 30 | so.json |
Southern Sotho | 31 | st.json |
Spanish | 577 | es.json |
Swahili | 74 | sw.json |
Swedish | 401 | sv.json |
Thai | 115 | th.json |
Turkish | 279 | tr.json |
Yoruba | 60 | yo.json |
Zulu | 29 | zu.json |
Sources
- Apache Lucene - Apache 2.0 License
- Carrot2 - License
- cue.language - Apache 2.0 License
- Jacques Savoy - BSD License
- SMART Information Retrieval System: ftp://ftp.cs.cornell.edu/pub/smart/
- ASP Stoplist Project - CC-BY and Apache 2.0
License and Copyright
Copyright (c) 2017 Peter Graham, contributors. Released under the Apache-2.0 license.