This is just a little library of helper functions for doing data science stuff in JS. It's probably not very well written.
npm install --save https://github.com/jrc03c/js-data-science-helpers
Pull out individual functions:
const {
containsOnlyNumbers,
getCorrelationMatrix,
...,
} = require("@jrc03c/js-data-science-helpers")
Or dump all of the functions into the global scope:
require("@jrc03c/js-data-science-helpers").dump()
<script src="/path/to/dist/js-data-science-helpers.js"></script>
<script>
// pull out individual functions:
const {
containsOnlyNumbers,
getCorrelationMatrix,
...,
} = JSDataScienceHelpers
// or dump everything into the global scope:
JSDataScienceHelpers.dump()
</script>
Makes it easy to remove outliers by clipping all values in a dataset (x
) to the range [median(x) - maxScore * MAD(x), median(x) + maxScore * MAD(x)]
and then potentially taking the log of those values. (See: MAD)
The constructor accepts an options object with these options:
-
isAllowedToClip
= a boolean indicating whether or not the model is allowed to clip outliers into the range described above; the default istrue
-
isAllowedToTakeTheLog
= a boolean indicating whether or not the model is allowed to take the log of all values if any of the values fall outside the range described above; the default isfalse
-
maxScore
= a non-negative number that helps to define the range described above by determining how far (i.e., how many MADs) values are allowed to fall from the median of the training dataset before they're marked as outliers; the default is5
A boolean indicating whether or not the model is allowed to clip outliers into the range described above.
A boolean indicating whether or not the model is allowed to take the log of all values if any of the values fall outside the range described above.
A number representing the median absolute deviation (MAD) from the median of the training dataset.
A non-negative number that helps to define the range described above by determining how far (i.e., how many MADs) values are allowed to fall from the median of the training dataset before they're marked as outliers.
A number representing the median of the training dataset.
Computes the median, MAD, and minimum value of x
, which can be an array, a DataFrame
, or a Series
. Returns the OutlierMitigator
instance.
Runs the fit method on x
, passes all of the arguments into the transform
method, and returns the results.
Modifies and returns the given dataset(s) according to whether or not (1) outliers can be clipped into the range described above, and/or (2) whether or not the log of all values can be taken if any outliers exist in the dataset.
Note that each argument is transformed on its own; so outliers in x
, for example, won't have any bearing on how y
, z
, etc., are transformed.
const { OutlierMitigator } = require("@jrc03c/js-data-science-helpers")
const x = [-10000, 1, 2, 3, 4, 10000]
const gator = new OutlierMitigator()
gator.fit(x)
console.log(gator.transform(x))
// [ -5, 1, 2, 3, 4, 10 ]
gator.isAllowedToTakeTheLog = true
console.log(gator.transform(x))
// [
// 0,
// 1.9459101490553132,
// 2.0794415416798357,
// 2.1972245773362196,
// 2.302585092994046,
// 2.772588722239781,
// ]
Returns the Cohen's D value for two vectors or Series
instances a
vs. b
.
Returns a boolean indicating whether or not x
contains only numbers.
Given a matrix or DataFrame
and an optional config
object, returns a cleaned-up matrix or DataFrame
that contains only numerical values (including NaN
values).
If provided, the config
object can contain these values:
-
correlationThreshold
= the coefficient of correlation threshold above which two columns will be considered to be virtually identical (and one will be dropped); the default is1 - 1e-5 = 0.99999
-
maxUniqueValues
= the number of unique values above which a column will no longer be eligible for one-hot-encoding; the default is7
-
minNonMissingValues
= the number of non-missing values below which the column will be dropped; the default is15
-
progress
= a callback function that is passed a single value in the range[0, 1]
that represents the fraction of preprocessing completed
The cleaning process involves:
- inferring the types of all columns and casting column values into those inferred types, including:
- converting booleans to 0s and 1s
- converting dates to integers (as milliseconds since the Unix epoch (midnight on January 1, 1970); see `Date.prototype.getTime())
- dropping all but 1 of any duplicate or highly correlated columns (i.e., r >
correlationThreshold
) - dropping any columns with fewer than 15 non-missing values
- dropping any columns with only 1 unique value
- one-hot-encoding any columns where the top
maxUniqueValues
unique values account for at least 90% of all of the values (and if there are any other values left over, then they're returned in an "other" column) - dropping all other columns that cannot be converted to numbers (e.g., string columns containing all unique values)
Turns 1-dimensional array or Series
x
into a square matrix with the values of x
along the main diagonal (top left to bottom right) and zeros everywhere else. For example:
const { diagonalize } = require("@jrc03c/js-data-science-helpers")
diagonalize([1, 2, 3])
// [
// [1, 0, 0],
// [0, 2, 0],
// [0, 0, 3],
// ]
Returns a correlation matrix containing the correlations of every column in a
against every column in b
. If b === null
, then a
is just compared against itself.
getHighlyCorrelatedColumns(a, b=null, threshold=(1 - 1e-5))
getHighlyCorrelatedColumns(c, threshold=(1 - 1e-5))
Returns a dictionary of columns and their highly correlated counterparts given (1) a
and b
, two matrices or DataFrame
instances for which a correlation matrix has not yet been computed, or (2) a correlation matrix c
. An optional threshold
can be specified, which defines the correlation (r) value above which columns are considered to be highly correlated.
The return value might look something like this:
{
col1: ["col1", "col7", ...],
col2: ["col2", "col4", ...]
}
Note that literally identical columns will be included among the results. So, for example, if you only pass an a
value into the function, then every column will at least be identical to itself (meaning that there will be at least one column name in every array in the object), though it might also be highly correlated with other columns.
Returns the Euclidean length (i.e., the 2-norm) of x
.
Given a vector containing n unique values, returns an dictionary with n-1 key-value pairs where each key is name
+ a unique value and each value is a vector of binary values indicating whether or not x
matches that particular unique value. For example:
const { getOneHotEncodings } = require("@jrc03c/js-data-science-helpers")
const x = [2, 2, 3, 4, 2, 4]
const encodings = getOneHotEncodings("foo", x)
console.log(encodings)
// { foo_3: [ 0, 0, 1, 0, 0, 0 ], foo_4: [ 0, 0, 0, 1, 0, 1 ] }
Returns a matrix containing the p-values of every column in a
against every column in b
. If b === null
, then a
is just compared against itself.
Returns an array in which each value is an object representing a unique value in x
with the properties value
, count
, and percentages
. For example:
const { getPercentages } = require("@jrc03c/js-data-science-helpers")
const x = ["a", "a", "b", "c"]
const percentages = getPercentages(x)
console.log(percentages)
// [
// { value: "a", count: 2, percentage: 0.5 },
// { value: "b", count: 1, percentage: 0.25 },
// { value: "c", count: 1, percentage: 0.25 }
// ]
Sorts a correlation matrix (array or DataFrame
) so that variables near each other in the visualization are also highly correlated with one another. The first variable chosen is the one with the highest sum of squared correlations. The second variable chosen is the remaining variable most highly correlated with the first; the third variable chosen is the remaining variable most highly correlated with the second; and so on. This algorithm is called the Hunter chain method. (See: "Methods of Reordering the Correlation Matrix to Vacilitate Visual Inspection and Preliminary Cluster Analysis" by John Edward Hunter)
The IndexMatcher
class makes it relatively easy to make sure that two Series
or DataFrame
instances have the same index. The constructor takes a single argument, the mode
, which is one of:
IndexMatcher.DROP_MISSING_MODE
IndexMatcher.DROP_NAN_MODE
In the first mode, rows are dropped only if they contain null, undefined, or NaN values. In the second mode, rows are dropped if they contain any non-numerical values.
Records the index which is common to all of the given datasets.
Transforms the given datasets to have the index that was recorded by the fit
function. Note that a single array containing all of the transformed datasets is returned. So, a common syntax might be something like:
const a = new DataFrame(...)
const b = new DataFrame(...)
const c = new DataFrame(...)
const matcher = new IndexMatcher()
const [d, e, f] = matcher.fit(a, b, c).transform(a, b, c)
Performs the fitting and transforming in a single step. So, similar to the example above:
const a = new DataFrame(...)
const b = new DataFrame(...)
const c = new DataFrame(...)
const [d, e, f] = new IndexMatcher().fitAndTransform(a, b, c)
Returns a boolean indicating whether or not x
contains only binary data (0s and 1s).
Returns a boolean indicating whether or not the array x
is jagged / ragged (i.e., whether or not it has nested arrays of inconsistent length).
The two primary K-means models from which to choose are KMeansPlusPlus
and KMeansMeta
. The former should be used if you already know how many clusters there are in your data; otherwise, the latter model can be used to find the optimal K-value.
Although I don't think sklearn has a comparable KMeansMeta
model, I've nevertheless tried to mimic their API in both of these classes. Importantly, just like sklearn's model, the score
method returns the negative of the K-means objective. Since the K-means objective is the within-cluster sum of squared errors, the score
method returns the negative of that value such that higher scores are better than lower scores, which follows the sklearn scoring convention.
To use them, import them from the KMeans
namespace, like this:
const { KMeansMeta, KMeansPlusPlus } =
require("@jrc03c/js-data-science-helpers").KMeans
Note that the API is virtually identical across the two classes. The main differences appear in the constructor functions. But the fit
, predict
, and score
methods, as well as the centroids
property, should work the same way in both classes.
The constructor for the base model takes a configuration object argument. The only required property in this object is k
, the number of cluster centers (AKA centroids). Optional properties include:
-
maxRestarts
= the number of times that the algorithm is allowed to start over with new a new batch of centroids; the default value is 25 -
maxIterations
= the number of times within a single restart that the algorithm is allowed to update the centroids' positions; the default value is 100 -
tolerance
= the update distance threshold below which the algorithm stops iterating; the update distance is the Euclidean distance between one iteration's centroid positions and the next iteration's centroid positions, so if the update is sufficiently small, then we consider the algorithm to have converged and thus stop iterating; the default value is 1e-4
These four values all become properties of the KMeansPlusPlus
instance (keeping their same names).
The constructor for the meta model takes a configuration object argument. There are no required properties for this object. Optional properties include:
-
ks
= the K-values to test; the default value is the range[1, 16)
-
maxRestarts
= the number of restarts to pass into the constructor of the final fitted model (after finding the best K) -
maxIterations
= the number of iterations to pass into the constructor of the final fitted model (after finding the best K) -
tolerance
= the update distance threshold to pass into the constructor of the final fitted model (after finding the best K) -
modelClass
= the class definition to use during the fitting process; the default value is theKMeansPlusPlus
class
Fits the model to the two-dimensional data, x
. Optionally, a progress
callback function can be provided. This function takes a single argument that represents the overall completion of the fit
method (in terms of restarts and iterations) expressed as a fraction between 0 and 1.
Returns a function that updates the fitting state. Most of the time, you'll probably want to use the fit
method. But there may be cases where it's preferable to use a step function that slowly increments the fitting state, like when drawing an animation of the fitting process or when trying to avoid locking up the browser window. Here's an example of how to use it:
// set up model, then:
const fitStep = model.getFitStepFunction(x, progress)
let state
while (!state || !state.isFinished) {
state = fitStep()
}
// done!
Returns the labels for each point in x
. A label is an index into the model's centroids
array. Optionally, an alternative set of centroids
can be supplied as the second argument.
Returns the negative of the K-means objective. The K-means objective is the within-cluster sum of squared errors; so the score
method returns the negative of that value (so that higher scores are better than lower scores). See the note at the start of this section for more info.
The array of learned centroids. It's only available after the fit
method has been run.
Identical to the standardize
function. Returns a transformed copy of x
in which the values have been converted to z-scores. In other words: (x - mean(x)) / stdev(x)
Returns a transformed copy of a matrix or DataFrame
x
in which all of the columns have been made orthogonal to each other. (See: Gram-Schmidt process) This is particularly useful for generating random datasets with uncorrelated features.
Returns the p-value of two vectors using Welch's t-test. (See: Welch's t-test) Note that this function returns results that are very, very close to scipy's ttest_ind
function when the latter is invoked this way:
ttest_ind(a, b, equal_var=False, nan_policy="omit")
I'm not sure why there's a very slight variation in returned p-values between my version of the function and scipy's. It's possible that there's some subtle degrees-of-freedom difference in our implementations; or maybe they have a better way of computing the probability of t (because mine uses a table of values and theirs may use a continuous function or whatever). However, after lots of testing, I feel pretty confident that these small differences are probably not significant. Let me know if you disagree, though. 😊
Returns the projection of vector or Series
v
onto vector or Series
u
.
Returns (roughly) the square root of the R2 value of xTrue
versus xPred
. Since R2 can be negative, the actual value returned is sign(R^2) * sqrt(abs(R^2))
. The two datasets can be any shape provided that they have the same shape as each other.
Identical to the normalize
function.
Transforms and/or untransforms 1- or 2-dimensional data in basically the same way as the normalize
function. However, having the functionality wrapped in a class makes it easier to apply the same transformation or untransformation across multiple datasets. Its functionality is supposed to mimic sklearn's StandardScaler
.
Learns the means and standard deviations of each column in x
. If x
is a 1-dimensional array or Series
, then it's treated as a 1-column matrix or DataFrame
respectively.
Transforms each column in x
by subtracting that column's corresponding mean and dividing by that column's corresponding standard deviation (i.e., the means and standard deviations learned in the fit
method). If x
is a 1-dimensional array or Series
, then it's treated as a 1-column matrix or DataFrame
respectively. The number of columns in x
must be the same as the number of columns in the data on which the StandardScaler
instance was trained.
Reverses the transformation done by the transform
method. Specifically, it transforms each column in x
by multiplying by that column's corresponding standard deviation and adding that column's corresponding mean. If x
is a 1-dimensional array or Series
, then it's treated as a 1-column matrix or DataFrame
respectively. The number of columns in x
must be the same as the number of columns in the data on which the StandardScaler
instance was trained.
Splits the given datasets into train and test sets in the same way as sklearn's train_test_split
function. Can optionally be passed an options object. Properties on this object can include:
-
shouldShuffle
= a boolean indicating whether or not the dataset(s) should be shuffled row-wise before being split into train and test sets -
testSize
= a number in the range [0, 1] representing the fraction of rows to reserve as the test dataset(s)
Many of the calculations in this library can't be done correctly if the given dataset includes NaN values. By default, the library automatically drops NaN values in every relevant calculation, no errors are thrown, and no warnings are given.
For example, the normalize
function relies on being able to compute the mean and standard deviation of a dataset; but any NaNs in the dataset will cause the mean and standard deviation to be NaN as well. So the function drops NaN values first, then computes the mean and standard deviation from the remaining values, and then uses the mean and standard deviation to transform the original dataset.
However, if you'd prefer that these functions should include NaN values in their calculations, then you can override the default setting this way:
const { common } = require("@jrc03c/js-data-science-helpers")
common.shouldIgnoreNaNValues = false
By default, the normalize
function would return results like this:
const { normalize } = require("@jrc03c/js-data-science-helpers")
const x = [2, 3, "four", 5, 6]
normalize(x)
// [
// -1.2649110640673518,
// -0.6324555320336759,
// NaN,
// 0.6324555320336759,
// 1.2649110640673518
// ]
But if we override the default setting and allow NaN values to be used in the calculations, then we get results like this:
const { common, normalize } = require("@jrc03c/js-data-science-helpers")
common.shouldIgnoreNaNValues = false
const x = [2, 3, "four", 5, 6]
normalize(x)
// [NaN, NaN, NaN, NaN, Nan]