napkinxc.models.HSM

class napkinxc.models.HSM(output, tree_type='hierarchicalKmeans', arity=2, max_leaves=100, kmeans_eps=0.0001, kmeans_balanced=True, flatten_tree=0, tree_structure=None, hash=None, features_threshold=0, norm=True, bias=1.0, pick_one_label_weighting=False, optimizer='liblinear', loss='log', weights_threshold=0.1, liblinear_c=10, liblinear_eps=0.1, liblinear_solver=None, liblinear_max_iter=100, eta=1.0, epochs=1, adagrad_eps=0.001, load_as='map', ensemble=1, seed=None, threads=0, verbose=0, **kwargs)[source]

Bases: LabelTreeModel

Hierarchical Softmax (multi-class) classifier with linear node estimators, using CPP core.

__init__(output, tree_type='hierarchicalKmeans', arity=2, max_leaves=100, kmeans_eps=0.0001, kmeans_balanced=True, flatten_tree=0, tree_structure=None, hash=None, features_threshold=0, norm=True, bias=1.0, pick_one_label_weighting=False, optimizer='liblinear', loss='log', weights_threshold=0.1, liblinear_c=10, liblinear_eps=0.1, liblinear_solver=None, liblinear_max_iter=100, eta=1.0, epochs=1, adagrad_eps=0.001, load_as='map', ensemble=1, seed=None, threads=0, verbose=0, **kwargs)[source]

Construct a Hierarchical Softmax model.

Parameters:
  • output (str) – Directory where the model will be stored

  • tree_type (str, optional) –

    Tree type to construct. Available tree types:

    • 'hierarchicalKmeans'

    • 'balancedInOrder'

    • 'balancedRandom'

    • 'completeKaryInOrder'

    • 'completeKaryRandom'

    • 'huffman'

    Defaults to 'hierarchicalKmeans'

  • arity (int, optional) – Arity of tree nodes, k for k-means clustering used in hierarchical k-means tree building procedure, defaults to 2

  • max_leaves (int, optional) – Maximum degree of pre-leaf nodes, defaults to 100

  • kmeans_eps (float, optional) – Tolerance of termination criterion of the k-means clustering used in hierarchical k-means tree building procedure, defaults to 0.0001

  • kmeans_balanced (bool, optional) – Use balanced k-means clustering, defaults to True

  • hash (int, optional) – Hash features to a space of given size, value of this argument is saved with model weights, if None or 0 disable hashing, defaults to None

  • features_threshold (float, optional) – Prune features below given threshold, value of this argument is saved with model weights, defaults to 0

  • norm (bool, optional) – Unit norm feature vector, value of this argument is saved with model weights, defaults to True

  • bias (float, optional) – Value of the bias features, value of this argument is saved with model weights, defaults to 1.0

  • optimizer (str, optional) – Optimizer used for training node classifiers {'liblinear', 'sgd', 'adagrad'}, defaults to 'liblinear'

  • loss (str, optional) – Loss optimized while training node classifiers {'log' (alias 'logistic'), 'l2' (alias 'squaredHinge')}, defaults to 'log'

  • weights_threshold (float, optional) – Threshold value for pruning weights, defaults to 0.1

  • liblinear_c (float, optional) – LIBLINEAR cost co-efficient, inverse regularization strength, smaller values specify stronger regularization, makes effect only if optimizer='liblinear', defaults to 10.0

  • liblinear_eps (float, optional) – LIBLINEAR tolerance of termination criterion, makes effect only if optimizer='liblinear', defaults to 0.1

  • liblinear_solver (str, optional) –

    Override LIBLINEAR solver set by loss parameter (default for loss='log': 'L2R_LR_DUAL', for loss='l2': 'L2R_L2LOSS_SVC_DUAL'), makes effect only if optimizer='liblinear'. Available solvers:

    • 'L2R_LR_DUAL'

    • 'L2R_LR'

    • 'L1R_LR'

    • 'L2R_L2LOSS_SVC_DUAL'

    • 'L2R_L2LOSS_SVC'

    • 'L2R_L1LOSS_SVC_DUAL'

    • 'L1R_L2LOSS_SVC'

    L2R_LR_DUAL and L2R_L2LOSS_SVC_DUAL usually work the best in XC setting, defaults to None

  • liblinear_max_iter (int, optional) – Limits number of iteration by LIBLINEAR, makes effect only if optimizer='liblinear', defaults to 100

  • eta (float, optional) – Step size (learning rate) for online optimizers, defaults to 1.0

  • epochs (int, optional) – Number of training epochs for online optimizers, defaults to 1

  • adagrad_eps (float, optional) – Defines starting step size for AdaGrad, defaults to 0.001

  • ensemble (int, optional) – Number of trees in the ensemble, defaults to 1

  • seed (int, optional) – Seed, If None use current system time, defaults to None

  • threads (int, optional) – Number of threads used for training and prediction, if 0 use number of available CPUs, if -1 use number of available CPUs - 1, defaults to 0

  • verbose (bool, optional) – If True print progress, defaults to False

Methods

__init__(output[, tree_type, arity, ...])

Construct a Hierarchical Softmax model.

build_tree(X, Y)

Build the tree for the given data (without training node classifiers)

fit(X, Y)

Fit the model to the given training data.

fit_on_file(path)

Fit the model to the training data in the given file in multi-label svmlight/libsvm format.

get_nodes_to_update(Y)

Based on the current tree, get list of updates for each set of labels in Y.

get_nodes_updates(Y)

Based on the current tree, get list of updates for each node for dataset Y.

get_params([deep])

Get parameters of this model.

get_tree_structure()

Return internal label tree structure

load()

Load the model to RAM.

ofo(X, Y[, type, a, b, epochs])

Perform Online F-measure Optimization procedure on the given data to find optimal thresholds.

predict(X[, top_k, threshold, labels_weights])

Predict labels for data points in X.

predict_for_file(path[, top_k, threshold, ...])

Predict labels for data points in the given file in multi-label svmlight/libsvm format.

predict_proba(X[, top_k, threshold, ...])

Predict labels with probability estimates for data points in X.

predict_proba_for_file(path[, top_k, ...])

Predict labels with probability estimates for data points in the given file in multi-label svmlight/libsvm format.

remap_tree_structure(tree_structure)

Remaps tree structure to list of tuples of ints.

set_params(**params)

Set parameters for this model.

set_tree_structure(tree_structure)

Set internal label tree structure

unload()

Unload the model from RAM.

build_tree(X, Y)

Build the tree for the given data (without training node classifiers)

Parameters:
  • X (ndarray, csr_matrix, list[list[int]], list[list[tuple[int, float]]) – Data points as a matrix or list of lists of int or tuples of int and float (feature id, value).

  • Y (list[int], list[list|tuple[int]]) – Target labels as list of ints (multi-class data) or lists or tuples of ints (multi-label data).

fit(X, Y)

Fit the model to the given training data.

Parameters:
  • X (csr_matrix, ndarray, list[list[int]|tuple[int]], list[list[tuple[int, float]]) – Training data points as a matrix or list of lists of int or tuples of int and float (feature id, value).

  • Y (csr_matrix|ndarray|list[list[int]|tuple[int]], list[list[tuple[int, float]], list[int]) – Target labels as a matrix or lists or tuples of ints (multi-label data) or list of ints (multi-class data).

fit_on_file(path)

Fit the model to the training data in the given file in multi-label svmlight/libsvm format.

Parameters:

path (str) – Path to the file.

get_nodes_to_update(Y)

Based on the current tree, get list of updates for each set of labels in Y.

Parameters:

Y (list[int], list[list|tuple[int]]) – Target labels as list of ints (multi-class data) or lists or tuples of ints (multi-label data).

Returns:

List of lists of nodes and their updates (0 - negative or 1 - positive) for each set of labels in Y.

Return type:

list[list[tuple[int, float]]]

get_nodes_updates(Y)

Based on the current tree, get list of updates for each node for dataset Y.

Parameters:

Y (list[int], list[list[int]|tuple[int]]) – Target labels as list of ints (multi-class data) or lists or tuples of ints (multi-label data).

Returns:

List of lists of examples and their updates (0 - negative or 1 - positive) for each node in the current tree.

Return type:

list[list[tuple[int, float]]]

get_params(deep=False)

Get parameters of this model.

Parameters:

deep – Ignored, added for Scikit-learn compatibility, defaults to False

Returns:

Mapping of string to any

Return type:

dict

get_tree_structure()

Return internal label tree structure

Returns:

Tree structure, represented as a list of tuples representing nodes, where the first value is an index of a parent node, if equal to -1, then the node is a root node, the second value is an index of the node, and the third, a label assigned to the node, if equal to -1, then no label is assigned to the node.

Return type:

list[tuple[int, int, int]]

load()

Load the model to RAM.

ofo(X, Y, type='micro', a=10, b=20, epochs=1)

Perform Online F-measure Optimization procedure on the given data to find optimal thresholds.

Parameters:
  • X (csr_matrix, ndarray, list[list[int]|tuple[int]], list[list[tuple[int, float]]) – Data points as a matrix or list of lists of int or tuples of int and float (feature id, value).

  • Y (csr_matrix, ndarray, list[list[int]|tuple[int]], list[list[tuple[int, float]], list[int]) – Target labels as a matrix or lists or tuples of ints (multi-label data) or list of ints (multi-class data).

  • type (str) – Type of OFO procedure {'micro', 'macro'}, default to 'micro'

  • a (int) – Parameter of OFO procedure, defaults to 10

  • b (int) – Parameter of OFO procedure, defaults to 20

  • epochs (int, optional) – Number of OFO epochs, defaults to 1

Returns:

Single threshold in case of type='micro' and list of thresholds in case of type='macro'

Return type:

float, list[float]

predict(X, top_k=0, threshold=0, labels_weights=None)

Predict labels for data points in X.

Parameters:
  • X (csr_matrix, ndarray, list[list[int]|tuple[int]], list[list[tuple[int, float]]) – Data points as a matrix or list of lists of int or tuples of int and float (feature id, value).

  • top_k (int) – Predict top-k labels, if 0, the option is ignored, defaults to 0

  • threshold (float, list[float], ndarray, optional) – Predict labels with probability above the threshold in case of single value or above the specific threshold for each label in case of list or array of values, if 0, the option is ignored, defaults to 0

  • labels_weights (list[float], ndarray, optional) – Predict labels according to their weights multiplied by probability if None, the option is ignored, defaults to None

Returns:

List of lists with predicted labels.

Return type:

list[list[int]]

predict_for_file(path, top_k=0, threshold=0, labels_weights=None)

Predict labels for data points in the given file in multi-label svmlight/libsvm format.

Parameters:
  • path (str) – Path to the file

  • top_k (int) – Predict top-k labels, if 0, the option is ignored, defaults to 0

  • threshold (float, list[float], ndarray, optional) – Predict labels with probability above the threshold in case of single value or above the specific threshold for each label in case of list or array of values, if 0, the option is ignored, defaults to 0

  • labels_weights (list[float], ndarray, optional) – Predict labels according to their weights multiplied by probability if None, the option is ignored, defaults to None

Returns:

List of lists with predicted labels.

Return type:

list[list[int]]

predict_proba(X, top_k=0, threshold=0, labels_weights=None)

Predict labels with probability estimates for data points in X.

Parameters:
  • X (csr_matrix, ndarray, list[list[int]|tuple[int]], list[list[tuple[int, float]]) – Data points as a matrix or list of lists of int or tuples of int and float (feature id, value).

  • top_k (int) – Predict top-k labels, if 0, the option is ignored, defaults to 0

  • threshold (float, list[float], ndarray, optional) – Predict labels with probability above the threshold in case of single value or above the specific threshold for each label in case of list or array of values, if 0, the option is ignored, defaults to 0

  • labels_weights (list[float], ndarray, optional) – Predict labels according to their weights multiplied by probability if None, the option is ignored, defaults to None

Returns:

List of list of tuples (label id, probability) with predicted labels

Return type:

list[list[tuple[int, float]]

predict_proba_for_file(path, top_k=0, threshold=0, labels_weights=None)

Predict labels with probability estimates for data points in the given file in multi-label svmlight/libsvm format.

Parameters:
  • path (str) – Path to the file.

  • top_k (int) – Predict top-k labels, if 0, the option is ignored, defaults to 0

  • threshold (float, list[float], ndarray, optional) – Predict labels with probability above the threshold in case of single value or above the specific threshold for each label in case of list or array of values, if 0, the option is ignored, defaults to 0

  • labels_weights (list[float], ndarray, optional) – Predict labels according to their weights multiplied by probability if None, the option is ignored, defaults to None

Returns:

List of list of tuples (label id, probability) with predicted labels

Return type:

list[list[tuple[int, float]]

static remap_tree_structure(tree_structure)

Remaps tree structure to list of tuples of ints.

Parameters:

tree_structure (list[tuple[any, any, any]]) – Tree structure in format of a list of tuples representing nodes, where the first value is name/index a parent node, if equal to None or -1, then the node is a root node, the second value is name/index of the node, and the third, a label assigned to the node, if equal to None or -1, then no label is assigned to the node.

Returns:

Tree structure, represented as a list of tuples representing nodes, where the first value is an index of a parent node, if equal to -1, then the node is a root node, the second value is an index of the node, and the third, a label assigned to the node, if equal to -1, then no label is assigned to the node.

Return type:

list[tuple[int, int, int]]

set_params(**params)

Set parameters for this model. Should be used only if you know what are you doing.

Param:

**params: Parameter names with their new values.

Returns:

self

Return type:

Model

set_tree_structure(tree_structure)

Set internal label tree structure

Parameters:

tree_structure (list[tuple[int, int, int]]) – Tree structure in format of a list of tuples representing nodes, where the first value is an index of a parent node, if equal to -1, then the node is a root node, the second value is an index of the node, and the third, a label assigned to the node, if equal to -1, then no label is assigned to the node.

unload()

Unload the model from RAM.