Welcome to napkinXC’s documentation!

Note

Documentation is currently a work in progress!

napkinXC is an extremely simple and fast library for extreme multi-class and multi-label classification that implements the following methods both in Python and C++:

  • Probabilistic Label Trees (PLTs) - for multi-label log-time training and prediction,
  • Hierarchical softmax (HSM) - for multi-class log-time training and prediction,
  • Binary Relevance (BR) - multi-label baseline,
  • One Versus Rest (OVR) - multi-class baseline.

All the methods decompose multi-class and multi-label into the set of binary learning problems.

Right now, the detailed descirption of methods and their parameters can be found in this paper: Probabilistic Label Trees for Extreme Multi-label Classification

Python Quick Start

Installation

Python (3.5+) version of napkinXC can be easily installed from PyPy repository on Linux and MacOS (Windows is currently not supported). It requires modern C++17 compiler, CMake and Git installed:

pip install napkinxc

or directly from the GitHub repository:

pip install pip install git+https://github.com/mwydmuch/napkinXC.git

Usage

napkinxc module contains three submodules: models that contains all the model classes and two additional modules

Minimal example of usage:

from napkinxc.datasets import load_dataset
from napkinxc.models import PLT
from napkinxc.measures import precision_at_k

X_train, Y_train = load_dataset("eurlex-4k", "train")
X_test, Y_test = load_dataset("eurlex-4k", "test")
plt = PLT("eurlex-model")
plt.fit(X_train, Y_train)
Y_pred = plt.predict(X_test, top_k=1)
print(precision_at_k(Y_test, Y_pred, k=1))

Using C++ executable

napkinXC can also be built and used as an executable that can be used to train and evaluate models and make a prediction.

Building

To build napkinXC, first clone the project repository and run the following commands in the root directory of the project. It requires modern C++17 compiler, CMake and Git installed. Set CXX and CC environmental variables before running cmake command if you want to build with the specific C++ compiler.

cmake .
make

-B options can be passed to CMake command to specify other build directory. After successful compilation, nxc executable should appear in the root or specified build directory.

LIBSVM data format

napkinXC supports multi-label svmlight/libsvm like-format (less strict) and format of datasets from The Extreme Classification Repository, which has an additional header line with a number of data points, features, and labels.

The format is text-based. Each line contains an instance and is ended by a \n character.

<label>,<label>,... <feature>(:<value>) <feature>(:<value>) ...

<label> and <feature> are indexes that should be positive integers. Unlike to normal svmlight/libsvm format, labels and features do not have to be sorted in ascending order. The :<value> can be omitted after <feature>, to assume value = 1.

Usage

nxc executable needs command, i.e. train, test, predict as a first argument. -i/--input and -o/--output arguments needs to be always provided.

nxc <command> -i <path to dataset> -o <path to model directory> <args> ...

Command line options

Usage: nxc <command> <args>

Commands:
    train                   Train model on given input data
    test                    Test model on given input data
    predict                 Predict for given data
    ofo                     Use online f-measure optimization
    version                 Print napkinXC version
    help                    Print help

Args:
    General:
    -i, --input             Input dataset, required
    -o, --output            Output (model) dir, required
    -m, --model             Model type (default = plt):
                            Models: ovr, br, hsm, plt, oplt, svbopFull, svbopHf, brMips, svbopMips
    --ensemble              Number of models in ensemble (default = 1)
    -t, --threads           Number of threads to use (default = 0)
                            Note: -1 to use #cpus - 1, 0 to use #cpus
    --hash                  Size of features space (default = 0)
                            Note: 0 to disable hashing
    --featuresThreshold     Prune features below given threshold (default = 0.0)
    --seed                  Seed (default = system time)
    --verbose               Verbose level (default = 2)

    Base classifiers:
    --optimizer             Optimizer used for training binary classifiers (default = liblinear)
                            Optimizers: liblinear, sgd, adagrad, fobos
    --bias                  Value of the bias features (default = 1)
    --weightsThreshold      Threshold value for pruning models weights (default = 0.1)

    LIBLINEAR:              (more about LIBLINEAR: https://github.com/cjlin1/liblinear)
    -s, --liblinearSolver   LIBLINEAR solver (default for log loss = L2R_LR_DUAL, for l2 loss = L2R_L2LOSS_SVC_DUAL)
                            Supported solvers: L2R_LR_DUAL, L2R_LR, L1R_LR,
                                               L2R_L2LOSS_SVC_DUAL, L2R_L2LOSS_SVC, L2R_L1LOSS_SVC_DUAL, L1R_L2LOSS_SVC
    -c, --liblinearC        LIBLINEAR cost co-efficient, inverse of regularization strength, must be a positive float,
                            smaller values specify stronger regularization (default = 10.0)
    --eps, --liblinearEps   LIBLINEAR tolerance of termination criterion (default = 0.1)

    SGD/AdaGrad:
    -l, --lr, --eta         Step size (learning rate) for online optimizers (default = 1.0)
    --epochs                Number of training epochs for online optimizers (default = 1)
    --adagradEps            Defines starting step size for AdaGrad (default = 0.001)

    Tree:
    -a, --arity             Arity of tree nodes (default = 2)
    --maxLeaves             Maximum degree of pre-leaf nodes. (default = 100)
    --tree                  File with tree structure
    --treeType              Type of a tree to build if file with structure is not provided
                            tree types: hierarchicalKmeans, huffman, completeKaryInOrder, completeKaryRandom,
                                        balancedInOrder, balancedRandom, onlineComplete

    K-Means tree:
    --kmeansEps             Tolerance of termination criterion of the k-means clustering
                            used in hierarchical k-means tree building procedure (default = 0.001)
    --kmeansBalanced        Use balanced K-Means clustering (default = 1)

    Prediction:
    --topK                  Predict top-k labels (default = 5)
    --threshold             Predict labels with probability above the threshold (default = 0)
    --thresholds            Path to a file with threshold for each label

    Test:
    --measures              Evaluate test using set of measures (default = "p@1,r@1,c@1,p@3,r@3,c@3,p@5,r@5,c@5")
                            Measures: acc (accuracy), p (precision), r (recall), c (coverage), hl (hamming loos)
                                      p@k (precision at k), r@k (recall at k), c@k (coverage at k), s (prediction size)

Python API

Models

models.PLT
models.HSM
models.BR
models.OVR

Datasets

datasets.download_dataset
datasets.load_dataset
datasets.load_libsvm_file
datasets.load_json_lines_file
datasets.to_csr_matrix
datasets.to_np_matrix

Measures

measures.precision_at_k(Y_true, Y_pred[, k]) Calculate precision at 1-k places.
measures.recall_at_k(Y_true, Y_pred[, k, …]) Calculate recall at 1-k places.
measures.coverage_at_k(Y_true, Y_pred[, k]) Calculate coverage at 1-k places.
measures.dcg_at_k(Y_true, Y_pred[, k]) Calculate Discounted Cumulative Gain (DCG) at 1-k places.
measures.Jain_et_al_inverse_propensity(Y[, A, B]) Calculate inverse propensity as proposed in Jain et al.
measures.Jain_et_al_propensity(Y[, A, B]) Calculate propensity as proposed in Jain et al.
measures.ndcg_at_k(Y_true, Y_pred[, k, …]) Calculate normalized Discounted Cumulative Gain (nDCG) at 1-k places.
measures.psprecision_at_k(Y_true, Y_pred, inv_ps) Calculate Propensity Scored Precision (PSP) at 1-k places.
measures.psrecall_at_k(Y_true, Y_pred, inv_ps) Calculate Propensity Scored Recall (PSR) at 1-k places.
measures.psdcg_at_k(Y_true, Y_pred, inv_ps) Calculate Propensity Scored Discounted Cumulative Gain (PSDCG) at 1-k places.
measures.psndcg_at_k(Y_true, Y_pred, inv_ps) Calculate Propensity Scored normalized Discounted Cumulative Gain (PSnDCG) at 1-k places.
measures.hamming_loss(Y_true, Y_pred) Calculate unnormalized (to avoid very small numbers because of large number of labels) hamming loss - average number of misclassified labels.
measures.f1_measure(Y_true, Y_pred[, …]) Calculate F1 measure, also known as balanced F-score or F-measure.

Indices and tables