napkinxc.datasets.load_dataset

napkinxc.datasets.load_dataset(dataset, subset='train', format='bow', root='./data', verbose=False)[source]

Downloads the dataset from the internet and puts it in root directory. If dataset is already downloaded, it is not downloaded again. Then loads requested datasets into features matrix and labels.

Parameters:
  • dataset (str) –

    Name of the dataset to load, case insensitive, available datasets:

    • 'Eurlex-4K' ('bow' format only),

    • 'Eurlex-4.3K' ('bow' format only),

    • 'AmazonCat-13K',

    • 'AmazonCat-14K',

    • 'Wiki10-31K' (alias: 'Wiki10', 'bow' format only),

    • 'DeliciousLarge-200K' (alias: 'DeliciousLarge', 'bow' format only)

    • 'WikiLSHTC-325K' (alias: 'WikiLSHTC', 'bow' format only)

    • 'WikiSeeAlsoTitles-350K',

    • 'WikiTitles-500K',

    • 'WikipediaLarge-500K' (alias: 'WikipediaLarge'),

    • 'AmazonTitles-670K',

    • 'Amazon-670K',

    • 'AmazonTitles-3M',

    • 'Amazon-3M',

    • 'LF-AmazonTitles-131K' (for now 'bow' format only),

    • 'LF-Amazon-131K' (for now 'bow' format only),

    • 'LF-WikiSeeAlsoTitles-320K' (for now 'bow' format only),

    • 'LF-WikiSeeAlso-320K' (for now 'bow' format only),

    • 'LF-WikiTitles-500K' (for now 'bow' format only),

    • 'LF-AmazonTitles-1.3M' (for now 'bow' format only).

  • subset (str, optional) – Subset of dataset to load into features matrix and labels {'train', 'test', 'validation'}, defaults to 'train'

  • format (str, optional) – Format of dataset to load {'bow' (bag-of-words/tf-idf weights, alias 'tf-idf'), 'raw' (raw text)}, defaults to 'bow'

  • root (str, optional) – Location of datasets directory, defaults to './data'

  • verbose (bool, optional) – If True print downloading and loading progress, defaults to False

Returns:

Tuple of features matrix and labels.

Return type:

(csr_matrix, list[list[int]]) or (list[str], list[list[str]])