Skip to contents

Wasserstein Dictionary Learning (WDL) model for topic modeling

Usage

wdl(docs, ...)

# S3 method for class 'character'
wdl(docs, specs = wdl_specs(), verbose = TRUE, ...)

# S3 method for class 'wdl'
print(x, topic = 0, token_per_topic = 10, ...)

# S3 method for class 'wdl'
summary(object, topic = 1, token_per_topic = 10, ...)

Arguments

docs

character vector, sentences to be analyzed

...

only for compatibility

specs

list, model specification for the WDL see wdl_specs for reference

verbose

bool, whether to print useful info

x

WDL model

topic

int, number of topic to be printed

token_per_topic

int, number of tokens to be printed

object

WDL model

Value

topics and weights computed from the WDL given the input data

Details

This is the re-implementation of WDL model from ground up, and it calls the barycenter under the hood (to be precise directly calling the underlying C++ routine for barycenter)

References

Peyré, G., & Cuturi, M. (2019). Computational Optimal Transport: With Applications to Data Science. Foundations and Trends® in Machine Learning, 11(5–6), 355–607. https://doi.org/10.1561/2200000073

Schmitz, M. A., Heitz, M., Bonneel, N., Ngolè, F., Coeurjolly, D., Cuturi, M., Peyré, G., & Starck, J.-L. (2018). Wasserstein dictionary learning: Optimal transport-based unsupervised nonlinear dictionary learning. SIAM Journal on Imaging Sciences, 11(1), 643–678. https://doi.org/10.1137/17M1140431

Xie, F. (2025). Deriving the Gradients of Some Popular Optimal Transport Algorithms (No. arXiv:2504.08722). arXiv. https://doi.org/10.48550/arXiv.2504.08722

Examples

# simple WDL example
sentences <- c("this is a sentence", "this is another one")
wdl_fit <- wdl(
   sentences,
   specs = wdl_specs(wdl_control = list(num_topics = 2),word2vec_control = list(min_count = 1)),
   verbose = TRUE)
#> Preprocessing the data...
#> Running tokenizer on the sentences...
#> Running Word2Vec for the embeddings and distance matrix...
#> `method` is automatically switched to "log"
#> Running WDL in CPU mode...
#> This might take a while depending on the problem size...
#> Running in serial mode...
#> Initializing WDL model with 4 vocabs, 2 docs, and 2 topics...
#> Training WDL model with 2 epochs, 1 batches
#> Epoch 1 of 2, batch 1 of 1:
#> avg speed: 0.00 sec, last speed: 0.00 sec
#> Epoch 2 of 2, batch 1 of 1:
#> avg speed: 0.00 sec, last speed: 0.00 sec
#> Inference on the dataset