Wasserstein Dictionary Learning (WDL) model for topic modeling
Arguments
- docs
character vector, sentences to be analyzed
- ...
only for compatibility
- specs
list, model specification for the WDL see
wdl_specsfor reference- verbose
bool, whether to print useful info
- x
WDL model
- topic
int, number of topic to be printed
- token_per_topic
int, number of tokens to be printed
- object
WDL model
Details
This is the re-implementation of WDL model from ground up,
and it calls the barycenter under the hood
(to be precise directly calling the underlying C++ routine
for barycenter)
References
Peyré, G., & Cuturi, M. (2019). Computational Optimal Transport: With Applications to Data Science. Foundations and Trends® in Machine Learning, 11(5–6), 355–607. https://doi.org/10.1561/2200000073
Schmitz, M. A., Heitz, M., Bonneel, N., Ngolè, F., Coeurjolly, D., Cuturi, M., Peyré, G., & Starck, J.-L. (2018). Wasserstein dictionary learning: Optimal transport-based unsupervised nonlinear dictionary learning. SIAM Journal on Imaging Sciences, 11(1), 643–678. https://doi.org/10.1137/17M1140431
Xie, F. (2025). Deriving the Gradients of Some Popular Optimal Transport Algorithms (No. arXiv:2504.08722). arXiv. https://doi.org/10.48550/arXiv.2504.08722
Examples
# simple WDL example
sentences <- c("this is a sentence", "this is another one")
wdl_fit <- wdl(
sentences,
specs = wdl_specs(wdl_control = list(num_topics = 2),word2vec_control = list(min_count = 1)),
verbose = TRUE)
#> Preprocessing the data...
#> Running tokenizer on the sentences...
#> Running Word2Vec for the embeddings and distance matrix...
#> `method` is automatically switched to "log"
#> Running WDL in CPU mode...
#> This might take a while depending on the problem size...
#> Running in serial mode...
#> Initializing WDL model with 4 vocabs, 2 docs, and 2 topics...
#> Training WDL model with 2 epochs, 1 batches
#> Epoch 1 of 2, batch 1 of 1:
#> avg speed: 0.00 sec, last speed: 0.00 sec
#> Epoch 2 of 2, batch 1 of 1:
#> avg speed: 0.00 sec, last speed: 0.00 sec
#> Inference on the dataset