主题模型 LDA 源码分享

2024年6月1日 94次阅读

转载请注明来源：http://blog.csdn.net/yihucha166/article/details/9046835

Latent Dirichlet Allocation（LDA）是目前业界最为流行的机器学习方法之一，这里用C++实现了一个as-lda版本，使用了非对称的先验设置，随着主题数的增加，主题分布上比传统模型更加稳定，减少因为主题数量大而导致大量小众主题，参考文献《Rethinking LDA:Why Priors Matter》，代码目录中包含了中文测试数据

代码地址：https://code.google.com/p/as-lda/

asymmetric prior Latent Dirichlet Allocation (LDA) by c++

Usually, symmetric dirichlet prior is used in the implementation of lda. in “Rethinking LDA:Why Priors Matter” , they have showed that asymmetric prior can generate better result and stable topic distribution under the increment of topic number. So, in this project, we adopt this algorithm.

other features：
#easy to use, easy to understand
#small memory used

ML tools source code:
as-lda: https://code.google.com/p/as-lda/
gbdt: http://code.google.com/p/simple-gbdt/
adaboost: http://code.google.com/p/simple-adaboost/

——–how to use it———–

Usage:  
  -c  corpus file, default './corpus.txt'  
  -v  vocab file, default './vocab.txt'  
  -e or -i  act type(e for estimate,i for inference)  
  -m  model files dir, default './models'  
  -z  pre model assignment file ( inference )  
  -a  hyperparameter alpha, default 500/topic_num 
  -b  hyperparameter beta, default 0.1 
  -k  topic number, default 100 
  -n  max iteration number, default 1000

Examples:

extimate: ./as_lda -e -c ./corpus.txt -v ./vocab.txt -n 2000 inference: ./as_lda -i -n 100 -c corpus.txt.test -v vocab.txt -z ./models/model.z

——–input format————
For corpus:

    one line one doc, the number stands for word id
    example:
    2699\t10608\t52656\t17781\t17781\t7900\t24007

For vocab：
one line one word，word id is the line number