GBDT(Gradient Boost Decision Tree) 目前是工业界最为流行的机器学习工具之一,我最近依据一些开源实现,写了一个精简版的gbrt,也就是(Gradient Boost Regression Tree),因为在我们的应用中,基本上都是在解决rank的问题,一般只需要regression就可以了。
这个版本主要的特点就是代码逻辑简单,并且使用了tbb的并行库对于多核进行优化(其他的并行库一装一大堆,实在不好用,tbb的so也就1兆而已),该版本发布在了google code上:http://code.google.com/p/simple-gbdt/
使用说明:
it is simple implement of gbdt(actually gradient boost regression tree)--compared to others, parallelization with tbb lib.
this project can be easily used in regression problem or point wise learning to rank problem(l2r), and can be good material to know about gbdt.
there are several more features:
1. support data sample configure
2. using tbb lib for parallelization(http://threadingbuildingblocks.org/)
3. empty features in l2r data will be skipped (to avoid unnecessary cost)
4. support two input format(l2r format and cvs format)
Usage:
-i input file
-f [cvs, l2r] input format , default 'l2r'
-c gbrt model config file , default './gbrt.conf'
-t or -p act type(t for train,p for predict):
-m model file , default './gbrt.model'
-d max dimention(only for L2R format)
--------------------------------------
Example
train example: ./gbrt -t -i train_l2r.dat -d 415
predict example: ./gbrt -p -i test_l2r.dat -d 415 -m gbrt.model
--------------------------------------
Input Format:
l2r: "-1 qid:1234 1:1 3:1 8:1"
cvs: "1,3,-1" last column is label
--------------------------------------
Configure Example:
max_tree_leafes=20
feature_subspace_size=40
use_opt_splitpoint=true
learn_rate=0.01
data_sample_ratio=0.4