spark mllib支持哪些机器学习算法？

2023年6月17日 337次阅读来源: HxLiang

Spark 2.1 Mllib
考虑到spark选型做mllib的人，最关心问题，就是spark mllib能够支持多少机器学习的算法呢？
问题很简单，就下面这么多，你看着用吧。
数据集：
• Local vector（向量）（稀疏/密集）
• Labeled point（坐标向量）（稀疏/密集）
• Local matrix（矩阵）（稀疏/密集）
• Distributed matrix（矩阵）
o RowMatrix（行矩阵）（向量矩阵）
o IndexedRowMatrix（行坐标矩阵）（indexRow）
o CoordinateMatrix（坐标矩阵）（适用于稀疏矩阵）（matrixEntry）
o BlockMatrix（块矩阵）
向量(1.0,0.0,1.0,3.0)用密集格式表示为[1.0,0.0,1.0,3.0]，用稀疏格式表示为(4,[0,2,3],[1.0,1.0,3.0]) 第一个4表示向量的长度(元素个数)，[0,2,3]就是indices数组，[1.0,1.0,3.0]是values数组表示向量0的位置的值是1.0，2的位置的值是1.0,而3的位置的值是3.0,其他的位置都是0，矩阵同理。
算法包：
• Basic statistics（基本统计信息）
o summary statistics（摘要统计）
o correlations（相关性）
o stratified sampling（分层抽样）
o hypothesis testing（假设）
o streaming significance testing（流量统计）
o random data generation（随机数据生成）
• Classification and regression（分类和回归）
o linear models (SVMs, logistic regression, linear regression)（线性模型（向量机、罗辑回归、线性回归））
o naive Bayes（朴素贝叶斯）
o decision trees（决策树）
o ensembles of trees (Random Forests and Gradient-Boosted Trees)（随机森林、梯度树）
o isotonic regression（保序回归）
• Collaborative filtering（协同过滤）
o alternating least squares (ALS)（最小二乘）
• Clustering
o k-means（聚类）
o Gaussian mixture（高斯混合）
o power iteration clustering (PIC)（迭代聚类）
o latent Dirichlet allocation (LDA)（三层贝叶斯概率模型）
o bisecting k-means（二分聚类）
o streaming k-means（流聚类）
• Dimensionality reduction（降维）
o singular value decomposition (SVD)（奇异值分解）
o principal component analysis (PCA)（主成分分析）
• Feature extraction and transformation（特征提取和转换）
• Frequent pattern mining（频繁模式挖掘）
o FP-growth（关联分析算法）
o association rules（关联规则）
o PrefixSpan（序列模式分析算法）
• Evaluation metrics（指标评测）
• PMML model export（PMML模型）
• Optimization (developer)（优化算法）
o stochastic gradient descent（随机梯度下降法）
o limited-memory BFGS (L-BFGS)（拟牛顿算法）

    原文作者：HxLiang
    原文地址: https://www.jianshu.com/p/b2a4886b90f0
    本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。