明天kaggle的NLP竞赛Mercari Price Suggestion Challenge就要结束了,相信参加这个比赛的同学都在祈祷自己的kernel在新数据下别炸了.
按照这次比赛的public leaderboard的成绩来看,很有可能国内又要新添两个GM了,祝愿参加这次比赛的同学好运。
我仔细看了一下最近几次比赛的NLP比赛的baseline kernel,发现NLP并没有像之前不了解时候感觉的那样复杂,一套流程下来大概三步吧.
- 用正则或NLTK对句子分句然后分词,另外根据需求涉及stopwords,词型还原等.
- 用sklearn的TfidfTransformer及CountVectorizer或keras的一些工具将句子向量化,再加上一些其他统计特征.
- 使用NB,GBDT,FM,LR,NN等方法模型建模,融合.
当然以上只能做出一个baseline,我并没有参加过比赛,如何提高性能可能有很多技巧,重点在于特征还是模型的设计我也不知道.
按照过去的惯例,把所有NLP竞赛找出来,慢慢学习:
kaggle:
Mercari Price Suggestion Challenge
Toxic Comment Classification Challenge
Personalized Medicine: Redefining Cancer Treatment
Text Normalization Challenge – English Language
Text Normalization Challenge – Russian Language
Transfer Learning on Stack Exchange Tags
The Allen AI Science Challenge
Bag of Words Meets Bags of Popcorn
Microsoft Malware Classification Challenge (BIG 2015)
Sentiment Analysis on Movie Reviews
The Hunt for Prohibited Content
Tradeshift Text Classification
KDD Cup 2014 – Predicting Excitement at DonorsChoose.org
Greek Media Monitoring Multilabel Classification (WISE 2014)
The Big Data Combine Engineered by BattleFin
KDD Cup 2013 – Author-Paper Identification Challenge (Track 1)
KDD Cup 2013 – Author Disambiguation Challenge (Track 2)
Predict Closed Questions on Stack Overflow
Detecting Insults in Social Commentary
中文: