2008-05-10

关于canopy聚类的几点思考

关键字: 数据挖掘, 聚类分析, redpoll
1. 首先是轻量距离量度的选择,是选择数据模型其中的一个属性,还是其它外部属性这对canopy的分布最为重要。 2. T1, T2的取值影响到canopy重叠率f,以及canopy的粒度。 3. Canopy有消除孤立点的作用,而K-means在这方面却无能为力。建立canopies之后,可以删除那些包含数据点数目较少的canopy,往往这些canopy是包含孤立点的。 4. 根据canopy内点的数目,来决定聚类中心数目k,这样效果比较好
2008-05-08

canopy-clustering执行顺序

关键字: canopy, clustering, data mining, mapreduce
好记性不如烂笔头,记一下: NetflixDataPrep(准备数据) -> NetflixCanopyMaker(产生canopy中心) -> NetflixCanopyData(分配所有点到各canopy) -> NetflixKMeansIter(进行k-means聚类) 假设数据记录条数为n, 第二步产生的canopy数量为c, 那第三步计算量则为 n * c,就算用了mapper计算量也非常大。一定要改进成增量式的方法。 研究一下~~
2008-04-27

popular clustering techniques

关键字: redpoll, clustering, data mining
k-Means, k-Medoids, Kernel Clustering, Spectral Clustering (uses eigenvectors), Gravitational Clustering, Canopy Clustering, Self-Organizing Maps, Expectation Maximization, AGNES, CLARA, DBSCAN, DIANA, BIRCH, and many others.
Today, I accidently found an interesting stuff, which may help us operating large scales of data sets for redpoll. This is a matrix computational library based on hadoop hbase. http://code.google.com/p/hama/
Days before, I've submitted an application to participate in Apache Mahout and at this time, have got a reply from the guru of this project. It inspired us with lots of courages. We decided that if I were selected by ASF, we will integrate redpoll into Mahout which has the same end goals, same lice ...
We are pleased to introduce a new open source proejct today. It's another machine learning library using hadoop besides the mahout of ASF(Apache Software Foundation). The name of this project is redpoll, which means any of several small finches of northern North America and Eruasia, having a red cr ...
coderplay
搜索本博客
存档
最新评论