Shortly after getting
the principal component analysis of my sample data set working, I also got
k-means clustering working. As far as I can tell, the cluster quality looks pretty good.
Sample code is up on
GitHub: jrbl/wikilytics.
To use it, you'll want a dump of the descriptive statistics for two or more wikis in a CSV format. For the 2098 wikis in my sample data, my dataset includes the following statistics:
* Wiki UUID
* RecentChanges count
* Page View count
* Page Edit count
* ViewedPages (count of distinct pages)
* Viewers
* Edited Pages (count of subset of pages receiving the edits)
* Editors (number of distinct editors, out of registered users)
* CommentedPages (number of pages with comments)
* Feed
* New
* Search
* Attachments
* Email
* Export
* Result of dividing ViewedPages/EditedPages (or 0)
* Result of dividing Viewers/Editors (or 0)
Clustering to two centroids gives a large set of wikis that are clearly unhealthy (every stat is at, or very close to, 0), and a slightly smaller group of wikis that are harder to classify. Three centroids still gives largely ambiguous results too. But when you cluster to four centroids and take the smallest set, you get back a list of wikis that looks good to me. When I eyeball their values in the spreadsheet, the values for the 3 principal columns map to my expectations about health.
The four clusters (on my most recent run) break down into groups of sizes: 1290, 16, 165, 627
Examining that group of 16 closer, we see the following values:
- RecentChanges, Views, Edits
- 46,1856,268
- 211,1557,380
- 0,133287,19
- 7,1199,1
- 43,4388,783
- 27,4727,62
- 27,23987,144
- 81,19220,405
- 919,3410,231
- 3,2281,31
- 25,7207,29
- 81,3039,687
- 6,2688,3
- 232,1520,327
- 1013,9530,563
- 15,26550,7837
As we can see, all of these items have nonzero RecentChanges, except for one, which has a very large (comparatively speaking) number of Views. From what I know about collaboration and knowledge sharing tools, I would expect this set to be the wikis that we can call "Healthy". Of the other clusters, the size 1290 one is obviously the "
Long Tail" of Wikis without users or traffic - most, or all of the data fields for most or all of these are zeros.
The other two groups are more interesting. They occupy a shadowy realm of wikis-which-might-be-healthy. Or perhaps wikis-for-very-particular-target-groups.
I'm hoping to get a chance to meet with some luminaries of the social software space in the coming weeks and talk over whether there is - or can be - any good single characterization for these collections of items.
Besides that, the obvious next steps (insofar as purely numerical methods are concerned) seem to involve refining the k-means clustering and experiments on the one hand, and trying for a reliable, deterministic clustering, probably via some kind of simple
SVM.
In both cases, my thinking is that the probabilistic nature of k-means is a little troubling. One approach to make the effect less onerous might be to cluster the data iteratively, doing a best-of-n for which wikis end up in each cluster size rankings. Another approach might be to calculate the variance predicted by each principal component of the data, keeping enough components to predict, say, 95% of the data variance, and throwing away the rest. In big-oh notation, this doesn't save you a lot of work for a pairwise Euclidean distance estimation, but it does save you some. You could also just trim that long tail before doing the classification - it's easy to tell what things go in that set without having a fancy algorithm. That about halves the number of data points to be clustered, which in an exponential algorithm is a big big win.
Lest I forget to mention it publicly, Eugene pointed out that Luke Closs initially suggested that RecentChanges might be pretty important. Looks like he was right. Thanks, Luke.