This CRAN Task View contains a list of packages that can be
used for finding groups in data and modelling unobserved
cross-sectional heterogeneity. Many packages provide functionality for
more than one of the topics listed below, the section headings are
mainly meant as quick starting points rather than an ultimate
categorization. Except for packages stats and cluster (which ship with
base R and hence are part of every R installation), each package is
listed only once.
Hierarchical Clustering:
-
Functions
hclust()
from package stats and
agnes()
from
cluster
are the primary
functions for agglomerative hierarchical clustering, function
diana()
can be used for divisive hierarchical
clustering. A faster alternative to
hclust()
is
provided by
flashClust.
-
Function
dendrogram()
from stats and associated methods can
be used for improved visualization for cluster dendrograms.
-
pvclust
is a package for assessing the uncertainty in
hierarchical cluster analysis. It provides approximately
unbiased p-values as well as bootstrap p-values.
-
hybridHclust
implements hybrid hierarchical
clustering via mutual clusters.
-
Package
dynamicTreeCut
contains methods for detection
of clusters in hierarchical clustering dendrograms.
-
Package
LLAhclust
provides likelihood linkage analysis
hierarchical clustering.
Partitioning Clustering:
-
Function
kmeans()
from package stats provides
several algorithms for computing partitions with respect to
Euclidean distance.
-
Function
pam()
from package
cluster
implements
partitioning around medoids and can work with arbitrary
distances. Function
clara()
is a
wrapper to
pam()
for larger data sets. Silhouette plots
and spanning ellipses can be used for visualization.
-
Package
clusterSim
allows to search for the optimal
clustering procedure for a given dataset.
-
Package
flexclust
provides k-centroid cluster
algorithms for arbitrary distance measures, hard competitive
learning, neural gas and QT clustering. Neighborhood graphs and
image plots of partitions are available for visualization. Some of
this functionality is also provided by package
cclust.
-
Package
trimcluster
provides trimmed k-means
clustering.
-
Package
bayesclust
allows to test and search for
clusters in a hierarchical Bayes model.
-
Package
clues
provides a clustering method based on
local shrinking.
-
Package
skmeans
allows spherical k-Means Clustering,
i.e. k-means clustering with cosine similarity. It features several
methods, including a genetic and a simple fixed-point algorithm and
an interface to the CLUTO vcluster program for clustering
high-dimensional datasets.
Model-based Clustering:
-
ML estimation:
-
Package
mclust
fits mixtures of Gaussians using the EM
algorithm. It allows fine control of volume and shape of
covariance matrices and agglomerative hierarchical clustering
based on maximum likelihood. It provides comprehensive strategies
using hierarchical clustering, EM and the Bayesian Information Criterion
(BIC) for clustering, density estimation, and discriminant
analysis.
-
prabclus
clusters a presence-absence matrix
object by calculating an MDS
from the distances, and applying maximum likelihood Gaussian
mixtures clustering to the MDS
points.
-
Package
MFDA
implements model-based functional data
analysis.
-
Package
GLDEX
fits mixtures of generalized lambda
distributions and for grouped conditional data package
mixdist
can be used.
-
Package
mixRasch
estimates mixture Rasch models,
including the dichotomous Rasch model, the rating scale model, and the
partial credit model with joint maximum likelihood estimation.
-
Bayesian estimation:
-
Bayesian estimation of finite mixtures of multivariate Gaussians
is possible using package
bayesm. The package provides
functionality for sampling from such a mixture as well as estimating
the model using Gibbs sampling. Additional functionality for
analyzing the MCMC chains is available for averaging
the moments over MCMC draws, for determining the marginal densities,
for clustering observations and for plotting the uni- and bivariate
marginal densities.
-
Package
bayesmix
provides Bayesian estimation using
JAGS. Bayesian estimation using a variational approach for
multivariate Gaussian distributions with a diagonal covariance matrix
is provided by package
vabayelMix.
-
Package
Bmix
provides Bayesian Sampling for
stick-breaking mixtures.
-
Package
bclust
allows Bayesian clustering using a
spike-and-slab hierarchical model and is suitable for clustering
high-dimensional data.
-
Package
mixAK
contains a mixture of statistical
methods including the MCMC methods to analyze normal mixtures with
possibly censored data.
-
Package
EMCC
provides evolutionary Monte Carlo (EMC)
methods for clustering.
-
Package
GSM
fits mixtures of gamma distributions.
-
Package
mcclust
implements methods for processing a
sample of (hard) clusterings, e.g. the MCMC output of a Bayesian
clustering model. Among them are methods that find a single best
clustering to represent the sample, which are based on the posterior
similarity matrix or a relabelling algorithm.
-
Package
rjags
provides an interface to the JAGS
MCMC library which includes a module for mixture modelling.
-
Other estimation methods:
-
Package
AdMit
allows to fit an adaptive mixture of Student-t
distributions to approximate a target density through its kernel
function.
-
Robust estimation using Weighted Likelihood can be done with
package
wle.
-
Package
pendensity
estimates densities with a penalized
mixture approach.
Other Cluster Algorithms:
-
Package
amap
provides alternative implementations
of k-means and agglomerative hierarchical clustering.
-
Package
biclust
provides several algorithms to find
biclusters in two-dimensional data.
-
Package
cba
implements clustering techniques for
business analytics like "rock" and "proximus".
-
Package
CHsharp
clusters 3-dimensional data into
their local modes based on a convergent form of Choi and Hall's
(1999) data sharpening method.
-
Package
clue
implements ensemble methods for both
hierarchical and partitioning cluster methods.
-
Fuzzy clustering and bagged clustering are available in
package
e1071.
-
Package
compHclust
provides complimentary
hierarchical clustering which was especially designed for microarray
data to uncover structures present in the data that arise from
'weak' genes.
-
Package
FactoClass
performs a combination of
factorial methods and cluster analysis.
-
The
hopach
algorithm is a hybrid between
hierarchical methods and PAM and builds a tree by
recursively partitioning a data set.
-
For graphs and networks model-based clustering approaches are
implemented in packages
latentnet
and
mixer.
-
Package
nnclust
allows fast clustering of large data sets
by constructing a minimum spanning tree for each cluster. For each
cluster the procedure is stopped when the nearest-neighbour distance
rises above a specified threshold. A set of clusters and a
set of "outliers" not in any cluster is returned. The algorithm works best for
well-separated clusters in up to 8 dimensions, and sample sizes up
to hundreds of thousands.
-
Package
randomLCA
provides the fitting of latent
class models which optionally also include a random effect. Package
poLCA
allows for polytomous variable latent class
analysis and regression.
-
Package
RPMM
fits recursively partitioned mixture
models for Beta and Gaussian Mixtures. This is a model-based
clustering algorithm that returns a hierarchy of classes, similar to
hierarchical clustering, but also similar to finite mixture
models.
-
Package
segclust
fits a segmentation/clustering
model. A mixture of univariate gaussian distributions is used for
the cluster structure and segments are assumed to arise because
switching between clusters over time occurs.
-
Self-organizing maps are available in package
som.
-
Several packages provide cluster algorithms which have been
developped for bioinformatics applications. These packages include
FunCluster
for profiling microarray expression data,
MMG
for mixture models on graphcs,
ORIClust
for order-restricted information-based clustering and
varmixt
for mixture models on the variance.
Additional Functionality:
-
Mixtures of univariate normal distributions can be printed
and plotted using package
nor1mix.
-
Packages
gcExplorer
and
clusterfly
allow
to visualise the results of clustering algorithms.
-
Package
clusterGeneration
contains functions for
generating random clusters and random covariance/correlation
matrices, calculating a separation index (data and population
version) for pairs of clusters or cluster distributions, and 1-D and
2-D projection plots to visualize clusters.
Alternatively
MixSim
generates a finite mixture model
with Gaussian components for prespecified levels of maximum and/or
average overlaps. This model can be used to simulate data for
studying the performance of cluster algorithms.
-
For cluster validation package
clusterRepro
tests the
reproducibility of a cluster. Package
clv
contains
popular internal and external cluster validation methods ready to
use for most of the outputs produced by functions from package
cluster
and
clValid
calculates several
stability measures.
-
Package
clustTool
provides a GUI for clustering data
with spatial information.
-
Package
clustvarsel
provides variable selection for
model-based clustering.
Cluster-wise Regression:
-
Package
flexmix
implements an user-extensible
framework for EM-estimation of mixtures of regression models,
including mixtures of (generalized) linear models.
-
Package
fpc
provides fixed-point methods both for
model-based clustering and linear regression. A collection of
asymmetric projection methods can be used to plot various
aspects of a clustering.
-
Multigroup mixtures of latent Markov models on mixed categorical
and continuous data (including time series) can be fitted using
depmix
or
depmixS4. The parameters are
optimized using a general purpose optimization routine given linear
and nonlinear constraints on the parameters.
-
Package
mixreg
fits mixtures of one-variable
regressions and provides the bootstrap test for the number of
components.
-
Mixed-mode latent class regression with special focus on
longitudinal data is implemented by
mmlcr. The components
can follow a multivariate distribution of a (censored) Gaussian,
multinomial, negative binomial or Poisson distribution. In addition
concomitant variables can be specified to model the priors.
-
moc
fits mixture models to multivariate mixed data
using a Newton-type algorithm. The component specific distribution
may have one, two or three parameters. Covariates and concomitant
variables can be specified as well as constraints for the
parameters.
-
mixtools
provides fitting with the EM algorithm of
mixtures of multinomials, multivariate normals, normals with
repeated measures, Poisson regressions and Gaussian regressions
(with random effects) and with the Metropolis-Hastings algorithm of
mixtures of Gaussian regressions.
-
mixPHM
fits mixtures of proportional hazard models
with the EM algorithm.
-
Package
gamlss.mx
fits finite mixtures of of gamlss
family distributions.