Dear Emacs, please make this -*-Text-*- mode! CHANGES IN tm VERSION 0.5-1 BUG FIXES o preprocessReut21578XML() no longer generates invalid file names. CHANGES IN tm VERSION 0.5 SIGNIFICANT USER-VISIBLE CHANGES o All classes, functions, and generics are reimplemented using the S3 class system. o Following functions have been renamed: activateCluster() to tm_startCluster() , asPlain() to as.PlainTextDocument(), deactivateCluster() to tm_stopCluster() , tmFilter() to tm_filter() , tmIndex() to tm_index() , tmIntersect() to tm_intersect() , and tmMap() to tm_map() . o Mail handling functionality is factored out to the tm.plugin.mail package. DEPRECATED & DEFUNCT o Following functions have been removed: tmTolower() (use tolower() instead), and replacePatterns() (use gsub() instead). CHANGES IN tm VERSION 0.4 SIGNIFICANT USER-VISIBLE CHANGES o The Corpus class is now virtual providing an abstract interface. o VCorpus, the default implementation of the abstract corpus interface (by subclassing), provides a corpus with volatile (= standard R object) semantics. It loads all documents into memory, and is designed for small to medium-sized data sets. o PCorpus, an implementation of the abstract corpus interface (by subclassing), provides a corpus with permanent storage semantics. The actual data is stored in an external database (file) object (as supported by the filehash package), with automatic (un-)loading into memory. It is designed for systems with small memory. o Language codes are now in ISO 639-2 (instead of ISO 639-1). o Reader functions no longer have a load argument for lazy loading. NEW FEATURES o The reader function readReut21578XMLasPlain() was added and combines the functionality of readReut21578XML() and asPlain(). BUG FIXES o weightTfIdf() no longer applies a binary weighting to an input matrix in term frequency format (which happened only in 0.3-4). CHANGES IN tm VERSION 0.3-4 SIGNIFICANT USER-VISIBLE CHANGES o .onLoad() no longer tries to start a MPI cluster (which often failed in misconfigured environments). Use activateCluster() and deactivateCluster() instead. o DocumentTermMatrix (the improved reimplementation of defunct TermDocMatrix) does not use the Matrix package anymore. NEW FEATURES o The DirSource() constructor now accepts the two new (optional) arguments pattern and ignore.case. With pattern one can define a regular expression for selecting only matching files, and ignore.case specifies whether pattern-matching is case-sensitive. o The readNewsgroup() reader function can now be configured for custom date formats (via the DateFormat argument). o The readPDF() reader function can now be configured (via the PdfinfoOptions and PdftotextOptions arguments). o The readDOC() reader function can now be configured (via the AntiwordOptions argument). o Sources now can be vectorized. This allows faster corpus construction. o New XMLSource class for arbitrary XML files. o The new readTabular() reader function allows to create a custom tailor-made reader configured via mappings from a tabular data structure. o The new readXML() reader function allows to read in arbitrary XML files which are described with a specification. o The new tmReduce() transformation allows to combine multiple maps into one transformation. DEPRECATED & DEFUNCT o CSVSource is defunct (use DataframeSource instead). o weightLogical is defunct. o TermDocMatrix is defunct (use DocumentTermMatrix or TermDocumentMatrix instead). CHANGES IN tm VERSION 0.3-3 NEW FEATURES o The abstract Source class gets a default implementation for the stepNext() method. It increments the position counter by one, a reasonable value for most sources. For special purposes custom methods can be created via overloading stepNext() of the subclass. o New URISource class for a single document identified by a Uniform Resource Identifier. o New DataframeSource for documents stored in a data frame. Each row is interpreted as a single document. BUG FIXES o Fix off-by-one error in convertMboxEml() function. Reported by Angela Bohn. o Sort row indices in sparse term-document matrices. Kudos to Martin Mächler for his suggestions. o Sources and readers no longer evaluate calls in a non-standard way. CHANGES IN tm VERSION 0.3-2 NEW FEATURES o Weighting functions now have an Acronym slot containing abbreviations of the weighting functions' names. This is highly useful when generating tables with indications which weighting scheme was actually used for your experiments. o The functions tmFilter(), tmIndex(), tmMap() and TermDocMatrix() now can use a MPI cluster (via the snow and Rmpi packages) if available. Use (de)activateCluster() to manually override cluster usage settings. Special thanks to Stefan Theussl for his constructive comments. o The Source class receives a new Length slot. It contains the number of elements provided by the source (although there might be rare cases where the number cannot be determined in advance---then it should be set to zero).