Converting to and from Document-Term Matrix and Corpus objects
Julia silge and david robinson, tidying document-term matrices.
Many existing text mining datasets are in the form of a DocumentTermMatrix class (from the tm package). For example, consider the corpus of 2246 Associated Press articles from the topicmodels package:
If we want to analyze this with tidy tools, we need to turn it into a one-term-per-document-per-row data frame first. The tidy function does this. (For more on the tidy verb, see the broom package ).
Just as shown in this vignette , having the text in this format is convenient for analysis with the tidytext package. For example, you can perform sentiment analysis on these newspaper articles.
We can find the most negative documents:
Or visualize which words contributed to positive and negative sentiment:
Note that a tidier is also available for the dfm class from the quanteda package:
Casting tidy text data into a DocumentTermMatrix
Some existing text mining tools or algorithms work only on sparse document-term matrices. Therefore, tidytext provides cast_ verbs for converting from a tidy form to these matrices.
This allows for easy reading, filtering, and processing to be done using dplyr and other tidy tools, after which the data can be converted into a document-term matrix for machine learning applications.
Tidying corpus data
You can also tidy Corpus objects from the tm package. For example, consider a Corpus containing 20 documents, one for each
The tidy verb creates a table with one row per document:
Similarly, you can tidy a corpus object from the quanteda package:
This lets us work with tidy tools like unnest_tokens to analyze the text alongside the metadata.
We could then, for example, see how the appearance of a word changes over time:
For example, we can use the broom package to perform logistic regression on each word.
You can show these models as a volcano plot, which compares the effect size with the significance:
We can also use the ggplot2 package to display the top 6 terms that have changed in frequency over time.
15 Ways to Create a Document-Term Matrix in R
Original post on December 2020. Updated on August 2021.
The Document-Term Matrix (DTM) is the foundation of computational text analysis, and as a result there are several R packages that provide a means to build one. What is a DTM ? It is a matrix with rows and columns, where each document in some sample of texts (called a corpus ) are the rows and the columns are all the unique words (often called types or vocabulary) in the corpus. The cells of the matrix are typically a count of how many times each unique word occurs in a given document (often called tokens).
Below, I attempt a comprehensive overview and comparison of 15 different methods for creating a DTM . Two are custom functions written in base R . The rest are from eleven text analysis packages. One of these is an R package, text2map , that I developed with Marshall Taylor (Stoltz and Taylor 2022). (The dtm_builder() function was developed in tandem with writing the original comparison back in December 2020).
Below are the non-text analysis packages we’ll be using.
And, here are the text analysis R packages we’ll be using. These include every single package that includes functions to create DTM that I could find (the koRpus package does provide a document-term matrix method, but I could not get it to work). Feel free to let me know if I’ve missed a package that creates a DTM .
Getting Started #
We need some text data! Let’s get scripts from Star Trek from the rtrek package, because, why not? Let’s create a corpus for all series, and then a corpus of just Star Trek: The Next Generation . We will filter the scripts by series, and then collapse each line so it is one script per episode.
Let’s do a tiny bit of preprocessing (lowercasing, smooshing contractions, removing punctuation, numbers, and getting rid of extra spaces).
Base R, Dense DTMs #
To get started, let’s create two base R methods for creating dense DTM s. There are three necessary steps: (1) tokenize, (2) create vocabulary, and (3) match and count.
First, each document is split into list of individual tokens. Second, from these lists of tokens, we need to extract only the unique tokens to create a vocabulary. Finally, we will count each time we find a match between a token in a document with a token in the vocabulary.
While the above are essential, there are a few optional steps which functions may or may not take by default. First, the most basic DTM uses the raw counts of each word in a document. Some functions may include the option to weight the matrix. The most common is to normalize by the row count to get relative frequencies. Since all weightings require raw counts anyway, we will just stop at a count DTM (not to mention relative frequencies will turn an integer matrix into a real number matrix, which will result in a larger object in terms of memory).
Second, the columns of our DTM may be sorted by (1) the order they appear in the corpus, (2) alphabetic order, or (3) by their frequency in the corpus. The first option will be the fastest and the third option being the slowest. The function may also incorporate the removal or “stopping” of certain tokens. It is more efficient to build the DTM first, and then simply remove the columns that match a given stoplist.
Finally, we can tokenize using a variety of rules. Both methods below will use strsplit() with a literal, single space ( fixed = TRUE significantly speeds up this process). This is a very simple tokenizing rule. This also means it is very fast in comparison to more complex rules. For example, we could tokenize by every two word bi-gram or instead of a literal single space, we could use other kinds of whitespace (tabs, carriage returns, newlines, etc…). Both methods below will also use unique() for getting the unique tokens (i.e. vocabulary) of the corpus.
The first function uses a for loop with the table function to count the number of instances of a given token and the second uses only lapply with the tabulate function to count tokens. In base R , matrices are represented in a “dense” format, which is given the class matrix – this will make more sense when we discuss sparsity below. Generally, lapply is more efficient, however, because we initialize an empty matrix (and thus preallocate memory) and then fill it up with the first function, our for loop approach may perform better.
Let’s double check that these two methods produced identical results. We’ll (1) check that they’re the same dimensions, (2) check they sum to the same number of total tokens in the corpus, (3) and check that the words and document IDs (episode titles) are the same.
Which method is more efficient? To compare methods, I will use the mark() function from the bench package.
Similar, but our for loop function beat the lapply . It’s possible that with a larger corpus lapply will show more substantial gains over the for loop method.
Sparse DTMs #
As a result of the nature of language, DTMs tend to be very “sparse” —meaning, they have lots and lots of zeros. For example, let’s see how many cells are zeros in one of the dense DTMs we just created based on Star Trek: TNG scripts.
That is a lot of zero cells! The two functions above produced a basic “dense” matrix, which can quickly take up a lot of memory. There are several strategies for dealing with the memory issues of very large matrices in R , but, for the most part you will need enough RAM on your machine to hold the entirety of the matrix in memory at once. The most straightforward way to deal with memory limitations is to simply represent a matrix as a slightly different kind of data object called a “sparse” matrix. Simply put, when a matrix has a lot of zeros, the sparse format will produce a smaller object. Many of the dedicated DTM functions use this format, so let’s dive a little deeper into them.
There are two popular R packages that offer sparse matrix formats, Matrix and slam . The latter package offers one main kind of sparse matrix called a simple_triplet_matrix . The former package has several more. We will use the dgCMatrix class to represent integer and real number DTMs and lgCMatrix for the binary DTM (the Matrix package also has a triplet format: dgTMatrix ).
We built text2map ’s dtm_builder() first for speed, and second for memory efficiency. We were able to speed up the vocabulary creation step and the matching and counting step. And, just like the previous functions, we also limit tokenizing to the fixed, literal space.
Comparing DTM Functions #
Now that we have a better understanding of what is happening “under the hood” when creating a DTM in R , we can turn to comparing how well dedicated text analysis packages produce DTM s. We will measure the time and memory used to turn our lightly pre-processed Star Trek scripts into a DTM .
The dedicated functions all use either dgCMatrix or simple_triplet_matrix formats to represent the final outputs, but several intermediary steps are taken prior. This often involves converting the texts into different kinds of data structures. For example, in both our base R functions, we turn each individual episode into a list of each individual token, then we turn that into a list of token counts . A popular package, tidytext , uses the tokenizers package and then outputs a three-column token-count data frame, where each row is a document, term, and value – also called a tripletlist . Next, we can use cast_dtm() to get the equivalent to the tm package’s dtm or cast_dfm() to produce the equivalent to the quanteda package’s dfm . The udpipe package also creates a tripletlist before creating a DTM .
Next, we’ll create a unique function for each of the different packages that will go directly from lightly cleaned text to a DTM . Because a lot of these packages use the same or similar nomenclature, we will use the explicit package declaration for each function ( :: ).
We use our two base R functions as baselines, along with the dtm_builder() function from our text2map package. Then we use the two most commonly used packages, tm and quanteda . By default these two output a simple_triplet_matrix and a dgCMatrix matrix, respectively. Next, we will use tidytext and corpustools which both provide methods of producing matrices compatible with the tm and quanteda packages. The next group of five packages are explicitly oriented toward optimization (note: as of writing, gofastr and wactor have not been updated in a while). Finally, there is udpipe , which is particularly well known for providing parsers for numerous non-English languages. Together, we will compare 15 different methods.
Now, we ran 100 iterations using the bench package. We can turn the output into violin plots to show the overall range in terms of time, and we can plot a bar chart to show the overall memory allocated during each iteration.
The text2map function dtm_builder() is the fastest – which was to be expected since we built it specifically for this purpose. As far as the other text analysis packages go, the winner, in terms of being fast and memory efficient is text2vec . textTinyR is written almost entirely in Rcpp, meaning that it is very memory efficient, but seems to lose some time interfacing between R and C++ . tidytext loses a lot of time because it first creates a tripletlist tibble, then creates a DTM (our base R loop function beats it in terms of speed and almost ties it in terms of memory despite operating on a base R dense matrix). In terms of the two most popular packages, quanteda edges out the tm package.
It’s important to not that these packages may use different tokenizing rules or remove some kinds of words by default . I attempted to standardize this across each function as best I could, but two functions produce a DTM of a different size from the rest. The tm option in corpustools produced a DTM with 25,583 words, while the gofastr package produced a DTM with only 13,096 words. I was unable to figure out what was up with these two packages.
Furthermore, these tests are all creating a DTM with 176 episodes and a vocabulary of 20,858 words. Perhaps varying these parameters would change the rankings. So, let’s look at every Star Trek episode across all series, totally 716 episodes and 39,324 unique words (er, 50,805 for corpustools ’ tm option and 24,994 words for gofastr ). Note: this may take a while for most personal computers.
Comparison of all DTM methods on all Star Trek series
With a much larger corpus, the ranks stay remarkably similar. Our text2map function is hovering at the fastest end with around 300 milliseconds to create this DTM , followed by quanteda and text2vec . textTinyR still has that time overhead, but beats every other function in overall memory allocated. tidytext continues to offer by far the slowest DTM creation, and middle of the pack in terms of memory.
For the plot aesthetics, I used the following:
Stoltz, Dustin S., and Marshall A. Taylor. (2022). “ text2map: R tools for text matrices .” Journal of Open Source Software 7(72), 3741
- Published on December 19, 2021
- In Mystery Vault
A Guide to Term-Document Matrix with Its Implementation in R and Python
- by Yugesh Verma
In natural language processing, we are required to perform various types of text preprocessing tasks so that the mathematical operations can be performed on the data. Before applying mathematics to such data, data is required to be represented in the mathematical format. For text data, the term-document matrix is a kind of representation that helps in converting text data into mathematical matrices. In this article, we are going to discuss the term-document matrix and we will see how we can make one. We will do a hands-on implementation of term-document matrices in R and Python programming languages for a better understanding. The major points to be discussed in this article are listed below.
Table of Contents
What is a Term-Document Matrix?
- Term-Document Matrix in R
- Using Text Mining
- Application of Term-Document Matrix
Let’s start the discussion by understanding what the term-document matrix is.
Subscribe to our Newsletter
In natural language processing , we see many methods of representing text data. Term document matrix is also a method for representing the text data. In this method, the text data is represented in the form of a matrix. The rows of the matrix represent the sentences from the data which needs to be analyzed and the columns of the matrix represent the word. The dice under the matrix represent the number of occurrences of the words. Let’s understand it with an example.
Here, we can see a set of text responses. The term-document matrix of these responses will look like this:
The above table is a representation of the term-document matrix. From this matrix, we can get the total number of occurrences of any word in the whole corpus and by analyzing them we can reach many fruitful results. Term document matrices are one of the most common approaches which need to be followed during natural language processing and analyzing the text data. More formally we can say that it is the way to represent the relationship between words and sentences presented in the corpus.
Since R and python are two common languages that are being used for the NLP, we are going to see how we can implement a term-document matrix in both of the languages. Let’s start with the R language.
Implementation in R
In this section of the article, we are going to see how we can create a term-document matrix using the R language. For this purpose, we are required to install the tm(text mining) library in our environment.
Using the above lines of codes, we can install the text mining library. Instead of term-document and document-term matrix, we have various facilities available in the library from the field of text mining and others.
Importing the library:
Using the above lines of code we can call the library.
For making a term-document matrix in R, we are using crude data which comes with the tm library and it is a volatile corpus of 20 news articles which are dealing with crude oil.
Lets inspect the crude vcorpus
Here is the output. We can see the character counts and metadata information in vcorpu. For more detailed information, we can use the help function of the R.
Here we can also use the corpus for making the term-document matrix but we are using vcorpus because of its explainability after converting it to a term-document matrix.
Making Term-Document Matrix:
Here we can see the details of the term-document matrix. Let’s inspect some values from it.
Here in the output, we can see some of the values of the term-document matrix and some of the information regarding these values. We can also inspect the values using our chosen words from the documents.
inspect(tdm[c("price", "prices", "texas"), c("127", "144", "191", "194")])
We can also make the document term matrix using the functions provided by the tm library as,
Let’s inspect the document term matrix.
The basic difference between the term-document matrix and document term matrix is that the weighting of the term-document matrix is based on the term frequency (TF) and in the document term matrix the weighting is based on term frequency-inverse document frequency(TF-IDF).
The below image is a representation of a word cloud using the document term matrix that we have made earlier. We can make it using the following codes:
Here in the image, we can see that we are required to clean the data to get more proper results. Since the motive of the article is to learn the basic implementation of the document term matrix, we will be focused on this motive only. Let’s see how we can perform it on the Python programming language.
Implementation in Python
In this section of the article, we are going to see how we can make the document term matrix using the python languages and libraries built under python language. In python, there are various ways using which we can perform this. Before going on any of the processes let’s define a document. Here we are taking the sentences from the above-given table. Let’s start by defining the documents.
As we have said that in python, we can do it in various ways. Here we will be discussing two simplest ways for performing this. The first way of making the term-document matrix is to use the functions from the pandas and scikit learn libraries. Let’s see how we can perform this.
Importing the libraries
Adding the sentences
Defining and fitting the count vectorizer on the document.
Converting the vector on the DataFrame using pandas
Here we can see the document term matrix of the documents which we have defined. Now let’s see how we can perform this using our second way where we have a library named textmining which has a function for making the document term matrix from the text data.
Using Text Mining
Installing the library:
pip install textmining3
Initializing function for making term-document matrix.
Here we can see the type of object in the output which we have defined for making the term-document matrix.
Fitting the documents in the function.
Converting the term-document matrix in the Pandas data frame.
Here we can see the document term matrix which we have created using the text mining library.
Application of Term-Document Matrix
We can say that making a term-document matrix from the text data is one of the tasks which comes in between the whole NLP project. Term document matrix can be used in various types of NLP tasks, some of the tasks we can perform using the term-document matrix are as follows:
- By performing the singular value decomposition on the term-document matrix, search results can be improved to an extent. Using it on the search engine, we can improve the results of the searches by disambiguating polysemous words and searching for synonyms of the query.
- Most of the NLP processes are focused on mining one or more behavioural data from the corpus of text. Term document matrices are very helpful in extracting the behavioural data. By performing multivariate analysis on the document term matrix we can reach the different themes of the data.
Here in this article, we have seen what is a term-document matrix with an example along with how we can make the term-document matrix using the R and python programming languages. In the end, we have also discussed some major applications of the term-document matrix.
- Link for the R codes
- Link for the python codes
- Text mining library
Download our Mobile App
AIMResearch Pioneering advanced AI market research
With a decade of experience under our belt, we are transforming how businesses use ai & data-driven insights to succeed..
The Gold Standard for Recognizing Excellence in Data Science and Tech Workplaces
With best firm certification, you can effortlessly delve into the minds of your employees, unveil invaluable perspectives, and gain distinguished acclaim for fostering an exceptional company culture..
Now Everyone’s a Filmmaker, Thanks to Pika
The ‘ChatGPT moment’ for generative AI video has finally arrived.
If Sam Altman was from India…
Who would it be?
AWS re:Invent was All About Reinventing OpenAI
Jensen Huang Brings re:Invent to Life
Everybody likes to NVIDIA.
How Pfizer is Saving Lives with AWS’s Generative AI Services
Why Making Responsible AI Integral to Generative AI Applications is Non-Negotiable
KorrAI’s Mission against Urban Subsidence
Nagarro’s Approach to Generative AI: Tailoring Tools for Enterprise Needs
OpenAI is Nothing without Ilya
Rising 2024 announced: aim’s flagship summit on tech diversity and inclusion, our mission is to bring about better-informed and more conscious decisions about technology through authoritative, influential, and trustworthy journalism., shape the future of tech.
© Analytics India Magazine Pvt Ltd & AIM Media House LLC 2023
AIM Top Ranked PG Data Science Programs (Full Time On-Campus) – 2023
- Why Displayr
- Product Features
- How It Works
- Survey Analysis
- Data Visualization
- Automatic Updating
- PowerPoint Reporting
- Finding Data Stories
- Data Cleaning
- New Product Development
- Tracking Analysis
- Customer Feedback
- Brand Analytics
- Pricing Research
- Advertising Research
- Statistical Testing
- Text Analysis
- Factor Analysis
- Driver Analysis
- Correspondence Analysis
- Cluster & Latent Class
- Success Stories
- Dashboard Examples
- Demo Videos
- Book a Demo
- Ebooks & Webinars
- Help Center
- Product Roadmap
- Book a demo
- Choice Modeling/Conjoint Analysis
- Dimension Reduction
- Principal Component Analysis
- Machine Learning
- Linear Regression
- Cluster Analysis
- Latent Class Analysis
- Customer Feedback Surveys
- Dive Into Data
- Data Stories
- Data Stories Tutorials
- Account Administration
- Beginner's guides
- Dashboard Best Practices
- Getting Started
- Troubleshooting Common Issues
- R How To...
- R in Displayr
Text Analysis: Hooking up Your Term Document Matrix to Custom R Code
I have previously written about some of the text analysis options that are available in Displayr: sentiment analysis , text cleaning , and the predictive tree . As text analysis is a growing field, you likely want to use your own tools on top of those already built into Displayr. To feed information about your text into a statistical algorithm, it must first be converted into a form which is amenable to doing calculations. One approach to this is to use a term document matrix - the topic of this blog post. I'll explain what a term document matrix is, a version of a term document matrix called a sparse matrix and how to use it.
What is a term document matrix?
A term document matrix is a way of representing the words in the text as a table (or matrix) of numbers. The rows of the matrix represent the text responses to be analysed, and the columns of the matrix represent the words from the text that are to be used in the analysis. The most basic version is binary. A 1 represents the presence of a word and 0 its absence. Consider, as an example, the following, very basic, set of text responses:
The term document matrix for this would look something like the following:
The steps to creating your own term matrix in Displayr are:
- Clean your text responses using Insert > More > Text Analysis > Setup Text Analysis . Options for cleaning the text with this item are discussed in How to Set Up Your Text Analysis in Displayr .
- Add your term-document matrix using Insert > More > Text Analysis > Techniques > Create Term Document Matrix.
Like my other posts on text analysis, I will use the example of Donald Trump's tweets. In this example data set, which you can play with here in Displayr, there are 1,512 tweets from @realDonaldTrump. Using Displayr's tool to create the term document matrix, we instead start with an output that looks somewhat different from the one in our easy example:
This version of the matrix is called a sparse matrix (believe it or not!) and it is a more efficient representation of the information contained in the term document matrix. It is necessary for us to use this representation whenever there are a large number of cases or words. The matrix tends to be mostly 0's and in this case the output tells us that the proportion of entries that are zero (called the Sparsity ) is 97%. Because this representation does not store this information, we save a lot of computer memory. The downside is that it doesn't display as nicely on the screen, and you'll need to convert it into a normal matrix when you want to use it in a calculation.
If your data set contains only a few hundred text entries then you can use some R code to display the matrix:
- Click on Insert > R Output . This creates a new output which will display the output of any R code that you type.
- Click on Properties > R CODE on the right of the screen.
- Enter the following:
There are a couple of important things to note about this very simple snippet of code. Firstly, we have loaded the R package called tm (which stands for text mining ). We did this because this package knows how to handle the sparse matrix format that we have used. It contains a version of the generic function as.matrix() , which converts the sparse matrix into a normal R matrix. In addition, term.document.matrix is the name of our original sparse term document matrix. In Displayr you can, consequently, use outputs in your document as inputs to other calculations by referring to their name . To find the name of an output, first click on it, and then look in Properties > GENERAL > Name .
The result looks like this:
We are now ready to analyze the tweets with a statistical algorithm. To begin with, we will use a random forest model to see how the presence of particular words can be used to predict which device the tweet was sent from - iPhone or Android. Why do we care? The working hypothesis is that Trump himself tweets from an Android, whereas his media team tweet on his behalf from an iPhone (see a previous post on sentiment analysis). This results in differences in the language coming from those devices.
Displayr has a built in option for running a random forest model. This type of model predicts the relationship between variables in the data set. Use it by selecting Insert > More > Machine Learning > Random Forest . However, the term document matrix lives in an R output and is not saved as a set of variables in our data set. In fact, due to it's size, it is undesirable to save a term document matrix into your data set. Instead, we can modify the code for the existing random forest option to work as follows:
- Click on Insert > R Output .
- Use the following code:
The code above first converts the term document matrix, before combining it with the dependent variable ( tweetSource ), working out an appropriate R formula which relates the dependent variable to the columns of the term document matrix, and finally runs the random forest routine. Similarly, the same process could be used for a regression model, or other R routines which gets their data in, using this basic structure. This leads to a table showing how important each word is in improving the accuracy of predicting the source of each tweet:
The MeanDecreaseAccuracy figures provide a measure of how much each word improves the accuracy of the random forest model in predicting the source of the tweet. The first three columns show the importance for each possible source. In the second row, we see that the presence of the @realDonaldTrump (which is where the account re-tweets mentions of Trump), is by far the most important term. Looking at the relative frequencies of words used between the two devices, we therefore conlcude that the presence of such a mention is almost always from the Android (theorized to be Trump's own device). The first row, on the other hand, shows that the presence of the tag #trump2016 was very good at predicting a tweet did not come from the Android device, but as it was fairly infrequent overall, was not a great predictor of a tweet being from an iPhone.
TRY IT OUT Feel free to try out these examples in Displayr.
Many packages for doing text analysis have been written in the R language. We've made some of them available in Displayr already, including tm , tidytext , text2vec , stringr , hunspell , and SnowballC . If you come across one that you want to use, but which is unavailable in Displayr, you should contact us at [email protected] to let us know. We can, when needed, typically make new packages available within a few days.
Prepare to watch, play, learn, make, and discover!
Get access to all the premium content on displayr, last question, we promise, what type of survey data are you working with (select all that apply).
Market research Social research (commercial) Customer feedback Academic research Polling Employee research I don't have survey data
eBook: Using Machine Learning to Automate Text Coding
Notes for “Text Mining with R: A Tidy Approach”
5.1 tidying a document-term matrix.
A document-term matrix or term-document matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. This is a matrix where
each row represents one document
each column represents one term (word)
each value (typically) contains the number of appearances of that term in that document
Document-term matrices are often stored as a sparse matrix object. These objects can be treated as though they were matrices (for example, accessing particular rows and columns), but are stored in a more efficient format.
tidytext provides ways of converting between these two formats:
tidy() turns a document-term matrix into a tidy data frame (one-token-per-row)
cast() turns a tidy data frame into a matrix.There are three variations of this verb corresponding to different classes of matricies : cast_sparse() (converting to a sparse matrix from the Matrix package), cast_dtm() (converting to a DocumentTermMatrix object from tm ), and cast_dfm() (converting to a dfm object from quanteda)
DocumentTermMatrix class is built into the tm package. Notice that this DTM is 99% sparse (99% of document-word pairs are zero).
Terms() is a accessor function to extract the full distinct word vector
tidy it to get a tidy data frame
quanteda uses dfm (document-feauture matrix) as a common data structure for text data. For example, the quanteda package comes with a corpus of presidential inauguration speeches, which can be converted to a dfm using the appropriate function.
We, of course, want to tidy it
Suppose we would like to see how the usage of some user specified words change over time. We start by complete() the data frame, and then total words per speech:
TermDocumentMatrix: Term-Document Matrix
Constructs or coerces to a term-document matrix or a document-term matrix.
An object of class TermDocumentMatrix or class
DocumentTermMatrix (both inheriting from a
simple triplet matrix in package slam ) containing a sparse term-document matrix or document-term matrix. The attribute weighting contains the weighting applied to the matrix.
for the constructors, a corpus or an R object from which a corpus can be generated via Corpus(VectorSource(x)) ; for the coercing functions, either a term-document matrix or a document-term matrix or a simple triplet matrix (package slam ) or a term frequency vector.
a named list of control options. There are local options which are evaluated for each document and global options which are evaluated once for the constructed matrix. Available local options are documented in termFreq and are internally delegated to a termFreq call.
This is different for a SimpleCorpus . In this case all options are processed in a fixed order in one pass to improve performance. It always uses the Boost ( https://www.boost.org ) Tokenizer (via Rcpp ) and takes no custom functions as option arguments.
Available global options are:
A list with a tag global whose value must be an integer vector of length 2. Terms that appear in less documents than the lower bound bounds$global or in more documents than the upper bound bounds$global are discarded. Defaults to list(global = c(1, Inf)) (i.e., every term will be used).
A weighting function capable of handling a TermDocumentMatrix . It defaults to weightTf for term frequency weighting. Available weighting functions shipped with the tm package are weightTf , weightTfIdf , weightBin , and weightSMART .
the additional argument weighting (typically a WeightFunction ) is allowed when coercing a simple triplet matrix to a term-document or document-term matrix.
termFreq for available local control options.
Run the code above in your browser using DataCamp Workspace
tm Text Mining Package
- Introduction to the tm Package
- acq: 50 Exemplary News Articles from the Reuters-21578 Data Set of...
- combine: Combine Corpora, Documents, Term-Document Matrices, and Term...
- content_transformer: Content Transformers
- Corpus: Corpora
- crude: 20 Exemplary News Articles from the Reuters-21578 Data Set of...
- DataframeSource: Data Frame Source
- DirSource: Directory Source
- Docs: Access Document IDs and Terms
- findAssocs: Find Associations in a Term-Document Matrix
- findFreqTerms: Find Frequent Terms
- findMostFreqTerms: Find Most Frequent Terms
- foreign: Read Document-Term Matrices
- getTokenizers: Tokenizers
- getTransformations: Transformations
- hpc: Parallelized 'lapply'
- inspect: Inspect Objects
- matrix: Term-Document Matrix
- meta: Metadata Management
- PCorpus: Permanent Corpora
- PlainTextDocument: Plain Text Documents
- plot: Visualize a Term-Document Matrix
- readDataframe: Read In a Text Document from a Data Frame
- readDOC: Read In a MS Word Document
- Reader: Readers
- readPDF: Read In a PDF Document
- readPlain: Read In a Text Document
- readRCV1: Read In a Reuters Corpus Volume 1 Document
- readReut21578XML: Read In a Reuters-21578 XML Document
- readTagged: Read In a POS-Tagged Word Text Document
- readXML: Read In an XML Document
- removeNumbers: Remove Numbers from a Text Document
- removePunctuation: Remove Punctuation Marks from a Text Document
- removeSparseTerms: Remove Sparse Terms from a Term-Document Matrix
- removeWords: Remove Words from a Text Document
- SimpleCorpus: Simple Corpora
- Source: Sources
- stemCompletion: Complete Stems
- stemDocument: Stem Words
- stopwords: Stopwords
- stripWhitespace: Strip Whitespace from a Text Document
- termFreq: Term Frequency Vector
- TextDocument: Text Documents
- tm_filter: Filter and Index Functions on Corpora
- tm_map: Transformations on Corpora
- tm_reduce: Combine Transformations
- tm_term_score: Compute Score for Matching Terms
- tokenizer: Tokenizers
- URISource: Uniform Resource Identifier Source
- VCorpus: Volatile Corpora
- VectorSource: Vector Source
- Browse all...
matrix : Term-Document Matrix In tm: Text Mining Package
Description Usage Arguments Value See Also Examples
Constructs or coerces to a term-document matrix or a document-term matrix.
An object of class TermDocumentMatrix or class DocumentTermMatrix (both inheriting from a simple triplet matrix in package slam ) containing a sparse term-document matrix or document-term matrix. The attribute weighting contains the weighting applied to the matrix.
termFreq for available local control options.
Related to matrix in tm ..., r package documentation, browse r packages, we want your feedback.
Add the following code to your website.
REMOVE THIS Copy to clipboard
For more information on customizing the embed code, read Embedding Snippets .