Text Mining

Definitions
1. for the first time mentioned in Feldman et al. [FD95] (Hotho, Nürnberger, and Paaß 2005)
2. Text mining refers generally to the process of extracting interesting information and knowledge from unstructured text (Hotho, Nürnberger, and Paaß 2005)
3. the application of algorithms and methods from the fields machine learning and statistics to texts with the goal of finding useful patterns. For this purpose it is necessary to pre-process the texts accordingly. Many authors use information extraction methods, natural language processing or some simple pre- processing steps in order to extract data from texts. To the extracted data then data mining algorithms can be applied (see [NM02, Gai03]). (Hotho, Nürnberger, and Paaß 2005)
4. Text Mining [1] is the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources. (Gupta and Lehal 2009)
5. refers generally to the process of extracting interesting and non-trivial information and knowledge from unstructured text. (Gupta and Lehal 2009)
6. can work with unstructured or semi-structured data sets such as emails, full-text documents and HTML files etc. (Gupta and Lehal 2009)
7. a.k.a
  1. knowledge discovery from text (KDT) (Hotho, Nürnberger, and Paaß 2005)
  2. Intelligent Text Analysis, Text Data Mining or Knowledge-Discovery in Text (KDT) (Gupta and Lehal 2009)
  3. Text Analytics (Ppts APomares)
  4. Text Data Mining (Ppts APomares)
Related Areas
1. information retrieval, machine learning, statistics, computational linguistics and especially data mining. (Hotho, Nürnberger, and Paaß 2005)
2. Data Mining (Gupta and Lehal 2009)
  1. tries to find interesting patterns from large databases. (Gupta and Lehal 2009)
3. Web mining
  1. is to explore interesting information and potential patterns from the contents of web page, the information of accessing the web page linkages and resources of e-commerce by using techniques of data mining, which can help people extract knowledge, improve web sites design, and develop e- commerce better. (Gupta and Lehal 2009)
4. Graph Mining
5. Computational Linguistics (Gupta and Lehal 2009)
6. Information Retrieval
  1. Information retrieval is the finding of documents which contain answers to questions and not the finding of answers itself [Hea99] (Hotho, Nürnberger, and Paaß 2005)
7. Natural Language Processing NLP
  1. The general goal of NLP is to achieve a better understanding of natural language by use of computers [Kod99]. (Hotho, Nürnberger, and Paaß 2005)
8. Information Extraction
  1. The goal of information extraction methods is the ex- traction of specific information from text documents. These are stored in data base-like patterns (see [Wil97]) (Hotho, Nürnberger, and Paaß 2005)
  2. IE addresses the problem of transforming a corpus of textual documents into a more structured database, the database constructed by an IE module can be provided to the KDD module for further mining of knowledge (Gupta and Lehal 2009)
9. Databases
  1. are necessary in order to analyze large quantities of data efficiently. (Hotho, Nürnberger, and Paaß 2005)
10. Machine Learning
  1. is an area of artificial intelligence concerned with the development of techniques which allow computers to ”learn” by the analysis of data sets. (Hotho, Nürnberger, and Paaß 2005)
11. Statistics
  1. Statistics has its grounds in mathematics and deals with the science and practice for the analysis of empirical data. It is based on statistical theory which is a branch of ap- plied mathematics (Hotho, Nürnberger, and Paaß 2005)
12. ”Knowledge Discovery in Databases (KDD)
  1. is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data (Hotho, Nürnberger, and Paaß 2005)
Sub Areas
1. Text stream mining
  1. Text Stream
    1. Created by very large scale interacions of individuals or structured creations of particular kinds of content by dedicated organizations. i.e. news-wire services (Reuters, AP) (Aggarwal, 2012)
    2. Provide unprecedented challenges to data mining algorithms from an efficiency perspective (Aggarwal, 2012)
    3. Ubiquitous in recent years because wide variety of applications in social networks, news collection. In general continuous creation of massive streams (Aggarwal, 2012)
    4. Applications
      1. Social networks (Aggarwal, 2012)
      2. Users continuously communicate with one another with the use of text messages
      3. Interesting due to text messages are reflective of user interests, an the same applies to chat and email networks
      4. News aggregator services i.e. Google News (Aggarwal, 2012)
      5. Recieves news articles continously over time
      6. Web crawlers (Aggarwal, 2012)
      7. Collect large volume of documents from networks in small time frame
      8. Combines search results from major search engines like Google, Yahoo! and Bing
  2. Opportunites
    1. Methods for online summarizations need to be designed (Aggarwal, 2012)
Motivations
1. There are estimates that 85% of business information lives in the form of text (Hotho, Nürnberger, and Paaß 2005)
2. As most information (over 80%) is stored as text, text mining is believed to have a high commercial potential value. (Gupta and Lehal 2009), (Tan 1999)
3. Humans have the ability to distinguish and apply linguistic patterns to text and humans can easily overcome obstacles that computers cannot easily handle such as slang, spelling variations and contextual meaning. However, although our language capabilities allow us to comprehend unstructured data, we lack the computer’s ability to process text in large volumes or at high speeds. (Gupta and Lehal 2009)
4. As the most natural form of storing information is text, text mining is believed to have a commercial potential higher than that of data mining (Tan 1999)
Methodologies
1. CRoss Industry Standard Process for Data Mining (Crisp DM) (Hotho, Nürnberger, and Paaß 2005)
  1. (1) business understanding2, (2) data understanding, (3) data preparation, (4) modelling, (5) evaluation, (6) deployment
Phases
1. Preprocessing
  1. For mining large document collections it is necessary to pre-process the text documents and store the information in a data structure, which is more appropriate for further processing than a plain text file (Hotho, Nürnberger, and Paaß 2005)
  2. Filtering (Hotho, Nürnberger, and Paaß 2005)
    1. remove words from the dictionary and thus from the documents.
  3. Lemmatization (Hotho, Nürnberger, and Paaß 2005)
    1. Try to map verb forms to the infinite tense and nouns to the singular form.
  4. Stemming (Hotho, Nürnberger, and Paaß 2005)
    1. Try to build the basic forms of words, i.e. strip the plural ’s’ from nouns, the ’ing’ from verbs, or other affixes.
  5. Linguistic (Hotho, Nürnberger, and Paaß 2005)
    1. Part-of-speech tagging (POS)
      1. determines the part of speech tag, e.g. noun, verb, adjective, etc. for each term.
    2. Text chunking
      1. aims at grouping adjacent words in a sentence.
    3. Word Sense Disambiguation (WSD)
      1. Tries to resolve the ambiguity in the meaning of single words or phrases.
    4. Parsing
      1. produces a full parse tree of a sentence.
  6. Index Term Selection (Hotho, Nürnberger, and Paaß 2005)
    1. In this case, only the selected keywords are used to describe the documents.
    2. The most commonly used criteria is the entropy
      1. The entropy gives a measure how well a word is suited to separate documents by keyword search.
      2. The entropy can be seen as a measure of the importance of a word in the given domain context.
  7. Dimensionality reduction
    1. Principal Components Analysis (PCA)
      1. a.k.a Kahrhunen-Lo`eve procedure, eigenvector analysis, and empirical orthogonal functions depending on the context in which one is being used (Berry et al. 2008)
      2. recently it has been used primarily in statistical data analysis and image processing (Berry et al. 2008)
      3. For text and data mining that focuses on covariance matrix analysis (COV) (Berry et al. 2008)
    2. Latent Semantic Indexing (LSI)
      1. Given a database with M documents and N distinguishing attributes for relevancy ranking, let A denote the corresponding M-by-N document-attribute matrix model with entries a(i, j) that represent the importance of the ith term in the jth document. The fundamental idea in LSI is to reduce the dimension of the IR problem to k, where k M,N, by projecting the problem into the space spanned by the rows of the closest rank-k matrix to A in the Frobenius norm [DDF+90]. Projection (Berry et al. 2008)
    3. Singular Value Decomposition (SVD) (Ppts APomares)
2. Storage
  1. The Vector Space Model
    1. Each element of the vector usually represents a word (or a group of words) of the document collection, i.e. the size of the vector is defined by the number of words (or groups of words) of the complete document collection (Hotho, Nürnberger, and Paaß 2005)
  2. Bag of words
    1. A document is represented as a set of words, together with their associated frequency in the document. (Aggarwal, 2012)
    2. This representation is essentially independent of the sequence of words in the collection (Aggarwal, 2012)
    3. simplicity for classification purposes (Aggarwal, 2012)
Techniques
1. Classification (Hotho, Nürnberger, and Paaß 2005)
  1. Classification aims at assigning pre-defined classes to text documents
  2. a.k.a Categorization
    1. Categorization involves identifying the main themes of a document by placing the document into a pre-defined set of topics. (Gupta and Lehal 2009)
  3. Na ̈ıve Bayes Classifier
    1. We may assign the class with highest posterior probability to our document.
    2. Combining this “na ̈ıve” independence assumption with the Bayes formula
  4. Nearest Neighbor Classifier
    1. Instead of building explicit models for the different classes we may select documents from the training set which are “similar” to the target document. The class of the target document subsequently may be inferred from the class labels of these similar documents.
  5. Decision Trees
    1. Decision trees are classifiers which consist of a set of rules which are applied in a sequential way and finally yield a decision.
  6. Support Vector Machines
    1. The SVM algorithm determines a hyperplane which is located between the positive and negative examples of the training set
    2. The most important property of SVMs is that learning is nearly independent of the dimensionality of the feature space. It rarely requires feature selection as it inherently selects data points (the support vectors) required for a good classification.
2. Clustering (Hotho, Nürnberger, and Paaß 2005)
  1. Clustering method can be used in order to find groups of documents with similar content.
  2. Clustering [7] is a technique used to group similar documents, but it differs from categorization in that documents are clustered on the fly instead of through the use of predefined topics. (Gupta and Lehal 2009)
  3. K-means
    1. the documents are assigned to the nearest of the k centroids (also called cluster prototype). Step four calculates a new centroids on the basis of the new allocations. We repeat the two steps in a loop (step five) until the cluster centroids do not change any more
  4. Self organizing maps (SOM)
    1. The neurons in the input layer correspond to the input dimensions, here the words of the document vector. The output layer (map) contains as many neurons as clusters needed.
  5. Fuzzy Clustering
    1. While most classical clustering algorithms assign each datum to exactly one cluster, thus forming a crisp partition of the given data, fuzzy clustering al- lows for degrees of membership, to which a datum belongs to different clusters
3. Information Extraction (Gupta and Lehal 2009)
  1. The main task is to extract parts of text and assign specific attributes to it.
  2. Hidden Markov Models
    1. Hidden Markov models require the conditional independence of features of different words given the labels.
4. Visualization
  1. Visual text mining, or information visualization [3], puts large textual sources in a visual hierarchy or map and provides browsing capabilities, in addition to simple searching. (Gupta and Lehal 2009)
5. Topic Tracking
  1. choose keywords and notifies them when news relating to those topics becomes available. (Gupta and Lehal 2009)
6. Summarization
  1. Text summarization is immensely helpful for trying to figure out whether or not a lengthy document meets the user’s needs and is worth reading for further information. (Gupta and Lehal 2009)
7. Concept Linkage
  1. Concept linkage tools [3] connect related documents by identifying their commonly-shared concepts and help users find information that they perhaps wouldn’t have found using traditional searching methods. (Gupta and Lehal 2009)
8. Question Answering
  1. Another application area of natural language processing is natural language queries, or question answering (Q&A), which deals with how to find the best answer to a given question. (Gupta and Lehal 2009)
9. Association Rule Mining
  1. Association rule mining (ARM) [33] is a technique used to discover relationships among a large set of variables in a data set. (Gupta and Lehal 2009)
  2. ARM discovers what items customers typically purchase together. (Gupta and Lehal 2009)
Application Areas (Hotho, Nürnberger, and Paaß 2005)
1. Patent Analysis
  1. The users like to have these stories tagged with categories and the names of important persons, organizations and places
2. Text Classification for News Agencies
  1. Automatic clustering for news assigning topics
3. Bioinformatics
  1. Bio-entity recognition aims to identify and classify technical terms in the domain of molecular biology that correspond to instances of concepts that are of interest to biologists.
4. Knowledge and Human Resource management
  1. The need to organize and modify their strategies according to demands and to the opportunities that the market present requires that companies collect information about themselves, the market and their competitors, and to manage enormous amount of data, and analyzing them to make plans. (Gupta and Lehal 2009)
5. Customer Relationship Management (CRM)
  1. automatically rerouting specific requests to the appropriate service or at supplying immediate answers to the most frequently asked questions. (Gupta and Lehal 2009)
6. Market Analysis
  1. analyze competitors and/or monitor customers' opinions to identify new potential customers, as well as to determine the companies’ image through the analysis of press reviews and other relevant sources. (Gupta and Lehal 2009)
7. Technology watch
  1. analyses the characteristics of existing technologies, as well as identifying emerging technologies (Gupta and Lehal 2009)
Open Problems and future directions
1. It remains a challenge to see how semantic analysis can be made much more efficient and scalable for very large text corpora. (Tan 1999)
2. It is essential to develop text refining algorithms, that process multilingual text documents and produce language-independent intermediate forms (Tan 1999)
3. It is interesting to explore how one can take advantage of domain information to improve parsing efficiency and derive a more compact intermediate form. (Tan 1999)
4. Current text mining products and applications are still tools designed for trained knowledge specialists. Future text mining tools, as part of the knowledge management systems, should be readily usable by technical users as well as management executives. (Tan 1999)
Real Applications (Ppts APomares)
1. opinioncrawl.com
  1. Sentiment Analysis
  2. allows visitors to assess Web sentiment on a topic - a person, an event, a company or a product
2. socialmention.com
  1. social media search and analysis platform that aggregates user generated content from across the universe into a single stream of information.
3. nlp.stanford.edu:8080/sentiment/rntnDemo.html
  1. Sentiment Analysis