Mining Collection of Documents: Clustering and Categorization

  • Mr. Kolli.Srikanth, Prof. N.V.E.S.Murthy, Prof. P.V.G.D. Prasad Reddy


Technical Documents from Web pages and News Articles, Transcripts of phone calls with customers, Customer complaint letters through E-mails, Insurance Claims is Corporate Knowledge “Ore”. Information is highly unstructured form not readily useful for statistical analysis; huge collection of documents (Big Data) is challenges in Text Mining. The Sentiment Analysis (SA) is an active subject of a research in the text mining industry. SA is the new computational handling of views, feelings, and the subjectivity of text. The paper's key contributions include sophisticated clustering and the categorizations of a large number of recent articles, as well as an example of the current research pattern in sentiment analysis and related topics. The corpus-based technique starts with a seed list of opinion words and then searches a huge corpus for additional opinion words to aid in the discovery of opinion words with the context specific orientations. This might be accomplished by employing a statistical two phase text mining approach: feature extraction and information distillation. In Text Categorization, a document collection is processed and categorised into specified categories based on a user-supplied taxonomy. Clustering involves processing and grouping documents into clusters that are dynamically created by an algorithm. Text mining makes extensive use of information can from blogs, microblogs, and forums, as well as a news sources. This media information is extremely important for communicating people's sentiments or ideas about the certain issue or product. The use of social networking and microblogging sites as data sources still need further investigation. There are some real world data sets movie-pang02, chicago-affnia especially in reviews which are used for Machine Learning Classification and Clustering Algorithms evaluation using R tool.

Keywords—Text Mining, corpus, tf-idf, Partition based clustering. Naive Bayes classification, movie-pang02, chicago-affnia