Topic Modeling Using PLSA (Probabilistic Latent Semantic Analysis) Method for Copy Paste Identification in Sentences

Authors

  • Gentur Dwi Teguh Santoso*, Rizki Febriansyah, Ega Kamalludin, Agustian Putra Pamungkas, Ari Purno Wahyu W

Abstract

Document processing software application of Final Project and this software application has two
main features. The first feature is to search the Final Project report based on Author, Title and by Words.
Searching by Author and Title is accomplished using String Compare method. Meanwhile for searching by
words contained in the document, topic modeling is performed using the formula of PLSA (Probabilistic Latent
Semantic Analysis) to provide more relevant search results than search words syntactically. The second
feature the application can perform the sequence comparison of sentence to identify copy and paste sentences
from other documents, which is one of the acts of plagiarism. To accomplish this second feature sequence
comparison of sentences is performed based on the position of the word and sentence. The processes involved
to develop this application are documents acquisition and query processing. Document acquisition involves
extracting to terms and their frequency of each document in the form of Inverted Index matrix. The extraction
must also be able to save the information of words and sentences position into the matrix. The matrix then will
be used to calculate topic modeling using PLSA formula. In turn, this topic modeling will be used to search
by words for documents and presented through ranking system The experiment carried out on 17 documents
of the Final Project report from different years for the diversity of topics. The application gives more relevant
results within the scope of the cluster of topic based on single keyword. This is due to the using of simple
ranking system. Applications can also provide information related to any documents which do copy-paste
sentences with exactly the same sequence of words as indicator.

Published

2020-04-30

Issue

Section

Articles