Towards stop words identification in Tamil text clustering

dc.contributor.authorFaathima Fayaza, M. S.
dc.contributor.authorFathima Farhath, F.
dc.date.accessioned2022-02-21T10:50:04Z
dc.date.available2022-02-21T10:50:04Z
dc.date.issued2021
dc.description.abstractNow-a-days, digital documents have become the primary source of information. Therefore, natural language processing is widely utilized in information retrieval, topic modeling, document classification, and document clustering. Preprocessing plays a significant role in all of these applications. One of the critical steps in preprocessing is removing stopwords. Many languages have defined their list of stopwords. However, a publicly available stopwords list isn't available for the Tamil language since it is under-resourced. This study identified 93 general and some domain-specific stopwords for sports, entertainment, local and foreign news by analyzing more than 1.7 million Tamil documents with more than 21 million words. Also, this study shows that removing stopwords improves the accuracy of a Tamil document clustering system. It showed an improvement of 2.4%, 0.95% in the F-score for TF-IDF with one pass algorithm and FastText with the one-pass algorithm, respectively.en_US
dc.identifier.citationInternational Journal of Advanced Computer Science and Applications, Vol. 12, No. 12, 2021; p. 524-529.en_US
dc.identifier.issn2156-5570
dc.identifier.urihttp://ir.lib.seu.ac.lk/handle/123456789/5994
dc.language.isoen_USen_US
dc.publisherThe Science and Information Organizationen_US
dc.subjectStopwordsen_US
dc.subjectTamilen_US
dc.subjectPre-processingen_US
dc.subjectTF-IDFen_US
dc.subjectClusteringen_US
dc.titleTowards stop words identification in Tamil text clusteringen_US
dc.typeArticleen_US

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Paper_67-Towards_Stopwords_Identification.pdf
Size:
626.62 KB
Format:
Adobe Portable Document Format
Description:

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description: