ISSN : 2663-2187

An Intensified text clustering algorithm for Noisy Big Data: Haphazardly Dense Clustering with more Noise (HDCN)

Main Article Content

U. Vageeswari , B. Lavanya
ยป doi: 10.33472/AFJBS.6.6.2024.5561-5576

Abstract

Large Databases with multi-dimensions have large amounts of noise and sometimes only a small portion of it accounts for the clustering. Text Clustering involves stages of pre-processes since text data are semi-structured and unstructured. The method for transforming the text into another numerical form is referred to as text vectorization. TF-IDF and Word2Vec have widely used vectorization methods. This paper proposes a distance-based Text clustering algorithm, Haphazardly Dense Clustering with more Noise (HDCN) for the noisy and unknown dataset. Two datasets with two distinct vectorization approaches are used to test the algorithm. The same is compared with available state of art methods like K-Means, Hierarchical, and DBSCAN clustering. The DBSCAN algorithm detects 30.5 % of noise on average, but the HDCN algorithm detects 96.675 % of noise.

Article Details