Patrick L. Marcos , Dana Justine D. Pacatang
doi.org/10.36647/TTACA/04.01.A003
Abstract : Support Vector Machines (SVM) have shown strong performance in various classification tasks. However, SVM's performance deteriorates when faced with high-dimensional data due to the curse of dimensionality, where the increasing number of features reduces the model’s ability to generalize and increases computational complexity. This study addressed this challenge by using an enhanced SVM model that incorporates a Term Frequency-Inverse Document Frequency - Class Variance (TF-IDF-CV) feature extraction method applied in spam email classification. Unlike traditional methods, TF-IDF-CV considers class variance during feature extraction, which helps mitigate the negative effects of high-dimensional data. Experimental results demonstrate that the enhanced SVM outperforms traditional feature extraction techniques, including TF-IDF, Bag of Words, and Word2Vec, achieving an accuracy of 99.42%, precision of 99.43%, recall of 99.42%, and an F1-score of 99.42%. These results highlight the model’s improved robustness and reliability, making it a promising solution for accurate and efficient spam detection in high-dimensional datasets.
Keyword : Classification, Curse of Dimensionality, Feature Extraction, High-Dimensional Data, Machine Learning, Support Vector Machine, TF-IDF.