Evaluating Semantic Geometry of Indonesian News Texts: Agglomerative Clustering Study using IndoBERT Embeddings

Joni Wilson Sitopu

Abstract


This study aims to evaluate the effectiveness of various Agglomerative Clustering configurations in unveiling the Semantic Geometry of a large corpus of Indonesian news texts, represented using IndoBERT Embeddings. The IndoBERT transformer model addresses the limitations of traditional methods (such as TF-IDF) in capturing semantic equivalence despite lexical variations. However, this research finds that the dense (homogeneous) nature of the embeddings necessitates a meticulous clustering methodology. The use of Cosine Similarity resulted in a highly uneven cluster distribution, with one cluster dominating over 99% of the documents, demonstrating a limitation in distinguishing thematic nuances due to the high vector directional similarity. Conversely, the combination of Euclidean Distance with UMAP (Uniform Manifold Approximation and Projection) dimensionality reduction proved optimal. UMAP, as a non-linear technique, successfully decomposed the finer data structure, yielding clusters with the most balanced size (ranging from 4254 to 8204 documents) and being thematically representative. The thematic profiling of the UMAP-Euclidean clusters successfully identified five distinct and granular main themes: Politics, Health & Technology, Macroeconomics & Finance, Economy & Industry, and Education & Social Issues. This research concludes that non-linear dimensionality reduction (UMAP) is a crucial step for clarifying the Semantic Geometry and achieving granular and meaningful clustering on IndoBERT embeddings.


Keywords


IndoBERT Embeddings; Semantic Geometry; Agglomerative Clustering; UMAP; Euclidean Distance; Indonesian News Analysis

Full Text:

PDF

References


Asri, Y., Kuswardani, D., Sari, A. A., & Ansyari, A. R. (2025). Word embedding for contextual similarity using cosine similarity. Indonesian Journal of Electrical Engineering and Computer Science, 38(2), 1170-1180.

Ferro-Diez, L. E., Villegas, N. M., Díaz-Cely, J., & Acosta, S. G. (2021). Geo-spatial market segmentation & characterization exploiting user generated text through transformers & density-based clustering. IEEE Access, 9, 55698-55713.

Gandasari, D., & Dwidienawati, D. (2020). Content analysis of social and economic issues in Indonesia during the COVID-19 pandemic. Heliyon, 6(11).

Giabelli, A., Malandri, L., Mercorio, F., Mezzanzanica, M., & Nobani, N. (2022). Embeddings evaluation using a novel measure of semantic similarity. Cognitive Computation, 14(2), 749-763.

Gómez, J., & Vázquez, P. P. (2022). An empirical evaluation of document embeddings and similarity metrics for scientific articles. Applied Sciences, 12(11), 5664.

Laicher, S., Kurtyigit, S., Schlechtweg, D., Kuhn, J., & Walde, S. S. I. (2021). Explaining and improving BERT performance on lexical semantic change detection. arXiv preprint arXiv:2103.07259.

Medileh, S., Hammoudeh, M., Bounceur, A., Brahim, F., Laouid, A., Kara, M., & Muthanna, A. (2025). Optimizing deep learning for webshell detection based on flexible dataset reduction. Egyptian Informatics Journal, 31, 100770.

Mohd Tajul Ariffin, T. A., Sheikh Abdullah, S. N. H., Fauzi, F., Murah, Z., & Hasan, M. K. (2025). Review on honeynet analysis: can LSTM and shot learning drive intelligent cyber threat modelling and automation?. Cluster Computing, 28(9), 1-32.

Moreno Pérez, C., & Minozzo, M. (2022). Natural language processing and financial markets: semi-supervised modelling of coronavirus and economic news.

Orhan, U., & Tulu, C. N. (2021). A novel embedding approach to learn word vectors by weighting semantic relations: SemSpace. Expert Systems with Applications, 180, 115146.

Pawar, D., Phansalkar, S., Sharma, A., Sahu, G. K., Ang, C. K., & Lim, W. H. (2023). Survey on the biomedical text summarization techniques with an emphasis on databases, techniques, semantic approaches, classification techniques, and similarity measures. Sustainability, 15(5), 4216.

Poschmann, P., Goldenstein, J., Büchel, S., & Hahn, U. (2024). A vector space approach for measuring relationality and multidimensionality of meaning in large text collections. Organizational Research Methods, 27(4), 650-680.

Purnawati, D. G. I., Putri, D. P. S., & Piarsa, I. N. (2025). Implementation of Text Mining for Evaluating the Relevance Between News Headlines and Content on a Web-Based Platform. Journal of Applied Informatics and Computing, 9(4), 1463-1476.

Putri, D. U. K., & Pratomo, D. N. (2022). Clickbait detection of Indonesian news headlines using fine-tune bidirectional encoder representations from transformers (BERT). Inform: Jurnal Ilmiah Bidang Teknologi Informasi Dan Komunikasi, 7(2), 162-168.

Santoso, D. H. (2021). New media and nationalism in Indonesia: An analysis of discursive nationalism in online news and social media after the 2019 Indonesian presidential election. Jurnal Komunikasi: Malaysian Journal of Communication, 37(2), 289-304.

Setiawan, G. H., Pranata, M. D. A., Arimbawa, I. B. A., Giri, I. W. P., & Dayani, N. P. L. C. (2025). Topic Clustering of Student Complaints Based on Semantic Meaning Using the indoBERT and K-Means Models. Journal of Applied Informatics and Computing, 9(4), 1715-1721.

Sitopu, W., Nababan, E., & Budiman, M. (2025). Reducing Semantic Distortion of Multiword Expressions for Topic Modeling with Latent Dirichlet Allocation. Journal of Information Systems and Informatics, 7(3), 2920-2938. https://doi.org/10.51519/journalisi.v7i3.1266

Sommerschield, T., Assael, Y., Pavlopoulos, J., Stefanak, V., Senior, A., Dyer, C., ... & De Freitas, N. (2023). Machine learning for ancient languages: A survey. Computational Linguistics, 49(3), 703-747.

Suryadibrata, A., & Young, J. C. (2021). Embedding from Language Models (ELMos)-based Dependency Parser for Indonesian Language. International Journal of Advances in Soft Computing & Its Applications, 13(3).

Susanto, E. H., Loisa, R., & Junaidi, A. (2020). Cyber media news coverage on diversity issues in Indonesia. Journal of Human Behavior in the Social Environment, 30(4), 510-524.

Trisna, K. W., Huang, J., Liang, H., & Dharma, E. M. (2024). Fusion text representations to enhance contextual meaning in sentiment classification. Applied Sciences, 14(22), 10420.

Weng, M. H., Wu, S., & Dyer, M. (2022). Identification and visualization of key topics in scientific publications with transformer-based language models and document clustering methods. Applied Sciences, 12(21), 11220.

Wu, J., Yang, S., Zhan, R., Yuan, Y., Chao, L. S., & Wong, D. F. (2025). A survey on llm-generated text detection: Necessity, methods, and future directions. Computational Linguistics, 51(1), 275-338.

Yin, Y., Zhang, Y., Liu, Z., Wang, S., Shah, R. R., & Zimmermann, R. (2021). GPS2Vec: Pre-trained semantic embeddings for worldwide GPS coordinates. IEEE Transactions on Multimedia, 24, 890-903.

Yulianti, E., & Nissa, N. K. (2024). ABSA of Indonesian customer reviews using IndoBERT: single-sentence and sentence-pair classification approaches. Bulletin of Electrical Engineering and Informatics, 13(5), 3579-3589.




DOI: http://dx.doi.org/10.30829/zero.v9i3.26549

Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Publisher :
Department of Mathematics
Faculty of Science and Technology
Universitas Islam Negeri Sumatera Utara Medan
📱 WhatsApp:085270009767 (Admin Official)
SINTA 2 Google Scholar CrossRef Garuda DOAJ