K-Nearest Neighbor Regression for Predicting Song Popularity Using Gower Distance
DOI:
https://doi.org/10.37134/ejsmt.vol12.sp.3.2025Keywords:
song popularity, k-nearest neighbor regression, audio feature, Gower distance, weighting methodAbstract
The machine learning approach is widely used to investigate human activities, such as in the art field. In the music industry, a song's popularity is essential to predict before it is released. In this paper, we were interested in predicting the popularity of songs using the -nearest neighbor regression. The Spotify app was used to gather some information related to the audio features of a song, i.e., song duration, instrumentalness, loudness, acousticness, danceability, energy, liveness, speechiness, audio valence, key, audio mode, tempo, and time signature. This research used mixed-type variables; thus, the dissimilarity is measured using the Gower distance. In addition, two weighting methods were also compared to predict song popularity. Using 10-fold cross-validation, we found that the inversely proportional weights-distance showed better prediction performance when compared with equal weight. Moreover, we also found the best performance in predicting the song popularity is obtained when = 5 nearest neighbors were used, with mean square error (MSE) of 636.75 and mean absolute percentage error (MAPE) of 41.58% that implies a reasonable prediction result.
Downloads
References
[1] Araujo, C.V.S., Cristo, M.A.P., & Giusti, R. (2019). Predicting music popularity using music charts. Proceeding of 18th IEEE International Conference On Machine Learning And Applications (ICMLA), 859-864. https://doi: 10.1109/ICMLA.2019.00149.
[2] Yang, L.-C., Chou, S.-Y., Liu, J.-Y., Yang, Y.-H., & Chen, Y.-A. (2017). Revisiting the problem of audio-based hit song prediction using convolutional neural networks. Proceeding of 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 621-625.
https://doi.org/10.48550/arXiv.1704.01280
[3] Al-Beitawi, Z., Salehan, M., & Zhang, S. (2020). What makes a song trend? Cluster analysis of musical attributes for Spotify top trending songs. Journal of Marketing Development and Competitiveness, 14(3), 79-91. https://doi.org/10.33423/jmdc.v14i3.3065
[4] Pham, J., Kyauk, E., & Park, E. (2016). Predicting song popularity (Tech. Rep. Vol. 26). Dept. Comput. Sci., Stanford Univ., Stanford, CA, USA. https://cs229.stanford.edu/proj2015/140_report.pdf
[5] Askin, N., & Mauskapf, M. (2017). What makes popular culture popular? Product features and optimal differentiation in music. American Sociological Review, 82(5), 910–944.
https://doi.org/10.1177/ 0003122417728662
[6] Pareek, P., Shankar, P., Pathak, P., & Sakariya, N. (2022). Predicting music popularity using machine learning algorithm and music metrics available in spotify. Journal of Development Economics and Management Research Studies (JDMS), 9(11), 10 -19. http://doi.org/10.53422/JDMS.2022.91102
[7] Suh, B. J. (2019). International music preferences: an analysis of the determinants of song popularity on Spotify for the US, Norway, Taiwan, Ecuador, and Costa Rica. CMC Senior Theses.
https://scholarship.claremont.edu/cmc_theses/2271.
[8] Saragih, H.S. (2023). Predicting song popularity based on Spotify's audio features: insights from the Indonesian streaming users. Journal of Management Analytics, 10(4), 693-709.
https://doi.org/10.1080/23270012.2023.2239824
[9] Dong, A., Qiu, R., & Ye, Z. (2023). Regression analysis of song popularity based on ridge, K-nearest neighbors and multiple-layers neural networks. Highlights in Science, Engineering and Technology, 39, 609-617. https://doi.org/10.54097/hset.v39i.6602
[10] Song, Y., Liang, J., Lu, J., & Zhao, X. (2017). An efficient instance selection algorithm for k nearest neighbor regression. Neurocomputing, 251, 26–34. https://doi.org/10.1016/j.neucom.2017.04.018
[11] Chen, G.H. & Shah, D. (2018). Explaining the success of nearest neighbor methods in prediction. Foundations and Trends in Machine Learning, 10(5-6), 337–588. https://doi.org/10.1561/2200000064.
Cosenza, D. N., Korhonen, L., Maltamo, M., Packalen, P., Strunk, J. L., Næsset, E., ... & Tomé, M. (2021). Comparison of linear regression, k-nearest neighbour and random forest methods in airborne laser-scanning-based prediction of growing stock. Forestry: An International Journal of Forest Research, 94(2), 311-323. https://doi.org/10.1093/forestry/cpaa034
[12] Shataee, S., Kalbi, S., Fallah, A., & Pelz, D. (2012). Forest attribute imputation using machine-learning methods and ASTER data: comparison of k-NN, SVR and random forest regression algorithms. International Journal of Remote Sensing, 33(19), 6254–6280.
https://doi.org/10.1080/01431161.2012.682661
[13] Zhang, F., & O'Donnell, L. J. (2019). Support vector regression. Machine Learning, 123-140. https://doi.org/10.1016/B978-0-12-815739-8.00007-9
[14] Haykin, S. (2009). Neural networks and learning machines (3rd ed.). Pearson Education, Inc., McMaster University, Hamilton. http://dai.fmph.uniba.sk/courses/NN/haykin.neural-networks.3ed.2009.pdf
[15] Fathabadi, A., Seyedian, S.M., & Malekian, A. (2022). Comparison of bayesian, k-nearest neighbor and gaussian process regression methods for quantifying uncertainty of suspended sediment concentration prediction. Science of The Total Environment, 818, article151760.
https://doi.org/10.1016/j.scitotenv.2021.151760
[16] Liu, W., Wang, P., Meng, Y., Zhao C., and Zhang Z. (2020). Cloud spot instance price prediction using kNN regression. Hum. Cent. Comput. Inf. Sci. 10, 34. https://doi.org/10.1186/s13673-020-00239-5
[17] Paryudi, I. 2019. What affects k value selection In K-nearest neighbor? Int. J. Sci. Technol. Res., 8(7) 86-92. https://www.ijstr.org/research-paper-publishing.php?month=july2019
[18] Kataria, A., Singh, M. (2013). A review of data classification using k-nearest neighbour algorithm. Int. J. Emerg. Technol. Adv. Eng. 3(6), 354–360.
https://www.ijetae.com/files/Volume3Issue6/IJETAE_0613_60.pdf
[19] Van de Velden, M., D’Enza, A. I., Markos, A., & Cavicchia, C. (2024). A general framework for implementing distances for categorical variables. Pattern Recognition, 153, 110547.
https://doi.org/10.1016/j.patcog.2024.110547
[20] Tuerhong, G., Kim, S.B. (2014). Gower distance-based multivariate control charts for a mixture of continuous and categorical variables. Expert Syst. Appl., 41(4), 1701–1707.
https://doi.org/10.1016/j.eswa.2013.08.068.
[21] Sulc, Z., Procházka, J., and Matějkaz, M. (2016). Modifications of the Gower similarity coefficient. The Proceeding of 19th Appl. Math. Stat. Econ. 2016; Banská Štiavnica, Slovakia; Matej Bel University [Online]. https://www.researchgate.net/publication/313387106.
[22] Van de Velden, M., D'Enza, A. I., Markos, A., & Cavicchia, C. (2024). Unbiased mixed variables distance. arXiv preprint arXiv:2411.00429. https://arxiv.org/abs/2411.00429
[23] Kadhim, M.N, Al-Shammary, D., & Sufi, F. (2024). A novel voice classification based on Gower distance for Parkinson disease detection. International Journal of Medical Informatics, 191, 105583. https://doi.org/10.1016/j.ijmedinf.2024.10558
[24] Coombes, C. E., Liu, X., Abrams, Z. B., Coombes, K. R., & Brock, G. (2021). Simulation-derived best practices for clustering clinical data. Journal of Biomedical Informatics, 118, 103788. https://doi.org/10.1016/j.jbi.2021.103788
[25] Yozza. H., Azizah, N.M., Yulianti, L., and Rahmi, I. (2023). The classification of "Program Sembako" recipients in Payobasung West Sumatra based on k-nearest neighbor classifier. Jurnal Natural (in Bahasa). 23(2), 83-91. https://doi.org/ 10.24815/jn.v23i2.29738
[26] Yasser, M. (2021). Song popularity dataset. Available at https://www.kaggle.com/datasets/yasserh/song-popularity-dataset/data
[27] Araujo, V.S., Cristo, M.A.P., & Giusti, R. (2020). Predicting music popularity on streaming platform. Revista de Inform, 27(04), 108-117. http://dx.doi.org/10.22456/2175-2745.107021
[28] Van de Velden, M., D'Enza, A.I., & Markos, A. (2019). Distance‐based clustering of mixed data. Wiley Interdisciplinary Reviews: Computational Statistics, 11(3), e1456. DOI: 10.1002/wics.1456
[29] Kumbure, M.M., & Luukka, P. (2022). A generalized fuzzy k-nearest neighbor regression model based on Minkowski distance. Granul. Comput, 7, 657–671. https://doi.org/10.1007/s41066-021-00288-w
[30] Maharana, K., Mondal, S., & Nemade, B. (2022). A review: Data pre-processing and data augmentation techniques. Global Transitions Proceedings, 3(1), 91-99. https://doi.org/10.1016/j.gltp.2022.04.020
[31] Nijkamp, R. (2018). Prediction of product success: explaining song popularity by audio features from Spotify data [paper presentation]. 11th IBA Thesis Conference, University of Twente, Enschede, The Netherlands
[32] Jamdar, A., Abraham, J., Khanna, K., & Dubey, R. (2015). Emotion analysis of songs based on lyrical and audio features. Int. J. Artif. Intell. Appl., 6(3), 35–50. https://doi.org/10.5121/ijaia.2015.6304
[33] Kowald, D., Schedl, M., & Lex, E. (2019). The unfairness of popularity bias in music recommendation: A reproducibility study. arXiv preprint arXiv:1912.04696. https://doi.org/10.48550/arXiv.1912.04696
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Hazmira Yozza, Riswan Efendi, Nor Azah Samat, Izzati Rahmi, Aqil Burney S.M.

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

