When code smells meet ML: on the lifecycle of ML-specific code smells in ML-enabled systems

Article Content

Abstract

The adoption of Machine Learning (ML)-enabled systems is growing rapidly, introducing novel challenges in maintaining quality and managing technical debt in these complex systems. Among the key quality threats are ML-specific code smells (ML-CSs), suboptimal implementation practices in ML pipelines that can compromise system performance, reliability, and maintainability. Although these smells have been defined in the literature, detailed insights into their characteristics, evolution, and mitigation strategies are still needed to help developers address these quality issues effectively. In this paper, we investigate the emergence and evolution of ML-CSs through a large-scale empirical study focusing on (i) their prevalence in real ML-enabled systems, (ii) how they are introduced and removed, and (iii) their survivability. We analyze over 400,000 commits from 337 ML-enabled projects, leveraging CodeSmile, a novel ML smell detector that we developed to enable our investigation and identify ML-specific code smells. Our results reveal that: (1) CodeSmile can detect ML-CSs with precision and recall rates of 87.4% and 78.6%, respectively; (2) ML-CSs are frequently introduced during file modifications in new feature tasks; (3) smells are typically removed during tasks related to new features, enhancements, or refactoring; and (4) the majority of ML-CSs are resolved within the first 10% of commits. Based on these findings, we provide actionable conclusions and insights to guide future research and quality assurance practices for ML-enabled systems.

Discover the latest articles and news from researchers in related subjects, suggested using machine learning.

Machine Learning
Open Source
Sensory Evaluation
Statistical Learning
Symbolic AI
Technical Languages

Data Availability Statement

The manuscript includes data as electronic supplementary material. In particular, datasets generated and analyzed during the current study, detailed results, scripts, and additional resources useful for reproducing the study are available as part of our online appendix on Figshare (Recupito et al. 2024). In addition, we included the GitHub repository link for CodeSmile: https://github.com/giammariagiordano/smell_ai/tree/main.

Notes

https://chat.openai.com/
Source code available at: https://github.com/huggingface/transformers/blob/main/examples/research_projects/bertology/run_bertology.py
Available at https://github.com/acmsigsoft/EmpiricalStandards
Qualtrics sample size: https://www.qualtrics.com/blog/calculating-sample-size/
Case of a ‘Columns and DataType Not Explicitly Set’ smell-introducing commit when performing refactoring in BrikerMan/Kashgari project: https://github.com/BrikerMan/Kashgari/commit/f7fb43d2f3651fbba92eb6e5cee8bfd279b0317a
Case of a ‘Columns and DataType Not Explicitly Set’ smell-removing commit when performing enhancement in RTIInternational/gobbli project:https://github.com/RTIInternational/gobbli/commit/b93d184c610c3ae779607679501b4b1dafd30b28
Example of a smell-removing commit aware of the ML-CSs: https://github.com/geyang/ml-logger/commit/25aff14cf101ac4db06be02e65d845c54fc38c84

References

Azeem MI, Palomba F, Shi L, Wang Q (2019) Machine learning techniques for code smell detection: A systematic literature review and meta-analysis. Inf Softw Technol 108:115–138

Google Scholar
Basili VR, Caldiera G, Rombach HD (1994) The goal question metric approach. Encyclopedia of software engineering pp 528–532
Bessghaier N, Ouni A, Mkaouer MW (2020) On the diffusion and impact of code smells in web applications. In: Wang Q, Xia Y, Seshadri S, Zhang LJ (eds) Services Computing – SCC 2020. Springer International Publishing, Cham, pp 67–84

Google Scholar
Cardozo N, Dusparic I, Cabrera C (2023) Prevalence of code smells in reinforcement learning projects
Cliff N (1993) Dominance statistics: Ordinal analyses to answer ordinal questions. Psychol Bull 114:494–509. https://api.semanticscholar.org/CorpusID:120113824
Conover WJ (1999) Practical nonparametric statistics, vol 350. john wiley & sons
Costal D, Gómez C, Martínez-Fernández S (2024) Metrics for code smells of ml pipelines. In: Kadgien R, Jedlitschka A, Janes A, Lenarduzzi V, Li X (eds) Product-focused software process improvement. Springer Nature Switzerland, Cham, pp 3–9

Google Scholar
Cunningham W (1992) The wycash portfolio management system. ACM Sigplan Oops Messenger 4(2):29–30

Google Scholar
Fowler M (2018) Refactoring. Addison-Wesley Professional

Google Scholar
Fowler M, Beck K (1997) Refactoring: Improving the design of existing code. In: 11th European conference. Jyväskylä, Finland
Giordano G, Annunziata G, De Lucia A, Palomba F et al (2023) Understanding developer practices and code smells diffusion in ai-enabled software: A preliminary study. In: IWSM-Mensura
Giordano G, Fasulo A, Catolino G, Palomba F, Ferrucci F, Gravino C (2022) On the evolution of inheritance and delegation mechanisms and their impact on code quality. In: 2022 IEEE International conference on software analysis, evolution and reengineering (SANER), pp 947–958. IEEE
Giordano G, Fasulo A, Catolino G, Palomba F, Ferrucci F, Gravino C (2022) On the evolution of inheritance and delegation mechanisms and their impact on code quality. In: 2022 IEEE International conference on software analysis, evolution and reengineering (SANER), pp 947–958
Giordano G, Sellitto G, Sepe A, Palomba F, Ferrucci F (2023) The yin and yang of software quality: On the relationship between design patterns and code smells. In: 2023 49th Euromicro conference on software engineering and advanced applications (SEAA), pp 227–234. IEEE
Khomh F, Penta MD, Guéhéneuc YG, Antoniol G (2012) An exploratory study of the impact of antipatterns on class change-and fault-proneness. Empir Softw Eng 17:243–275

Google Scholar
Lehman MM, Ramil JF, Wernick PD, Perry DE, Turski WM (1997) Metrics and laws of software evolution-the nineties view. In: Proceedings fourth international software metrics symposium, pp 20–32. IEEE
Lenarduzzi V, Lomio F, Moreschini S, Taibi D, Tamburri DA (2021) Software quality for ai: Where we are now? In: Software quality: future perspectives on software engineering quality: 13th international conference, SWQD 2021, Vienna, Austria, January 19–21, 2021, Proceedings 13, pp 43–53. Springer
Martínez-Fernández S, Bogner J, Franch X, Oriol M, Siebert J, Trendowicz A, Vollmer AM, Wagner S (2022) Software engineering for ai-based systems: a survey. ACM Trans Softw Eng Methodol (TOSEM) 31(2):1–59

Google Scholar
Murphy-Hill E, Black AP (2007) Why don’t people use refactoring tools? In: Proceedings of the 1st workshop on refactoring tools, pp 61–62
Palomba F, Bavota G, Di Penta M, Fasano F, Oliveto R, De Lucia A (2018) On the diffuseness and the impact on maintainability of code smells: a large scale empirical investigation. In: Proceedings of the 40th international conference on software engineering, pp 482–482
Palomba F, Bavota G, Di Penta M, Oliveto R, De Lucia A (2014) Do they really smell bad? a study on developers’ perception of bad code smells. In: 2014 IEEE International conference on software maintenance and evolution, pp 101–110. IEEE
Palomba F, Panichella A, Zaidman A, Oliveto R, De Lucia A (2018) The scent of a smell: An extensive comparison between textual and structural smells. IEEE Trans Softw Eng 44(10):977–1000. https://doi.org/10.1109/TSE.2017.2752171

Article Google Scholar
de Paulo Sobrinho EV, De Lucia A, de Almeida Maia M (2018) A systematic literature review on bad smells-5 w’s: which, when, what, who, where. IEEE Trans Softw Eng 47(1):17–66

Google Scholar
Ratzinger J, Sigmund T, Gall HC (2008) On the relation of refactorings and software defect prediction. In: Proceedings of the 2008 international working conference on Mining software repositories, pp 35–38
Recupito G, Giordano G, Ferrucci F, Di Nucci D, Palomba F (2024). When code smells meet ml: On the lifecycle of ml-specific code smells in ml-enabled systems – appendix. https://doi.org/10.6084/m9.figshare.28167065

Article Google Scholar
Recupito G, Pecorelli F, Catolino G, Lenarduzzi V, Taibi D, Di Nucci D, Palomba F (2024) Technical debt in ai-enabled systems: On the prevalence, severity, impact, and management strategies for code and architecture. J Syst Softw 216:112151

Google Scholar
Rhmann W (2021) Quantitative software change prediction in open source web projects using time series forecasting. Int J Open Source Softw Processes (IJOSSP) 12(2):36–51

Google Scholar
Riquet N, Devroey X, Vanderose B (2022) Gitdelver enterprise dataset (gded) an industrial closed-source dataset for socio-technical research. In: Proceedings of the 19th international conference on mining software repositories, pp 403–407
Rosner B, Glynn RJ (2009) Power and sample size estimation for the wilcoxon rank sum test with application to comparisons of c statistics from alternative prediction models. Biometrics 65(1):188–197. https://doi.org/10.1111/j.1541-0420.2008.01062.x

Article MathSciNet Google Scholar
Sculley D, Holt G, Golovin D, Davydov E, Phillips T, Ebner D, Chaudhary V, Young M, Crespo JF, Dennison D (2015) Hidden technical debt in machine learning systems. Advances in neural information processing systems 28
Spadini D, Aniche M, Bacchelli A (2018) Pydriller: Python framework for mining software repositories. In: Proceedings of the 2018 26th ACM Joint meeting on european software engineering conference and symposium on the foundations of software engineering, pp 908–911
Taibi D, Janes A, Lenarduzzi V (2017) How developers perceive smells in source code: A replicated study. Inf Softw Technol 92:223–235

Google Scholar
Tang Y, Khatchadourian R, Bagherzadeh M, Singh R, Stewart A, Raja A (2021) An empirical study of refactorings and technical debt in machine learning systems. In: 2021 IEEE/ACM 43rd international conference on software engineering (ICSE), pp 238–250. IEEE
Tufano M, Palomba F, Bavota G, Oliveto R, Di Penta M, De Lucia A, Poshyvanyk D (2017) When and why your code starts to smell bad (and whether the smells go away). IEEE Trans Softw Eng 43(11):1063–1088

Google Scholar
Tufano M, Palomba F, Bavota G, Oliveto R, Di Penta M, De Lucia A, Poshyvanyk D (2017) When and why your code starts to smell bad (and whether the smells go away). IEEE Trans Softw Eng 43(11):1063–1088

Google Scholar
Van Oort B, Cruz L, Aniche M, Van Deursen A (2021) The prevalence of code smells in machine learning projects. In: 2021 IEEE/ACM 1st Workshop on AI engineering-software engineering for AI (WAIN), pp. 1–8. IEEE
Walter B, Alkhaeir T (2016) The relationship between design patterns and code smells: An exploratory study. Inf Softw Technol 74:127–142

Google Scholar
Wang G, Wang Z, Chen J, Chen X, Yan M (2022) An empirical study on numerical bugs in deep learning programs. In: Proceedings of the 37th IEEE/ACM international conference on automated software engineering, pp 1–5
Widyasari R, Yang Z, Thung F, Qin Sim S, Wee F, Lok C, Phan J, Qi H, Tan C, Tay Q, Lo D (2023) Niche: A curated dataset of engineered machine learning projects in python. In: 2023 IEEE/ACM 20th International conference on mining software repositories (MSR), pp 62–66
Wohlin C, Runeson P, Höst M, Ohlsson MC, Regnell B, Wesslén A (2012) Experimentation in software engineering. Springer Science & Business Media
Zhang H, Cruz L, Van Deursen A (2022) Code smells for machine learning applications. In: Proceedings of the 1st international conference on AI engineering: software engineering for AI, pp 217–228
Zhou Y, Leung H, Xu B (2009) Examining the potentially confounding effect of class size on the associations between object-oriented metrics and change-proneness. IEEE Trans Softw Eng 35(5):607–623

Google Scholar

Download references

Acknowledgements

This work has been partially supported by the European Union – NextGenerationEU through the Italian Ministry of University and Research, Projects PRIN 2022 “QualAI: Continuous Quality Improvement of AI-based Systems” (grant n. 2022B3BP5S , CUP: H53D23003510006) and by the EMELIOT national project funded by the MUR under the PRIN 2020 program (Contract 2020W3A5FY).

Author information

Authors and Affiliations

Software Engineering (SeSa) Lab — University of Salerno, Fisciano, Italy

Gilberto Recupito, Giammaria Giordano, Filomena Ferrucci, Dario Di Nucci & Fabio Palomba

Contributions

Gilberto Recupito: Formal analysis, Investigation, Data Curation, Validation, Writing – Original Draft, Visualization. Giammaria Giordano: Formal analysis, Investigation, Data Curation, Validation, Writing – Original Draft, Visualization. Filomena Ferrucci: Writing – Review & Editing. Dario Di Nucci: Supervision, Writing – Review & Editing. Fabio Palomba: Supervision, Resources, Writing – Review & Editing.

Corresponding author

Correspondence to Gilberto Recupito.

Ethics declarations

Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Communicated by: Gema Rodriguez-Perez and Ben Hermann.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article belongs to the Topical Collection: SI: Registered Reports

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Recupito, G., Giordano, G., Ferrucci, F. et al. When code smells meet ML: on the lifecycle of ML-specific code smells in ML-enabled systems. Empir Software Eng 30, 139 (2025). https://doi.org/10.1007/s10664-025-10676-4

Download citation

Accepted 14 May 2025
Published 12 July 2025
DOI https://doi.org/10.1007/s10664-025-10676-4

Keywords

Software engineering for artificial intelligence
Software quality for artificial intelligence
Technical debt
Empirical software engineering

Related Articles

Contact us