Comparing machine learning algorithms and linear model for detecting significant SNPs for genomic evaluation of growth traits in F2 chickens

Document Type : Original Research

Authors
1 Department of Animal Science, Faculty of Agriculture, Tarbiat Modares University, Tehran, Islamic Republic of Iran.
2 Department of Animal Science and Aquaculture, Dalhousie University, Truro, NS, Canada.
3 Agriculture Victoria Research, AgriBio, Centre for AgriBioscience, Bundoora, Victoria 3083, Australia.
Abstract
High-density Single Nucleotide Polymorphisms (SNPs) panels are expensive, especially in developing countries. However, methods have been developed to detect critical SNPs from these panels and design low-density chips for genomic evaluation at lower cost. This study aimed to determine the efficiency of Random Forest (RF) and Gradient Boosting Machine (GBM) algorithms, and Linear Model (LM) in identification of SNPs subsets to predict Genomic Estimated Breeding Values (GEBVs) for Body Weights at 6 (BW6) and 9 (BW9) weeks in broiler chickens and compare the predicted GEBVs with those obtained by the 60K SNP panel. The data were collected on 312 F2 chickens that genotyped with 60K Illumina SNP BeadChip. After applying quality control, the remaining 45,512 SNPs were ranked based on p-values, mean square error percentage, and relative influence, obtained by LM, RF and GBM methods, respectively. Then, subsets of top 400, 1,000, 3,000 and 5,000 SNPs, selected by each method, were employed to construct genomic relationship matrices for the prediction of GEBVs with genomic best linear unbiased prediction model. Results indicated that predicted accuracies by RF and GBM were generally higher than LM. A Subset of 1,000 SNPs selected by RF and GBM algorithms compared to the total SNPs increased accuracy from 0.38 to 0.64 and 0.66 for BW6, and from 0.42 to 0.60 and 0.66 for BW9, respectively. The findings of the present study provide that machine learning methods, especially GBM, can perform better than LM in selecting important SNPs and increasing the accuracy of genomic prediction in broiler chickens.

Keywords

Subjects


Abdollahi‐Arpanahi, R., Nejati‐Javaremi, A., Pakdel, A., Moradi‐Shahrbabak, M., Morota, G., Valente, B.D., Kranis, A., Rosa, G.J.M. and Gianola, D. 2014. Effect of allele frequencies, effect sizes and number of markers on prediction of quantitative traits in chickens. J. Anim. Breed. Genet., 131 (2): 123-133.
Breiman, L. 2001. Random forests. Mach. Learn., 45: 5-32.
Breiman, L. 2013. Breiman and Cutler’s random forests for classification and regression. Package ‘randomForest’. Institute for Statistics and Mathematics, Vienna University of Economics and Business.
Brown, D.J. and Reverter, A. A. 2002. Comparison of methods to pre-adjust data for systematic effects in genetic evaluation of sheep. Livest. Prod. Sci., 75:281–91.
Calus, M.P.L., Meuwissen, T.H.E., De Roos, A.P.W. and Veerkamp, R.F. 2008. Accuracy of genomic selection using different methods to define haplotypes. Genet., 178: 553–561.
Chen, H. and Boutros, P.C. 2011. VennDiagram: a package for the generation of highly-customizable Venn and Euler diagrams in R. BMC Bioinformatics 12:35.
Daetwyler, H.D., Pong-Wong, R., Villanueva, B. and Woolliams, J.A. 2010. The impact of genetic architecture on genome-wide evaluation methods. Genet., 185:1021–31.
Demeure, O., Duclos, M.J., Bacciu, N., Le Mignon, G., Filangi, O., Pitel, F., Boland, A., Lagarrigue, S., Cogburn, L.A., Simon, J. and Le Roy, P. 2013. Genome-wide interval mapping using SNPs identifies new QTL for growth, body composition and several physiological variables in an F 2 intercross between fat and lean chicken lines. Genet. Sel. Evol., 45 (1): 1-12.
Druet, T., Macleod, I.M. and Hayes, B.J. 2014. Toward genomic prediction from whole-genome sequence data: impact of sequencing design on genotype imputation and accuracy of predictions. Heredity, 112 (1): 39-47.
Emrani, H., Torshizi, R.V., Masoudi, A.A. and Ehsani, A. 2017. Identification of new loci for body weight traits in F2 chicken population using genome-wide association study. Livest. Sci., 206: 125-131.
Friedman, J.H. 2001. Greedy function approximation: a gradient boosting machine. Ann. statist., 29 (5).1189-1232.
Gianola, D., Fernando, R.L. and Stella, A. 2006. Genomic-assisted prediction of genetic value with semiparametric procedures. Genet., 173 (3): 1761-1776.
González-Recio, O.; Forni, S. 2011. Genome-wide prediction of discrete traits using bayesian regressions and machine learning. Genet. Sel. Evol., 43 (1): 7.
González-Recio, O.; Weigel, K. A.; Gianola, D.; Naya, H., Rosa, G. J. M. 2010. L2-Boosting algorithm applied to high-dimensional problems in genomic selection. Genet. Res., 92 (3): 227–237.
Greenwell, B., Boehmke, B., Cunningham, J., Developers, G.B.M. and Greenwell, M.B. 2019. Package ‘gbm’. R package version, 2 (5).
Habier, D., Fernando, R.L. and Dekkers, J.C. 2009. Genomic selection using low-density marker panels. Genet., 182 (1): 343-353.
Jang, S., Tsuruta, S., Leite, N.G., Misztal, I. and Lourenco, D. 2023. Dimensionality of genomic information and its impact on genome-wide associations and variant selection for genomic prediction: a simulation study. Genet. Sel. Evol. 55, 49.
Kriaridou, C., Tsairidou, S., Houston, R.D. and Robledo, D. 2020. Genomic prediction using low density marker panels in aquaculture: performance across species, traits, and genotyping platforms. Front. Genet., 11:124.
Li, B., Zhang, N., Wang, Y.G., George, A.W., Reverter, A. and Li, Y. 2018. Genomic prediction of breeding values using a subset of SNPs identified by three machine learning methods. Front. Genet., 9: 237.
Li, Y., Raidan, F.S.S., Vitezica, Z. and Reverter, A. 2018. Using Random Forests as a prescreening tool for genomic prediction: impact of subsets of SNPs on prediction accuracy of total genetic values. World Congress on Genetics Applied to Livestock Production., 1130. Massey University.
Liu, T., Luo, C., Ma, J., Wang, Y., Shu, D., Su, G. and Qu, H. 2020. High-throughput sequencing with the preselection of markers is a good alternative to SNP chips for genomic prediction in broilers. Front. Genet., 11: 108.
Long, N., Gianola, D., Rosa, G. J. M., Weigel, K. A., Avendaño, S. 2007. Machine learning classification procedure for selecting SNPs in genomic selection: application to early mortality in broilers. J. Anim. Breed. Genet., 124 (6): 377–389.
Lopez, B.I., Lee, S.H., Shin, D.H., Oh, J.D., Chai, H.H., Park, W., Park, J.E. and Lim, D. 2020. Accuracy of genomic evaluation using imputed high-density genotypes for carcass traits in commercial Hanwoo population. Livest. Sci., 241:104256.
Lu, S., Liu, Y., Yu, X., Li, Y., Yang, Y., Wei, M., Zhou, Q., Wang, J., Zhang, Y., Zheng, W. and Chen, S. 2020. Prediction of genomic breeding values based on pre-selected SNPs using ssGBLUP, WssGBLUP and BayesB for Edwardsiellosis resistance in Japanese flounder. Genet. Sel. Evol., 52: 49.
Luo, Z., Yu, Y., Xiang, J. and Li, F. 2021 Genomic selection using a subset of SNPs identified by genome-wide association analysis for disease resistance traits in aquaculture species. Aquaculture., 539:736620.
Minozzi, G., Pedretti, A., Biffani, S., Nicolazzi, E.L. and Stella, A. 2014. Genome wide association analysis of the 16th QTL- MAS Workshop dataset using the Random Forest machine learning approach. BMC proc., 5: 1-6.
Mokry, F.B., Higa, R.H., de Alvarenga Mudadu, M., Oliveira de Lima, A., Meirelles, S.L.C., Barbosa da Silva, M.V.G., Cardoso, F.F., Morgado de Oliveira, M., Urbinati, I., Meo Niciura, S.C. and Tullio, R.R. 2013. Genome-wide association study for backfat thickness in Canchim beef cattle using Random Forest approach. BMC genet., 14 (1): 1-11.
Mrode, R., Tarekegn, G.M., Mwacharo, J.M. and Djikeng, A. 2018. Invited review: Genomic selection for small ruminants in developed countries: how applicable for the rest of the world? Anim., 12 (7): 1333-1340.
Pérez-Rodríguez, P. and de los Campos, G. 2022. Multitrait Bayesian shrinkage and variable selection models with the BGLR-R package. Genetics., 222(1): 112.
Piles, M., Bergsma, R., Gianola, D., Gilbert, H. and Tusell, L. 2021. Feature Selection Stability and Accuracy of Prediction Models for Genomic Prediction of Residual Feed Intake in Pigs Using Machine Learning. Front. Genet., 12: 137.
Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M.A., Bender, D., Maller, J., Sklar, P., De Bakker, P.I., Daly, M.J. and Sham, P.C. 2007. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet., 81 (3): 559-575.
Ren, D., Cai, X., Lin, Q., Ye H., Teng, J., Li, J., Ding, X. and Zhang, Z. 2022. Impact of linkage disequilibrium heterogeneity along the genome on genomic prediction and heritability estimation. Genet. Sel. Evol.,54 (1): 47.
Schiavo, G., Bertolini, F., Galimberti, G., Bovo, S., Dall'Olio, S., Nanni Costa, L., Gallo, M., Fontanesi, L. 2020. A machine learning approach for the identification of population-informative markers from high-throughput genotyping data: application to several pig breeds. Anim., 14 (2): 223-232.
Seo, D., Cho, S., Manjula, P., Choi, N., Kim, Y.K., Koh, Y.J., Lee, S.H., Kim, H.Y., Lee, J.H. 2021. Identification of Target Chicken Populations by Machine Learning Models Using the Minimum Number of SNPs. Anim., 11 (1): 241.
Speed, D., Hemani, G., Johnson, Michael, R., Balding, David, J. 2012. Improved Heritability Estimation from Genome- wide SNPs. Am. J. Hum. Genet., 91: 1011–1021.
Unterseer, S., Bauer, E., Haberer, G., Seidel, M., Knaak, C., Ouzunova, M., Meitinger, T., Strom, T.M., Fries, R., Pausch, H. and Bertani, C. 2014. A powerful tool for genome analysis in maize: development and evaluation of the high density 600 k SNP genotyping array. BMC genomics., 15(1), pp.1-15.
Wickham, H., François, R., Henry, L., Müller, K., and Vaughan, D. 2023. dplyr: A Grammar of Data Manipulation. https://dplyr.tidyverse.org.
Wray, N.R., Yang, J., Hayes, B.J., Price, A.L., Goddard, M.E. and Visscher, P.M. 2013. Pitfalls of predicting complex traits from SNPs. Nat. Rev. Genet., 14 (7): 507–515.
Ye, S., Gao, N., Zheng, R., Chen, Z., Teng, J., Yuan, X., Zhang, H., Chen, Z., Zhang, X., Li, J. and Zhang, Z. 2019. Strategies for obtaining and pruning imputed whole-genome sequence data for genomic prediction. Front. Genet., 10: 673.
Zhou, J. and Troyanskaya, O.G. 2015. Predicting effects of noncoding variants with deep learning–based sequence model. Nat. Methods., 12 (10): 931-934.