This section details the characteristics of the methods used by CropGBM in data processing, and briefly introduces the principles of these methods.
Use LightGBM algorithm to predict phenotypic values and select SNPs
At present, methods such as GBLUP are commonly used to predict phenotype values. Calculate the correlation matrix between samples from the entire genome marker or genotype, and then use the mixed linear model (MLM) to calculate the estimated breeding value (EBV) of the samples. But GBLUP also has many shortcomings:
- Difficult to capture complex non-linear relationships between genotypes. When faced with complex population structure prediction problems, its cross-population prediction ability is not ideal.
- The kinship matrix is not extensible. If the training population changes, the kinship matrix needs to be recalculated. At the same time, the computation time of the kinship matrix increases dramatically with the increase of the sample size of the training population.
- Unable to solve the multi-classification problem. Many genotype-to-phenotype prediction tasks require classification prediction.
With the development of genotype detection, phenotype detection and other technologies, the cost of sample acquisition has continued to decline. The genotype and phenotype data have accumulated rapidly. The sample size has grown from hundreds to thousands. As GBLUP and other methods need to calculate the correlation matrix between samples, its computing performance decreases significantly with the increase of the population size. CropGBM uses the LightGBM algorithm, which is a large number of optimized Gradient Boosting Decision Tree (GBDT) algorithms, as the kernel of phenotype prediction. When processing large batches of sample data, compared to GBLUP, the lightgbm algorithm can be used without losing accuracy:
- Dramatically reduce memory consumption.
- Significantly improve prediction speed and support GPU-accelerated model training.
- Support discontinuous phenotype value prediction
CropGBM can also mine phenotype-related SNP sites according to the weight of SNP in the model. Generally, the larger the weight of SNP, the stronger the correlation with phenotype. It is recommended that users compare the weights of each SNP provided by CropGBM with the GWAS results to eliminate false positive results in their respective methods and improve their reliability.
Introducing t-SNE method to reduce dimensionality of genotype data
The t-distributed stochastic neighbor embedding (t-SNE) algorithm is a nonlinear dimensionality reduction algorithm for mining high-dimensional data. The data points are mapped from high-dimensional to low-dimensional in the form of conditional probability, and t-distribution is used to cleverly solve the congestion caused by dimensionality reduction of high-dimensional data, so that the dimensionality reduction results retain the local and global structure of the data. Linear dimensionality reduction algorithms such as PCA cannot ensure that the data structure of non-linear high-dimensional data can be mapped correctly after dimensionality reduction. On the other hand, when the data is reduced to two dimensions, the t-SNE algorithm re-arranges the data on a two-dimensional plane, while the PCA algorithm selects the two dimensions from the high-dimensional space that can best show the difference in data to display the data. Therefore, the t-SNE algorithm is better at displaying dimensionality reduction data.
Because t-SNE algorithm has quadratic time and space complexity, it requires higher system resources. Before using the t-SNE algorithm to reduce the dimensionality of the program, the program first reduced the dimensionality of the SNP data through a sliding window. The SNP data is divided into multiple groups according to the sliding window size set by the user. The sliding window size is the number of SNP. The sum of the SNP values in each group is used as the value of the feature in the new dimension to achieve the purpose of reducing system resource consumption and increasing SNP feature value richness.
Introducing OPTICS method to cluster data
The ordering points to identify the clustering structure (OPTICS) algorithm is a density-based clustering algorithm. Finding high-density regions separated by low-density regions in the data set and dividing the separated high-density regions into clusters, it can find any shape of the group structure. However, the K-Means algorithm and hierarchical clustering algorithm commonly used in the current research are distance-based clustering algorithms. They perform well on spherical data sets. But when the data set is non-spherical structure, the clustering effect of the above algorithm is not good. On the other hand, K-Means algorithm requires users to provide the number of clusters in advance. But sometimes researchers may not know the population structure of the sample set. The OPTICS algorithm can automatically determine the number of clusters based on the dense distribution of data in the sample set, helping users to discover the population structure and sort out the relationships between samples.
Plot the distribution histogram of parameters of the genotype data
The pre-processing module of the program supports statistics on the sample heterozygosity rate, genotype heterozygosity rate, minor allele frequency, genotype deletion rate, and sample deletion rate of the input genotype data and outputs it as a histogram. It is convenient for users to understand the overall situation of the data and formulate subsequent research methods.