QuickStart
This section briefly introduces the functions and usage of CropGBM to help users get started quickly. For a detailed introduction to CropGBM, see Tutorial
Test data download address 1: https://gitee.com/cau-xyt/CropGBM-Tutorial-data
Test data download address 2: https://github.com/YuetongXU/CropGBM-Tutorial-data
Installation
Install via Conda
$ conda install -c xu_cau_cab cropgbm
Install via Pypi
$ pip install --user cropgbm
Install via Source code
$ tar -zxf CropGBM.tar.gz
# Install Python package dependencies of CropGBM: setuptools, wheel, numpy, scipy, pandas, scikit-learn, lightgbm, matplotlib, seaborn
$ pip install setuptools wheel numpy scipy pandas scikit-learn lightgbm matplotlib seaborn
# Install external dependencies of CropGBM: PLINK 1.90
$ wget s3.amazonaws.com/plink1-assets/plink_linux_x86_64_20191028.zip
$ mkdir plink_1.90
$ unzip plink_linux_x86_64_20191028.zip -d ./plink_1.90
# Add CropGBM, PLINK to the system environment variables for quick use:
$ vi ~/.bashrc
export PATH="/userpath/CropGBM:$PATH"
export PATH="/userpath/plink1.90:$PATH"
$ source ~/.bashrc
Parameter configuration
CropGBM supports two parameter assignment forms: 'configuration file' and 'command line'. CropGBM will read the values of the parameters in the configuration file first, and then read the values of the parameters in the command line. When a parameter is assigned by both methods at the same time, CropGBM uses the parameter value in the command line as a reference and ignores the parameter value in the configuration file.
# CropGBM reads the values of each parameter in the configuration file (-c config_path) and calls the genotype data preprocessing module (-pg all)
$ cropgbm -c config_path -o ./gbm_result/ -pg all
# CropGBM ignores the fileformat value in the configuration file and uses ped as a reference
$ cropgbm -c config_path -o ./gbm_result/ -pg all --fileformat ped
NOTE: If the program does not work, try adding python before the program name. E.g. $ python cropgbm -c config_path -pg all
Genotype data pre-processing
The functions of the genotype data pre-processing module include: extracting the genotype data of specific sample ID and SNPID, displaying the SNP deletion rate, heterozygosity in the form of histogram and genotype recoding. Provide data and acceptable file formats for downstream analysis of the program. CropGBM currently supports ped, bed input formats for genotype files.
# Call the genotype data pre-processing module to count and display the missing rate and heterozygosity rate of genotype data
$ cropgbm -o ./gbm_result/ -pg all --fileprefix genofile --fileformat ped
Phenotype data pre-processing
The functions of the phenotype data pre-processing module include: extracting phenotype data of specific sample ID, SNPID, groupID, phenotype normalization and phenotype recoding. It also supports displaying the distribution of data in the form of histograms or boxplots.
# Call the phenotype data preprocessing module (-pp) for normalization (--phe-norm)
$ cropgbm -o ./gbm_result/ -pp --phe-norm --phefile-path phefile.txt --phe-name DTT
# Extract phenotypic data based on the groupID (--ppgroupfile-path groupfile.txt) and display it as a boxplot
$ cropgbm -o ./gbm_result/ -pp --phe-plot --phefile-path phefile.txt --phe-name DTT --ppgroupfile-path phefile.txt --ppgroupid-name paternal_line
Population structure analysis
The population structure analysis module can analyze the population structure of a sample based on genotype data. CropGBM supports dimensionality reduction using t-SNE or PCA methods and clustering using OPTICS or Kmeans methods. It also supports displaying the population structure of samples in the form of scatter plots.
# Call the population structure analysis module (-s) to cluster and display samples based on genotype data (--structure-plot)
$ cropgbm -o ./gbm_result/ -s --structure-plot --genofile-path genofile_filter.geno --redim-mode pca --cluster-mode kmeans --n-clusters 30
Building models and SNP selection
The model training module is mainly written based on the lightGBM algorithm. To improve the accuracy of the model, it is recommended to provide a validation set to assist in tuning. If there is no validation set, cross validation can be used to select appropriate parameter values. CropGBM select SNP based on the gain of each SNP in the training model. It also supports the use of boxplot and heatmap to show the importance of the selected SNP.
# Cross-validation (-e -cv)
$ cropgbm -o ./gbm_result/ -e -cv --traingeno train.geno --trainphe train.phe
# Build model (-e -t). If there is no validation set data, the --validgeno and --validphe parameters can be omitted.
$ cropgbm -o ./gbm_result/ -e -t --traingeno train.geno --trainphe train.phe --validgeno valid.geno --validphe valid.phe
# SNP selection (-e -t -sf), showing changes in model prediction accuracy (--bygain-boxplot)
$ cropgbm -o ./gbm_result/ -e -t -sf --bygain-boxplot --traingeno train.geno --trainphe train.phe
Phenotype prediction
The phenotype prediction module uses the trained model to predict the phenotype of the test set.
# Phenotype prediction (-e -p)
$ cropgbm -o ./gbm_result/ -e -p --testgeno test.geno --modelfile-path train.lgb_model