Parameters

In this section, the parameters of each module are divided into four types: Boolean, necessary, semi-necessary, and optional.

  • Boolean parameters are divided into assigned and unassigned. When assigning, the program will execute the function corresponding to the parameter; otherwise, the program will not execute the corresponding function.
  • If the necessary parameter is not assigned a value, an error will be reported when the module it belongs to is called.
  • If the semi-necessary parameter is not assigned a value, an error will be reported when calling a specific function under the module.
  • If optional parameters are not assigned, certain functions cannot be used, but no error will be reported.

Except for DEFAULT module parameters, other parameters are only used when the owning module is called.

DEFAULT module

The following parameters will be used in each module:
-c,--config-path (str) : The path of the configuration file. The file indicates the value of each parameter that the program needs to call. When the parameters in the file overlap with the command line, the program will use the command line value as the standard.
-o,--output-folder (str) : The output path of the program result. Default is the current working path.

Genotype data preprocessing module

  • Boolean

-pg, --preprocessed-geno: Call genotype data preprocessing module, optional values [all, filter]. The 'all' indicates that the program will call the complete pre-processing module, including: statistical and visual data basics, screening out non-compliant sample individuals and SNP sites based on the deletion rate, MAF, and filling the missing data of SNP based on high-frequency genotypes, removing redundant SNP according to r ^ 2, extracting specific genotype data function of specific sample according to SNPID or sampleID. The 'filter' indicates that the program will call part of the preprocessing module, which only includes the specific extraction of specific samples according to SNPID or sampleID

  • necessary

--fileprefix (str) : Genotype data filename prefix.

  • optional

--fileformat (str) : Genotype data file format, optional values [ped, bed]. Default is bed
--plink-path (str) :Specify plink path. Default is plink
--remove-sampleid-path (str) : The file path of the sampleID to be deleted. The format is two columns with spaces or tabs as separators. The first column is familyID and the second column is within-familyID. The program will delete the corresponding sample data in the genotype file according to sampleID.
--keep-sampleid-path (str) : The file path of the sampleID to be extracted. The file format is the same as --remove-sampleid-path.
--extract-snpid-path (str) : The file path of the SNPID to be extracted, the format is a list of SNPID. The program will extract the corresponding SNP data from the genotype file based on the SNPID.
--exclude-snpid-path (str) : The file path of the SNPID to be deleted. The file format is the same as --extract-snpid-path.
--snpmaxmiss (float) : The maximum miss rate of SNP. Screen out SNP with missing rate exceeding threshold. Default is 0.05
--samplemaxmiss (float) : Maximum sample miss rate. Screen out samples whose missing rate exceeds the threshold. Default is 0.05
--maf-max (float) : The minimum allele frequency. SNP with minimum allele frequency less than the threshold was screened out. Default is 0.01
--r2-cutoff (float) : Equivalent to the--indep-pairwise parameter in PLINK. Default is 0.8

Phenotyp data preprocessing module

  • Boolean

-pp,--preprocessed-phe: Call the phenotype data preprocessing module.
--phe-norm: Use z-score to normalize phenotypic data.
--phe-plot: Call phenotype data visualization function.

  • necessary

--phefile-path (str) : The phenotype data file path.

  • semi-necessary

--ppexsampleid-path (str) : The file path of the sampleID to be extracted. The file content is only a list of sampleID.
--ppgroupfile-path (str) : GroupID data file path.
--ppgroupfile-sep (str) : GroupID data file separator. Default is ','
--ppgroupid-name (str) : The column name that stores the groupID.
--num2wordfile-path (str) : The file path of transformation table of integer and phenotype. This parameter is only used when --phe-recode=num2word.

  • optional

--phefile-sep (str) : phenotype data file separator. Default is ','
--phe-name (str) : The column name in the phenotype data file that stores phenotype.
--phe-recode (str) : Call the phenotype data recoding function, optional values [word2num, num2word]. The 'word2num' means re-encoding the phenotype data into continuous non-negative integer form. The 'num2word' means re-encoding the continuous non-negative integer form into phenotype data (need to provide a conversion table corresponding to integer and phenotype). When performing classification tasks, Lightgbm only accepts consecutive integers with example labels [0, N). If the training sample comes from 5 groups, [0, 1, 2, 3, 4] is required as the label of the 5 groups, but this usually does not match the actual groupID. Using this parameter, the program can implement a reversible conversion between sample labels and [0, N) consecutive integers. Provide compatible phenotype data for downstream classification tasks.

Population structure analysis module

  • Boolean

-s,--structure: Call the population structure analysis module.
--structure-plot: Call the population structure plot function. A total of 2 or 3 scatter plots are displayed to show the results of dimensionality reduction and clustering.

  • necessary

--genofile-path (str) : The path of the genotype data file. The pre-processed results are often used as input.

  • semi-necessary

--n-clusters (int) : the number of clusters. This parameter is only used when --cluster-mode = kmeans.
--sgroupfile-path (str) : GroupID data file path.
--sgroupfile-sep (str) : GroupID data file separator. Default is ','
--sgroupfile-name (str) : The column name that stores groupID.

  • optional

--redim-mode (str) : Dimension reduction algorithm, optional value [tsne, pca]. Default is pca
--pca-explained-var (int or float) : The parameter value is an integer greater than 1 or a decimal between 0-1. When the parameter value is an integer, it indicates the dimension of the data after pca dimensionality reduction; when the parameter value is a decimal, it indicates the amount of variation that can be explained by the data after pca dimensionality reduction. Default is 0.95. This parameter is only used when --redim-mode = pca
--window-size (int) : Sliding window size in dimensionality reduction. Because the calculation speed of the t-SNE dimensionality reduction algorithm decreases significantly with increasing dimensions, the program performs a preliminary dimensionality reduction on the data by sliding windows. Default is 20. This parameter is only used when --redim-mode = tsne.
--cluster-mode (str) : Clustering algorithm, optional value [kmeans, optics]. Default is kmeans.
--optics-min-samples (int or float) : The parameter value is an integer greater than 1 or a decimal between 0-1. When the parameter value is an integer, it indicates the minimum number of samples required to form the core point; when the parameter value is a decimal number, it indicates the ratio of the minimum number of samples required to form the core point to the total sample number. This parameter is only used when --cluster-mode = optics. Default is 0.025
--optics-xi (int or float) : The minimum value of the reachable distance gradient required for the aggregation class. This parameter is only used when --cluster-mode = optics. Default is 0.05
--optics-min-cluster-size (int or float) : The parameter value is an integer greater than 1 or a decimal between 0-1. When the parameter value is an integer, it indicates the minimum number of samples required for the aggregation. When the parameter value is a decimal number, it indicates the minimum number of samples required for the aggregation. This parameter is only used when --cluster-mode = optics. Default is 0.03

Training model, SNP selection and phenotype prediction module

  • Boolean

-e,--engine: Call the training model, SNP selection and phenotype prediction module.
-t,--train: Call the training model function.
-cv: Call the cross-validation function.
-p,--predict: Call the phenotype prediction function.
-sf,--select-feature: Call SNP selection function.
--bygain-boxplot: Draws a boxplot that keeps changing as the SNP is added to the model. This parameter is used when -sf is specified.

  • semi-necessary

--traingeno (str) : The path to the genotype data file for the training set. The file separator is ',', the first line is SNPID, and the first column is sampleID. This parameter is used when -t or-cv is specified.
--trainphe (str) : The path to the phenotype data file for the training set. The file separator is ',', including the header, and the first column is sampleID information. This parameter is used when -t or-cv is specified.
--testgeno (str) : The path to the genotype data file for the test set. File format is the same as --traingeno. This parameter is used when -p is specified.
--modelfile-path (str) : The path of the LightGBM model file. The program uses this model to complete the phenotype prediction of the test set. This parameter is used when -p is specified.

  • optional

--validgeno (str) : The path to the genotype data file for the validation set. File format is the same as --traingeno. This parameter is used when -t is specified.
--validphe (str) : Path to the sampleID file of the validation set sample. File format is the same as --traingeno. This parameter is used when -t is specified.
--init-model-path (str) : The path of the initial model file. The LightGBM algorithm will continue to train on this model. Commonly used for batch training of big data. This parameter is used when -t or-cv is specified.
--min-detal (float) : The minimum percentage of model accuracy improvement per training. The program returns the number of times the model is trained when the accuracy is less than this threshold and the accuracy of the cross-validation of the model. When objective=regression, the accuracy is the mean of the mse between the predicted result and the true value. This parameter is used when -cv is specified. Default is 0.05
--cv-times (int) : Number of repetitions for cross-validation. This parameter is used when -sf is specified. Default is 5
--cv-nfold (int) : The number of folds per cross-validation. This parameter is used when -cv is specified. Default is 5
--min-gain (float) : calculate the total gain of each SNP in the model, and use the product of the maximum gain and--min-gain as the threshold. When the total gain of the SNP in the model is less than the threshold, this SNP will not be selected. This parameter is used when -sf and--bygain-boxplot are specified. Default is 0.05
--max-colorbar (float) : The product of the maximum gain of each SNP in each tree and--max-colorbar will be used as the maximum value of colorbar in the heat map. This parameter is used when -sf is specified. Default is 0.6

LightGBM parameters

--learning-rate (float) : The learning rate of the model. Default is 0.1
--num-leaves (int) : The maximum number of leaves per decision tree. Default is 10
--num-threads (int) : The number of CPU threads available for core module operation calls. Default is 0, which is the default number of threads in OpenMP.
--min-data-in-leaf (int) : The minimum number of samples in each leaf. Default is 1
--objective (str) : The type of purpose for model training, optional value [regression, multiclass]. Default is 'regression'
--device-type (str) : The device type called by training model, optional value ['cpu', 'gpu']. Default is 'cpu'
--max-depth (int) : The maximum depth of the decision tree. If the value is <= 0, there is no maximum depth limit. Default is -1
--feature-fraction (float) : The proportion of features used in training the model to the total number of features. Default is 1
--verbosity (int) : Controls the complexity of information output during model training. <0: Fatal, = 0: Error (Warning), = 1: Info,> 1: Debug. Default is 0
--num-class (int) : The number of classes in the training set. This parameter is only called for multi-class tasks, that is, --objective: multiclass. Default is 1
--num-boost-round (int) : The number of training (iterations) of the model. Default is 100
--early-stopping-rounds (int) : This parameter should not be called only if-cv or -t has a validation set. After every N rounds of training (N is the value of the user-defined parameter), the program will calculate the prediction accuracy of the model on the validation set. If the prediction accuracy no longer improves, the model will stop training immediately. This parameter can be used to prevent model training from overfitting. Default is 20
--verbose-eval (int) : This parameter should not be called only when-cv or -t has a validation set. After each N rounds of training (N is the value of the user-defined parameter), the program will calculate the prediction accuracy of the model on the validation set and output it. Default is 10