S: 22 Oct 2024 (b.7.7) D: 22 Oct 2024 Main functions (--distance...) (--make-grm-bin...) (--ibs-test...) (--assoc, --model) (--mh, --mh2, --homog) (--assoc, --gxe) (--linear, --logistic) Core algorithms Quick index search |
Population stratificationClustering--cluster ['cc'] [{group-avg | old-tiebreaks}] ['missing'] ['only2'] --cluster uses IBS values calculated via "--distance ibs"/--ibs-matrix/--genome to perform complete linkage clustering. The clustering process can be customized in a variety of ways.
--cluster automatically launches an appropriate IBS calculation when necessary, so you don't have to use it with --distance/--ibs-matrix/--genome unless you want to save the distance matrix to disk. Reusing an IBS/IBD calculation--read-genome <filename> --read-genome lets you use the (possibly gzipped) results of a previous --genome run as the basis for clustering, instead of recomputing IBS and PPC test results from scratch. If any pair of samples is missing from the input file, an error is reported. You can also invoke --read-dists to reuse the results of a "--distance triangle bin ibs" run. If --read-dists and --ppc are present in the same run, PPC test p-values are calculated from scratch (or loaded via --read-genome) while distances are loaded from the .mibs.bin file. Adding clustering constraints--ppc <minimum p-value> --mc <maximum cluster size>
If the initial cluster assignment violates any of these constraints, a warning will be printed. --match <filename> [missing value] Given a file where each line has the following fields:
--match prevents any pair of samples which differ on at least one covariate from being merged into the same cluster. If you provide a second parameter, all covariates with that value are treated as missing (i.e. they don't induce any merge restrictions). To instead force members of the same cluster to differ on some or all of these covariates, you can combine --match with --match-type. Its input file should contain a single line with up to M fields, each of which is '0', '1', or '-1' (or equivalently, '-', '+', or '*'); '0'/'-' entries specify "negative matches" (samples with equal covariate values cannot be in the same cluster), '1'/'+' entries specify "positive matches" (samples with differing covariate values cannot be in the same cluster), and '-1'/'*' indicates the covariate should be ignored. Thus, using --match without --match-type is equivalent to loading a --match-type file with M '1's. To enforce within-cluster similarity (but not uniformity) on some quantitative trait(s), you can use --qmatch in combination with --qt. In this case, the --qmatch input file has the same structure as a --match input file (with the additional restriction that all covariates must be numeric), while the --qt input file should contain up to M lines with a single nonnegative tolerance per line (or '-1' to specify that the covariate should be ignored). Merges involving any pair of samples which differ by more than the tolerance for any --qmatch covariate will not be permitted. For backwards compatibility, if no second parameter is provided to --qmatch, the --missing-phenotype value (default '-9') is still treated as missing. If there are fewer than M entries in the --match-type/--qt file, the trailing fields in the --match/--qmatch file are ignored. --match and --qmatch can be used in the same run (in which case their input files don't have to contain the same number of covariates). If the initial cluster assignment violates a --match or --qmatch constraint, a warning will be printed. Dimension reductionPLINK 1.9 provides two dimension reduction routines: --pca, for principal components analysis (PCA) based on the variance-standardized relationship matrix, and --mds-plot, for multidimensional scaling (MDS) based on raw Hamming distances. Top principal components are generally used as covariates in association analysis regressions to help correct for population stratification, while MDS coordinates help with visualizing genetic distances. --pca [count] ['header'] ['tabs'] ['var-wts'] --pca-cluster-names <name(s)...> By default, --pca extracts the top 20 principal components of the variance-standardized relationship matrix; you can change the number by passing a numeric parameter. Eigenvectors are written to plink.eigenvec, and top eigenvalues are written to plink.eigenval. The 'header' modifier adds a header line to the .eigenvec file(s), and the 'tabs' modifier makes the .eigenvec file(s) tab- instead of space-delimited. You can request variant weights with the 'var-wts' modifier, and dump the matrix by using --pca in combination with --make-rel/--make-grm-gz/--make-grm-bin. This is a simple port of GCTA's --pca flag, which generates the same files from a previously computed relationship matrix. For more full-featured principal component analysis, including automatic outlier removal, high-speed randomized approximation for very large datasets, and LD regression, try EIGENSOFT 6. If clusters are defined (via --within), you can base the principal components off a subset of samples and then project everyone else onto those PCs with --pca-cluster-names and/or --pca-clusters. --pca-cluster-names accepts a space-delimited sequence of cluster names on the command line, while --pca-clusters takes the name of a file with one cluster name per line. If you also want the MAFs used in the relationship matrix calculation to be based on only samples in those clusters, dump those MAFs in a separate run with --freqx + --keep-cluster-names/--keep-clusters, and then load them during your PCA run with --read-freq. --mds-plot <dimension count> ['by-cluster'] ['eigendecomp'] ['eigvals'] In combination with --cluster, --mds-plot produces a Haploview-friendly multidimensional scaling report. By default, multidimensional scaling is performed on an inter-sample distance matrix; use the 'by-cluster' modifier to perform it on an inter-cluster distance matrix (calculated by averaging all inter-sample distances for each cluster pair) instead. The default, singular value decomposition-based algorithm is designed to give the same results as PLINK 1.07 and the R cmdscale() function (up to rounding errors and sign flips, anyway). The 'eigendecomp' modifier requests a faster eigendecomposition-based algorithm which yields slightly different results. The 'eigvals' modifier causes top eigenvalues to be written to plink.mds.eigvals (one per line; first value corresponds to the first dimension in the .mds file, etc.). Outlier detection diagnostics--neighbour <n1> <n2> For each sample, --neighbour looks at genomic distances to the n1th- through n2th-nearest neighbors, and reports how they compare with the same statistics for other samples. See the PLINK 1.07 documentation for discussion of this diagnostic. Note that PLINK 1.9 does not require --neighbour to be used with --cluster. |