S: 22 Oct 2024 (b.7.7) D: 22 Oct 2024 Main functions (--distance...) (--make-grm-bin...) (--ibs-test...) (--assoc, --model) (--mh, --mh2, --homog) (--assoc, --gxe) (--linear, --logistic) Core algorithms Quick index search |
Developer informationSource codeOur public GitHub repository is at https://github.com/chrchang/plink-ng. Here's the source code snapshot for our latest posted build, and here's a mirror of PLINK 1.07's source. PLINK 1.9 is GPLv3+ software: you are free to modify and rerelease it, as long as you do not restrict others' freedom to do the same, and include proper attribution. Compilation instructionsWe haven't tested this across a wide range of machine configurations. If you are running into problems, or if you have experience setting up portable installation scripts, let us know. (Note that the plink_first_compile script currently puts zlib-1.3 under the parent directory; you may want to change this configuration.) First time on Amazon Linux: sudo yum install -y gcc gcc-c++ libstdc++ gcc-gfortran glibc glibc-devel make blas-devel lapack lapack-devel atlas-devel ./plink_first_compile (Ubuntu is similar, but you'll need both libatlas-dev and libatlas-base-dev.) First time on macOS: ./plink_first_compile Subsequent compiles on both platforms: make plink Due to the need for MinGW-w64, our Windows build process is not as easy to replicate. However, if you have experience with that compiler, you should be able to adapt the 'plinkw' Makefile build target to your machine. We'll respond to requests for assistance as time allows. Core algorithmsPartial sum lookup
For example, the GCTA genomic relationship matrix is defined by the following per-marker increments (where q is the MAF):
This suggests the following matrix calculation algorithm, as a first draft:
We can substantially improve on this by handling multiple markers at a time. Since seven cases can be distinguished by three bits, we can compose a sequence of bitwise operations which maps a pair of padded 2-bit PLINK genotypes to seven different 3-bit values in the appropriate manner. On 64-bit machines, 20 3-bit values can be packed into a machine word (for example, let bits 0-2 describe the relation at marker #0, bits 3-5 describe the relation at marker #1, etc., all the way up to bits 57-59 describing the relation at marker #19), so this representation lets us instruct the processor to act on 20 markers simultaneously. Then, we need to perform the update Ajk := Ajk + f0(x0) + f1(x1) + ... + f19(x19) where the xi's are bit trios, and the fi's map them to increments. This could be done with 20 table lookups and floating point addition operations. Or, the update could be restructured as Ajk := Ajk + f{0-4}(x{0-4}) + ... + f{15-19}(x{15-19}) where x{0-4} denotes the lowest-order 15 bits, and f{0-4} maps them directly to f0(x0) + f1(x1) + f2(x2) + f3(x3) + f4(x4); similarly for f{5-9}, f{10-14}, and f{15-19}. In exchange for a bit more precomputation (four tables of size 215 each; total size 1 MB, which isn't too hard on today's L2/L3 caches), this restructuring licenses the use of four table lookups and adds per update instead of twenty. (Given probabilistic calls, where there's no seven-cases-per-marker restriction, we recommend expressing the result as the sum of a few large matrix multiplications; you can then rely on a machine-optimized dgemm/sgemm to do the heavy lifting.) Bit population count Most new (post-2008) processors offer a specialized POPCNT instruction to directly evaluate this in hardware. But it turns out that the best portable vector algorithms, refined over nearly 50 years of hacking, can come within 10 percent of the speed of algorithms exploiting the hardware instruction. And unfortunately, reliance on the GCC __builtin_popcountll() intrinsic leads to a ~300% slowdown when POPCNT is not present. Therefore, 64-bit PLINK relies on a vector algorithm. Building on work and discussion by Andrew Dalke, Robert Harley, Cédric Lauradoux, Terje Mathisen, and Kim Walisch (cf. Dalke's testing, and this comp.arch.arithmetic thread tracing the algorithm's origin to Harley), we developed an implementation that achieves near-optimal performance with only SSE2 instructions, which are supported by all Intel and AMD x86-64 processors dating back to 2003. Even older processors are supported by 32-bit PLINK, which evaluates bit population counts with just integer operations. (See popcount_vecs() and popcount_longs() in plink_common.c.) Either way, when evaluating unweighted distances, this approach is even more efficient than using partial sum lookups. Many of PLINK's other calculations also have a bit population count, or a minor variant of it, in the innermost loop. Here's an example. Ternary dot product Encoding homozygous minor genotypes as the 2-bit value 00, heterozygous genotypes as 01, and homozygous major as 10, we evaluate the dot product of two encoded genotype vectors x and y as follows:
Handling of missing values is discussed in our upcoming paper. It slows things down by a factor of three or so, but this approach still beats partial sum lookup. The same basic idea can be used to calculate covariance matrix terms (without variance standardization). However, in this case we have not found a good way to handle missing calls, so partial sum lookups are still employed. Vertical population count Brian Browning noted that it was not necessary to explicitly evaluate all four sums; instead, given three sums, the fourth could be determined via subtraction. This is a key optimization introduced by his PRESTO software, which produces a large speedup on the low-MAF variants that will be increasingly common in tomorrow's datasets. However, it is not compatible with the 'horizontal' bit population count described above (which sums one full row at a time, instead of one column at a time). Therefore, we decided to search for an efficient "vertical population count" which could be used with Browning's optimization. One obvious approach involves ordinary parallel adds. Unfortunately, it is necessary to pad the 0/1 matrix entries out to 32 bits in order to avoid integer overflow, meaning only 4 terms can be updated simultaneously by SSE2 instructions; the padding also has a horrible effect on memory bus traffic. Now, you could buy an additional factor of 2 by limiting the number of cases to 65535 and using 16-bit integers, but that's a crude hack which limits scalability... unless... ...unless you have an inner loop which runs to 65535, and an outer loop which uses the 16-bit counts to update an array of 32-bit totals before overflow would actually happen. PLINK uses a 4-bit inner loop, an 8-bit middle loop, and a 32-bit outer loop instead of just a 16-bit and a 32-bit loop, but the basic idea is the same. In addition, instead of spending even 4 bits per permutation matrix entry, we use bit shifts and masks in the inner loop to efficiently unpack our 1-bit-per-entry storage format. We use a specially reordered representation of the permutation matrix to streamline the inner loop logic. More precisely, we use an inner loop like this:
and a middle loop like so:
This has a lot in common with Edel and Klein's vertical population count. Hardy-Weinberg equilibrium and Fisher exact tests Using the probability distribution's super-geometric decay to put upper bounds on tail sums, we have also developed an even faster routine for comparing a contingency table against a specific p-value. The Fisher 2x2 exact test has a nearly identical form, which we have applied all the same optimizations to. Our permutation test routines go one step further: comparisons against a known p-value are usually performed by checking whether the upper-left contingency table value is within a precomputed range. This almost eliminates the need to use chi-square approximations, which can be a significant win when rare alleles are involved. The code for our Fisher-Freeman-Halton 2x3 exact test is an extension of the SNP-HWE strategy to two dimensions, with time complexity O(n), constant space complexity, and small enough coefficients that even tables with entries in the millions can be evaluated within seconds in a web browser on a commodity PC (see the demo link below). This is a very large improvement over Mehta and Patel's network algorithm and its descendants (including e.g. Requena's work), which have been the standard approach for the last 30 years. Extension to larger table sizes is conceptually straightforward (determine relative likelihoods of the df-ball of tables near enough to the most probable one to not cause a machine precision underflow); the general-case time complexity is O(ndf/2). We expect that the network algorithm is still the best basic approach once there are enough degrees of freedom, but if you're writing any serious software for df ≤ 6 we strongly recommend trying our algorithm. For your convenience, we have provided GPLv3 interactive JavaScript calculators and C/C++ standalone code utilizing these ideas. (That page also includes an O(sqrt(n)) exact binomial test applying the same idea.) Multithreaded gzip For multithreaded block-gzip, we use code from htslib 1.2.1. 1: Well, mostly. Since pigz doesn't support Windows yet, we just use single-threaded compression there. Guidelines for adding new functionalityIt's not too hard to add a new flag to PLINK if you're experienced with C programming. Define one or two more variables in main() and extend the argument list of the appropriate downstream function (usually plink()), add command-line parsing logic for your flag in the appropriate alphabetic branch of the big switch() block, and then insert your actual calculation. Some technical notes:
If you want to learn more, you are encouraged to watch the GitHub repository. Feel free to open discussions about the code on its issues page. |