|Published (Last):||14 March 2014|
|PDF File Size:||2.22 Mb|
|ePub File Size:||10.64 Mb|
|Price:||Free* [*Free Regsitration Required]|
We focus on standard procedures for QC of genomewide array data, including genotype calling, sex verification, sample identity verification, relationship checking and population structure, that are complicated by the HumanExome panel's enrichment in rare, exonic variation. Genome-wide association studies GWAS have successfully identified thousands of common genetic variants of small to moderate effect associated with common, heritable diseases Hindorff et al.
The cost of whole-exome sequencing, however, is still daunting for samples of the size required for detecting single rare variants of moderate effect Page et al. Other considerations, including data storage requirements and a labor-intensive data processing pipeline, also make WES less attractive. Exome-based genotyping arrays combine targeted interrogation of exonic variation with the low cost and convenience of a GWAS chip.
Researchers may add up to , custom variants. The chip, moreover, includes a panel of 4, ancestry-informative markers 3, AA vs. European, 1, Native American vs. European and a scaffold of 5, markers with high minor allele frequency MAF in several continental populations, suitable for linkage analysis or relationship estimation. Quality control QC is an often underappreciated process crucial to the success of any genetic association study. Two previous publications have specifically addressed QC for exome arrays.
Grove et al. Recently, Guo et al presented detailed protocols for running Illumina GenomeStudio, Illumina's sample processing and genotype calling software, with references to accessory R and Python scripts to perform specific data-processing tasks Guo et al. Both of these reports have some limitations: the Grove et al.
Here, we emphasize procedures and potential problems not previously described, though we outline the complete process. In addition, considerable text-file manipulation and processing is required, for which a scripting language such as Perl or Python is best suited. GenomeStudio's GenCall algorithm is optimized for common variants Perreault et al.
For this reason, we verified rare variant genotypes by means of a second genotyping program, zCall Goldstein et al. Guo et al. Additionally, Grove et al. Genotyping from raw fluorescence data Illumina. When data are initially loaded to create a GenomeStudio project, GenomeStudio either clusters genotypes empirically, using only the data, or calls genotypes from predefined clusters in a cluster. We used Illumina's standard clustering file to call genotypes within GenomeStudio.
Raw fluorescence data are contained in two. A Sample Sheet, provided to GenomeStudio by the user to organize the project, links the sample identifiers IDs provided by the investigators with the. Merging from several genotyping projects into a single GenomeStudio project involves extra steps, and requires that the arrays be exactly the same version across all data sets. After loading the first dataset, choose Load Additional Samples from the File drop-down menu and load the data as with the first dataset and repeat until all samples are loaded.
Examine low-quality samples by means of genomewide B allele frequency BAF and log R ratio LRR intensity plots see Sample Quality, below for large chromosomal deletions that may affect overall call rate.
Such samples may be usable for analysis. Here, final genotypes were not based on zCall output directly, but rather zCall genotypes are used as a means of identifying rare variants with true heterozygous genotoypes missed by GenCall. To conduct the zCall review, first generate a GenomeStudio Final Report containing genotype calls and intensity values in X, Y coordinates as input for zCall.
This change increases the required distance from homozygote clusters for a genotype to be called as heterozygous. Hence, both input and output files may be very large, especially the input Final Report about 80 MB per sample. Compare minor allele counts from GenCall and zCall by means of an allele frequency report on each version of the genotypes using PLINK's --freq function, with the --counts option to report allele counts.
If a distinct heterozygote cluster is evident and the cluster locations are consistent with high-quality marker data see Guo et al. Otherwise, discard the marker. Also manually review all pseudoautosomal, Y-linked and mitochondrial markers in GenomeStudio for poor clustering. Where misclustering is evident, adjust cluster locations manually where possible and discard markers without well-defined clusters.
GenomeStudio's plug-in module exports genotypes to PLINK plain-text format, which can result in extremely large pedigree files. As an optional step to reduce the file size, filter out in GenomeStudio markers with a minor allele frequency of 0 in the sample, since these SNPs contain no variation useful for detecting genetic association or linkage.
There are likely to be many of these monomorphic markers, unless the sample is extremely large. Do not discard monomorphic markers, however, if the exported data is to be combined with other sets of genotypes in which these markers may be polymorphic.
Finally, omit markers mapping to chromosome 0. Extract Y chromosome genotypes separately, with a call rate threshold somewhat below the proportion of male samples in the data set. Conduct further call rate screening for the Y chromosome using male samples only, after resolving sex-mismatch errors described under Sample Quality, below.
DNA strand orientation is an important consideration when genotype data sets from different arrays are combined and when untyped markers are imputed Unit 1.
A small percentage of variants may not have rsIDs. Sample quality QC is concerned primarily with identifying a poorly typed DNA samples; b sample identity errors, including samples erroneously duplicated and switched; and c chromosome anomalies that may interfere with or confound genetic analysis.
Many, but not all, poor-quality DNA samples will be identified and removed during genotype calling; thus, the sample QC procedures primarily concern sample identity. All standard sample QC objectives are readily achieved with HumanExome data, with a few complicating factors associated with the structure of the HumanExome panel. Some screens do not immediately identify samples for exclusion, but rather flag samples for additional screening and resolution.
Table 1 lists sample-based screening criteria, further explained in the following sections. The primary screen for sample quality is call rate.
The desired final call rate may vary depending on original DNA quality: DNA samples stored for many years generally yield fewer high-quality genotypes. Whole-genome amplification, a common method to increase abundance of DNA samples in limited quantity before genotyping, may cause anomalous results from both sample and marker QC, particularly in marker panels enriched in rare variants.
Consequently, whole-genome-amplified WGA samples require somewhat different QC screens from non-amplified DNA, and require particular care in screens involving total fluorescence intensity or DNA copy number. WGA samples have lower call rate, on average, but screen them for call rate as for other samples. Subtle differences in fluorescence intensity may also occur between DNA extracted from blood samples and DNA from lymphoblastoid cell lines Wellcome Trust Case Control Consortium, , but these will not likely cause detectable differences in genotype frequencies in samples of moderate size, particularly among rare variants.
Before screening sex chromosomes for sample QC, the X- and Y-linked markers must be properly annotated. Comparing the apparent genetic sex of each sample with the sex recorded in the clinical data can reveal DNA samples mismatched with the participants' clinical records, as well as clerical errors in sample reporting.
We recommend both of two complementary procedures for screening sex-chromosome data: a plot of mean fluorescence intensity of the X vs. Measures of fluorescence intensity used for QC include the B allele frequency BAF , a value between 0 and 1 representing the relative strength of the A and B allele intensities; and the log R ratio LRR , a normalized measure of overall signal strength.
Female genotypes on the Y chromosome will generally be missing. Consequently, for the Y chromosome, do not set a standard high threshold for call rate, but rather one somewhat less than the proportion of males in the study sample e. Adding an upper limit to call rate e. A tail extending toward lower intensity for the Y in males likely reflects age-related loss of the Y chromosome Forsberg et al. Plot of mean LRR of Y vs. X for recorded putative males blue squares and females magenta circles without A and including B WGA samples.
Outlying values are circled and annotated with putative sex chromosome anomalies. An asterisk indicates a confirmed XXX individual. Apparent sex-mismatch errors are evident as points of the opposite color within the two major clusters. Outlying points in Fig.
One individual in Fig. Whole-genome-amplified samples have more variable intensity profiles compare Fig. In contrast, the Y intensities are less dramatically affected in WGA samples. Like PLINK's sex check see next paragraph , this test is not very detailed, and should not be used without confirmation via a mean intensity plot.
A value of F X between 0. Because the HumanExome chip is scarce in common variants, the estimated F X of female samples is imprecise Fig. Plot of Y chromosome call rate vs. Recorded sex, and potentially sex-aneuploid samples, are indicated as in Fig. Nonetheless, X chromosome heterozygosity, when paired with chromosome Y call rate, can provide useful information that a plot of fluorescence intensity can miss, particularly for WGA samples, whose mean intensity values are not a reliable indicator of true DNA copy number.
Indeed, these genotype-based sex chromosome profiles are considerably more robust to WGA status than the intensity profiles compare Fig. F X Fig. We discuss this sample further below. Thus, the intensity- and genotype-based approaches to examining sex chromosomes have unique advantages when applied to HumanExome data.
PLINK, by default, ignores samples assigned as female when analyzing Y-linked markers, whereas call rates for all samples are required for the genotype-based sex chromosome check. Using R's base graphics package, we created more easily interpretable plots Fig. The absence of a heterozygote band in Fig. Filtering out improperly annotated and poor-quality markers dramatically improves ease of interpreting the BAF plot compare Fig.
In Fig. This sample is marked with an asterisk in Fig. However, the BAF of these genotypes is more variable than expected, and closer inspection reveals that this band is off center, with an average BAF of about 0.
An intermediate, off-center BAF band is also visible on the Y chromosome, although the mean intensity is typical of a single copy of the Y.
GenomeStudio Software 2011.1 User Guides
Illumina GenomeStudio software was used to extract the probe DNA methylation intensity signal values for each locus. Data were then preprocessed following recommendations from the Dedeurwaerder et al. Data wer […]. Raw gene expression data was processed using Illumina GenomeStudio software v1. Data were then analysed using R statistical software v2. Hybridization intensity data were extracted from the scanned images, and evaluated using Illumina GenomeStudio software, V
Note: bit Windows system should be compatible to most bit executables, as long as all components of the program are compiled in bit code; therefore, if you install bit Perl in the bit Windows computer, in principle PennCNV can still run without re-compilation, and this has been confirmed by some users. The advantage is obvious: one can simply click mouse buttons and perform CNV detection and visualization. The disadvantages are: 1 it is very slow: The CNV calling is implemented by exporting signal files from BeadStudio one by one, and then calling PennCNV again and again for each file, and each time reloading all necessary model files into memory, which is a very inefficient way to perform CNV analysis by PennCNV. Make sure your computer has at least 2GB preferably 4GB memory. Download the ActivePerl for windows version 5.
Quality Control for the Illumina HumanExome BeadChip.
Instructions for the GenomeStudio Software GenomeStudio GenomeStudio Genotyping Module v2. GenomeStudio GT Module v1.