File Download

  • Find it @ UNIST can give you direct access to the published full text of this article. (UNISTARs only)

Views & Downloads

Detailed Information

Cited time in webofscience Cited time in scopus
Metadata Downloads

Implications of imputing missing clinical records on GWAS analysis

Author(s)
Lee, Chanyoung
Advisor
Bhak, JongHwa
Issued Date
2024-02
URI
https://scholarworks.unist.ac.kr/handle/201301/82024 http://unist.dcollection.net/common/orgView/200000743178
Abstract
GWAS, relies on analyzing single-nucleotide polymorphisms (SNPs) to identify genetic variants associated with specific phenotypes. A sample size directly determines the statistical power of these studies. Larger sample sizes are crucial, particularly for detecting variants with small effect sizes that are common in complex diseases but challenging to identify in smaller cohorts. However, in a special case where a number of participants is limited and data for clinical phenotype are missing, then imputation might be one of consideration to increase the statistical power. This study aims to explore the efficacy of imputing missing clinical data and its implications with GWAS analysis. Given the data often sourced from various hospitals contain missing values, the completeness and integrity of clinical data become essential for conducting reliable research. Factors like non-responses, data entry errors, and differing data collection protocols can lead to missing data, and their mishandling can result in biased outcomes and erroneous conclusions. In order to address those issues, I estimated the suitability of using different imputation methods, such as Multiple Imputation by Chained Equations (MICE), Multiple Imputation (MI), and K-Nearest Neighbors (KNN) depending on the characteristics of data. 4K dataset and 10K dataset, consisting of 2,492 and 6,085 people, respectively, went through imputation to see the difference in 20 clinical phenotypes. When examining the results of imputation and GWAS before and after in the 4K dataset, diverse patterns emerged. For some traits, significant loci disappeared and reappeared elsewhere, while in others, the existing loci were maintained while new ones emerged. Some traits showed minimal changes in results. Conversely, in the 10K dataset, prior to imputation, the dataset exhibited a state where many SNP sites displayed outlier values due to the absence of allele balance correction, a step in batch effect removal. However, upon conducting GWAS analysis after imputation, a reduction in these outlier phenomena was observed in some phenotypes. Considering these trends, employing clinical information for imputation and GWAS may not be as reliable as well-established SNP-based batch effect removal techniques like kinship removal, allele balance correction, and heterozygosity check. However, it can be considered as a supplementary approach, particularly when the results are viewed as reference notes, as in the case of the 4K dataset, or when preprocessing is incomplete, as seen in the 10K dataset.
Publisher
Ulsan National Institute of Science and Technology

qrcode

Items in Repository are protected by copyright, with all rights reserved, unless otherwise indicated.