File Download

There are no files associated with this item.

  • Find it @ UNIST can give you direct access to the published full text of this article. (UNISTARs only)

Views & Downloads

Detailed Information

Cited time in webofscience Cited time in scopus
Metadata Downloads

Deep Learning-Integrated Systems Biology Framework for Elucidating Strain-Level Diversity in Bacterial Transcriptional Regulatory Networks

Author(s)
Bang, Ina
Advisor
Kim, Donghyuk
Issued Date
2026-02
URI
https://scholarworks.unist.ac.kr/handle/201301/91103 http://unist.dcollection.net/common/orgView/200000964664
Abstract
Transcription factors (TFs) enable bacteria to rapidly redirect cellular metabolism in response to environmental changes, making their regulatory function essential for survival. Understanding these regulatory mechanisms requires systematic reconstruction of transcriptional regulatory networks (TRNs). Systematic analysis of TRNs has advanced from characterizing individual regulons to resolving genome-wide regulatory interactions using high-throughput sequencing. Recent TRN reconstruction leverages integrated multi-omics datasets, combining TF binding profiles, condition- specific transcriptomes, and promoter features to infer regulatory logic with higher resolution. Chromatin immunoprecipitation (ChIP)-based approaches have become one of the central tools for defining genome-wide TF binding, thereby enabling us to understand how bacteria coordinate gene expression in response to metabolic and environmental cues. Among these methods, ChIP-exo provides near-single-nucleotide resolution by precisely trimming immunoprecipitated DNA, outperforming ChIP-seq and ChIP-chip in defining protein binding positions and resolving motif patterns. However, despite its resolution advantage, ChIP-exo analysis remains bottlenecked by extensive manual curation, particularly when distinguishing actual binding events from noise. Beyond these challenges, bacterial TRNs are further complicated by biological diversity. Even when TF protein sequences are nearly identical, evolutionary diversification of genome content can lead to significantly different regulatory outcomes across strains. As a result, TRNs derived from a single model organism cannot fully represent the regulatory logic operating across a species. Reconstructing TRNs beyond traditional model strains and identifying which regulatory features are conserved versus diversified is therefore essential for understanding how transcriptional control evolves and how bacterial strains achieve distinct physiological and adaptive behaviors. At the same time, computational prediction about TF binding and regulatory behavior also faces important limitations. Traditional motif-based predictors offer only binary assessments of motif presence and fail to capture non-canonical patterns or condition-dependent regulatory activity. Recent deep learning approaches have enhanced the capacity to discern TF-DNA interactions directly from sequence. Nevertheless, the majority of these approaches remain classification-focused and deficient in quantitative, condition-dependent predictive capability. This emphasizes the necessity for enhanced frameworks that integrate experimental data with interpretable, quantitative modelling of transcription regulation. To address these challenges, we developed DEOCSU, a CNN-based peak-calling suite optimized for the unique border structure of ChIP-exo data. Trained on image-converted peak representations from Escherichia coli sigma factor ChIP-exo datasets, DEOCSU accurately distinguished bona fide peaks from alignment noise and consistently outperformed existing tools. Benchmarking across various other bacteria, archaea, and eukaryotic TFs demonstrated that DEOCSU reliably recovers canonical motifs and characteristic peak width distributions without organism-specific tuning. Together with ChEAP an integrated preprocessing and visualization pipeline specialized for ChIP-exo data analysis, DEOCSU provides a generalizable, automated workflow for high-resolution mapping of DNA-protein interactions. Using these tools, the conservation and divergence of CRP-mediated transcription regulation were investigated across thirteen E. coli strains, spanning laboratory, industrial, environmental, and pathogenic strains. High-resolution CRP ChIP-exo under acetate (high cAMP) and glucose (low cAMP) conditions, coupled with RNA-seq, enabled reconstruction of strain-specific CRP regulons and definition of a CRP pan-regulon. Although the CRP motif was universally conserved, regulatory outputs varied markedly. iModulon-based decomposition revealed that metabolism-related modules remain broadly conserved, whereas stress and condition-responsive modules exhibit substantial strain-specific divergence. Comparison with the E. coli pan-genome revealed that regulatory diversity exceeds genomic diversity, as many core genes participate in accessory or unique CRP regulons, highlighting regulatory rewiring as a significant evolutionary force. Mechanistic contributors to this divergence included pseudogenization of genes, mutations in regulatory sequences, variation in crp and cAMP metabolism-related gene expression, and heterogeneity in local TF networks. A targeted examination of the E. coli K-12 W3110 CRP(K29T) variant revealed that even a single amino acid change can significantly alter CRP binding intensity and downstream transcriptional responses. Despite differences in regulatory outcomes, CRP exhibits a conserved DNA-binding pattern. Recognising this, the deep learning-based SIRU framework was designed to test whether differences in flanking sequences and the local sequence context they establish can quantitatively predict CRP binding intensity. SIRU combines a SentencePiece-based tokenizer trained on experimentally validated CRP binding sequences with a transformer encoder that learns how token identities, their pairwise relationships, and their distance-dependent dependencies contribute to CRP binding patterns across the region. Trained on CRP ChIP-exo datasets from twelve E. coli strains, the model accurately predicts condition-specific normalized intensity values and generalizes to unseen strains. Even in densely packed regulatory regions where adjacent input windows share substantial sequence overlap, SIRU preserves high positional resolution and distinguishes closely spaced binding sites with sequence-level precision. Attention-based interpretation revealed that SIRU does not rely solely on motif presence. Instead, it infers binding intensity from the interplay among biological context, demonstrating that the model consistently attends to biologically meaningful contextual relationships despite never being trained on regulatory annotations. By integrating experimental, comparative, and deep learning-based analyses, this work demonstrates how conserved DNA-binding grammars interact with strain-specific genomic and regulatory contexts to shape diverse transcriptional outputs. Through this integrated approach, I establish a deep learning- integrated systems biology framework that elucidates strain-level diversity in bacterial transcriptional regulatory networks. More broadly, the combined framework can provide a generalizable foundation for modeling transcription factor activity and for understanding how regulatory networks evolve across bacterial lineages.
Publisher
Ulsan National Institute of Science and Technology
Degree
Doctor
Major
School of Energy and Chemical Engineering

qrcode

Items in Repository are protected by copyright, with all rights reserved, unless otherwise indicated.