COMPUTATIONAL TOOLS FOR THE DNA TEXT COMPLEXITY ESTIMATES FOR MICROBIAL GENOMES STRUCTURE ANALYSIS
Abstract and keywords
Abstract (English):
One of the fundamental tasks in bioinformatics involves searching for repeats, which are statistically heterogeneous segments within DNA sequences and complete genomes of microorganisms. Theoretical approaches to analyzing the complexity of macromolecule sequences (DNA, RNA, and proteins) were established prior to the availability of complete genomic sequences. These approaches have experienced a resurgence due to the proliferation of mass parallel sequencing technologies and the exponential growth of accessible data. This article explores contemporary computer methods and existing programs designed to assess DNA text complexity as well as construct profiles of properties for analysing the genomic structures of microorganisms. The article offers a comprehensive overview of available online programs designed for detecting and visualising repeats within genetic text. Furthermore, the paper introduces a novel computer-based implementation of a method to evaluate the linguistic complexity of text and its compression using Lempel-Ziv. This approach aims to identify structural features and anomalies within the genomes of microorganisms. The article also provides examples of profiles generated through the analysis of text complexity. Application of these complexity estimates in the analysis of genome sequences, such as those of the SARS-CoV-2 coronavirus and the Mumps Orthorubulavirus, is discussed. Specific areas of low complexity within the genetic text have been successfully identified in this research.

Keywords:
bioinformatics, biophysical models, text complexity, microbial genomes
Text
Publication text (PDF): Read Download
References

1. Simoes R.P., Wolf I.R., Correa B.A., Valente G.T. Uncovering patterns of the evolution of genomic sequence entropy and complexity. Mol Genet Genomics, 2021, vol. 296, no. 2, pp. 289-298, doi:https://doi.org/10.1007/s00438-020-01729-y.

2. Orlov Y.L., Potapov V.N. Complexity: an internet resource for analysis of DNA sequence complexity. Nucleic Acids Res., 2004, vol. 32, pp. W628-W633, doi:https://doi.org/10.1093/nar/gkh466.

3. Bartal A., Jagodnik K.M. Progress in and Opportunities for Applying Information Theory to Computational Biology and Bioinformatics. Entropy (Basel), 2022, vol. 24, no. 7, pp. 925, doi:https://doi.org/10.3390/e24070925.

4. Bernaola-Galvan P., Carpena P., Gomez-Martin C., Oliver J.L. Compositional Structure of the Genome: A Review. Biology (Basel), 2023, vol. 12, no. 6, p. 849, doi:https://doi.org/10.3390/biology12060849.

5. Chang C.H., Hsieh L.C., Chen T.Y., Chen H.D., Luo L., Lee H.C. Shannon information in complete genomes. J. Bioinform. Comput. Biol., 2005, vol. 3, no. 3, pp. 587-608, doi:https://doi.org/10.1142/s0219720005001181.

6. Olson W.K., Zhurkin V.B. Modeling DNA deformations. Curr Opin Struct Biol., 2000, vol. 10, no. 3, pp. 286-297, doi:https://doi.org/10.1016/s0959-440x(00)00086-5.

7. Orlov Y.L., Filippov V.P., Potapov V.N., Kolchanov N.A. Construction of stochastic context trees for genetic texts. In Silico Biol., 2002, vol. 2, no. 3, pp. 233-247.

8. Chanda P., Costa E., Hu J., Sukumar S., Van Hemert J., Walia R. Information Theory in Computational Biology: Where We Stand Today. Entropy, 2020, vol. 22, no. 6, p. 627, doi:https://doi.org/10.3390/e22060627.

9. Akbari Rokn Abadi S., Mohammadi A., Koohi S. A new profiling approach for DNA sequences based on the nucleotides' physicochemical features for accurate analysis of SARS-CoV-2 genomes. BMC Genomics, 2023, vol. 24, no. 1, p. 266, doi:https://doi.org/10.1186/s12864-023-09373-7.

10. Altschul S.F., Madden T.L., Schaffer A.A., Zhang J., Zhang Z., Miller W., Lipman D.J. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 1997, vol. 25, no. 17, pp. 3389-3402, doi:https://doi.org/10.1093/nar/25.17.3389.

11. Berselli M., Lavezzo E., Toppo S. NeSSie: a tool for the identification of approximate DNA sequence symmetries. Bioinformatics, 2018, vol. 34, no. 14, pp. 2503-2505, doi:https://doi.org/10.1093/bioinformatics/bty142.

12. Andersen E.S. Prediction and design of DNA and RNA structures. New Biotechnology, 2010, vol. 27, no. 3, pp. 184-193, doi:https://doi.org/10.1016/j.nbt.2010.02.012.

13. Shi X., Teng H., Sun Z. An updated overview of experimental and computational approaches to identify non-canonical DNA/RNA structures with emphasis on G-quadruplexes and R-loops. Brief Bioinform., 2022, vol. 23, no. 6, pp. bbac441, doi:https://doi.org/10.1093/bib/bbac441.

14. Narad P., Kumar A., Chakraborty A., Patni P., Sengupta A., Wadhwa G., Upadhyaya K.C. Transcription Factor Information System (TFIS): A Tool for Detection of Transcription Factor Binding Sites. Interdiscip Sci., 2017, vol. 9, no. 3, pp. 378-391, doi:https://doi.org/10.1007/s12539-016-0168-5.

15. Safronova N.S., Ponomarenko M.P., Abnizova I.I., Orlova G.V., Chadaeva I.V., Orlov Y.L. Flanking monomer repeats determine decreased context complexity of single nucleotide polymorphism sites in the human genome. Russian Journal of Genetics: Applied Research, 2016, vol. 6, no. 8, pp. 809-815 (In Russ.).

16. Vityaev E.E., Orlov Y.L., Vishnevsky O.V., Pozdnyakov M.A., Kolchanov N.A. Computer system "Gene Discovery" for promoter structure analysis. In Silico Biol., 2002, vol. 2, pp. 257-262.

17. Babenko V., Chadaeva I., Orlov Y. Genomic landscape of CpG rich elements in human genome. BMC evolutionary biology, 2017, vol. 17, suppl. 1, pp. 19, doi:https://doi.org/10.1186/s12862-016-0864-0.

18. Babenko V.N., Bogomolov A.G., Babenko R.O., Galieva E.R., Orlov Y.L. CpG islands’ clustering uncovers early development genes in the human genome. Computer Science and Information Systems, 2018, vol. 15, no. 2, pp. 473-485, doi:https://doi.org/10.2298/CSIS170523004B.

19. Orlov Y.L., Levitskii V.G., Smirnova O.G., Podkolodnaya O.A., Khlebodarova T.M., Kolchanov N.A. Statistical analysis of DNA sequences containing nucleosome positioning sites. Biophysics, 2006, vol. 51, no. 4, pp. 541-546 (In Russ.).

20. Goh W.S., Orlov Y., Li J., Clarke N.D. Blurring of high-resolution data shows that the effect of intrinsic nucleosome occupancy on transcription factor binding is mostly regional, not local. PLoS Comput Biol., 2010, vol. 6, no. 1, e1000649, doi:https://doi.org/10.1371/journal.pcbi.1000649.

21. Dergilev A.I., Spitsina A.M., Chadaeva I.V., Svichkarev A.V., NAumenko F.M., Kulakova E.V., Vityaev E.E., Chen M., Orlov Y.L. Computer analysis of colocalization of the TFs’ binding sites in the genome according to the ChIP-seq data. Russian Journal of Genetics: Applied Research, 2017, vol. 7, no. 5, pp. 513-522 (In Russ.).

22. Alipanahi B., Delong A., Weirauch M.T., Frey B.J. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat Biotechnol., 2015, vol. 33, no. 8, pp. 831-838, doi:https://doi.org/10.1038/nbt.3300.

23. Mitina A.V., Orlov Y.L. The estimates of linguistic complexity of genetic sequences of SARS-CoV-2 stamms. Collection of scientific papers of the VII Congress of Biophysicists of Russia: in 2 volumes, vol.1 - Krasnodar: Printing house of FGBOU VO "KubGTU", 2023, p. 330 (In Russ.).

24. Orlov Y.L., Gusev V.D., Miroshnichenko L.A. LZcomposer: Decomposition of Genomic Sequences by Repeat Fragments. Biofizika, 2003, vol. 48, suppl. 1, pp. S7-S16.

25. Wu C., Chen J., Liu Y., Hu X. Improved Prediction of Regulatory Element Using Hybrid Abelian Complexity Features with DNA Sequences. International Journal of Molecular Sciences, 2019, vol. 20, no. 7, p. 1704, doi:https://doi.org/10.3390/ijms20071704.

26. Orlov Y.L., Mitina A.V., Suslov V.V., Dergilev A.I. Computer estimates of the information complexity of prokaryotic genomes. Abstracts of the 4th All-Russian Conference on Astrobiology "Geological, biological and biogeochemical processes in solving astrobiological problems" February 27 - March 2, 2023, Pushchino. Institute of Physicochemical and Biological Problems of Soil Science RAS, pp. 20-22 (In Russ.).

27. Suslov V.V., Afonnikov D.A., Podkolodny N.L., Orlov Y.L. Genome features and GC content in prokaryotic genomes in connection with environmental evolution. Paleontological Journal, 2013, vol. 47, no. 9, pp. 1056-1060 (In Russ.).

28. Safronova N.S., Babenko V.N., Orlov Y.L. 117 Analysis of SNP containing sites in human genome using text complexity estimates. Journal of Biomolecular Structure and Dynamics, 2015, vol. 33, suppl. 1, pp. 73-74, doi:https://doi.org/10.1080/07391102.2015.1032750.

29. Dergilev A.I., Orlova N.G., Mitina A.V., Orlov Y.L. Application of methods for assessing text complexity to the analysis of genomic clusters of transcription factor binding sites. Collection of scientific papers of the VII Congress of Biophysicists of Russia: in 2 volumes, vol.1 - Krasnodar: Printing house of FGBOU VO "KubGTU", 2023, pp. 335-336 (In Russ.).

30. Dergilev A.I., Orlova N.G., Dobrovolskaya O.B., Orlov Y.L. Statistical estimates of multiple transcription factors binding in the model plant genomes based on ChIP-seq data. J Integr Bioinform., 2021, vol. 19, no. 1, p. 20200036, doi:https://doi.org/10.1515/jib-2020-0036.

31. Pringlaeva A.M., Dergilev A.I., Panova A.D., Orlov Y.L. The complexity of the text and the structure of genome repeats on the example of coronavirus. Marchuk Scientific Readings 2020: Abstracts of the Intern. conf., dedicated 95th anniversary of the birth of Acad. G. I. Marchuk Novosibirsk, October 19-23, 2020. Inst. Comput. mathematics and math. geophysics SB RAS, Novosibirsk: CPI NSU, 2020, p. 167 (In Russ.).

32. Galieva A.G., Luzin A.N., Orlova N.G., Kulikova D.K., Dergilev A.I., Orlov Y.L. Bioinformatics approaches to analyze the mutation points of the coronavirus genome. In the collection: Molecular Diagnostics and Biosafety-2021. COVID-19: epidemiology, diagnosis, prevention: collection of abstracts of the Online Congress with international participation (April 28-29, 2021, Moscow). M.: Central Research Institute of Epidemiology of Rospotrebnadzor, 2021, 144 p. (In Russ.).

33. Antao R., Mota A., Machado J.A.T. Kolmogorov complexity as a data similarity metric: application in mitochondrial DNA. Nonlinear Dyn., 2018, vol. 93, no. 3, pp. 1059-1071.

34. Dheemanth H.N. LZW Data Compression. American Journal of Engineering Research (AJER), 2014, vol. 3, no. 2, pp. 22-26.

35. Putta P., Orlov Y.L., Podkolodnyy N.L., Mitra C.K. Relatively conserved common short sequences in transcription factor binding sites and miRNA. Vavilov Journal of Genetics and Breeding, 2011, vol. 15, no. 4, pp. 750-756 (In Russ.).

36. Orlov Y.L., te Boekhorst R., Abnizova I.I. Statistical measures of the structure of genomic sequences: entropy, complexity, and position information. J Bioinform Comput Biol., 2006, vol. 4, pp. 523-536.

37. Popov O., Segal D.M., Trifonov E.N. Linguistic complexity of protein sequences as compared to texts of human languages. Biosystems, 1996, vol. 38, no. 1, pp. 65-74, doi:https://doi.org/10.1016/0303-2647(95)01568-x.

38. Troyanskaya O.G., Arbell O., Koren Y., Landau G.M., Bolshoy A. Sequence complexity profiles of prokaryotic genomic sequences: a fast algorithm for calculating linguistic complexity. Bioinformatics, 2002, vol. 18, no. 5, pp. 679-688.

39. Lu R., Zhao X., Li J. et al. Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding. Lancet, 2020, vol. 395, no. 10224, pp. 565-574, doi:https://doi.org/10.1016/S0140-6736(20)30251-8.

40. Hu B., Guo H., Zhou P. et al. Characteristics of SARS-CoV-2 and COVID-19. Nat Rev Microbiol., 2021, vol. 19, pp. 141-154, doi:https://doi.org/10.1038/s41579-020-00459-7.

41. Rubalskaya T.S., Erokhov D.V., Zherdeva P.E., Milikhina A.V., Gadzhieva A.A., Tikhonova N.T. Genotyping of mumps virus (Paramyxoviridae: Orthorubulavirus: Mumps Orthorubulavirus) as element of laboratory confirmation of infection. Questions of virology, 2023, vol. 68, no. 1, pp. 59-65 (In Russ.).

42. Su S.B., Chang H.L., Chen A.K. Current Status of Mumps Virus Infection: Epidemiology, Pathogenesis, and Vaccine. Int J Environ Res Public Health, 2020, vol. 17, no. 5, p. 1686, doi:https://doi.org/10.3390/ijerph17051686.

43. Yuminova N.V., Kontarova E.O., Balaev N.V., Artyushenko S.V., Kontarov N.A., Rossoshanskaya N.V., Sidorenko E.S., Gafarov R.R., Zverev V.V. Measles, mumps and rubella vaccination: tasks, problems and realities. Epidemiology and Vaccinal Prevention, 2011, vol. 4, no. 59, pp. 40-44 (In Russ.).

44. Chao H., Zhang S., Hu Y., Ni Q., Xin S., Zhao L., Ivanisenko V.A., Orlov Y.L., Chen M. Integrating omics databases for enhanced crop breeding. J Integr Bioinform., 2023, doi:https://doi.org/10.1515/jib-2023-0012.

45. Orlov Y.L., Bragin A.O., Babenko R.O., Dresvyannikova A.E., Kovalev S.S., Shaderkin I.A., Orlova N.G., Naumenko F.M. Integrated Computer Analysis of Genomic Sequencing Data Based on ICGenomics Tool. In: Advances in Intelligent Systems, Computer Science and Digital Economics. CSDEIS 2019, AISC 1127, International Journal of Intelligent Systems and Applications (IJISA), 2020, pp. 154-164, doi:https://doi.org/10.1007/978-3-030-39216-1_15.


Login or Create
* Forgot password?