<!DOCTYPE article
PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.4 20190208//EN"
       "JATS-journalpublishing1.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" article-type="research-article" dtd-version="1.4" xml:lang="en">
 <front>
  <journal-meta>
   <journal-id journal-id-type="publisher-id">Russian Journal of Biological Physics and Chemisrty</journal-id>
   <journal-title-group>
    <journal-title xml:lang="en">Russian Journal of Biological Physics and Chemisrty</journal-title>
    <trans-title-group xml:lang="ru">
     <trans-title>АКТУАЛЬНЫЕ ВОПРОСЫ БИОЛОГИЧЕСКОЙ ФИЗИКИ И ХИМИИ</trans-title>
    </trans-title-group>
   </journal-title-group>
   <issn publication-format="print">2499-9962</issn>
  </journal-meta>
  <article-meta>
   <article-id pub-id-type="publisher-id">54569</article-id>
   <article-categories>
    <subj-group subj-group-type="toc-heading" xml:lang="ru">
     <subject>Общая биофизика</subject>
    </subj-group>
    <subj-group subj-group-type="toc-heading" xml:lang="en">
     <subject>General biophysics</subject>
    </subj-group>
    <subj-group>
     <subject>Общая биофизика</subject>
    </subj-group>
   </article-categories>
   <title-group>
    <article-title xml:lang="en">Influence of the degree of sequencing data filtering on the quality and completeness of the de novo transcriptome assembly</article-title>
    <trans-title-group xml:lang="ru">
     <trans-title>Влияние степени фильтрации данных секвенирования на качество и полноту de novo сборки транскриптома</trans-title>
    </trans-title-group>
   </title-group>
   <contrib-group content-type="authors">
    <contrib contrib-type="author">
     <name-alternatives>
      <name xml:lang="ru">
       <surname>Мегер</surname>
       <given-names>Я В</given-names>
      </name>
      <name xml:lang="en">
       <surname>Meger</surname>
       <given-names>Y V</given-names>
      </name>
     </name-alternatives>
     <email>meger_yakov@mail.ru</email>
     <xref ref-type="aff" rid="aff-1"/>
    </contrib>
    <contrib contrib-type="author">
     <name-alternatives>
      <name xml:lang="ru">
       <surname>Лантушенко</surname>
       <given-names>А О</given-names>
      </name>
      <name xml:lang="en">
       <surname>Lantushenko</surname>
       <given-names>A O</given-names>
      </name>
     </name-alternatives>
     <xref ref-type="aff" rid="aff-2"/>
    </contrib>
    <contrib contrib-type="author">
     <name-alternatives>
      <name xml:lang="ru">
       <surname>Водясова</surname>
       <given-names>Е А</given-names>
      </name>
      <name xml:lang="en">
       <surname>Vodiasova</surname>
       <given-names>E A</given-names>
      </name>
     </name-alternatives>
     <xref ref-type="aff" rid="aff-3"/>
    </contrib>
   </contrib-group>
   <aff-alternatives id="aff-1">
    <aff>
     <institution xml:lang="ru">Севастопольский государственный университет</institution>
     <country>ru</country>
    </aff>
    <aff>
     <institution xml:lang="en">Sevastopol State University</institution>
     <country>ru</country>
    </aff>
   </aff-alternatives>
   <aff-alternatives id="aff-2">
    <aff>
     <institution xml:lang="ru">Севастопольский государственный университет</institution>
     <country>ru</country>
    </aff>
    <aff>
     <institution xml:lang="en">Sevastopol State University</institution>
     <country>ru</country>
    </aff>
   </aff-alternatives>
   <aff-alternatives id="aff-3">
    <aff>
     <institution xml:lang="ru">Севастопольский государственный университет; ФИЦ «Институт биологии южных морей им. А.О. Ковалевского»</institution>
     <country>ru</country>
    </aff>
    <aff>
     <institution xml:lang="en">Sevastopol State University; A.O. Kovalevsky Institute of Biology of the Southern Seas of RAS</institution>
     <country>ru</country>
    </aff>
   </aff-alternatives>
   <pub-date publication-format="print" date-type="pub" iso-8601-date="2020-12-25T20:22:29+03:00">
    <day>25</day>
    <month>12</month>
    <year>2020</year>
   </pub-date>
   <pub-date publication-format="electronic" date-type="pub" iso-8601-date="2020-12-25T20:22:29+03:00">
    <day>25</day>
    <month>12</month>
    <year>2020</year>
   </pub-date>
   <volume>5</volume>
   <issue>4</issue>
   <fpage>580</fpage>
   <lpage>586</lpage>
   <history>
    <date date-type="received" iso-8601-date="2020-12-20T20:22:29+03:00">
     <day>20</day>
     <month>12</month>
     <year>2020</year>
    </date>
    <date date-type="accepted" iso-8601-date="2020-12-20T20:22:29+03:00">
     <day>20</day>
     <month>12</month>
     <year>2020</year>
    </date>
   </history>
   <self-uri xlink:href="https://rusjbpc.ru/en/nauka/article/54569/view">https://rusjbpc.ru/en/nauka/article/54569/view</self-uri>
   <abstract xml:lang="ru">
    <p>Для сборки de novo транскриптома существует множество сборщиков, которые имеют различающиеся алгоритмы. В тоже время этап фильтрации, являясь одним из ключевых, также имеет несколько подходов и алгоритмов. Однако, на сегодняшний день работ по изучению влияния степени фильтрации на сборку de novo транскриптома крайне мало. В данной работе были проанализированы транскриптомы, полученные с помощью двух наиболее распространенных программ (rnaSPADES и Trinity), а также применены различные подходы к этапу фильтрации прочтений. Были показаны ключевые различия для двух сборок и выявлены параметры, которые оказались чувствительными к степени фильтрации и длине входных прочтений. Также был предложен эффективный алгоритм фильтрации, который является двухэтапным и позволяет максимально сохранить объем входных данных при необходимом качестве всех прочтений после фильтрации и обрезки.</p>
   </abstract>
   <trans-abstract xml:lang="en">
    <p>There are many assemblers that have different algorithms to assemble a de novo transcriptome. At the same time, the filtering stage, being one of the key stages, also has several approaches and algorithms. However, to date, there is very little work on the influence of filtration degree on the de novo transcriptome Assembly. In this paper, we analyzed transcripts obtained using two of the most common programs (rnaSPADES and Trinity), and applied various approaches to the stage of filtering readings. Key differences were shown for the two assemblies and parameters were identified that were sensitive to the degree of filtering and the length of input reads. We also proposed an effective filtering algorithm that is two-stage and allows you to save the maximum amount of input data with the necessary quality of all readings after filtering and cropping.</p>
   </trans-abstract>
   <kwd-group xml:lang="ru">
    <kwd>rnaSPADES</kwd>
    <kwd>Trinity</kwd>
    <kwd>сборка de novo транскриптома</kwd>
    <kwd>RNA-seq</kwd>
    <kwd>фильтрация прочтений</kwd>
   </kwd-group>
   <kwd-group xml:lang="en">
    <kwd>RNA-seq</kwd>
    <kwd>rnaSPADES</kwd>
    <kwd>Trinity</kwd>
    <kwd>de novo transcriptome assembly</kwd>
    <kwd>read filtering</kwd>
   </kwd-group>
   <funding-group>
    <funding-statement xml:lang="ru">Работа выполнена в рамках государственной бюджетной темы (№ 0828-2018-0003), при поддержке Министерства образования и науки РФ (грант № 14.W03.31.0015) и внутреннего гранта СевГУ 2020 № 33/06-31.</funding-statement>
   </funding-group>
  </article-meta>
 </front>
 <body>
  <p></p>
 </body>
 <back>
  <ref-list>
   <ref id="B1">
    <label>1.</label>
    <citation-alternatives>
     <mixed-citation xml:lang="ru">
            
              Marinov G.K. On the design and prospects of direct RNA sequencing. Briefings in functional genomics, 2017, vol. 16, pp. 326-335.
            
          </mixed-citation>
     <mixed-citation xml:lang="en">
            
              Marinov G.K. On the design and prospects of direct RNA sequencing. Briefings in functional genomics, 2017, vol. 16, pp. 326-335.
            
          </mixed-citation>
    </citation-alternatives>
   </ref>
   <ref id="B2">
    <label>2.</label>
    <citation-alternatives>
     <mixed-citation xml:lang="ru">
            
              Liu L., Song B., Ma J., Song Y., Zhang S.Y., Tang Y., Wu X., Wei Z., Chen K., Su J., Rong R., Lu Z., de Magalhães J.P., Rigden D.J., Zhang L., Zhang S.W., Huang Y., Lei X., Liu H., Meng J. Bioinformatics approaches for deciphering the epitranscriptome: Recent progress and emerging topics.Computational and structural biotechnology journal, 2020, vol. 18, pp. 1587-1604.
            
          </mixed-citation>
     <mixed-citation xml:lang="en">
            
              Liu L., Song B., Ma J., Song Y., Zhang S.Y., Tang Y., Wu X., Wei Z., Chen K., Su J., Rong R., Lu Z., de Magalhães J.P., Rigden D.J., Zhang L., Zhang S.W., Huang Y., Lei X., Liu H., Meng J. Bioinformatics approaches for deciphering the epitranscriptome: Recent progress and emerging topics.Computational and structural biotechnology journal, 2020, vol. 18, pp. 1587-1604.
            
          </mixed-citation>
    </citation-alternatives>
   </ref>
   <ref id="B3">
    <label>3.</label>
    <citation-alternatives>
     <mixed-citation xml:lang="ru">
            
              Fu M., Su H., Su Z., Yin Z., Jin J., Wang L., Zhang Q., Xu X. Transcriptome analysis of Corynebacterium pseudotuberculosis-infected spleen of dairy goats. Microbial pathogenesis, 2020, vol. 34, pp. 104-120.
            
          </mixed-citation>
     <mixed-citation xml:lang="en">
            
              Fu M., Su H., Su Z., Yin Z., Jin J., Wang L., Zhang Q., Xu X. Transcriptome analysis of Corynebacterium pseudotuberculosis-infected spleen of dairy goats. Microbial pathogenesis, 2020, vol. 34, pp. 104-120.
            
          </mixed-citation>
    </citation-alternatives>
   </ref>
   <ref id="B4">
    <label>4.</label>
    <citation-alternatives>
     <mixed-citation xml:lang="ru">
            
              Seweryn M.T., Pietrzak M., Ma Q. Application of information theoretical approaches to assess diversity and similarity in single-cell transcriptomics.Computational and structural biotechnology journal, 2020, vol. 18, pp. 1830-1837.
            
          </mixed-citation>
     <mixed-citation xml:lang="en">
            
              Seweryn M.T., Pietrzak M., Ma Q. Application of information theoretical approaches to assess diversity and similarity in single-cell transcriptomics.Computational and structural biotechnology journal, 2020, vol. 18, pp. 1830-1837.
            
          </mixed-citation>
    </citation-alternatives>
   </ref>
   <ref id="B5">
    <label>5.</label>
    <citation-alternatives>
     <mixed-citation xml:lang="ru">
            
              Tamames J., Cobo-Simón M., Puente-Sánchez F. Assessing the performance of different approaches for functional and taxonomic annotation of metagenomes. BMC genomics, 2019, vol. 20, pp. 960.
            
          </mixed-citation>
     <mixed-citation xml:lang="en">
            
              Tamames J., Cobo-Simón M., Puente-Sánchez F. Assessing the performance of different approaches for functional and taxonomic annotation of metagenomes. BMC genomics, 2019, vol. 20, pp. 960.
            
          </mixed-citation>
    </citation-alternatives>
   </ref>
   <ref id="B6">
    <label>6.</label>
    <citation-alternatives>
     <mixed-citation xml:lang="ru">
            
              Hölzer M., Manja M. De novo transcriptome assembly: A comprehensive cross-species comparison of short-read RNA-Seq assemblers. GigaScience, 2019, vol. 8, pp. 247-260.
            
          </mixed-citation>
     <mixed-citation xml:lang="en">
            
              Hölzer M., Manja M. De novo transcriptome assembly: A comprehensive cross-species comparison of short-read RNA-Seq assemblers. GigaScience, 2019, vol. 8, pp. 247-260.
            
          </mixed-citation>
    </citation-alternatives>
   </ref>
   <ref id="B7">
    <label>7.</label>
    <citation-alternatives>
     <mixed-citation xml:lang="ru">
            
              Longone P. Percolation of aligned rigid rods on two-dimensional triangular lattices. Physical review. E, 2019, vol. 100, pp. 52-64.
            
          </mixed-citation>
     <mixed-citation xml:lang="en">
            
              Longone P. Percolation of aligned rigid rods on two-dimensional triangular lattices. Physical review. E, 2019, vol. 100, pp. 52-64.
            
          </mixed-citation>
    </citation-alternatives>
   </ref>
   <ref id="B8">
    <label>8.</label>
    <citation-alternatives>
     <mixed-citation xml:lang="ru">
            
              Andrews S. FastQC: A Quality Control Tool for High Throughput Sequence Data [Online], 2010. URL: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
            
          </mixed-citation>
     <mixed-citation xml:lang="en">
            
              Andrews S. FastQC: A Quality Control Tool for High Throughput Sequence Data [Online], 2010. URL: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
            
          </mixed-citation>
    </citation-alternatives>
   </ref>
   <ref id="B9">
    <label>9.</label>
    <citation-alternatives>
     <mixed-citation xml:lang="ru">
            
              Chen S. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics, 2018, vol. 34, pp. 884-890.
            
          </mixed-citation>
     <mixed-citation xml:lang="en">
            
              Chen S. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics, 2018, vol. 34, pp. 884-890.
            
          </mixed-citation>
    </citation-alternatives>
   </ref>
   <ref id="B10">
    <label>10.</label>
    <citation-alternatives>
     <mixed-citation xml:lang="ru">
            
              Grabherr M.G., Haas B.J., Yassour M., Levin J.Z., Thompson D.A., Amit I., Adiconis X., Fan L., Raychowdhury R., Zeng Q., Chen Z., Mauceli E., Hacohen N., Gnirke A., Rhind N., di Palma F., Birren B.W., Nusbaum C., Lindblad-Toh K., Friedman N., Regev A. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat. Biotechnol, 2011, vol. 29, pp. 644-702.
            
          </mixed-citation>
     <mixed-citation xml:lang="en">
            
              Grabherr M.G., Haas B.J., Yassour M., Levin J.Z., Thompson D.A., Amit I., Adiconis X., Fan L., Raychowdhury R., Zeng Q., Chen Z., Mauceli E., Hacohen N., Gnirke A., Rhind N., di Palma F., Birren B.W., Nusbaum C., Lindblad-Toh K., Friedman N., Regev A. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat. Biotechnol, 2011, vol. 29, pp. 644-702.
            
          </mixed-citation>
    </citation-alternatives>
   </ref>
   <ref id="B11">
    <label>11.</label>
    <citation-alternatives>
     <mixed-citation xml:lang="ru">
            
              Bushmanova E., Antipov D., Lapidus A., Prjibelski A.D. rnaSPAdes: a de novo transcriptome assembler and its application to RNA-Seq data. GigaScience, 2019, vol. 8, pp.103-147.
            
          </mixed-citation>
     <mixed-citation xml:lang="en">
            
              Bushmanova E., Antipov D., Lapidus A., Prjibelski A.D. rnaSPAdes: a de novo transcriptome assembler and its application to RNA-Seq data. GigaScience, 2019, vol. 8, pp.103-147.
            
          </mixed-citation>
    </citation-alternatives>
   </ref>
   <ref id="B12">
    <label>12.</label>
    <citation-alternatives>
     <mixed-citation xml:lang="ru">
            
              Gurevich A., Saveliev V., Vyahhi N., Tesler G. QUAST: quality assessment tool for genome assemblies. Bioinformatics, 2013, vol. 29(8), pp. 1072-1075.
            
          </mixed-citation>
     <mixed-citation xml:lang="en">
            
              Gurevich A., Saveliev V., Vyahhi N., Tesler G. QUAST: quality assessment tool for genome assemblies. Bioinformatics, 2013, vol. 29(8), pp. 1072-1075.
            
          </mixed-citation>
    </citation-alternatives>
   </ref>
   <ref id="B13">
    <label>13.</label>
    <citation-alternatives>
     <mixed-citation xml:lang="ru">
            
              Langmead B., Wilks C., Antonescu V., Charles R. Scaling read aligners to hundreds of threads on general-purpose processors. Bioinformatics, 2019, vol. 35, pp. 421-432.
            
          </mixed-citation>
     <mixed-citation xml:lang="en">
            
              Langmead B., Wilks C., Antonescu V., Charles R. Scaling read aligners to hundreds of threads on general-purpose processors. Bioinformatics, 2019, vol. 35, pp. 421-432.
            
          </mixed-citation>
    </citation-alternatives>
   </ref>
   <ref id="B14">
    <label>14.</label>
    <citation-alternatives>
     <mixed-citation xml:lang="ru">
            
              Seppey M., Manni M., Zdobnov E.M. BUSCO: Assessing Genome Assembly and Annotation Completeness. Methods in Molecular Biology, 2019, vol. 6, pp.19-62.
            
          </mixed-citation>
     <mixed-citation xml:lang="en">
            
              Seppey M., Manni M., Zdobnov E.M. BUSCO: Assessing Genome Assembly and Annotation Completeness. Methods in Molecular Biology, 2019, vol. 6, pp.19-62.
            
          </mixed-citation>
    </citation-alternatives>
   </ref>
   <ref id="B15">
    <label>15.</label>
    <citation-alternatives>
     <mixed-citation xml:lang="ru">
            
              Edgar R.C. Search and clustering orders of magnitude faster than BLAST. Bioinformatics, 2010, vol. 26, pp. 2460-2461.
            
          </mixed-citation>
     <mixed-citation xml:lang="en">
            
              Edgar R.C. Search and clustering orders of magnitude faster than BLAST. Bioinformatics, 2010, vol. 26, pp. 2460-2461.
            
          </mixed-citation>
    </citation-alternatives>
   </ref>
  </ref-list>
 </back>
</article>
