去除重复序列

对于基因组序列进行分析时,首要的事情就是去除重复序列,本文详细介绍重复序列的种类,标记重复序列的流程,使用的工具,以及总结目前网上的数据库、wiki文档资源。

重复序列的种类

对于一个新基因组,标记重复序列的流程

  1. 已知的转座子的鉴定;
  2. 新的重复序列的预测;
  3. 串连重复的鉴定;

重复序列的特征分析

重复序列占基因组的比例,不同类型重复序列的数目,比较不同物种之间的差异,描绘出该物种重复序列的特征。

使用的软件及其相关参数的设置

重复序列鉴定使用最多的就是RepeatMasker,其集成了很多的库、工具,无论是对于已知鉴定,还是未知的预测,都可以完成。

  • 已知的TEs标记,可以使用RepeatMasker比对Repbase库,以及使用RepeatProteinMask3比对TE蛋白质库;
  • 使用RepeatModeler3预测新的重复序列,该程序调用了 RECON5、 RepeatScout6预测程序;
  • 使用RepeatMasker -noint标记串连重复序列
NCBI去除重复序列的命令,及其各个参数的意义:
RepeatMasker -engine "wublast" -s -cutoff 255 -species "Nasonia giraulti" -no_is -frag 20000

-no_is         skips bacterial insertion element check

-cutoff [number] sets cutoff score for masking repeats when using -lib
               (default cutoff 225)

-s             Slow search; 0-5% more sensitive, 2.5 times slower than default.

-engine [crossmatch|wublast|decypher]
               Select a non-default search engine to use.  If not specified
                 RepeatMasker will use the default configured at install time.

-frag [number] Maximum sequence length masked without fragmenting
                 (default 40000).

RepeatMasker安装及其参数说明

安装请参照http://www.repeatmasker.org/RMDownload.html

1 OPTIONS参数说明

1.1 Species options 物种

-species     Indicate source species of query DNA

-lib [filename]             Allows the use of a custom library

contamination checking options
-is_only       only clips E coli insertion elements out of FASTA  and .qual files
-is_clip       clips IS elements before analysis (default: IS only reported)
-no_is         skips bacterial insertion element check
-rodspec       only checks for rodent specific repeats (no RepeatMasker run)
-primspec      only checks for primate specific repeats (no RepeatMasker run)

1.2 Masking options (options that determine what kind of repeats are masked) 遮蔽

-cutoff [number] sets cutoff score for masking repeats when using -lib
               (default cutoff 225)
-nolow         does not mask low complexity DNA or simple repeats
-l(ow)         same as nolow (historical)
-(no)int       only masks low complex/simple repeats (no interspersed repeats)
-alu           only masks Alus (and 7SLRNA, SVA and LTR5)(only for primate DNA)
-div [number]  masks only those repeats that are less than [number] percent
               diverged from the consensus sequence

1.3   Options effecting speed and search parameters 速度搜索参数

-q             Quick search; 5-10% less sensitive, 3-4 times faster than default
-qq            Rush job; about 10% less sensitive,
-s             Slow search; 0-5% more sensitive, 2.5 times slower than default.
-pa(rallel) [number]
               Number of processors to use in parallel (only works for
                 batch files or sequences larger than 50 kb)
-engine [crossmatch|wublast|decypher]
               Select a non-default search engine to use.  If not specified
                 RepeatMasker will use the default configured at install time.
-w(ublast)     Use WU-blast, rather than cross_match as engine
                 **DEPRECATED** Use -engine [crossmatch|wublast|decypher] now.
-frag [number] Maximum sequence length masked without fragmenting
                 (default 40000).
-maxsize [nr]  Maximum length for which IS- or repeat clipped sequences
                  can be produced (default 4000000). Memory requirements go
                  up with higher maxsize.
-gc [number]   Use matrices calculated for 'number' percentage background
                  GC level.
-gccalc        Program calculates the GC content even for batch files/small
                  sequences.
-nocut         Skips the steps in which repeats are excised.
-noisy         Prints cross_match progress report to screen (defaults to
                 .stderr file)

1.4  Output options输出设置

-a      shows the alignments in a .align output file; -ali(gnments) also works
-inv    alignments are presented in the orientation of the repeat (with option -a)

-cut    saves a sequence (in file.cut) from which full-length repeats are excised
        (temporarily disfunctional)
-small  returns complete .masked sequence in lower case
-xsmall returns repetitive regions in lowercase (rest capitals) rather than masked
-x      returns repetitive regions masked with Xs rather than Ns

-poly   reports simple repeats that may be polymorphic (in file.poly)
-ace    creates an additional output file in ACeDB format
-gff    creates an additional General Feature Finding format output
-u      creates an untouched annotation file besides the manipulated file
-xm     creates an additional output file in cross_match format (for parsing)

-fixed  creates an (old style) annotation file with fixed width columns
-no_id  leaves out final column with unique ID for each element
-e(xcln) calculates repeat densities (in .tbl) excluding runs of >25 Ns in query

-noisy  prints cross_match progress report to screen (defaults to .stderr file)

相关资源

发表评论

电子邮件地址不会被公开。 必填项已用*标注

请启用Javascript,以完成验证!