转换GFF工具–大汇总

汇总,将各种格式转换为GFF格式的脚本。这些脚本分散在不同的软件包中,可以根据需要下载使用。

bioPerl

  • search2gff              This script will turn a protein Search report (BLASTP, FASTP, SSEARCH, AXT, WABA) into a GFF File.
  • genbank2gff3.pl       — Genbank->gbrowse-friendly GFF3
  • gff2ps                    This script provides GFF to postscript handling. 继续阅读

Genbank格式转换为GFF3格式

在处理数据的过程中,遇到最多的恐怕就是Genbank格式转换为GFF3格式,推荐使用脚本genbank2gff3.pl,官方脚本,速度快,使用灵活,转换的格式较为标准,注意要更新到最新的版本,先前的版本ID标志符使用基因的名称,这样会产生一个比较严重的问题,GBrowse对于有重复的基因显示错误,全部当做同一个基因。用法:

genbank2gff3.pl [options] filename(s)
   Options:
        --dir     -d  path to a list of genbank flatfiles GB格式文件所在的目录
        --outdir  -o  location to write GFF files 转换的GFF3文件保存的目录
        --zip     -z  compress GFF3 output files with gzip 转换的GFF3数据进行压缩
        --summary -s  print a summary of the features in each contig
        --filter  -x  genbank feature type(s) to ignore 过滤掉某些类型的feature
        --split   -y  split output to seperate GFF and fasta files for 
                      each genbank record  Genbank中每条记录单独
        --nolump  -n  seperate file for each reference sequence
                      (default is to lump all records together into one
                       output file for each input file)
        --ethresh -e  error threshold for unflattener
                      set this high (>2) to ignore all unflattener errors
        --help    -h  display this message

继续阅读

GFF格式说明

gff格式是Sanger研究所定义,是一种简单的、方便的对于DNA、RNA以及蛋白质序列的特征进行描述的一种数据格式,比如序列的那里到那里是基因,已经成为序列注释的通用格式,比如基因组的基因预测,许多软件都支持输入或者输出gff格式。目前格式定义的最新版本是版本3。原始定义见SONG website
gff是存文本文件,由tab键隔开的9列组成,以下是各列的说明: 继续阅读

The CGView Server: a comparative genomics tool for circular genomes.

Sample output from the CGView Server

Sample output from the CGView Server

The CGView Server generates graphical maps of circular genomes that show sequence features, base composition plots, analysis results and sequence similarity plots. Sequences can be supplied in raw, FASTA, GenBank or EMBL format. Additional feature or analysis information can be submitted in the form of GFF (General Feature Format) files. The server uses BLAST to compare the primary sequence to up to three comparison genomes or sequence sets. The BLAST results and feature information are converted to a graphical map showing the entire sequence, or an expanded and more detailed view of a region of interest. Several options are included to control which types of features are displayed and how the features are drawn. The CGView Server can be used to visualize features associated with any bacterial, plasmid, chloroplast or mitochondrial genome, and can aid in the identification of conserved genome segments, instances of horizontal gene transfer, and differences in gene copy number. Because a collection of sequences can be used in place of a comparison genome, maps can also be used to visualize regions of a known genome covered by newly obtained sequence reads. The CGView Server can be accessed at http://stothard.afns.ualberta.ca/cgview_server/
CGView是一种画图工具,生成展示序列特性、基本组成片段、分析相似片段的圆形基因图解图,需要提供的序列为raw、FASTA、GenBank和EMBL格式, 另增加特性或者分析信息请在提交的GFF文件中添加相关信息。此服务使用blast比对相似序列从而建立三组比较基因组或者基因组序列套件。 blast比对结果和特性信息转换到展示整个序列或者扩展更多细节的趣味片段区域的基因图解图中。 特性的展示和怎么来绘制可以通过一些选项来控制。该服务显示与细菌、质粒、叶绿体、线粒体基因组有关的特性,帮助鉴定基因组储存段、举证基因水平转运和找出基因复制数量不同的图谱。由于收集的序列可被也在比较基因组,图谱可被用在读取新获取的序列被已知基因组覆盖的可视区域 。CGView 服务请访问http://stothard.afns.ualberta.ca/cgview_server/

Circos: An information aesthetic for comparative genomics.

We created a visualization tool, called Circos, to facilitate the identification and analysis of similarities and differences arising from comparisons of genomes. Our tool is effective in displaying variation in genome structure and, generally, any other kind of positional relationships between genomic intervals. Such data are routinely produced by sequence alignments, hybridization arrays, genome mapping, and genotyping studies. Circos uses a circular ideogram layout to facilitate the display of relationships between pairs of positions by the use of ribbons, which encode the position, size, and orientation of related genomic elements. Circos is capable of displaying data as scatter, line and histogram plots, heat maps, tiles, connectors and text. Bitmap or vector images can be created from GFF-style data inputs and hierarchical configuration files, which can be easily generated by automated tools, making Circos suitable for rapid deployment in data analysis and reporting pipelines. 继续阅读