WTCHG Mouse Resources

WTCHG Bioinformatics Group

Richard Mott's Home Page

Jonathan Flint's Home Page

QTL mapping with HAPPY

QTL mapping strategies

Wellcome-CTC Mouse Strain SNP Genotype Set

Select Mouse SNPs

Heterogeneous Stock QTL Project

QTL Mapping with Outbred MF1 Mice

QTN Analysis

Complex Trait Consortium Meeting, Oxford 1-3 July 2003

Estimation of the probability of ENU0induced coding mutations

.

Wellcome Trust Centre for Human Genetics

The GSCANDB software

GSCANDB is a portable system for managing and viewing genome scan data.

This page describes the GSCANDB software:

We use GSCANDB for a number of projects, including our large-scale QTL-mapping project in heterogeneous stock mice and GWA projects involving several hundred thousand SNPs like Gabriel. A users guide to the GSCANDB web interface is available here.

System Requirements

GSCANDB comprises a MySQL backend server and a Perl-CGI interface, with a number of Perl scripts used for uploading and managing data. We run our version of GSCANDB on a 64bit Debian Linux system using Perl v5.8.8 and MySQL version 14.12. Therefore in order to use GSCANDB you will need:

  1. A linux server running instances of MySQL and Apache, or their equivalents on other OS. The Perl modules GD, CGI and GD::SVG are required. In what follows the name of this server will be denoted by $server.
  2. A MySQL account with privileges to create, drop and alter databases. In what follows this user name will be denoted as $user, with password $passwd..
  3. For creating and populating the GSCAN database, JAVA (jdk.1.6) and ANT have to be installed.
  4. Biomart...

Download, configuration and installation

  1. Download and unzip the file: gscandb_v1.zip
  2. Go to the gscandb_v1 directory
    cd gscandb_v1
  3. Ask your web server administrator to create a web alias for the gscandb_v1 directory.
  4. Configure the database.properties file (located in the gscandb_v1 sub-directory called etc)by replacing xxx with your username and password
    •  username=xxx  
    •  password=xxx  
      If required, edit some or all of the following parameter settings, to suit your environment:
    •  server=localhost 
    •  systemURL=http://localhost
    •  baseURL=http://localhost/yourGScanWebAlias 
    •  tmpdir=/tmp/www-data/   This is the temporary directory for images generated by GScan
    •  biomart=/yourPathToTheBioMartDirectory/biomart/lib
      	
  5. Create the database and build the tables by issuing the command
     ant
    from within the gscandb_v1 directory.

The Genome Scan Viewer

  1. Check that the web server works, by pointing it at the url:
    http://localhost/yourGScanWebAlias/wwwqtl.cgi
  2. This should generate a web page with header 'Genome Scan Viewer', similar to that on our GSCANDB web site http://mus.well.ox.ac.uk/gscandb except the scrolling lists and pulldown menus will be empty

Uploading data

  1. The database will need to be populated with your data. We provide some example data from our mouse QTL mapping experiment in the compressed tarball gscandb.examples.zip. Download and unpack it, preferably into a directory different to the gscandb directory. The directory contains comma-separate files, which format and content are further described in the Input files section below.
  2. GSCANDB can be populated using different arguments, depending on whether it is being populated for the first time or whether data is being updated or added to the database.

    • Option A: Initial data upload (populate an emtpy database)
      sh gscanLoad.sh dir=/data/infiles/ marker=marker.csv sample=sample.csv genotype=genotype.csv 
    • The command line argument dir specifies the directory containing the input files (marker.csv, sample.csv and genotype.csv). This argument can be omitted if the full path to the infile is provided instead, ie:
      sh
       gscanLoad.sh dir=/data/infiles marker=marker.csv 
      is the same as writing:
      sh gscanLoad.sh marker=/data/infiles/marker.csv
      Arguments marker, sample, genotype and gscan indicate the type of the input file. See section Input files for further description of the input files.

    • Option B: Update and add data to the database.
      sh gscanLoad.sh dir=/gscan/gscandb/csvInfiles marker=marker.csv update
    • To see all the available command line options for populating the database, issue the command:
      sh gscanLoad.sh help
      See section Examples for more examples on how to populate gscandb using gscanLoad.sh.

Input files

    All the infiles should be comma seperated.
  1. marker.csv
    containing basic marker information
  2. marker_mapping.csv
    containing positions of the markers on genome builds
  3. sample.csv
    containing information about samples (individuals with genotypes)
  4. genotype.csv
    containing the genotypes of the markers on the samples
  5. hapmap.csv
    containing haplotype map information for the markers
  6. files named
    Biochem.ALP.chr*.scan
    containing genome-scan data for one phenotype across 20 chromosomes in a special format described below.
  7. threshold.csv
    containing significance threshold information for genome scans
  • The headers for the csv files files are as follows. Null fields should be entered as ",,". Fields in bold cannot be null.
    TABLE NAME FIELD NAMES
    genome_build name, date species, comments, ensembl, ensembldb, ensemblspecies,liftover
    phenotype name, description,public_name
    population name, species, size, comments
    marker name, marker_type, leftseq, rightseq, alias
    marker_mapping marker, genome_build, chromosome, bp_position, strand, cm
    trait_locus name, population,genome_scan, subscan_label phenotype, marker1, marker2, species, chromosome, start_bp, end_bp, threshold, score, peak, label, comment, url
    sample name, gender, notes
    genotype marker, sample, genotype
    hapblock genome_build, chromosome, marker_start, marker_end, info
    chromosome name,genome_build,length

  • Most of the fields in the tables are self-explanatory, but in detail:
    • genome_build contains basic information about a genome assembly against which the genome scan data are to be plotted and the genome annotations matched.. At present the only information which is required is the name of the genome build; the other ensembl-related fields are unused but will be used in later releases. The liftover field is used to lift genome scans defined on one genome build onto another build (described below).
    • phenotypedescribes a phenotype. The most important field other than the name is public_name, which is the text displayed in the interface. If the option "public.version" => 1 in qtlOptions.pl then only phenotypes with public names are displayed. This mechanism is used so that public and development versions can be run off the same database.
    • population defines a mapping population.
    • marker contains information about a marker. Note that in GSCANDB a marker is a unique location on the genome.
    • marker_mapping contains the locations of markers on genome builds
    • trait_locus contains information about QTL. A QTL can be defined in two ways.
    • If the QTL is associated with a genome subscan - ie corresponds to some region in the scan that is likely to contain a functional variant, then both the genome_scan and subscan_label must be defined appropriately the range of the QTL is defined in terms of markers (marker1 to marker2), rather than by base-pair position (chromosome, start_bp, end_bp). The advantage of the former method is that is is possible to lift over between genome builds.
    • species defines species.
    • sample defines individuals with genotypes
    • genotype defines genotypes connect a sample to a marker
    • hapblock contains special information about the locations of haplotype blocks.
    • chromosome defines chromosome lengths.
  • In GSCANDB a genome scan is associated with a mapping population, phenotype and genome build. Each scan contains one or more named subscans. A subscan is a series of quantitative measurements along the genome, where each measurement is associated with a marker or marker interval. The subscan mechanism is useful for storing different analyses of the same underlying data, for example we analyse all our phenotypes in at least four ways, looking for singlepoint additive and dominance effects and and multipoint additive and dominance effects. Note that the marker order of the data depends on the genome build, and is therefore defined by uploading marker_mapping files. Although genome scan files may contain positional information, this is ignored. Genome Scan input files have two accepted formats:
    • tabular files are tab-separated tables with a header line. Additional information (about population, genome build, phenotype, subscan label) is provided by command-line arguments. We recommend you use this format. This format was designed initially for uploading microarray expression data, but can be used for other singlepoint data as well. The header line of the file must contain exactly one column named either "marker", "Transcript" or "gene", and another corresponding to the scan data, and whose name matches the command-line argument -colname. Each row of the file corresponds to data for a specific marker/probe/transcript/gene.
    • scan files have a header section with key-value data defining: the phenotype, mapping population, the genome build, the unit of measurement, the type of plot ("interval" or "point") the formula used in the analysis (or any other piece of text), followed by a table of space-separated scan data section. Both single-point and multi-point data are supported. A minimal example is given here. The data section starts after the line BEGIN_SCAN_DATA and ends before the line END_SCAN_DATA. The first row gives the column names. We use this format to upload files produced by our QTL-mapping pipeline written in R.
    • To upload subscans named "additive" and "full" for the phenotype "EMO" population "HS", genome build "34",from a genome scan file called "EMO.txt" containing columns named "additive" and "full" one would type
      sh gscanLoadx.sh gscan=Emo.txt pop=HS build=34 pheno=EMO labels=additive,full
      The command-line argument -colname specifies a comma-separated list of column names to upload, corresponding to the subscan names.

    Examples

      Examples will be shown here

    Genome lift over

    GSCANDB can lift the scan data from one genome build onto another. This means that if genome scans are performed using one genome build then the results can be projected onto another (usually more recent) build in order to take advantage of improvements to genome annotations. This is done in a virtual manner by setting the liftover field in the table genome_build to point to the genome_build_id of the build that actually contains the scan data. For the liftover to work it is necessary to have define the marker mappings for both genome builds. As an example, in our live database, all scans were computed against build 34 of the mouse genome. In order to display them against build 36, the following lines of the table genome_build are defined:

        +-----------------+------+------------+--------------+----------+-----------------------+-----------------+----------------+----------+
        | genome_build_id | name | date       | species      | comments | ensembl               | ensembldb       | ensemblspecies | liftover |
        +-----------------+------+------------+--------------+----------+-----------------------+-----------------+----------------+----------+
        | 4               | 34   |            | Mus musculus |          |                       |                 | Mus_musculus   | NULL     |
        | 36              | 36   |            | Mus musculus |          |                       |                 | Mus_musculus   | 4        |
        +-----------------+------+------------+--------------+----------+-----------------------+-----------------+----------------+----------+
    

    Genome Liftover is only appropriate when either the genome scans are singlepoint and therefore independent of marker order, or are multipoint and the differences between genome build mainly leaves the marker order unchanged, but changes the underlying genome annotation.

    Making a local copy of EnsMart

    .

    GSCANDB fetches its genome annotations from EnsMart. You can use the Perl script CopyEnsembl.pl to make a local MySQL database containing a copy of the relevant tables. You then need to edit qtlOptions.pl so that the correct version of EnsMart is associated with each genome build. Querying a local copy of EnsMart should be faster than making remote queries to ensembl. The script's command-line options are:

        CopyEnsembl.pl -ensembl ensembldb.ensembl.org -user anonymous -mart ensembl_mart_41 -localdb gscandb 
                                                       -localmart ensembl_mart_41 -create -drop -host gscan -local_user root 

    This will create a local copy of part of ensembl_mart_41 (the tables are listed inside the script). It will first drop any local copy of the database and then create it, and then copy the tables into it. The script takes a minute or so to run. Note that by default the script runs locally as root MySQL user, and assumes that the root password is in your ~/my.cnf file (otherwise you will be prompted for your password many times).

    Next, in order for your local copy of GSCANDB to access the local EnsMart, edit qtlOptions.pl: there is a hash table defining the EnsMart connection details for each genome build (in this example the builds are named "33", "34", "35") so that different build can be associated with different genome annotations.

    	  'ensembl' => {
                  "33" => { # mouse build 33
                      assembly=>'1',
                      dbi=> "",
                      user=> $martuser,
                      passwd=> $martpasswd,
                  },
                  "34" => { # mouse build 34
                      assembly=>'4',
                      dbi=> "DBI:mysql:ensembl_mart_37:gscan",
                      user=> $martuser,
                      passwd=> $martpasswd,
                  },
                  "36" => { # mouse build 36
                      assembly=>'36',
                      dbi=> "DBI:mysql:ensembl_mart_41:gscan",
                      user=> $martuser,
                      passwd=> $martpasswd,
                  },
              },
    

    You will need to edit these as appropriate. The value of "assembly" is equal to the value of genome_build_id in the GSCANDB table genome_build. The "dbi" connection string should connect to your local copy of EnsMart. The user and passwd are to log into the local EnsMart: we set these to the variables $martuser and $martpasswd in qtlOptions.pl, but you could configure your system so that these are the same as $user and $passwd.

    Enabling password protection

    In qtlOptions.pl several options control whether the web site is password-protected:

    	  "public.version" => 1, # Determines whether to use public (1) or private (0) data by default.
                                 # A setting of 1 can be over ridden by providing a valid password.
    	  "valid.passwords" => { "HS" => "XXXXX" },
    	  "project.urls" => { "HS" => 0 },
    

    Edit public.version to be 0 if you wich to enable password protection. Then edit the passwords and projects as appropriate.

    GSCANDB has a password an authentication page qtlAuth.cgi. Tapping a valid password into there generates a single session cookie (i.e. this will need to be repeated each time the browser is restarted, but will last the entire session the browser is open for). Setting a cookie and instructing a redirect to wwwqtl.cgi at the same time appears to cause problems for some browsers so currently the page provides a link to the GSCANDB-viewer.

    If a user wants to move from the private view into the public view, then they can point their browser at qtlAuth.cgi and it will clear the cookie. This system seems stable and is extensible (in principal there could be multiple distinct private views, though at the moment this would require a change in the database schema).

    Customising the appearance of GSCANDB

    The file qtlInclude.pl contains HTML code for the headers and margins used in the GSCANDB interface; these can be edited to change the layout, titles etc.

    The file qtlOptions.pl contains editable options controlling the appearance of the genome scan plots, eg the default scaling between base pairs and pixels, and the colours associated with different scan types (this mechanism will be improved in a later version).

    Extras

    • Publication-quality plots. The Perl script drawQTL.pl can be used to generate SVG publication-quality (or screen-dump quality PNG) genome scan plots like those produced in the region view of GSCANDB, but using scaleable vector graphics. The .SVG files can then be edited converted into PDF using CoralDraw or a similar package. Internally it uses the Perl GD::SVG library. The resulting SVG plots are slightly different from the PNG screen images produced by GSCANDB, principally in the use of fonts and colours. The usage is:
       drawQTL.pl -phenotype EMO -subscans additive,full -population HS -chromosome 1 \
                      -from 143000000 -to 146000000 -build 34 -out EMO.svg -format SVG -width 1000 -height 200 

      This will create a file called EMO.svg containing a plot of the additive and full sbscans for the phenotype EMO on chromosome 1 between 143 and 146 Mb, relative to build 34. The width and height of the image are in pixel units. You can create PNG format files by specifying -format PNG.

    • The Perl API for GSCANDB. The functions in the file qtlDB.pl may be used to access the GSCANDB data in a form suitable for analysis. More documentation soon.....

    Legal Matters

    The GSCANDB software package is distributed under the GNU General Public License

    This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

    This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

    You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.

    Funding Details

    The GSCANDB system is funded through grants from the Wellcome Trust and the European Comission. EU contribution comes via the BioSapiens project, which is funded by the European Commission within its FP6 Programme, under the thematic area "Life sciences, genomics and biotechnology for health,"contract number LHSG-CT-2003-503265. It is administered and lead by The European Bionformatics Institute. See the EBI's Press Release.


    Contact Richard Mott Jonathan Flint or William Valdar for more details.

    Valid HTML 4.01 Transitional

  • spacer