|
WTCHG Mouse Resources
WTCHG Bioinformatics Group
Richard Mott's Home Page
Jonathan Flint's Home Page
QTL mapping with HAPPY
QTL mapping strategies
Wellcome-CTC Mouse Strain SNP Genotype Set
Select Mouse SNPs
Heterogeneous Stock QTL Project
QTL Mapping with Outbred MF1 Mice
QTN Analysis
Complex Trait Consortium Meeting, Oxford 1-3 July 2003
Estimation of the probability of ENU0induced coding mutations
.
|
The GSCANDB software
GSCANDB is a portable system for managing and viewing genome scan data.
This page describes the GSCANDB software:
We use GSCANDB for a number of projects, including our large-scale QTL-mapping project in heterogeneous stock mice and GWA projects involving several hundred thousand SNPs like Gabriel. A users guide to the GSCANDB web interface is available here.
GSCANDB comprises a MySQL backend server and a Perl-CGI interface, with a number of Perl scripts used for uploading and managing data. We run our version of GSCANDB on a 64bit Debian Linux system using Perl v5.8.8 and MySQL version 14.12. Therefore in order to use GSCANDB you will need:
- A linux server running instances of MySQL and Apache, or their equivalents on other OS. The Perl modules GD, CGI and GD::SVG are required. In what follows the name of this server will be denoted by $server.
- A MySQL account with privileges to create, drop and alter databases. In what follows this user name will be denoted as $user, with password $passwd..
- For creating and populating the GSCAN database, JAVA (jdk.1.6) and ANT have to be installed.
- Biomart...
- Download and unzip the file: gscandb_v1.zip
- Go to the gscandb_v1 directory
cd gscandb_v1
- Ask your web server administrator to create a web alias for the gscandb_v1 directory.
- Configure the database.properties file (located in the gscandb_v1 sub-directory called etc)by replacing xxx with your username and password
username=xxx
password=xxx
If required, edit some or all of the following parameter settings, to suit your environment:
server=localhost
systemURL=http://localhost
baseURL=http://localhost/yourGScanWebAlias
tmpdir=/tmp/www-data/ This is the temporary directory for images generated by GScan
biomart=/yourPathToTheBioMartDirectory/biomart/lib
- Create the database and build the tables by issuing the command
ant
from within the gscandb_v1 directory.
- Check that the web server works, by pointing it at the url:
http://localhost/yourGScanWebAlias/wwwqtl.cgi
- This should generate a web page with header 'Genome Scan Viewer', similar to that on our GSCANDB web site http://mus.well.ox.ac.uk/gscandb except the scrolling lists and pulldown menus will be empty
-
The database will need to be populated with your data. We provide some example data from our mouse QTL mapping experiment in the compressed tarball gscandb.examples.zip. Download and unpack it, preferably into a directory different to the gscandb directory. The directory contains comma-separate files, which format and content are further described in the Input files section below.
-
GSCANDB can be populated using different arguments, depending on whether it is being populated for the first time or whether data is being updated or added to the database.
sh
gscanLoad.sh dir=/data/infiles marker=marker.csv
is the same as writing:
sh gscanLoad.sh marker=/data/infiles/marker.csv
Arguments marker, sample, genotype and gscan indicate the type of the input file. See section Input files for further description of the input files.
-
- Option B: Update and add data to the database.
sh gscanLoad.sh dir=/gscan/gscandb/csvInfiles marker=marker.csv update
To see all the available command line options for populating the database, issue the command:
sh gscanLoad.sh help
See section Examples for more examples on how to populate gscandb using gscanLoad.sh.
All the infiles should be comma seperated.
marker.csv containing basic marker information
marker_mapping.csv containing positions of the markers on genome builds
sample.csv containing information about samples (individuals with genotypes)
genotype.csv containing the genotypes of the markers on the samples
hapmap.csv containing haplotype map information for the markers
- files named
Biochem.ALP.chr*.scan containing genome-scan data for one phenotype across 20 chromosomes in a special format described below.
threshold.csv containing significance threshold information for genome scans
The headers for the csv files files are as follows. Null fields should be entered as ",,". Fields in bold cannot be null.
| TABLE NAME | FIELD NAMES |
| genome_build | name, date species, comments, ensembl, ensembldb, ensemblspecies,liftover |
| phenotype | name, description,public_name |
| population | name, species, size, comments |
| marker | name, marker_type, leftseq, rightseq, alias |
| marker_mapping | marker, genome_build, chromosome, bp_position, strand, cm |
| trait_locus | name, population,genome_scan, subscan_label phenotype, marker1, marker2, species, chromosome, start_bp, end_bp, threshold, score, peak, label, comment, url |
| sample | name, gender, notes |
| genotype | marker, sample, genotype |
| hapblock | genome_build, chromosome, marker_start, marker_end, info |
| chromosome | name,genome_build,length |
Most of the fields in the tables are self-explanatory, but in detail:
- genome_build contains basic information about a genome assembly against which the genome scan data are to be plotted and the genome annotations matched.. At present the only information which is required is the name of the genome build; the other ensembl-related fields are unused but will be used in later releases. The liftover field is used to lift genome scans defined on one genome build onto another build (described below).
- phenotypedescribes a phenotype. The most important field other than the name is public_name, which is the text displayed in the interface. If the option "public.version" => 1 in qtlOptions.pl then only phenotypes with public names are displayed. This mechanism is used so that public and development versions can be run off the same database.
- population defines a mapping population.
- marker contains information about a marker. Note that in GSCANDB a marker is a unique location on the genome.
- marker_mapping contains the locations of markers on genome builds
- trait_locus contains information about QTL. A QTL can be defined in two ways.
- If the QTL is associated with a genome subscan - ie corresponds to some region in the scan that is likely to contain a functional variant, then both the genome_scan and subscan_label must be defined appropriately the range of the QTL is defined in terms of markers (marker1 to marker2), rather than by base-pair position (chromosome, start_bp, end_bp). The advantage of the former method is that is is possible to lift over between genome builds.
- species defines species.
- sample defines individuals with genotypes
- genotype defines genotypes connect a sample to a marker
- hapblock contains special information about the locations of haplotype blocks.
- chromosome defines chromosome lengths.
In GSCANDB a genome scan is associated with a mapping population, phenotype and genome build. Each scan contains one or more named subscans. A subscan is a series of quantitative measurements along the genome, where each measurement is associated with a marker or marker interval. The subscan mechanism is useful for storing different analyses of the same underlying data, for example we analyse all our phenotypes in at least four ways, looking for singlepoint additive and dominance effects and and multipoint additive and dominance effects. Note that the marker order of the data depends on the genome build, and is therefore defined by uploading marker_mapping files. Although genome scan files may contain positional information, this is ignored.
Genome Scan input files have two accepted formats:
Examples will be shown here
GSCANDB can lift the scan data from one genome build onto another. This means that if genome scans are performed using one genome build then the results can be projected onto another (usually more recent) build in order to take advantage of improvements to genome annotations. This is done in a virtual manner by setting the liftover field in the table genome_build to point to the genome_build_id of the build that actually contains the scan data. For the liftover to work it is necessary to have define the marker mappings for both genome builds. As an example, in our live database, all scans were computed against build 34 of the mouse genome. In order to display them against build 36, the following lines of the table genome_build are defined:
+-----------------+------+------------+--------------+----------+-----------------------+-----------------+----------------+----------+
| genome_build_id | name | date | species | comments | ensembl | ensembldb | ensemblspecies | liftover |
+-----------------+------+------------+--------------+----------+-----------------------+-----------------+----------------+----------+
| 4 | 34 | | Mus musculus | | | | Mus_musculus | NULL |
| 36 | 36 | | Mus musculus | | | | Mus_musculus | 4 |
+-----------------+------+------------+--------------+----------+-----------------------+-----------------+----------------+----------+
Genome Liftover is only appropriate when either the genome scans are singlepoint and therefore independent of marker order, or are multipoint and the differences between genome build mainly leaves the marker order unchanged, but changes the underlying genome annotation.
.
GSCANDB fetches its genome annotations from EnsMart. You can use the Perl script CopyEnsembl.pl to make a local MySQL database containing a copy of the relevant tables. You then need to edit qtlOptions.pl so that the correct version of EnsMart is associated with each genome build. Querying a local copy of EnsMart should be faster than making remote queries to ensembl. The script's command-line options are:
CopyEnsembl.pl -ensembl ensembldb.ensembl.org -user anonymous -mart ensembl_mart_41 -localdb gscandb
-localmart ensembl_mart_41 -create -drop -host gscan -local_user root
This will create a local copy of part of ensembl_mart_41 (the tables are listed inside the script). It will first drop any local copy of the database and then create it, and then copy the tables into it. The script takes a minute or so to run. Note that by default the script runs locally as root MySQL user, and assumes that the root password is in your ~/my.cnf file (otherwise you will be prompted for your password many times).
Next, in order for your local copy of GSCANDB to access the local EnsMart, edit qtlOptions.pl: there is a hash table defining the EnsMart connection details for each genome build (in this example the builds are named "33", "34", "35") so that different build can be associated with different genome annotations.
'ensembl' => {
"33" => { # mouse build 33
assembly=>'1',
dbi=> "",
user=> $martuser,
passwd=> $martpasswd,
},
"34" => { # mouse build 34
assembly=>'4',
dbi=> "DBI:mysql:ensembl_mart_37:gscan",
user=> $martuser,
passwd=> $martpasswd,
},
"36" => { # mouse build 36
assembly=>'36',
dbi=> "DBI:mysql:ensembl_mart_41:gscan",
user=> $martuser,
passwd=> $martpasswd,
},
},
You will need to edit these as appropriate. The value of "assembly" is equal to the value of genome_build_id in the GSCANDB table genome_build. The "dbi" connection string should connect to your local copy of EnsMart. The user and passwd are to log into the local EnsMart: we set these to the variables $martuser and $martpasswd in qtlOptions.pl, but you could configure your system so that these are the same as $user and $passwd.
In qtlOptions.pl several options control whether the web site is password-protected:
"public.version" => 1, # Determines whether to use public (1) or private (0) data by default.
# A setting of 1 can be over ridden by providing a valid password.
"valid.passwords" => { "HS" => "XXXXX" },
"project.urls" => { "HS" => 0 },
Edit public.version to be 0 if you wich to enable password protection. Then edit the passwords and projects as appropriate.
GSCANDB has a password an authentication page qtlAuth.cgi. Tapping a valid password into there generates a single session cookie (i.e. this will need to be repeated
each time the browser is restarted, but will last the entire session the
browser is open for). Setting a cookie and instructing a redirect to wwwqtl.cgi at the same
time appears to cause problems for some browsers so currently the page provides a link to the GSCANDB-viewer.
If a user wants to move from the private view into the public view, then they
can point their browser at qtlAuth.cgi and it will clear the cookie. This
system seems stable and is extensible (in principal there could be multiple
distinct private views, though at the moment this would require a change in
the database schema).
The file qtlInclude.pl contains HTML code for the headers and margins used in the GSCANDB interface; these can be edited to change the layout, titles etc.
The file qtlOptions.pl contains editable options controlling the appearance of the genome scan plots, eg the default scaling between base pairs and pixels, and the colours associated with different scan types (this mechanism will be improved in a later version).
- Publication-quality plots. The Perl script drawQTL.pl can be used to generate SVG publication-quality (or screen-dump quality PNG) genome scan plots like those produced in the region view of GSCANDB, but using scaleable vector graphics. The .SVG files can then be edited converted into PDF using CoralDraw or a similar package. Internally it uses the Perl GD::SVG library. The resulting SVG plots are slightly different from the PNG screen images produced by GSCANDB, principally in the use of fonts and colours. The usage is:
drawQTL.pl -phenotype EMO -subscans additive,full -population HS -chromosome 1 \
-from 143000000 -to 146000000 -build 34 -out EMO.svg -format SVG -width 1000 -height 200
This will create a file called EMO.svg containing a plot of the additive and full sbscans for the phenotype EMO on chromosome 1 between 143 and 146 Mb, relative to build 34. The width and height of the image are in pixel units. You can create PNG format files by specifying -format PNG.
- The Perl API for GSCANDB. The functions in the file qtlDB.pl may be used to access the GSCANDB data in a form suitable for analysis. More documentation soon.....
The GSCANDB software package is distributed under the GNU General Public License
This program is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License
as published by the Free Software Foundation; either version 2
of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
The GSCANDB system is funded through grants from the Wellcome Trust and the European Comission. EU contribution comes via the BioSapiens project, which is funded by the European Commission within its FP6 Programme, under the thematic area "Life sciences, genomics and biotechnology for health,"contract number LHSG-CT-2003-503265. It is administered and lead by The European Bionformatics Institute. See the EBI's Press Release.
Contact Richard Mott Jonathan Flint or William Valdar for more details.
|
|