Database Import

First, download the pre-build database files that we provide and unpack them. Please make sure that you have enough space available. The packed file consumes 31 Gb. When unpacked, it consumed additional 188 Gb.

$ cd /plenty/space
$ wget https://file-public.bihealth.org/transient/varfish/varfish-server-background-db-20201006.tar.gz{,.sha256}
$ sha256sum -c varfish-server-background-db-20201006.tar.gz.sha256
$ tar xzvf varfish-server-background-db-20201006.tar.gz

We recommend to exclude the large databases: frequency tables, extra annotations and dbSNP. Also, keep in mind that importing the whole database takes >24h, depending on the speed of your HDD.

This is a list of the possible imports, sorted by its size:

Component

Size

Exclude

Function

gnomAD_genomes

80G

highly recommended

frequency annotation

extra-annos

50G

highly recommended

diverse

dbSNP

32G

highly recommended

SNP annotation

thousand_genomes

6,5G

highly recommended

frequency annotation

gnomAD_exomes

6,0G

highly recommended

frequency annotation

knowngeneaa

4,5G

highly recommended

alignment annotation

clinvar

3,3G

highly recommended

pathogenicity classification

ExAC

1,9G

highly recommended

frequency annotation

dbVar

573M

recommended

SNP annotation

gnomAD_SV

250M

recommended

SV frequency annotation

ncbi_gene

151M

gene annotation

ensembl_regulatory

77M

frequency annotation

DGV

43M

SV annotation

hpo

22M

phenotype information

hgnc

15M

gene annotation

gnomAD_constraints

13M

frequency annotation

mgi

10M

mouse gene annotation

ensembltorefseq

8,3M

identifier mapping

hgmd_public

5,0M

gene annotation

ExAC_constraints

4,6M

frequency annotation

refseqtoensembl

2,0M

identifier mapping

ensembltogenesymbol

1,6M

identifier mapping

ensembl_genes

1,2M

gene annotation

HelixMTdb

1,2M

MT frequency annotation

refseqtogenesymbol

1,1M

identifier mapping

refseq_genes

804K

gene annotation

mim2gene

764K

phenotype information

MITOMAP

660K

MT frequency annotation

kegg

632K

pathway annotation

mtDB

336K

MT frequency annotation

tads_hesc

108K

domain annotation

tads_imr90

108K

domain annotation

vista

104K

orthologous region annotation

acmg

16K

disease gene annotation

You can find the import_versions.tsv file in the root folder of the package. This file determines which component (called table_group and represented as folder in the package) gets imported when the import command is issued. To exclude a table, simply comment out (#) or delete the line. Excluding tables that are not required for development can reduce time and space consumption. Also, the GRCh38 tables can be excluded.

A space-consumption-friendly version of the file would look like this:

build       table_group     version
GRCh37      acmg    v2.0
#GRCh37     clinvar 20200929
#GRCh37     dbSNP   b151
#GRCh37     dbVar   latest
GRCh37      DGV     2016
GRCh37      ensembl_genes   r96
GRCh37      ensembl_regulatory      latest
GRCh37      ensembltogenesymbol     latest
GRCh37      ensembltorefseq latest
GRCh37      ExAC_constraints        r0.3.1
#GRCh37     ExAC    r1
#GRCh37     extra-annos     20200704
GRCh37      gnomAD_constraints      v2.1.1
#GRCh37     gnomAD_exomes   r2.1
#GRCh37     gnomAD_genomes  r2.1
#GRCh37     gnomAD_SV       v2
GRCh37      HelixMTdb       20190926
GRCh37      hgmd_public     ensembl_r75
GRCh37      hgnc    latest
GRCh37      hpo     latest
GRCh37      kegg    april2011
#GRCh37     knowngeneaa     latest
GRCh37      mgi     latest
GRCh37      mim2gene        latest
GRCh37      MITOMAP 20200116
GRCh37      mtDB    latest
GRCh37      ncbi_gene       latest
GRCh37      refseq_genes    r105
GRCh37      refseqtoensembl latest
GRCh37      refseqtogenesymbol      latest
GRCh37      tads_hesc       dixon2012
GRCh37      tads_imr90      dixon2012
#GRCh37     thousand_genomes        phase3
GRCh37      vista   latest
#GRCh38     clinvar 20200929
#GRCh38     dbVar   latest
#GRCh38     DGV     2016

To perform the import, issue:

$ python manage.py import_tables --tables-path /plenty/space/varfish-server-background-db-20201006

Performing the import twice will automatically skip tables that are already imported. To re-import tables, add the --force parameter to the command:

$ python manage.py import_tables --tables-path varfish-db-downloader --force