Wait a year. I’ll be cheaper.

diskdrive by AlexWitherspoon
The $4400 Genome:
[Via ScienceNOW]

The cost of sequencing an entire human genome continues to plummet. Complete Genomics, a Mountain View, California-based biotechnology company last year claimed it would soon be able to sell full human genome sequences for as little as $5000 apiece. That now appears within reach. In tomorrow’s Science, the company will report that it sequenced three human genomes for about $4400 each, at least in the cost of reagents. Such cheap sequencing could vastly accelerate studies designed to pinpoint genes underlying complex diseases.
[More]

And they claim to be able to do it in a day! I’d like to see how accurate the data is but it does represent a huge decrease in time and money:


The rapid fall in sequencing prices may give genomics an equivalent of Moore’s Law, which describes how the number of transistors on computer chips doubles every 18 months, steadily driving down the cost of computing power. In 2003, the cost of sequencing a human genome was an estimated $300 million. That was down to $1 million in 2007 and $60,000 last year.


It is now dropping about an order of magnitude every 18 months or so. So by the end of next year, we could be looking at $500 genomes. As i said below,
buy stock in hard drive makers.

Technorati Tags:

Buy stock in hard drive companies

What’s in an Illumina GA run directory?:
[Via PolITiGenomics]

One of the main things that differentiates genomics from other endeavors that use a lot of disk space is that genomics file systems tend to have a lot of files (millions). This was true with Sanger sequencing, and it seems to be even more true with next-generation sequencing technologies, especially Illumina/Solexa and AB SOLiD. This large number of files and the parallel access of these files by large computational clusters tends to give most storage solutions great difficulty.So what, exactly, is in an Illumina run directory? Well, to get breakdowns of file statistics there is a nifty little tool called fsstats. It is just a simple Perl script that crawls through a directory stat’ing files and reporting metrics. For example, when you run it on an Illumina GA IIx 2×100, high cluster density run after the primary analysis has completed, you get the following information about the distribution of file sizes. (I have rearranged and condensed the information to make it fit.)

total 7.46 TB used to store 7.46 TB user data, overhead 0.04%
count=991227 avg=8076.50 KB
min=0.00 KB max=13128679.30 KB
size range    count   %tot  %tot cum       total size   %tot  %tot cum
[       0-       2 KB):   4019 ( 0.41) (  0.41)       3009.03 KB ( 0.00) (  0.00)
[       2-       4 KB):      2 ( 0.00) (  0.41)          6.99 KB ( 0.00) (  0.00)
[       4-       8 KB):    981 ( 0.10) (  0.50)       5964.82 KB ( 0.00) (  0.00)
[       8-      16 KB): 193351 (19.51) ( 20.01)    2588619.88 KB ( 0.03) (  0.03)
[      16-      32 KB):   2656 ( 0.27) ( 20.28)      58586.79 KB ( 0.00) (  0.03)
[      32-      64 KB):    901 ( 0.09) ( 20.37)      31369.79 KB ( 0.00) (  0.03)
[      64-     128 KB):   2893 ( 0.29) ( 20.66)     303872.38 KB ( 0.00) (  0.04)
[     128-     256 KB):      2 ( 0.00) ( 20.66)        345.34 KB ( 0.00) (  0.04)
[     256-     512 KB):      4 ( 0.00) ( 20.66)       1222.53 KB ( 0.00) (  0.04)
[     512-    1024 KB):      1 ( 0.00) ( 20.66)        622.26 KB ( 0.00) (  0.04)
[    1024-    2048 KB):      2 ( 0.00) ( 20.66)       3199.89 KB ( 0.00) (  0.04)
[    2048-    4096 KB):     12 ( 0.00) ( 20.66)      41779.69 KB ( 0.00) (  0.04)
[    4096-    8192 KB): 776654 (78.35) ( 99.02) 5863161178.18 KB (73.24) ( 73.28)
[   16384-   32768 KB):     21 ( 0.00) ( 99.02)     487156.46 KB ( 0.01) ( 73.28)
[   32768-   65536 KB):   3856 ( 0.39) ( 99.41)  163552521.17 KB ( 2.04) ( 75.32)
[   65536-  131072 KB):   3825 ( 0.39) ( 99.79)  307535341.32 KB ( 3.84) ( 79.17)
[  131072-  262144 KB):    133 ( 0.01) ( 99.81)   32458046.12 KB ( 0.41) ( 79.57)
[  262144-  524288 KB):   1787 ( 0.18) ( 99.99)  658830514.46 KB ( 8.23) ( 87.80)
[ 2097152- 4194304 KB):     16 ( 0.00) ( 99.99)   47898262.36 KB ( 0.60) ( 88.40)
[ 4194304- 8388608 KB):     64 ( 0.01) (100.00)  432084134.39 KB ( 5.40) ( 93.80)
[ 8388608-16777216 KB):     47 ( 0.00) (100.00)  496603147.67 KB ( 6.20) (100.00)

So the total size of the run directory is nearly 7.5 TB and there are almost one million files. The average size of a file in the run directory is about 8 MB and the maximum size is over 13 GB. The images (represented in the 4096-8192 KB range), comprise over 78% of the files and 73% of the total size of the run directory. This significant penalty can be avoided by using RTA and not transferring image files. The largest files are the alignment (ELAND) outputs and the FASTQ files in the GERALD directory. Speaking of directories, here is a breakdown by number of files in each directory.

[More]

7.5 terabytes! One million files! That just boggles the imagination. I love how the sizes are still in kilobytes. 1 terabyte is over a million kilobytes. or 1000 gigabytes. A blue ray disk can store up to 50 gigabytes on a dual layer disk so 1 terabyte would take 20 of these disks.

With all the sequencing going on, there will be lots of huge storage centers to hold all the data. I wonder how long it takes to back up several terabytes of information?

Technorati Tags: , ,