What’s in an Illumina GA run directory?:
[Via PolITiGenomics]
One of the main things that differentiates genomics from other endeavors that use a lot of disk space is that genomics file systems tend to have a lot of files (millions). This was true with Sanger sequencing, and it seems to be even more true with next-generation sequencing technologies, especially Illumina/Solexa and AB SOLiD. This large number of files and the parallel access of these files by large computational clusters tends to give most storage solutions great difficulty.So what, exactly, is in an Illumina run directory? Well, to get breakdowns of file statistics there is a nifty little tool called fsstats. It is just a simple Perl script that crawls through a directory stat’ing files and reporting metrics. For example, when you run it on an Illumina GA IIx 2×100, high cluster density run after the primary analysis has completed, you get the following information about the distribution of file sizes. (I have rearranged and condensed the information to make it fit.)
total 7.46 TB used to store 7.46 TB user data, overhead 0.04% count=991227 avg=8076.50 KB min=0.00 KB max=13128679.30 KB size range count %tot %tot cum total size %tot %tot cum [ 0- 2 KB): 4019 ( 0.41) ( 0.41) 3009.03 KB ( 0.00) ( 0.00) [ 2- 4 KB): 2 ( 0.00) ( 0.41) 6.99 KB ( 0.00) ( 0.00) [ 4- 8 KB): 981 ( 0.10) ( 0.50) 5964.82 KB ( 0.00) ( 0.00) [ 8- 16 KB): 193351 (19.51) ( 20.01) 2588619.88 KB ( 0.03) ( 0.03) [ 16- 32 KB): 2656 ( 0.27) ( 20.28) 58586.79 KB ( 0.00) ( 0.03) [ 32- 64 KB): 901 ( 0.09) ( 20.37) 31369.79 KB ( 0.00) ( 0.03) [ 64- 128 KB): 2893 ( 0.29) ( 20.66) 303872.38 KB ( 0.00) ( 0.04) [ 128- 256 KB): 2 ( 0.00) ( 20.66) 345.34 KB ( 0.00) ( 0.04) [ 256- 512 KB): 4 ( 0.00) ( 20.66) 1222.53 KB ( 0.00) ( 0.04) [ 512- 1024 KB): 1 ( 0.00) ( 20.66) 622.26 KB ( 0.00) ( 0.04) [ 1024- 2048 KB): 2 ( 0.00) ( 20.66) 3199.89 KB ( 0.00) ( 0.04) [ 2048- 4096 KB): 12 ( 0.00) ( 20.66) 41779.69 KB ( 0.00) ( 0.04) [ 4096- 8192 KB): 776654 (78.35) ( 99.02) 5863161178.18 KB (73.24) ( 73.28) [ 16384- 32768 KB): 21 ( 0.00) ( 99.02) 487156.46 KB ( 0.01) ( 73.28) [ 32768- 65536 KB): 3856 ( 0.39) ( 99.41) 163552521.17 KB ( 2.04) ( 75.32) [ 65536- 131072 KB): 3825 ( 0.39) ( 99.79) 307535341.32 KB ( 3.84) ( 79.17) [ 131072- 262144 KB): 133 ( 0.01) ( 99.81) 32458046.12 KB ( 0.41) ( 79.57) [ 262144- 524288 KB): 1787 ( 0.18) ( 99.99) 658830514.46 KB ( 8.23) ( 87.80) [ 2097152- 4194304 KB): 16 ( 0.00) ( 99.99) 47898262.36 KB ( 0.60) ( 88.40) [ 4194304- 8388608 KB): 64 ( 0.01) (100.00) 432084134.39 KB ( 5.40) ( 93.80) [ 8388608-16777216 KB): 47 ( 0.00) (100.00) 496603147.67 KB ( 6.20) (100.00)So the total size of the run directory is nearly 7.5 TB and there are almost one million files. The average size of a file in the run directory is about 8 MB and the maximum size is over 13 GB. The images (represented in the 4096-8192 KB range), comprise over 78% of the files and 73% of the total size of the run directory. This significant penalty can be avoided by using RTA and not transferring image files. The largest files are the alignment (ELAND) outputs and the FASTQ files in the GERALD directory. Speaking of directories, here is a breakdown by number of files in each directory.
[More]
7.5 terabytes! One million files! That just boggles the imagination. I love how the sizes are still in kilobytes. 1 terabyte is over a million kilobytes. or 1000 gigabytes. A blue ray disk can store up to 50 gigabytes on a dual layer disk so 1 terabyte would take 20 of these disks.
With all the sequencing going on, there will be lots of huge storage centers to hold all the data. I wonder how long it takes to back up several terabytes of information?
Technorati Tags: Health, Science, Technology


Mind Research Institute
Peter Aldhous has 
