Using DNA for archival storage

double helixby AndreaLaurel

DNA ‘perfect for digital storage’
[Via BBC News | Science/Nature | World Edition]

UK scientists demonstrate how DNA could be used to archive digital data, encoding Shakespeare’s sonnets and other information in the “life molecule”.


DNA is very stable when stored properly (heck, we can isolate and sequence the DNA of wholly mammoths that are 50,000 years old).

This idea using next-generation DNA synthesis/sequencing technologies was first explored last September by George Church’s lab. The idea is to break the data up into ‘packets‘ – much like data is done now when sent via the Internet. Data is broken down into small groups or packets of about 120 ‘bits’ . The packets not only include the data but added indexing bits to simplify reassembly. This allows redundancies to be used to make sure the data can be fully reassembled.

As a refresher, DNA consists of two strands composed of four nucleotide bases (A,G,T ad C). These nucleotides are complementary,meaning that an A on one strand always matches to a T on the other, Same with G and C. Thus, if you know the sequence of one strand, you automatically know the sequence of the other. This provides a layer of error checking ot found in computer network packets. You can sequence each strand separately to verify the actual sequence. If they are not complementary then you know there is an error.


The data are converted to digital form and encoded in a way to account for the properties of DNA (b and c). The approximately 120-nucleotide packets (over 50,000 in all) were then synthesized using next generation approaches

In these cases with DNA, the segments themselves also overlap, creating even more redundancies. In the recent paper, the overlaps occurred every 25 nucleotides.  Thus. every bit of sequence should also be found in at least 4 separate packets (d).



Then to read the data, next generation sequencing techniques were used to rapidly read all the sequences at once. Using modern bioinformatics software, all the thousands of sequence reads could then be reassembled. Then using he indexing data from each packet, the entire set of data could be decoded.

The main difference in this new paper deals with encoding procedures – for example, they create the packets where alternating segments were converted to their reverse complement. This means that, effectively, the top strand of a DNA double helix encodes the material for one packet. Then in the next packet , the bottom strand has the relevant data encoded. Then the top and so on.

Thus they use the inherent complementarity of DNA to help their assembly. As they recreate the broken up data, each packet should alternate from top strand to bottom strand back to the top – it would be like reading from right to left, then from left to right, then back again.

If this did not occur during assembly, then they know that their assembly is incorrect. 

The data encoded in the DNA takes up very little space (they got storage densities of about 2.2 petabytes per gram of DNA), is stable for long periods of time and is inherently redundant, making recovery of the data possible. Since the DNA used never goes through a living organism, there is no need to worry about mutations.

The difficulty currently is the write/read speed. This is not something that will ever replace a computer’s memory. It takes a while to create the DNA and quite some time to read it later. Thus the best use for this is deep archives of material that simply are not needed very often.

The costs are, at the moment, prohibitive for anything less than about 1000 years. At the moment, copying tapes every 5 years is cheaper for any time less than that. This is because the costs to synthesize the DNA are still high. If these drop 10 to 100 fold (which could happen in a decade), then it becomes more economical to use the DNA archiving system  rather than copying tapes.

There are many things that need to be archived but not accessed very often This could be a nice way to accomplish that for a lower eventual cost.