Thursday, December 18, 2008

What's 5 % between friends?

You gotta love the breakneck speed at which whole genome sequencing is proceeding. Hell, by the end of 2010 we will have a complete genome for 1000. But the big rub accurate will it be?

I was just telling a partner at Burrill and Company about how the whole field of genetic esoteric testing goes out the window when you can have a genome for 1000 USD. But what I didn't say was, "That's assuming the data is valid"

You see, we can be pretty accurate when sequencing a gene or 2. But when it comes to whole genome sequencing the best these companies can get is 95% correct. So I say, what's 5% between friends..........For clarity, this is per read according to some companies. But I maintain that even with 30x coverage, you will still have too much error to trust this, at least when it comes to making healthcare decisions....

I say this in response to the "pinheads" selling SNP scans DTC and "claiming that they hold the keys to all disease.....That 0.1% difference is not all that matters, in fact I would guess it is merely one of approximately 7 or 8 factors that play heavily into common human disease.....That's why the SNP Chip companies dropped prices and may have destroyed the commercial market for this test....Slide pic brought to you by Andrew....Thanks Drew!

In fact, that is precisely why Coriell will be tremendously successful at recruiting volunteers for their study. People realize 2500 USD is not the cost of a 1 million SNP can, in fact by the end of the year it may very well be the cost of a complete genome....

But here's the is the accuracy of that "Discount DNA" to quote the author David Ewing Duncan......

We know that Helicos actually kept their error rate quiet for some time. But then

"Initial commercial specifications for the Helicos™ Genetic Analysis Platform were set at 50 Mb per hour; 10 Gb per run in 8 days. Early adopters can expect 8 million reads at length-of-read from 25 to 50 bases in each of the 50 flow cell channels utilized, totaling 400 million reads per run. Aftermarket costs are approximately $1.80 per megabase sequenced or $45 per million reads. Additionally, performance is independent of template sizes anywhere from 25 b to 8 Kb. The total error rate is less than or equal to 5%, with a competitive 0.5% substitution error rate. Further, the error rate is independent of the read length. The HeliScope Sequencer is capable of accurately sequencing samples with 20% to 80% GC content. "

The dirty truth..........Yes, now those SNP scans are looking a heck of a lot more accurate than the wonderful complete genomes we may have....Hmmmmm...can some one do the math for me.....what's 5% of 3 billion? What about 6 billion? Even if not that high, what is 0.1%....isn't that what the SNP companies tell you can cause "all disease???"

Uh-Oh.....we may have a complete genome for 2500 USD, but whose genome is it? 0.1% makes us different, right? This problem will hamper the entire field for quite some time. Imagine all the false positive data that may be generated here. If you think GWAS needed replication, wait until you see W-GWAS studies. We are going to have so much false positive data out there until we can perfect the technology...I can't wait to see the next level of commercial ventures to arise prematurely from Complete Genomics....Have they said their error rate? I have a big problem with any company who says "Don't do the sequencing....we'll do the sequencing and give you the data.

In fact, there is a guy on house arrest in Manhattan for doing a similar thing. His name is Madoff. He said "I have a secret formula, trust me, I can give you fantastic returns.....Until my kids turn me in"

I wonder if Complete Genomics is the Genome's Ponzi scheme. I wouldn't assume so, since George Church is on their board and if anyone is a purist it's him. But hey, how will anyone know that without "double checking" the books???? Even if they have a dramatically lower error rate than Helicos.......

Unless the error rate gets to about 0.001% then we may have to wait a while before we can do most meaningful things with the "Whole Genome and nothing but the Genome"
The Sherpa Says: My take, we need to study how SNP scans AND whole genome scans may affect healthcare outcomes. If we aren't answering questions for clinicians and the public what are we doing? Providing neat websites based on false or lackluster data? We have to be serious here and figure what it all means before we start using this. That's precisely why 23andMe says that there scan shouldn't be used for healthcare....because they too know it is not ready for prime time.....I wonder how Complete Genomics feels about that? Because, I for one am a little concerned about the whole "black box" technology movement and applying it to my patients' care. But it is pretty obvious MDVIP doesn't care.


Andrew said...

Oh hm, where did that slide come from?

It's still viable that health predictions can be made with inaccurate tests (5% is quite high...), but my estimate is that in the time it would take to do a rigorous study, the error rate will be down significantly. I'm willing to give a company with George Church the benefit of the doubt when it comes to generating obnoxious press. The problem may lie with stupid journalists and PR flaks. (not unusual)

As for existing SNP chip tests like 23andMe, their accuracy is good, and the test could be repeated if it were a critical element in some medical decision. However, 23andMe wants to be non-medical, and as long as they don't imply otherwise , that's fine with me. When they're ready to be medically responsible, and I suspect that day will happen, yah sure, I'll include their data in medical services.

But, I'm sure as hell not going to promote medical advice in which some salesman leers through a bleached smile "yah, well, we can't be liable, but trust me." And I'm DEFINITELY going to be outraged when the car salesmen of DTC Navigenics tries to pull this shit on the medical community.

I've seen a lot of "wink wink nudge nudge" out of 23andMe, but not the flat out fraudulent bullshit like "we're provided by physicians as a medical tool" that I see out of Navigenics.

Andrew said...

Note: Navigenics has a similar disclaimer, though it's less clear than 23andMe's. However, I'm going to award a point to 23andMe for at least publishing clear and obvious prose in their TOS and Consent Form regarding the use of its test. Both 23andMe and Navigenics are still negative ethically, though.

Ricardipus said...

Let's not get confused with error rates. 5% raw error rate on a single read is not the same as 5% consensus error rate across a 40x (or whatever) coverage. I imagine with appropriate error correction and sufficient depth of coverage, Helicos and the rest of them will be robust enough.

Whether or not Helicos is still around six months from now is a totally different question, however. ;)

Daniel said...


You should have talked to someone familiar with next-gen sequencing before you posted this.

The 5% error rate quoted is for an individual sequencing read. Diagnostic sequencing would typically use somewhere in the realm of 30X coverage, meaning every base in the target region is covered an average of 30 times - so at a heterozygous site, each allele would have ~15 independent reads, of which only 5% (0.75 reads on average) will contain an error at that position.

With 30 independent reads per base a 5% per-read error rate translates into a level of accuracy that is higher than most traditional diagnostic tests. Solexa sequencing at 30X has an accuracy of over 99.999%.

Short-read sequencing definitely has some major issues - it can't sequence highly repetitive DNA, and it's still pretty bad at picking up large-scale structural rearrangements - but with sufficient sequencing depth mis-calling bases is not really an issue.

Given that misunderstanding the rest of your post is just embarrassing. Sorry.

Steve Murphy MD said...

I am not embarrassed by my post. if you sample a base pair 30 times, and have an independant error rate of 5% each time, what are the odds that you still have an incorrect base pair. Add that to each additional base pair that could have the same problem and you are still no where near 99.99% accurate.... That is just not correct. I understand 30x coverage, which is on the low end acutally.....But you have to take each read as an independant event, just like reproduction and independant assortment. So if you have a 5% chance each time it is still a very high pretest probability that you will have a wrong base pair. You can try for consensus at 30x and hope that it is the right sequence, but you know as well as I do that there can be a whole host of things to monkey this up, whether it is next-gen with Solexa or whoever or it is next-next gen with PacBio or Helicos....To tout an accuracy rate of 99.99% is probably even more embarassing than err'ing on the side of a much higher inaccuracy rate..

For Sanger versus Next-Next in Salmon you can read here

When caring for humans, we aren't talking about salmon. You have to be damn sure.....

At least that is the case when practicing medicine...maybe not for research, but then again, look at all the inaccurate research can't underestimate inaccuracy when caring for human life, you have to play on the cautious side...

And BTW the article on TOTAL error rate was:

"Helicos to Place Single-Molecule Sequencer at Broad Institute at No Cost"

[December 16, 2008]

By a GenomeWeb staff reporter

NEW YORK (GenomeWeb News) - Helicos BioSciences said today in a filing with the US Securities and Exchange Commission that it is preparing to place a Helicos Genetic Analysis System at the Broad Institute.

The company said that it is providing the system to the Broad at no cost and that it expects to install it in early 2009.

Helicos said that it has recently improved its single-molecule sequencing chemistry and other aspects of the system so that it now generates approximately 100 megabases to 140 megabases per hour — an improvement over the 50 megabases per hour that the company reported in its third-quarter earnings statement.

The company said that the system provides median read lengths of 33 bases and has an approximate total error rate of 5 percent.

Helicos has also placed its sequencer at Stanford University and genomic services firm Expression Analysis, but the company has not disclosed the financial terms for either of those agreements.

So I ask you, is that single read OR "TOTAL ERROR RATE?"

The people up here think it is "Total Error"......


p.s. and the "Experts" here at Yale say you need probably a 100x coverage to make up for base pair mistakes.....

Daniel said...

Hi Steve,

Sure, 30X is at the low end - your "experts" (your quotes) at Yale are probably right that 100X will be closer to the routine depth for serious diagnostics.

However, your "people up here" are wrong - the 5% error rate quoted is for a single read. Ask someone who knows what they're talking about. In fact you could just stop and think about it for yourself: how could someone quote an error rate for a multiple-pass sequencing run without specifying the coverage or the alignment and SNP-calling algorithms used? It's obvious the error rate must be a single-read raw rate. (And you even tacitly admit as much in your first paragraph, before veering into perplexed disagreement at the end of your comment.)

In fact your first paragraph is all over the place, so I'm not going to spend too much time trying to unravel it. I will say this: the 99.999% accuracy I quoted is an empirical value, obtained from sequencing a section of known DNA to 30X coverage - I linked you to the article (it's the recent "African genome" paper in Nature). But you can do your own calculations, multiplying out a 5% error rate over 30 independent reads, and tell me what you get.

Sure, that accuracy will drop somewhat when you look at heterozygous sites, but you can just increase the coverage - 100X would nail pretty much every mappable base with near-perfect accuracy.

As for your salmon paper - what, did you just do a PubMed search for "30x coverage" and just link to the first abstract you found? That paper doesn't refer to a "next-next" platform, it's just the lamest of the three next-gen platforms (454) - and the problems with assembly aren't a result of single-base error rates, but problems spanning repetitive regions (an issue I already noted in my comment above).

Finally, don't pay too much attention to Helicos - as Ricardipus hints, it's unlikely they'll be in the market for much longer, and the 454 technology may not be far behind (unless Roche can do something astonishing to its throughput).

By the way, I didn't say that you should be embarrassed by your post - I've been reading your blog too long to expect you to be introspective about anything you write. I meant that I was embarrassed by reading it.

Andrew said...

I do have to side with Daniel on this one. This is a sloppy attack, and I have stance about sloppy attacks.

Andrew said...