#### Selected (Mostly Recent) Publications

April, 2012 – Gusfield

#### Books:

ReCombinatorics: The Algorithmics of Ancestral Recombination Graphs and Explicit Phylogenetic Networks.

D. Gusfield (MIT Press, 2014)

Algorithm on Strings, Trees, and Sequences: Computer Science and Computational Biology

D. Gusfield (Cambridge Press)

The Stable Marriage Problem: Structure and Algorithms

D. Gusfield and R. Irving (MIT Press)

#### Papers on Multi-State Perfect Phylogeny

- Constructing Perfect Phylogenies and Proper Triangulations for Three-State Characters

Rob Gysel, Fumei Lam and Dan Gusfield.

Proceedings of WABI 2011, Springer LNCS - Reducing Multi-State to Binary Perfect Phylogeny with Applications to Missing, Removable, Inserted and Deleted Data.

Kristian Stevens and Dan Gusfield

Proceedings of WABI 2010, Springer LNCS Vol. 6293 2010, p. 274-287 - Extensions and Improvements to the Chordal Graph Approach to the Multi-state Perfect Phylogeny Problem

R. Gysel and D. Gusfield

In Proceedings of the ISBRA Conference, 2010, Springer LNBI Vol. 6054, p. 52-60. - The Three-state Perfect Phylogeny Problem Reduces to 2-SAT.

D. Gusfield and Y. Wu

Communicationsand Information Systems, Special Issue Dedicated to Michael Waterman,

Volume 9, No. 4, 2009. Pages 195-301 - Generalizing the Splits Equivalence Theorem and Four Gamete Condition: Perfect Phylogeny on Three State Characters

Fumie Lam, D. Gusfield, S. Sridhar

This an expanded version of the WABI 2009 conference paper presented September 12, 2009.The main result of the paper is that there is a three-state perfect phylogeny (or in other terms, the data is compatible with a three-state tree) if and only if there is one for every subset of three characters. This generalizes the binary case where the classical four-gametes condition says that there is a binary perfect phylogeny in binary data if and only if there is one for every pair of characters. The paper also shows that there is a set of four subpatterns, each with three characters and five taxa, such that any three-state data that does not have a three-state perfect phylogeny must contain one of those four subpatterns. This generalizes the four-gametes condition which shows that when a pair of characters does not have a binary perfect phylogeny (is not compatible) the subpattern of all four binary combinations must appear in that pair of characters. The techniques are based on the chordal graph view of perfect phylogeny. - The Multi-State Perfect Phylogeny Problem with Missing and Removable Data; Solutions via Integer Linear Programming and Chordal Graph Theory

Dan Gusfield

In Proceedings of the 2009 RECOMB Conference, May 2009.

Expanded journal version (with complete proofs)

J. of Computational Biology, Vol. 17, no. 3, 2010, p. 383-399For software related to this paper see Software For powerpoint slides related to this paper see Recent Talks#### Papers on Phylogenetic Networks and ARGs with Recombination

- A Decomposition Theory for Phylogenetic Networks and Incompatible Characters

Dan Gusfield, Vikas Bansal, Vineet Bafna, Yun S. Song

Journal of Computational Biology Dec 2007, Vol. 14, No. 10: 1247-1272.

This is a large expansion of the RECOMB 2005 conference paper listed below. It solves the major open question from that conference paper and extends the earlier results. - Algorithms to Distinguish the Role of Gene-Conversion from Single-Crossover Recombination in the Derivation of SNP Sequences in Populations

Yun S. Song, Zhihong Ding, Dan Gusfield, Charles H. Langley, Yufeng Wu

Journal of Computational Biology Dec 2007, Vol. 14, No. 10: 1273-1286

This is the extended journal version of the RECOMB 2006 conference paper listed below. - A new recombination lower bound and the minimum perfect phylogenetic forest problem

Y. Wu and D. Gusfield

LNCS, vol. 4598, Proceedings of the 13’th Annual International Conference on Combinatorics and Computing, 2007, p. 16-26

The Journal version is in Journal of Combinatorial Optimization, Vol 16, 2008 - Improved Algorithms for Inferring the Minimum Mosaic of a Set of Recombinants

Y. Wu and D. Gusfield

LNCS, vol. 4580, Proceedings of the 18’th Annual Symposium on Combinatorial Pattern Matching, 2007, p. 150-161 - An efficiently-computed lower bound on the number of recombinations in phylogenetic networks: Theory and empirical study

link to Science Direct for the final journal manuscript

D. Gusfield, D. Hickerson, S. Eddhu

Discrete Applied Mathematics, Special issue on Computational Biology, Vol. 155, pg. 806-830, 2007 preprint in pdf - Efficient Computation of Minimum Recombination with Genotypes (not Haplotypes) (pdf)

Y.Wu and D. Gusfield

In Proceedings of The CSB Conference, Stanford CA., August 2006.

The journal version appeared in J. Bioinformatics and Computational Biology, Vol. 5 No. 2(a), 2007, p. 181-200 - Algorithms to distinguish the role of gene-conversion from single-crossover recombination in the derivations of SNP sequences in populations (pdf)

Y.S. Song, Z. Ding, D. Gusfield, C. Langley, Y. Wu

In Proceedings of RECOMB 2006.

ps version

The Journal version is in Journal of Computational Biology, Vol. 14, 2007 - Efficient computation of close lower and upper bounds on the minimum number of recombinations in biological sequence evolution

Yun S. Song, Yufeng Wu and Dan Gusfield

Proceedings of ISMB 2005, published as a special issue of BioinformaticsFor the associated software

HapBound and SHRUB - On the Full-Decomposition Optimality Conjecture for Phylogenetic Networks (pdf) D. Gusfield

UCD Technical Report, Jan. 2005 - A Fundamental Decomposition Theory for Phylogenetic Networks and Incompatible Characters (pdf)

click here for postscript version

D. Gusfield and V. Bansal

Proceedings of RECOMB 2005, May 2005 - Optimal, Efficient Reconstruction of Root-Unknown Phylogenetic Networks with Constrained and Structured Recombination (pdf)

click here for postscript version

Technical Report UCDavis CSE-2004-31

D. Gusfield

November 25, 2004,

Journal version appears in J. Computer and Systems Sciences, 2005 Special issue on Computational Biology.

Vol. 70 (2005) p. 381-398 - A fundamental Decomposition Theory for Phylogenetic Networks and Incompatible Characters (pdf)

click here for postscript version

Technical Report UCDavis CSE-2004-32

D. Gusfield

October 14, 2004, - Optimal, Efficient Reconstruction of Phylogenetic Networks with Constrained Recombination (pdf)

Click here for Postscript

D. Gusfield, S. Eddhu, C. Langley

Journal of Bioinformatics and Computational Biology, Vol. 2 no. 1 (2004) p. 173-213

Special Issue on the 2003 IEEE CSB Bioinformatics Conference.This is a much expanded journal version of the conference paper listed next. - Efficient Reconstruction of Phylogenetic Networks with Constrained Recombination (pdf)

Click here for Postscript

D. Gusfield, S. Eddhu, C. Langley

Proceedings of the 2003 IEEE CSB Bioinformatics Conference. This won the best-paper award.

Thank you Hewlett Packard for donating the prize money. - The Fine Structure of Galls in Phylogenetic Networks (pdf)

D. Gusfield, S. Eddhu, C. Langley

INFORMS J. of Computing Special Issue on Computational Biology (2004).#### Papers on Haplotype Inference

- Analytical and Algorithmic Methods for Haplotype Frequency Inference: What do they tell us?

S. Orzack, D. Gusfield, L. Subrahmanyan, L. Essioux, S. Lissarrague

in Bioinformatics Algorithms: Techniques and Applications, I. Mandoiu and A. Zelikovsky (eds.), Wiley 2008, p. 373-394 - Algorithms for Imperfect Phylogeny Haplotyping with a Single Homoplasy or Recombination Event (pdf),

Y. S. Song, Y. Wu and Gusfield

Proceedings of WABI 2005, October 2005, Lecture Notes in Bioinformatics

In this paper we look at extensions of the PPH problem (see below) to situations where the underlying haplotypes that form the genotypes do not evolve on a perfect phylogeny, but rather evolve either on a tree with one homoplasy event (a recurrent or back mutation) or on a network with one recombination event. Using one emprically-justified assumption, we present a polynomial time algorithm for the first problem (the case of one homoplasy event) and an exponential-time algorithm for the second problem. We have subsequently developed a polynomial-time algorithm for the second problem as well. - A Linear-Time Algorithm for Perfect Phylogeny Haplotyping (pdf)

Z. Ding, V. Filkov, D. Gusfield

This is the expanded, full journal version of the RECOMB conference paper below.

Journal of Computational Biology, Vol. 13, No. 2, pg. 522-553, 2006 - A Linear-Time Algorithm for Perfect Phylogeny Haplotyping (pdf)

Z. Ding, V. Filkov, D. Gusfield

Proceedings of RECOMB 2005, May 2005 click here for postscript version

This paper solves the open question of whether the Perfect Phylogeny Haplotyping (PPH) Problem can be solved in linear time. The method presented is simple enough for easy implementation, and our implementation is on the web. - Haplotype Inference (pdf)

D. Gusfield and S.H. Orzack

In CRC Handbook on Bioinformatics, 2006 (S. Aluru Editor), p. 18-1 18-25 - A Note on Efficient Computation of Haplotypes via Perfect Phylogeny

Vineet Bafna, Dan Gusfield, Sridhar Hannenhalli, Shibu Yooseph

Journal of Computational Biology, Vol. 11, no. 5, 2004, p. 858-866This paper shows that two prior methods for the PPH problem are unlikely to be implementable in linear time. It also shows that the problem of finding a PPH solution that minimizes the number of distinct haplotypes used (over all PPH solutions) is NP-hard. - An Overview of Combinatorial Methods for Haplotype Inference (pdf)

In Computational Methods for SNPs and Haplotype Inference, S. Istrail, M. W aterman, and A. Clark (eds). Lecture Notes in Computer Science, vol. 2983, Springer, p. 9-25, 2004. click here for postscript version

(2004) A survey of combinatorial algorithms and software

for haplotype inference developed at UC Davis.

D. Gusfield - Empirical Exploration of Perfect Phylogeny Haplotyping and Haplotypers (pdf) click here for postscript version

R.H. Chung and D. Gusfield

Proceedings of the 2003 Cocoon Conference July 2003

Link to the proceedings

This reports on the performance of three perfect phylogeny haplotyping programs, and on two interesting phenomena observed when solving the perfect phylogeny haplotyping problem on simulated data. - Haplotyping by Pure Parsimony (pdf)

click here for postscript version

Technical Report UCDavis CSE-2003-2

D. Gusfield

January 23, 2003,

Final version in The Proceedings of the 2003 Combinatorial Pattern Matching Conference, June 2003

Link to conference proceedings The Pure Parsimony problem for Haplotype Inference is to find the smallest set of binary strings that can generate an input set of genotypes. The Pure Parsimony problem is NP-hard, and no paper has previously shown how an optimal Pure-Parsimony solution can be computed efficiently for problem instances of the size of current biological interest. In this paper, we show how to formulate the Pure-Parsimony problem as an integer linear program; we explain how to improve the practicality of the integer programming formulation; and we present the results of extensive experimentation we have done to show the time and memory practicality of the method, and to compare its accuracy against solutions found by the widely used general haplotyping program PHASE. The results are that the Pure Parsimony problem can be solved efficiently in practice for a wide range of problem instances of current interest in biology. Both the time needed for a solution, and the accuracy of the solution, depend on the level of recombination in the input strings. The speed of the solution improves with increasing recombination, but the accuracy of the solution decreases with increasing recombination. - Haplotyping as Perfect Phylogeny: A direct approach (pdf).

Haplotyping as Perfect Phylogeny: A direct approach (postscript)

Technical Report UCDavis CSE-2002-21

V. Bafna, D. Gusfield, G. Lancia, S. Yooseph

July 17, 2002.

An augmented version has appeared in Journal of Computational Biology, Vol. 10 No. 3 (2003) p. 323-340.This develops a more direct, self-contained method (than the first paper in the area: “Haplotyping as Perfect Phylogeny: Conceptual Framework and Efficient Solutions”) determining whether unphased genotype data can be explained by haplotypes that fit a perfect phylogeny. It gives a more intuitive and algorithmicly simpler way to create the representation of all such solutions. An implementation of this method is available at:

DPPH . - Haplotyping as Perfect Phylogeny: Conceptual Framework and Efficient Solutions. (pdf)

Haplotyping as Perfect Phylogeny: Conceptual Framework and Efficient Solutions (postscript)

D. Gusfield

Appeared in RECOMB 2002, April 2002.

This paper develops an almost-linear-time algorithm to determine if unphased genotype data can be explained by haplotype pairs which fit a rooted perfect phylogeny. If there is such an explanation, then the algorithm, in linear additional time, determines if the solution is unique, and if not, produces a representation of all the solutions. The method is based on reducing the haplotype problem to a problem of graph realization. We use this reduction to graph realization, in the software package GPPH (previously called PPH). However, in GPPH, we use a somewhat slower way to solve the graph realization problem than is described in the paper.Errata for: Haplotyping as Perfect Phylogeny: Conceptual Framework and Efficient Solutions (pdf)

Errata for: Haplotyping as Perfect Phylogeny: Conceptual Framework and Efficient Solutions (postscript) - Perfect Phylogeny Haplotyper: Haplotype Inferral Using a Tree Model (pdf)

Perfect Phylogeny Haplotyper: Haplotype Inferral Using a Tree Model (Postscript)

R.H. Chung and D. Gusfield.

This gives a short description of the program PPH (now called GPPH), which implements the solution method discussed in the prior paper and Errata.

Bioinformatics, Vol. 19, No. 6, pp. 780-781, 2003. - Inference of Haplotypes in Samples of Diploid Populations: Complexity and Algorithms.(pdf)

Inference of Haplotypes in Samples of Diploid Populations: Complexity and Algorithms (postscript)

D. Gusfield

In Journal of Computational Biology, Vol. 8, No. 3, 2001 p. 305-323 - The absolute and relative accuracy of haplotype inferral methods and a consensus approach to haplotype inferral. S. H. Orzack, D. Gusfield, V.P. Stanton.

Abstract in 51’th Annual Meeting of the American Society of Human Genetics, Fall 2001.

The full, much expanded paper (in Genetics, listed next) details how to modify Clark’s well-known and widely used method of haplotype inferral to greatly boost its accuracy, making it nearly competitive in accuracy with PHASE, and vastly faster. This paper is almost unique in the literature in that it compares the accuracy of the methods to laboratory determined haplotype data – determined by one of the authors. It is the joint collaboration of a computer scientist, a geneticist and molecular biologists. - Analysis and Exploration of the Use of Rule-Based Algorithms and Consensus Methods for the Inferral of Haplotypes

Genetics, Vol. 165, 915-928, October 2003

Steven Hecht Orzack, Daniel Gusfield, Jeffrey Olson, Steven Nesbitt, Lakshman Subrahmanyanc, and Vincent P. Stanton, Jr.#### Other Recent Papers and Proceedings in Computational Biology and Bioinformatics

- Worst-Case and Practical Speedup for the RNA co-folding Problem using the 4-Russians Idea.

Yelena Frid and Dan Gusfield

Proceedings of WABI 2010, Springer LNCS Vol. 6293, 2010, p. 1-12. - Untangling Tanglegrams: Comparing Trees by Their Drawings

Venkatachalam, B; Apple, J; St John, K; Gusfield, D,

In Proceedings of BIOINFORMATICS RESEARCH AND APPLICATIONS, 5542: 88-99

MAY 13-16, 2009Journal Version: IEEE/ACM Transactions on Computational Biology and Bioinformatics, Vol. 7, 2010, p. 588-597 - A simple, practical and complete O(n^3/ log n)-time Algorithm for RNA folding using the Four-Russians Speedup

Yelena Frid, D. GusfieldPresented at the WABI 2009 conference, September 13, 2009

Expanded and corrected Journal Version: in Algorithms for Molecular Biology 2010, 5:13 - Integer Programming Formulations and Computations Solving Phylogenetic and Population Genetic Problems with Missing or Genotypic Data (pdf)

D. Gusfield and Y. Frid and D. Brown

LNCS, vol. 4598, Proceedings of the 13’th Annual International Conference on Combinatorics and Computing, 2007, p. 51-64 - Linear-Time Algorithms for Finding and Representing all Tandem Repeats in a String (pdf)

D. Gusfield and J. Stoye

JCSS, vol. 69, 2004, p. 525-546

We show how to decorate a suffix tree for a string with the endpoints of all the distinct substrings of the string that are tandem repeats. The algorithm runs in linear-time as a function of the length of the string. Of course, such an algorithmic result would not be possible without the remarkable mathematical result, proven by A. Frankel and J. Simpson in 1998, that there can only be a linear number of distinct substrings that are tandem repeats. The actual bound is 2n, for a string of length n. After the suffix tree is decorated, almost all imaginable questions about the tandem repeats or tandem arrays (how many occurrences there are, where they occurr, their average length, the maximum and minimum lengths from given positions etc.) that were previously studied individually, can be answered efficiently using this decorated suffix tree. - RECOMB 2004, Proceedings of the Eighth Annual International Conference on Computational Molecular Biology

ACM Press

D. Gusfield (Program Chair), P. Bourne, S. Istrail, P. Pevzner, M. Waterman (editors) - Algorithms in Bioinformatics

Proceedings of the Second International Workshop on Algorithms in Bioinformatics, WABI 2002, Rome, Italy, September 2002

Springer Lecture Notes in Computer Science Vol. 2452, 554 pages

R. Guigo and D. Gusfield (editors) - String Barcoding: Uncovering Optimal Virus Signatures (pdf)

S. Rash and D. Gusfield, Appeared in RECOMB 2002, April 2002 - On the Complexity of Fundamental Computational Problems in Pedigree Analysis (pdf)

On the Complexity of Fundamental Computational Problems in Pedigree Analysis (postscript)

A. Piccolboni and Dan Gusfield

An improved version has appeared in Journal of Computational Biology, Vol 10, No. 5, 2003, p. 763-774. - Partition-distance: A problem and class of perfect graphs arising in clustering (pdf),

Partition-distance: A problem and class of perfect graphs arising in clustering (postscript),

Dan Gusfield

Information Processing Letters, Volume 82, Issue 3, 16 May 2002, Pages 159-164 - Simple and flexible detection of contiguous repeats using a suffix tree (pdf) ,

Simple and flexible detection of contiguous repeats using a suffix tree (postscript) ,

Jens Stoye and Dan Gusfield

Theoretical Computer Science, Volume 270, Issues 1-2, 6 January 2002, Pages 843-856#### Other Recent Papers

- The structure and complexity of sports elimination numbers (pdf)

The structure and complexity of sports elimination numbers (postscript)

Dan Gusfield and Chip Martel,

In Algorithmica (2002) 32, pages 73-86

#### Hard to Find Older Papers

- Efficient Algorithms for Inferring Evolutionary Trees (pdf)

Efficient Algorithms for Inferring Evolutionary Trees (ps)

D. GusfieldI get many requests for this paper from people who can’t find it in their libraries. It is my most cited journal publication. It was also my first submitted paper in computational biology, submitted around 1985, and rejected (I won’t reveal the journal). It was published in Networks, Vol. 21 (1991) p. 19-28. Networks is no longer a major journal (judging by the libraries that don’t receive it, and I guess even by 1991, it was not widely received.). So I am posting the version of it that I found in my files (after much searching). It may not be absolutely identical to the published paper, but it has all the results. The paper has three results: How to find a perfect phylogeny (if there is one) in linear time; a lower bound proof that every entry in the data has to be examined; how to solve the tree compatibility problem in linear time. The first result is also obtained in a somewhat nicer way in my book Algorithms on Strings, Trees and Sequences. - A little knowledge goes a long way: Faster detection of compromised data in 2-D tables(pdf)

D. GusfieldThis paper is buried in the Proceedings of the 1990 Oakland conference on Research in Security and Privacy. I don’t think anyone knows it is there – I should have published it in a journal.The paper considers the following problem Statistical Data Security: In a 2-D table of non-negative numbers, say n by n, where all the row totals and column totals are known, but only some of the n^2 cell values are known, one wants to determine, for each cell whose value is not known, what is the maximum possible value that the cell could take on. That is, if cell (i,j) has an unknown value, what is the maximum value for cell (i,j) so that there is a setting of all the other unknown (non-negative) values that makes all the rows and columns sum to their correct, fixed values? In this paper I show that no matter how many cell values are unknown, one needs to solve this problem only 2n-1 times. More specifically, I give an algorithm that selects at most 2n-1 of the cells, so that after the maximum value is determined for each of those 2n-1 cells, the maximum value can be obtained for each of the remaining unknown values by doing a single lookup in a table built during the solving of the first 2n-1 problems. So, the maximum value can be determined in constant time for each of the remaining cells. The total time for the algorithm is the time needed for 4n-2 network flow computations in a directed graph of 2n nodes. - New Uses for Uniform Lifted Alignments (pdf)

D. Gusfield and L. Wang.

In Mathematical Support for Molecular Biology.

DIMACS Series on Discrete Math and Theoretical Computer Science, VOL. 47, p. 33-52 (1999)This paper is obscure due to a terrible choice of titles. It should have been called something like:How to Compute Evolutionary History Really Badly, and Why I Want to.The paper uses approximation algorithms in a way that is backwards from what they were designed for, in order to establish bounds on the accuracy of certain computations, rather than trying to find good solutions. The key to doing this is to make an approximation algorithm as inaccurate as possible, while still producing a solution that falls within the worst-case approximation bound. One reviewer of this paper wrote “This is the dumbest idea I have heard in a long time”. We use this approach to try to show that a particular tree alignment of RNA sequences that David Sankoff and Robert Cedergren constructed in 1983 is close to the optimal alignment (under a given objective function). A simple analysis showed that the cost of their alignment is no more than 62% larger than the optimal. In this paper we show that its cost is no more than 16.57% larger than the optimal. I think this idea of using an approximation algorithm backwards can be applied in other problems, but I have never seen the idea picked up by anyone else.

- Efficient Methods for Multiple Sequence Alignment with Guaranteed Error Bounds

Bulletin of Mathematical Biology, Vol. 55, p. 141-154, 1993 A pre-publication version with all the main results, but a bit more raw than the published journal version. This was the first use of a bounded approximation algorithm in computational biology. The particular multiple alignment method developed in the paper was only for the purpose of being able to prove a guaranteed bound on the error, and hence the result in the paper is mainly theoretical. However, it has been reported that the actual alignments are useful in some applications.A historical trivia note: This is the first (oldest) paper PubMed indexed using the search term Computational Biology. PubMed query

- Parametric and Inverse Parametric Sequence Alignment with XPARAL

D. Gusfield and P. StellingMethods in Enzymology, Vol. 266, 1996

## A full list of publications as of April 2012

Return to Gusfield Homepage

April, 2012

This research was partially supported by NSF grants ITR-9723346, SEI-BIO 0513910, CCF-0515378, IIS-0803564, and CCF-1017580.