k-link README

This tarball contains the Expressed Sequence Tag (EST) reference clusters used in the paper:

k-link EST clustering: evaluating error introduced by chimeric sequences under different degrees of linkage. Bioinformatics 2009 25(18):2302-2308; doi:10.1093/bioinformatics/btp410.

The ESTs were downloaded from the NCBI public repository on November 10, 2008. Please read the paper for a full description of how these reference clusters were created.

ESTS from three species; Caenorhabditis elegans, Oryza sativa ssp. Indica, and Sorghum Bicolor, were used to benchmark the k-link clustering algorithm.

The prefix for each file indicates the source organism.

The *.fasta files contain the RBR/DUST masked sequences used in the reference clusterings. These sequences have been relabelled using a unique integer (called the internal id). The EST metadata can be found in the *.est_data.sql file if required.

The *.wcd files contain the reference clusters in WCD format (WCD is an EST clustering algorithm – see Hazelhurst et al. 2008). K-link requires input files to be in wcd format.

The *.est_data.sql contains the tab separated descriptions for each EST. It is of the format:

<EST internal id>   <GI>    <GB>    <GB revision>   <Long description>

The *.cluster_data.sql contains the tab-separated cluster membership data. It is of the format:

<Cluster ID>        <EST internal id>