k-link README
This tarball contains the Expressed Sequence Tag (EST) reference clusters used in the paper:
k-link EST clustering: evaluating error introduced by chimeric sequences under different degrees of linkage. Bioinformatics 2009 25(18):2302-2308; doi:10.1093/bioinformatics/btp410.
The ESTs were downloaded from the NCBI public repository on November 10, 2008. Please read the paper for a full description of how these reference clusters were created.
ESTS from three species; Caenorhabditis elegans, Oryza sativa ssp. Indica, and Sorghum Bicolor, were used to benchmark the k-link clustering algorithm.
The prefix for each file indicates the source organism.
The *.fasta files contain the RBR/DUST masked sequences used in the reference clusterings. These sequences have been relabelled using a unique integer (called the internal id). The EST metadata can be found in the *.est_data.sql file if required.
The *.wcd files contain the reference clusters in WCD format (WCD is an EST clustering algorithm – see Hazelhurst et al. 2008). K-link requires input files to be in wcd format.
The *.est_data.sql contains the tab separated descriptions for each EST. It is of the format:
<EST internal id> <GI> <GB> <GB revision> <Long description>
The *.cluster_data.sql contains the tab-separated cluster membership data. It is of the format:
<Cluster ID> <EST internal id>