BASARD.exe 

1. how to compile the program

The program in written in C++. It can be compile using g++. For example, the following line can be used to compile BASARD on the cygwin platform in Windows XP. The output file BASARD.exe will be able to run in Windows XP:

g++ -o BASARD.exe BASARD.cpp


2. how to use the program

The format of command line is: 

	BASARD InputFilePath OutputFilePath SequenceType SequenceNumber PatterWidth MaximumCopyNumber MaximumGapLength Epsilon_1 Epsilon_2 IterationNumber RunningMode

--Detailed explanation of the arguments:
*InputFilePath:  The full path, including the name, of the file containing the input sequences. This file should contain multiple nucleotide or amino acid sequences in a special FASTA format: each ID line starts with '>', each sequence is put in a single line. Examples of data sets can be found in the input_example folder. All letters in the input seuqnces should be in uppercase.

*OutputFilePath: The full path, including the name, of the .txt file containing the output of program. An example can be found in the output_example folder.

*SequenceType: the type of the input sequences. '0' for nucleotide sequences and '1' for amino acid sequences. 

*SequenceNumber: the number of the sequences in the input file.

*PatterWidth: the width of the target motif pattern.

*MaximumCopyNumber: the maximum copy number for each repeat segment.

*MaximumGapLength: the maximum gap length between two repeat units.

*Epsilon_1,Epsilon_2: the parameters of the prior distribution of repeat segment structures. Details can be found in the article.

*IterationNumber: The number of iterations that the MCMC chain will run for.

*RunningMode: There are two modes, fast and complete. In the complete mode, the algorithm is exactly as the same as the one described in the article. In the fast mode, the algorithm will not update the parameters of repeat segment locations within each iteration. Instead, this kind of parameters will be updated for every 10 iterations.

--Example
Take the experiment on real data, which is described in the article, for example, the command line are:  (users can also refer to Table 6 in the supplementary material of the article)


BASARD input_example\synthetic_data_set_12M-L_2.txt output_example\report_synthetic_data_set_12M-L_2.txt 0 6 12 15 2 0.25 0.5 3000 1

BASARD.exe input_example\real_data_set_nucleotide.txt output_example\report_real_data_set_nucleotide.txt 0 24 18 20 6 1.0 1.0 3000 0

BASARD.exe input_example\real_data_set_amino_acid.txt output_example\report_real_data_set_amino_acid.txt 1 24 6 20 2 0.1 1.0 3000 0


--Evaluation
If using our provided synthetic data sets, users can compare the estimate parameters with the actual parameters given in the two files in the Data_set folder, namely actual_motif_matrix_for_all_synthetic_data_sets.txt and actual_locations_and_structures_of_repeat_segments_for_all_synthetic_data_sets.txt files.
