CONCISE REPORT ON THE ALGORITHMS IN SIM 970623 INTRODUCTION The general outline of the similarity checker is as follows: 1. the files are read in (pass 1) 2. a forward-reference table is prepared 3. the set of interesting runs is determined 4. the line numbers of the runs are determined (pass 2) 5. the contents of the runs are printed in order (pass 3) To keep the memory requirements (relatively) small, the exact positions of the tokens are not recorded. This necessitates pass 2. See, however, the pertinent chapter. READING THE FILES Each file is tokenized using an lex-generated scanner appropriate for the input. Each token fits in one byte, possibly using all 8 bits. The tokens are stored in the array TokenArray[], which is extended by reallocation if it overflows. See tokenarray.c. Also, to optimize away pass 2, an attempt is made to remember the token positions of all beginnings of lines. The token-positions at BOL are stored in the array nl_buff[], which is also extended by reallocation, if needed. If the attempt fails due to lack of memory, nl_buff[] is abandoned, and pass2 will read the files instead. PREPARING THE FORWARD-REFERENCE TABLE Text is compared by comparing every substring to all substrings to the right of it; this process is in essence quadratic. However, only substrings of length at least 'MinRunSize' are of interest, which gives us the possibility to speed up this process by using a hash table. Once the entire text has been read in, a forward-reference table forward_references[] is made (see hash.c). For every position in the text, we construct an index which gives the next position in the text where a run of MinRunSize tokens starts that has the same hash code. If there is no such run, the index is 0. To fill in this array, we use a hash table last_index[], such that last_index[i] is the index of the latest token with hash_code i, or 0 if there is none. If at a given position p, we find that the text ahead of us has hash code i, last_index[i] tells us which position in forward_references[] will have to be updated to p. See MakeForwardReferences(). For long text sequences (say hundreds of thousands of tokens), the hashing is not really efficient any more since too many spurious matches occur. Therefore, the forward reference table is scanned a second time, eliminating from any chain all references to runs that do not start with and end in the same token (actually this is a second hash code). For the UNIX manuals this reduced the number of matches from 91.9% to 1.9% (of which 0.06% was genuine). DETERMINING THE SET OF INTERESTING RUNS The overall structure of the routine Compare() (see compare.c) is: for all new files for all texts it must be compared to for all positions in the new file for all positions in the text for ever increasing sizes try to match and keep the best If for a given position in the new file a good run (i.e. on of at least minimum length) has been found, the run is registered using a call of add_run(), the run is skipped in the new file and searching continues at the position after it. This prevents duplicate reports of runs. Add_run() allocates a struct run for the run (see sim.h) which contains two struct chunks and a quality description. It fills in the two chunks with the pertinent info, one for the first file and one for the second (which may be the same, if the run relates two chunks in the same file). The run is then entered into the arbitrary-in-sorted-out store AISO (see aiso.spc and aiso.bdy, a genuine generic abstract data type in C!), in which it is inserted according to its quality. Both positions (struct position) in both chunks in the run (so four in total) are each entered in a linked list starting at the tx_pos field in the struct text of the appropriate file. When this is finished, the forward reference table can be deleted. So the final results of this phase are visible both through the tx_pos fields and through the aiso interface. DETERMINING THE EXACT POSITION OF EACH RUN (PASS 2) The purpose of this pass is to find for each chunk, which up to now is known by token position only, its starting and ending line number (which cannot be easily derived from the token position). For each file that has a non-zero tx_pos field, ie. that has some interesting chunks, the positions in the tx_pos list are sorted on ascending line number (they have been found in essentially arbitrary order) by sort_pos() in pass2.c. Next we scan the pos list and the file in parallel, updating the info in a position when we meet it. A position carries an indication whether it is a starting or an ending position, since slightly differing calculations have to be done in each case. Actually, if the nl_buff[] data structure still exists, the file is not accessed at all and the data from nl_buff[] is used instead. This is done transparently in buff.c. PRINTING THE CONTENTS OF THE RUNS (PASS 3) Since each struct run has now been completely filled in, this is simple; the hard work is calculating the page layout. Pass3() accesses the aiso store and retrieves from it the runs in descending order of importance. Show_run() opens both files, positions them using the line numbers and prints the runs. ================================================================ CODE EXCERPT OF THE SOFTWARE SIMILARITY TESTER SIM (980222) sim: get command line options check the options init language, to precompute tables pass1, read the files # there is an array TokenArray[] that holds all input tokens make forward reference table # there is an array forward_references[], with one entry for # each token in the input; forward_references[i] gives the # token number where a token sequence starts with the same # hash value as the one starting at i compare various files to find runs delete forward reference table pass2, find newline positions of found similarities pass3, print the similarities pass1, read the files: for each file divide the text into tokens store all tokens except newlines in TokenArray and try to keep a record of the newline positions make forward reference table: # there are two independent hash functions, hash1() and hash2(). # hash1(i) gives the hash value of the token sequence starting at i # likewise for hash2(i) set up the forward references using the last_index table: # there is an array last_index[], with one entry for each # possible hash value; last_index[i] gives the position in # forward_references[] at which i was most recently # encountered as a hash value for each file for all positions in file except the last MinRunSize set forward_references[] and update last_index[] use hash2() to clean out matches: for all tokens find first token in chain with same hash2 code short-circuit forward reference to it compare: for all new files for all texts it must be compared to for all positions in the new file for all positions in the text for ever increasing sizes try to match and keep the best try to match and keep the best: # using forward_references[], we find a list of positions in # which a matching token sequence will start; # scanning this list, we measure the maximum length of the # match and add the longest match to the run collection pass2, find positions of found runs: for all files: sort the positions in the runs # we scan the pos list and the file in parallel for all positions inside this file if it matches a token position in a run record line number pass3, print the similarities: for all runs # a run consists of two chunks open the files that hold the chunks and position them at the beginning of the chunk display the chunks