SimScan QuickStart Guide

Overview

SimScan (Similarity Scanner) is a utility to find duplicated or similar pieces of code in large amounts of Java sources. The utility is distributed as a plugin for Intellij IDEA, Borland JBuilder and Eclipse Java IDEs and as a stand-alone command line tool.

To find similarities in code, SimScan relies on structural comparison of code which gives it a distinct advantage over simpler alternatives. For example, renaming a variable or changing the code formatting will not affect the results. Furthermore, it allows finding duplicate code fragments even when there are inserted or deleted lines of code or other differences, as long as these differences are relatively small to the size of the match. Not all matches are always suitable for refactoring, you have the option to tune SimScan search for stricter matches.

SimScan Release 1 Preview will expire on 31 August 2003, by which time you will be able to download a newer version. This version is free for non- commercial use AND for working on open source projects (please see license.txt). You can also register it so that it does not expire and is not restricted in its usage. To register, please visit http://blue-edge.bg/simscan.

SimScan is based on ANTLR (http://www.antlr.org), a tool for building parsers. We have ran SimScan on JDK 1.3 and JDK 1.4, and we can parse JDK 1.3 and 1.4 compatible code.

For news and updates of SimScan, please visit http://blue-edge.bg/simscan.

Installing SimScan

A single jar acts as a plugin for IntelliJ IDEA, JBuilder, and as a stand-alone command line version. The plugin for Eclipse consists of a folder named "bg.blue-edge.simscan.plugin.eclipse_release_1".

It is often useful to increase the memory available to the JVM where the plug-in is running to avoid OutOfMemoryError (-Xmx parameter of the JVM) Please read the performance section.

Using the plugin for IntelliJ IDEA

The plugin for Idea can be run using one of the following methods: 1. By selecting some files and/or folders in the project panel and running Tools -> Find Duplicated Code or pressing Ctrl-Alt-D. 2. By selecting some files and/or folders, right-clicking on any of them in the project panel and selecting Find Duplicated Code from the popup menu 3. By right-clicking in the editor pane and selecting Find Duplicated Code (to search in the currently edited file)

Using the plugin for JBuilder

The plugin for JBuilder can be run using one of the following methods: 1. By opening a file in the editor pane and running Search -> Find Duplicated Code or pressing Ctrl-Alt-D (to search in the currently edited file) 2. By selecting some files and/or folders, right-clicking on any of them in the project view and selecting Find Duplicated Code from the popup menu

Using the plugin for Eclipse

The plugin for eclipse can be run using one of the following methods: 1. By selecting some files and/or folders in the package explorer and running Search -> Find Duplicated Code, or clicking the simscan icon in the toolbar, or pressing Cltr-Alt-D. These methods will only work only if you have reset the current perspective, as explained in the installation notes. 2. By selecting some files and/or folders in the package explorer, right-clicking on any of them and selecting Find Duplicated Code from the popup menu.

Search Options

A dialog which lets you specify the options for the search appears when you run the tool.

Also search for duplicates with the rest of the project

When selected, the search performed matches code fragments from the current selection with all the other code. The intention is to allow the user to find duplicates in the code/module they are currently developing and the rest of the project, finding opportunities for reuse.

The files considered to be project files by the tool when this option is selected are:

Volume

The Volume setting defines the desired volume of the returned matches and acts as a filter: when Volume is higher, smaller matches will not be reported to the user, when it is lower both smaller and larger matches will be returned.

Similarity

The Similarity setting defines how strict the search for duplicates should be. When this setting is lower, more "fuzzy" matches will be reported, i.e. pieces of code, which are not so close and similar to each other (have greater differences). When higher, the search will be much more stricter, returning matches with less differences.

Speed / Quality

With the Speed/Quality setting you may define what is more important to you - higher search speed, or higher quality of the results. The tool achieves higher speed by skipping statistically less significant areas of the search space. With the setting tuned for higher speed, the tool might miss some matches, though probably less important ones. A higher speed setting should be good enough for general refactoring purposes and running on a regular basis.

Sort-by

The Sort-by setting defines a criterion which will be used to sort the results. The sort criterion can be changed later from the result view.

Save report

Check this option to save the results from the search in a human-readable format to a folder. The output is the same as if the tool was used from the command line. If the selected folder exists and there are older report files in it, they will be deleted. If the selected folder does not exist, it will be created.

The initial values for all settings will be the same as the settings used for the last search.

Search Progress

During the search, the current progress is reported to the user. The indication does not always run smoothly, especially for the fast searches. Usually the first several percents take disproportionately long.

The progress is displayed in a modal dialog with two buttons. Cancel will stop the search. If the option to make a report for the results was selected, all preliminary results will be saved to the folder. Use Background to continue the search in background mode. This will hide the modal dialog and will decrease the priority of the search thread, giving you the ability to continue with your other work. When in background mode, the progress will be displayed in the duplicated code view. You can still cancel the search by pressing the Close View button in Idea and JBuilder or the Stop button in Eclipse.

Search Results

The results of the search are displayed in the form of a tree in a view titled "Duplicated code". The results will be grouped in groups of similarity. Each group represents a connected component of similar items. Not all items in the group are necessarily similar to each other (see the notes on interpreting the results). The groups of similarity will be sorted according to the defined sort criterion. You can re-sort the results by clicking on the sort-by button in the results panel and selecting a different sort criterion.

Each match in a similarity group refers to some piece of code and is represented as a node in the result tree (whose parent is the node of the group to which this match belongs). By clicking on any of the matches, the code that it refers to will be opened in the editor.

Each match in the group can itself be expanded to show all direct matches to this specific match (Because a group is just a connected component, not necessarily all other matches in the group will be direct matches to the specific match). The expanded view enables you to look at the specific similarities between two matches. If you select one of the direct matches (let's name it B) of some match (we'll call it A), the tool will highlight the lines in A which are similar to B (In Eclipse, the lines will just be marked with a violet marker to the left of the source code). You may want to press the "Go To Pair" button - it will position you on the node in the tree, which represents match A as a direct match to match B. Thus you will be able to see the lines in match B, similar to A. By clicking rapidly several times on the "Go To Pair" button you will be able to grasp the similarities between the two matches.

If you edit some of the files included in the search after the search has started (e.g. if the search was run in background mode), this might render the results for these files inadequate (by shifting the line numbers). That is why, if a file has been modified after the search started, all duplicates for this file are displayed in the result view with a red warning icon instead of the normal blue one.

Sometimes there might be groups with just one single match in them. We call such matches "self-intersecting". See the notes below for a brief description of this term. The two similar parts in a self-intersecting match can also be viewed using the expanded view and the "Go To Pair" button.

There are several more buttons in the results panel - to expand and to collapse the whole tree and to rerun the search.

The rerun button is an easy way to quickly run the search over the same files after you have made some corrections to them, or if you just want to slightly change some of the search options.

Since a single search might take quite a lot of time, especially if performed on a large code base, search results are kept between restarts of the IDE. So if you need to restart the IDE and want to keep the results from the previous search, you need not worry - the tool will take care for this.

Search Report

If you are running from the command-line or if you have checked the option to save reports from the search, after the search finishes a folder will be created with a detailed report on the search results and the parameters of the search.

The output consists of the following files and folders:

Notes on interpreting the results

The idea of similarity groups is the following: there might be three pieces of code, each of which similar to the other two. If the tool just reported them as three different matches, the user might refactor and change the first two before he or she sees there is another piece of code, similar to these two, which could have been also refactored in the same fashion. That is why we prefer to report such matches as a single group. If the three pieces of code, let's name them 1,2 and 3 are such that 1 is similar to 2, 2 to 3, but 1 and 3 are not similar, then the user has to decide if he or she would refactor 1 and 2, or 2 and 3 or nothing at all. So a group is just a connected component of the "similarity graph".

Sometimes there might be groups with just a single item in them. This represents a self-intersecting match, i.e. a sequence of operators in which there are two subsequences of operators which are similar to one another. Such cases can be refactored by introducing a single method and calling it two times. For example:

1 int a = 0; 
2 int b = 0; 
3 a++; 
4 print(a); 
5 b++; 
6 print(b); 

In this example the sequence of lines 1,3,4 is a duplicate of 2,5,6. Such code will usually result in a similarity group with one single element - the whole code which contains the matching sequences.

Running from the command line

To run the utility use:

java bg.blue_edge.simscan.SimScan <directory or file name> [<directory or file name>|...]
        [-volume <0|1|2>] [-similarity <0|1|2>] [-speed <0|1|2|3|4>]
        [-output <output directory>]
        [-conf <properties file>] 

(And simscan_all.jar should be in the classpath)

or

java -jar simscan_all.jar <directory or file name> [<directory or file name>|...]
        [-volume <0|1|2>] [-similarity <0|1|2>] [-speed <0|1|2|3|4>]
        [-output <output directory>]
        [-conf <properties file>] 

The volume, speed and similarity options define scan parameters which affect the way similarity between fragments of code is estimated. These are high-level parameters used to easily set the low-level parameters that will actually be used for the scan. If any of these options is not specifed default values will be used. All low-level scan parameters will be reported to the user and written in the log.

The volume setting can be 0 - small, 1 - medium or 2 - large.

The similarity (strictness) setting can be 0 - less strict search, this setting will return the most fuzzy matches; 1 - medium strictness; or 2 - the strictest search.

The speed setting can be 0 - the slowest, exhaustive search, 1 - slow, 2 - normal, 3 - fast, and 4 - very fast.

Please, read the Search Options section for more info on the volume, similarity and speed settings.

Output directory is the directory, where the reports and the search results will be saved. If the -output option is not specified, the default output directory is "report". If the output directory already exists, a number will be automatically appended to resolve the conflict (nothing in the existing directory will be changed). The actual output directory will be reported to the user.

The remaining arguments should be directories containing java files or separate java files. All files in the directories and their subdirectories will be parsed and included in the scan.

The output consists of a log file, a graph description file, a report file and a number of directories, each of which contains similar pieces of code. The log file duplicates the output to the console and holds information of the parameters of the search and the duration of the search. Each of the numbered directories contains a single group of similar items. These represent a connected component, so not every one of the items is necessarily similar to each other in the group (see the notes at the end). The report file also gives the same information in another form. The graph file gives a full description of which pieces of code are similar to each other. Most times this information will not be necessary.

Performance

When searching on a larger code base, the search can take several hours, especially for code bases of more than several hundred KSLOC. The whole code base needs to be parsed and loaded into memory in the first phase of the search, so the tool needs a lot of memory to run on larger code bases.

All N nodes in the syntax tree need be loaded in the memory for the search algorithm. Having a rough estimate of 10 nodes per source line of code and about 50 bytes taken by each node, this gives 10*50*SLOC bytes for the whole tree. For a code base of 500 KSLOC, this makes 10*50*500 KB = 250 MB of allocated memory for the tree. This estimate is too rough, but it gives the idea that having enough memory is a necessity for the tool to run.

So, if you intend to run SimScan on large code bases, you will most probably need to explicitly set the memory parameters for the JVM.

When running from the command line, please specify the -Xmx<size>mb option with <size> at least 250 (this tells the Java virtual machine to allocate up to 250 Mb when needed without throwing OutOfMemoryError). To use the plugins on large code bases, you will need to change the JVM parameters with which your IDE starts. Please, refer to the specific IDE's manual for instructions on how to do this.

Registration

After you receive a registration key, please open the SimScan dialog and click on Register / About, where you can enter the registration code.

Support, Bug Reports, Suggestions

Questions, bug reports, comments and suggestions should be sent to simscan@blue-edge.bg. We would be glad to hear from you!

Credits

Special thanks go to Aurelian Teglas, whose contribution for the plugin part was enormous. He helped a lot not only with the development, but also with a lot of useful ideas and constructive criticism. Thank you, Rica!