This tool takes two texts as input. It scans through them and outputs where the text matches. The threshold for a match is customizable by character ngrams. For Chinese, 10 is a good threshold. For English and other alphabetic languages, a higher threshold is recommended. This tool will sometimes miss a match if there are single character mutations or shifts (which are often the result of typos). It is possible to identify these occurrences based on the "distance from last match" parameter. Here are several examples to help you interpret your results.
This tool will allow users to upload texts and it will automatically sanitize them according to the user's preference. It will output chunks of text normalized to a user configurable length. If the user specifies 5000 character texts and provides a 20,000 character long document, it will return 4 documents. How the program processes the balance of characters will also be configurable. This program will ease normalization for certain types of corpus linguistic analysis (some of which we discussed in the Digital China Lab).
Please see the projects page for more information on this tool, which is currently in alpha development.