This tool detects matching strings based on a configurable minimum ngram length. It was designed with Chinese in mind, but should work for any language. A minimum threshold match of 10 is set by default. This works well for Chinese, but a larger threshold will be necessary for English. The lower the threshold, the more noise. The higher the threshold, the more likely the tool will miss meaningful matches. Adjust as needed. This tool only finds exact matches, so it will not pick up paraphrasing. By default, capitalization is ignored. Punctuation and/or whitespace are optionally ignored. Condense whitespace transforms all consecutive whitespace into a single space. I recommend using already sanitized texts.
Distance from previous match illustrates the density of matches. Very low distance indicates typos/minor variation. A negative distance indicates that a phrase already found in the first text recurs and matches a previously matched string in the second text. Here are several examples to help you interpret your results.
You can also display the strings that do NOT match. This runs on the same algorithm.
The time it takes for the program to run can be highly variable, depending on the input tetxs. Two nearly identical texts will run very quickly (usually in under three seconds). Two long, completely different texts will take much longer. Chinese texts will generally run faster, as the algorithm runs on characters. English language works tend to have many more individual characters than Chinese works (comparing Moby Dick with Mabel's Mistake takes from ten to twenty minutes to run). This tool finds the first instance of a matching string (so if the same string is repeated later in the text, this tool will not catch it). This tool will be updated to find repeated matches in the near future.
|String||First Text Index||Second Text Index||Length||Distance From Last Match|