High speed similar string search:over 100 million records. No omission is search. Get a list ordered by similarity

Similar Characters Search Algorithm

Supports 20 million records in the 32-bit version and over 100 million records in the 64-bit version. Zero missing searches. Ultra-fast search in an instant. Retrieve any number of results starting from the most similar ones.
Based on the paper "集合間類似度に対する簡潔かつ高速な類似文字検索アルゴリズム(Compact and Fast Similar Characters Search Algorithm for Set-based Similarity)" by Naoki Okazaki, available at this link. The algorithm follows the steps below:
  1. Create 3-gram statistics from a vast amount of character string data.
  2. Assign a unique ID to each string and store all links from 3-gram to ID (ranging from one to tens of thousands) in a data structure that allows for high-speed search.
  3. Also convert the search string into 3-gram and output the ones with more links to the same ID as more similar.
  4. This part involves multiple optimizations for high performance, as detailed in the aforementioned paper.
The development version has two key differences compared to Naoki Okazaki's publicly available high-speed similar characters search program.
  1. First, when constructing the 3-gram of the original text, it utilizes our proprietary n-gram database library.
  2. In the character recognition library, there was already a developed n-gram library used to predict candidate characters based on character co-occurrence frequency. We integrated Okazaki's method into that library as a starting point.
    As a result, it ended up being quite different from the pseudocode provided by Okazaki.
  3. The second difference is the ability to link arbitrary data records to each text.
  4. Specifically, we have developed:
back

Creating a Dictionary for Similar Character Search

We are calculating n-grams using an algorithm based on the description in Chapter 2 of 長尾真編「自然言語処理」岩波講座ソフトウェア科学(Makoto Nagao's book "Natural Language Processing" in the Iwanami Lecture Software Science) 15, 1996. Since it is difficult to obtain the book, let's briefly explain it.
Proceed with the processing as follows:
  1. Text normalization
  2. At this time, newline characters and tab codes, etc., should be converted to non-existent character codes or ignored as necessary.
    If you consider it as a separate text that does not connect to another line (a list of corporate names, personal names, addresses), you can also consider supplementing a suitable number of "non-existent character codes" before and after.
    This is a measure to handle corporate names and personal names with one or two characters, as similar character search uses 3-gram data.
    Depending on the case, it may also normalize 1-byte characters or 3-byte characters to 2-byte characters. Since it is only necessary to know the beginning of each character for now, this processing is not essential.
    In the end, it is desirable that the entire target text exists in memory.
  3. Obtain the addresses of all characters as an array
  4. If it is 100 million characters, it will be an array of 100 million elements. For example, if you target all Japanese corporate names (as of 2023), it will be an array of 70 million elements. Since it is an address, it requires an array of 280 megabytes in a 32-bit environment.
    Even with 2 billion characters of Japanese Wikipedia, if it is a 64-bit environment, it can be processed as an array of 16 gigabytes.
  5. Sort the above array
  6. If there is no need to process it in an instant, quicksort is sufficient. Even with 2 billion characters, it can be processed in a few minutes (about 10 seconds if parallel quicksort is used).
    If real-time processing is necessary, it is also possible to process it in an instant by dividing it into groups based on the code of the initial character and then quicksorting it.
  7. Count n-gram statistics
  8. The addresses of characters are the sorted addresses of all characters in the half-straight string that extends to the end of the text.
    For example, it looks like this:
    ...
    computeer.....
    ...
    computer-.....
    ...
    computer.....
    computer AI....
    ...
    computer account....
    computer eye....
    ...
    
    All character strings become half-straight texts that extend to the end of the text, which is hundreds of megabytes long.
    By examining up to the n-th character of the subsequent text, it is possible to determine if it is the same or how many same ones are there. It is possible to speed up the process by devising algorithms if necessary.
  9. Save to a file
  10. If random access is fast, any format is acceptable.
    If you want to examine the properties of n-grams, consider ensuring both random access and sequential access.
    Even if you don't load all n-gram information in memory, it is better to load (n-1)-grams into memory to reduce the number of file accesses.
    It would be even better if there are ways to cache only the ones with high frequency in memory.
    Of course, if you can load all the data into memory, that would be ideal.
back

Similar Character Search for All Corporate Numbers

Corporate My Number data can be downloaded by anyone. Only data of corporate names, postal codes, and addresses like the following are used.
...
釧路検察審査会 0850824 北海道釧路市柏木町4-7
...
一般社団法人日本色彩療法士協会 0030005 北海道札幌市白石区東札幌五条1丁目1番1号札幌市産業振興センター3階C7
...
有限会社アートロジック 2250002 神奈川県横浜市青葉区美しが丘2丁目17番地29
...
Names of corporate types such as "Independent Administrative Institution," "Corporation," and "(Ltd.)" are removed in advance as they do not fit the purpose of the similar character search program.
The block starting from "丁目" (district name) in the address is separated. Rule-based analysis accurately separates the portion after "丁目" since there are district names with "丁目" in them.
...
釧路検察審査会 0850824 北海道釧路市柏木町/4-7
...
日本色彩療法士協会 0030005 北海道札幌市白石区東札幌五条/1丁目1番1号/札幌市産業振興センター3階C7
...
アートロジック 2250002 神奈川県横浜市青葉区美しが丘/2丁目17番地29
...
It is necessary to remove clearly incorrect records such as records with company names that are over 200 characters long or seem to be registered jokingly.
After various outlier data removal and normalization processes, 3-grams are created from company names with "$$" attached at the beginning and end.
There are numerous companies where removing "株式会社 X" (X Corporation) makes the name one character long, not to mention two-character company names. Therefore, the complementing process mentioned above is essential.
Finally, readings are extracted and generated from company names using a separate database for company name readings or morphological analysis software such as MeCab + neologd.
  1. By performing a similar character search with the company name, postal code, address, and company name readings can be obtained.
  2. By performing a similar character search with the company address, the postal code and company name can be obtained.
This enables the execution of the search process described above.
back

Similar character search targeting all addresses within Japan

The entire address is based on the postal code database publicly available from JP (Japan Post).
In the upper-level system, the functionality to search for postal codes from company name databases is also used. This increases the robustness of the search as the registered addresses for My Number may contain variations.
Searching for addresses from postal codes does not require the use of a similar character search program. It is performed using regular postal code search.
Searching for postal codes from addresses is done using a similar character search.
The construction is the same as the similar character search targeting company names, except that only postal codes are included in the link record.
back