High speed similar string search:over 100 million records. No omission is search. Get a list ordered by similarity

Similar Characters Search Algorithm

Supports 20 million records in the 32-bit version and over 100 million records in the 64-bit version. Zero missing searches. Ultra-fast search in an instant. Retrieve any number of results starting from the most similar ones.
Based on the paper "集合間類似度に対する簡潔かつ高速な類似文字検索アルゴリズム(Compact and Fast Similar Characters Search Algorithm for Set-based Similarity)" by Naoki Okazaki, available at this link. The algorithm follows the steps below:

Create 3-gram statistics from a vast amount of character string data.
Assign a unique ID to each string and store all links from 3-gram to ID (ranging from one to tens of thousands) in a data structure that allows for high-speed search.
Also convert the search string into 3-gram and output the ones with more links to the same ID as more similar.

The development version has two key differences compared to Naoki Okazaki's publicly available high-speed similar characters search program.

First, when constructing the 3-gram of the original text, it utilizes our proprietary n-gram database library.

The second difference is the ability to link arbitrary data records to each text.

A linkage record of the reading of corporate names, corporate postcodes, and corporate addresses as the target of similar character search among 4.5 million Japanese corporate names.
A linkage record of corporate names and corporate postcodes as the target of similar character search among 4.5 million corporate addresses.
A linkage record of postcodes as the target of similar character search among 120,000 Japanese addresses (excluding building numbers).

back

Creating a Dictionary for Similar Character Search

We are calculating n-grams using an algorithm based on the description in Chapter 2 of 長尾真編「自然言語処理」岩波講座ソフトウェア科学(Makoto Nagao's book "Natural Language Processing" in the Iwanami Lecture Software Science) 15, 1996. Since it is difficult to obtain the book, let's briefly explain it.
Proceed with the processing as follows:

Text normalization

Obtain the addresses of all characters as an array

Sort the above array

Count n-gram statistics

...
computeer.....
...
computer-.....
...
computer.....
computer AI....
...
computer account....
computer eye....
...

Save to a file

back

Similar Character Search for All Corporate Numbers

Corporate My Number data can be downloaded by anyone. Only data of corporate names, postal codes, and addresses like the following are used.

...
釧路検察審査会 0850824 北海道釧路市柏木町４－７
...
一般社団法人日本色彩療法士協会 0030005 北海道札幌市白石区東札幌五条１丁目１番１号札幌市産業振興センター３階Ｃ７
...
有限会社アートロジック 2250002 神奈川県横浜市青葉区美しが丘２丁目１７番地２９
...

Names of corporate types such as "Independent Administrative Institution," "Corporation," and "(Ltd.)" are removed in advance as they do not fit the purpose of the similar character search program.
The block starting from "丁目" (district name) in the address is separated. Rule-based analysis accurately separates the portion after "丁目" since there are district names with "丁目" in them.

...
釧路検察審査会 0850824 北海道釧路市柏木町/４－７
...
日本色彩療法士協会 0030005 北海道札幌市白石区東札幌五条/１丁目１番１号/札幌市産業振興センター３階Ｃ７
...
アートロジック 2250002 神奈川県横浜市青葉区美しが丘/２丁目１７番地２９
...

It is necessary to remove clearly incorrect records such as records with company names that are over 200 characters long or seem to be registered jokingly.
After various outlier data removal and normalization processes, 3-grams are created from company names with "$$" attached at the beginning and end.
There are numerous companies where removing "株式会社 X" (X Corporation) makes the name one character long, not to mention two-character company names. Therefore, the complementing process mentioned above is essential.
Finally, readings are extracted and generated from company names using a separate database for company name readings or morphological analysis software such as MeCab + neologd.

By performing a similar character search with the company name, postal code, address, and company name readings can be obtained.
By performing a similar character search with the company address, the postal code and company name can be obtained.

This enables the execution of the search process described above.
back

Similar character search targeting all addresses within Japan

The entire address is based on the postal code database publicly available from JP (Japan Post).
In the upper-level system, the functionality to search for postal codes from company name databases is also used. This increases the robustness of the search as the registered addresses for My Number may contain variations.
Searching for addresses from postal codes does not require the use of a similar character search program. It is performed using regular postal code search.
Searching for postal codes from addresses is done using a similar character search.
The construction is the same as the similar character search targeting company names, except that only postal codes are included in the link record.
back