Bananas, Hananas, Jananas: A Tale of Typos
The core of InfoScout’s business intelligence platform is built on top of receipts. Hundreds of thousands of receipt images are uploaded by shoppers from all over the country, every single day. The quality and resolution of these images vary wildly: they could be rotated, shot in low light conditions, or taken by a poor camera. Consequently, optical character recognition (OCR) systems have a tough time extracting line items from these receipts.
Accuracy here is paramount to our business, since the number of purchases is one of the most important metrics that our clients use to evaluate their position in the market.
As the saying goes, “garbage in, garbage out”. When we detect garbage OCR output, we fall back to our army of human transcribers: Amazon MTurk. While humans are certainly more accurate than OCR when it comes to receipt transcription, mistakes still happen, accidentally or otherwise. My task over the last few months as a summer intern was to architect another line of defense against dirty data riddled with typos.
This problem is a classic application of one type of unsupervised machine learning: clustering. Centered around each correctly spelled line item string, there resides a cluster of incorrectly spelled variations for that string, each maybe one or two letters off.
104041: BANANAS
1603: BANANS
992: BANANA
621: ANANAS
379: BANNAS
378: BANAS
299: BANANANS
An observation we made is that “the mode is always right”. In other words, within each cluster of similarly spelled strings, the one we’ve seen the most times, often by an order of magnitude or more, is the correctly spelled one. Therefore, once we have broken down our line items into clusters, we can clean the entire cluster to the mode. Sounds simple, right?
So, how did we cluster our receipt line item strings? We turned to the DBSCAN algorithm. While I won’t go into the algorithm itself, one major component of the algorithm is the neighborhood query function. It is this function that determines the final runtime of the algorithm, whether that is minutes and hours, or days and weeks.
Given a line item string, how do we characterize its neighbors? Through experimentation, we have found that a Levenshtein distance of 2 works well. Levenshtein distance is a measure of how many edit operations are necessary to transform one string to another, where an edit operation can be inserting one character, deleting one character, or substituting one character for another. Therefore, given one line item string, we define its neighbors to be the set of all other line item strings that can be transformed to the original in at most 2 edit operations.
We initially toyed with the scikit-learn Python library, as it had an implementation of DBSCAN readily available. We quickly learned that it was not the answer to our problems. While its implementation of DBSCAN was well optimized for numerical data and had fast indices like ball trees and KD trees, there were no indices available for string data using Levenshtein distance. Thus, this degenerated to creating a $n \times n$ matrix of the pairwise distances between all $n$ line item strings, which has $O(n^2)$ space and time complexity.
As an optimization to our algorithm, we decided to partition the receipt line items by the stores they came from, since stores differ wildly in their receipt line item descriptions for the same items. Even with this optimization, one of our largest stores, Kroger, still had 1.1 million unique line item strings that we’ve seen throughout the years. The distance matrix for Kroger would need dozens of terabytes of memory, which would not fit in even the largest AWS EC2 instance.
To circumvent this issue, a custom solution was required. Fortunately, the core of DBSCAN is not a difficult algorithm to implement. For the index, I found a niche data structure to efficiently find nearest neighbors for strings: BK trees. BK Trees work for any metric, which by definition need to satisfy:
d(x, y) ≥ 0 (non-negativity)
d(x, y) = 0 ⇔ x = y (identity of indiscernibles)
d(x, y) = d(y, x) (symmetry)
d(x, z) ≤ d(x, y) + d(y, z) (triangle inequality)
Fortunately, Levenshtein distance satisfies all these conditions. That means that we can take advantage of the numerous convenient properties of the BK tree: $O(n \log n)$ creation time, $O(\log n)$ query time, and $O(n)$ space. At this point, our dataset easily fits into any modern computer, and will complete within a few hours for a few hundred thousand unique line items, which is typical for a medium sized chain like Trader Joe’s. The cleansing output of this algorithm is quite promising, with solid precision and recall. Nevertheless, it can still be improved.
Now that we got the dataset to fit in memory, the next step would be to improve precision and recall. The intuition is that, for an OCR system, some types of mistakes are more common than others. For example, the number “0” and the capital letter “O” can be easily confused. On the other hand, it is unlikely for OCR to recognize an “X” when there is actually an “O”. However, using the Levenshtein distance algorithm, all these errors would have the exact same cost of 1.
To quantify these differences, we collected statistics from our proprietary data to determine the frequencies of each type of error, from both humans and OCR. These frequencies were then transformed into costs. For a typical error, the cost would be around 1. For common errors, such as substituting an “O” for a “0”, the cost would be less than 1, while uncommon errors would have costs greater than 1. These costs would then be used to calculate the Levenshtein distance. Thus, for the same threshold distance of 2, it allows a neighbor to be more than 2 edit operations away, if those edits are very common mistakes. Conversely, seldom seen edits are penalized, and perhaps only 1 edit operation is allowed before the distance threshold of 2 is reached.
Clean: | A | V | O | C | A | D | O | S | |
---|---|---|---|---|---|---|---|---|---|
Dirty: | A | Y | 0̸ | C | A | D | X | S | |
Costs: | 0.4 | 0.2 | 1.4 | Total = 2 |
Furthermore, human transcribed texts produce very different types of errors compared to OCR transcribed texts. For instance, it is not likely for an OCR system to transpose adjacent characters in a string. On the other hand, human typists make those mistakes constantly. How many times have you see “teh” or “adn” in a comment section or forum online? To address this difference, we instead used the Damerau-Levenshtein distance for human sourced line item strings. The Damerau-Levenshtein distance allows the same 3 edit operations from the Levenshtein distance, plus the transpose operation. In addition, we also used different weights for human sourced strings. For instance, the letters “I” and “O” are very close on a QWERTY keyboard, so it is likely for a human typist to type one instead of the other, while it is unlikely to be the case for OCR sourced strings, since the two characters look nothing alike.
We could not find any existing library out there that can calculate weighted Levenshtein and weighted Damerau-Levenshtein distance. Thus, we built a library to do this ourselves.
It is written in Cython for maximal performance and easy integration with Python. We have open sourced it, so check it out at https://github.com/infoscout/weighted-levenshtein! Usage is very simple:
import numpy as np
from weighted_levenshtein import lev, osa, dam_lev
# make an array of all 1's of size 128, the number of ASCII characters
insert_costs = np.ones(128, dtype=np.float64)
# make inserting the character 'D' have cost 1.5 (instead of 1)
insert_costs[ord('D')] = 1.5
# you can just specify the insertion costs
# delete_costs and substitute_costs default to 1 for all characters if unspecified
print lev('BANANAS', 'BANDANAS', insert_costs=insert_costs) # prints '1.5'
With this modification to our algorithm, both precision and recall were improved. Unlikely errors were pruned out of the clusterings, while more common errors found their way into their correct clusters. Unfortunately, BK trees can no longer be used to index the strings, since the weighted Levenshtein and Damerau-Levenshtein distance neither satisfies symmetry nor the triangle inequality. Once again, we are faced with a brute force $O(n)$ neighborhood query.
Fortunately, I stumbled upon this blog post. This is a highly efficient algorithm that allows constant neighborhood query time with respect to the size of the index. After using this algorithm to find an approximate neighborhood, we then use our weighted Levenshtein distance to find the exact neighborhood on the reduced search space. While this algorithm does use an obscene amount of memory, it cut down the algorithm’s runtime from hours to less than 10 minutes.
Thus, the final pipeline consists of:
- reading all the unique line item strings from the database
- applying some basic pre-cleansing rules
- creating clusters from those strings
- picking the most frequently occurring string within each cluster as the correct spelling.
For Trader Joe's, a store with around 480,000 unique line items that we've seen so far, this algorithm was able to clean that to around 200,000, a reduction of almost 60%. As we like to say up north, impressive, eh?