Fuzzy match in Query Service
Use a ‘fuzzy’ match on your Adobe Experience Platform data to return the most likely, approximate matches without the need to search for strings with identical characters. This allows for much a more flexible search of your data and makes your data more accessible by saving time and effort.
Instead of trying to re-format the search strings in order to match them, the fuzzy match analyzes the ratio of similarity between two sequences and returns the percentage of similarity. FuzzyWuzzy is recommended for this process as its functions are more suited to help match strings in more complex situations compared to regex or difflib.
The example provided in this use case focuses on matching similar attributes from a hotel room search across two different travel agency datasets. The document demonstrates how to match strings by their degree of similarity from large separate data sources. In this example, fuzzy match compares the search results for the features of a room from the Luma and Acme travel agencies.
Getting started getting-started
As part of this process requires you to train a machine learning model, this document assumes a working knowledge of one or more machine learning environments.
This example uses Python and the Jupyter Notebook development environment. Although there are many options available, Jupyter Notebook is recommended because it is an open-source web application that has low computational requirements. It can be downloaded from the official Jupyter site.
Before you begin, you must import the necessary libraries. FuzzyWuzzy is an open-sourced Python library built on top of the difflib library and used to match strings. It uses Levenshtein Distance to calculate the differences between sequences and patterns. FuzzyWuzzy has the following requirements:
- Python 2.4 (or higher)
- Python-Levenshtein
From the command line, use the following command to install FuzzyWuzzy:
pip install fuzzywuzzy
Or use the following command to install Python-Levenshtein as well:
pip install fuzzywuzzy[speedup]
More technical information on Fuzzywuzzy can be found in their official documentation.
Connect to Query Service
You must connect your machine learning model to Query Service by providing your connection credentials. Both expiring and non-expiring credentials can be provided. Please see the credentials guide for more information on how to acquire the necessary credentials. If you are using Jupyter Notebook, please read the full guide on how to connect to Query Service.
Also, be sure to import the numpy package into your Python environment to enable linear algebra.
import numpy as np
The commands below are necessary to connect to Query Service from Jupyter Notebook:
import psycopg2
conn = psycopg2.connect('''
sslmode=require
host=<YOUR_ORGANIZATION_ID>
port=80
dbname=prod:all
user=<YOUR_ADOBE_ID_TO_CONNECT_TO_QUERY_SERVICE>
password=<YOUR_QUERY_SERVICE_PASSWORD>
''')
cur = conn.cursor()
Your Jupyter Notebook instance is now connected to Query Service. If the connection is successful, no message will display. If the connection failed, an error will display.
Draw data from the Luma dataset luma-dataset
Data for analysis is drawn from the first dataset with the following commands. For brevity, the examples have been limited to the first 10 results of the column.
cur.execute('''SELECT * FROM luma;
''')
luma = np.array([r[0] for r in cur])
luma[:10]
Select Output to display the returned array.
code language-console |
---|
|
Draw data from the Acme dataset acme-dataset
Data for analysis is now drawn from the second dataset with the following commands. Again, for brevity, the examples have been limited to the first 10 results of the column.
cur.execute('''SELECT * FROM acme;
''')
acme = np.array([r[0] for r in cur])
acme[:10]
Select Output to display the returned array.
code language-console |
---|
|
Create a fuzzy scoring function fuzzy-scoring
Next, you must import fuzz
from the FuzzyWuzzy library and execute a partial ratio comparison of the strings. The partial ratio function allows you to perform substring matching. This takes the shortest string and matches it with all substrings that are of the same length. The function returns a percentage similarity ratio of up to 100%. For example, the partial ratio function would compare the following strings ‘Deluxe Room’, ‘1 King Bed’, and ‘Deluxe King Room’ and return a similarity score of 69%.
In the hotel room match use case, this is done using the following commands:
from fuzzywuzzy import fuzz
def compute_match_score(x,y):
return fuzz.partial_ratio(x,y)
Next, import cdist
from the SciPy library to compute the distance between each pair in the two collections of inputs. This computes the scores among all pairs of hotel rooms provided by each of the travel agencies.
from scipy.spatial.distance import cdist
pairwise_distance = cdist(luma.reshape((-1,1)),acme.reshape((-1,1)),compute_match_score)
Create mappings between the two columns using the fuzzy join score
Now that the columns have been scored based on distance, you can index the pairs and retain only matches that scored higher than a certain percentage. This example only retains pairs that matched with a score of 70% or higher.
matched_pairs = []
for i,c1 in enumerate(luma):
idx = np.where(pairwise_distance[i,:] > 70)[0]
for j in idx:
matched_pairs.append((luma[i].replace("'","''"),acme[j].replace("'","''")))
The results can be displayed with the following command. For brevity, the results are limited to ten rows.
matched_pairs[:10]
Select Output to see the results.
code language-console |
---|
|
The results are then matched using SQL with the following command:
matching_sql = ' OR '.join(["(e.luma = '{}' AND b.acme = '{}')".format(c1,c2) for c1,c2 in matched_pairs])
Apply the mappings to do fuzzy join in Query Service mappings-for-query-service
Next, the high-scoring matching pairs are joined using SQL to create a new dataset.
:
cur.execute('''
SELECT * FROM luma e
CROSS JOIN acme b
WHERE
{}
'''.format(matching_sql))
[r for r in cur]
Select Output to see the results of this join.
code language-console |
---|
|
Save fuzzy match results to Platform save-to-platform
Finally, the results of the fuzzy match can be saved as a dataset for use in Adobe Experience Platform using SQL.
cur.execute('''
Create table luma_acme_join
AS
(SELECT * FROM luma e
CROSS JOIN acme b
WHERE
{})
'''.format(matching_sql))