Advanced Records Matching
Records matching refers to the process of mapping existing data records into the Fabric. This can be accomplished via the utilization of an existing location_id value, geographically locating using coordinates, or by performing an address lookup. The best approach depends on the source data and what the data will then be used for in the future, thus there is no one “best way” that applies across all use cases.
This guide utilizes CostQuest Pytools and DuckDB and uses a small set of sample data (available here).
A vintage must be chosen and then held constant across all API calls. This guide uses a vintage of 202506, available vintages can be determined using the fabric/vintages API.
Prerequisites
- CostQuest API Key granted access to the Fabric, Fabric Extensions, and Geoservices APIs.
- Python (3.10 tested)
- DuckDB (1.1.3 tested)
Assign Location IDs (Python)
Depending on what attributes are present in your source data, you can utilize a mix of CostQuest APIs to assign appropriate location_id values for your chosen vintage.
Locate API
For records with geographic coordinates (must be represented as WGS84 latitude and longitude), use the fabricext/locate API to geographically determine a suitable location_id. To maximize potential links to a Fabric location_id, set the parceldistancem to 1000 meters, and the neardistancem to 100 meters.
Match API
Records with address data (either as components or as a single piece of text) can be matched to a location_id using the fabricext/match API.
PyTools Code
Below is a code snippet that runs the sample file data.csv through both of the above APIs.
from cqazapipytools import *
with cqazapipytools(os.environ['CQAPIKEY']) as cp:
# setup
vintage = '202506'
data = cp.csvRead('data.csv')
# locate
locate = cp.locate(vintage, cp.transformList(data, 'select', ['sourcekey','latitude','longitude']), parceldistancem=1000,neardistancem=100)
cp.csvWrite('data_locate.csv', locate)
# match
match = cp.match(vintage, cp.transformList(data, 'select', ['sourcekey','text']))
cp.csvWrite('data_match.csv', match)
Data Analysis (DuckDB)
Create a bridge table by joining all of the resulting output information from each of the processes above, bringing forward the relevant information.
Launch DuckDB in the directory with your data.csv and output files from the API processes above.
duckdb data.duckdb
Load the data into the duckdb database.
create table data as (select * from 'data.csv');
create table data_locate as (select * from 'data_locate.csv');
create table data_match as (select * from 'data_match.csv');
The queries below will build the bridge table for analysis.
install spatial;
load spatial;
drop table if exists data_bridge;
create table data_bridge as (
select i.sourcekey
,i.latitude source_latitude
,i.longitude source_longitude
,i.text source_text
,case when m.location_id is not null then 1 else 0 end match_true
,case when l.location_id is not null then 1 else 0 end locate_true
,case when l.location_id is not null or m.location_id is not null then 1 else 0 end location_id_true
,case when l.location_id is not null then l.location_id when m.location_id is not null then m.location_id end location_id_selected
,case when l.location_id is not null and m.location_id is not null and l.location_id<>m.location_id then 1 else 0 end location_id_disagreement
,0 location_id_duplicates
,0 location_id_duplicaterank
,st_distance_spheroid(st_point(i.latitude,i.longitude),
case when l.location_id is not null then st_point(l.latitude,l.longitude) when m.location_id is not null then st_point(m.latitude,m.longitude) end
) distancem
,l.location_id locate_location_id
,l.latitude locate_latitude
,l.longitude locate_longitude
,l.matchtype locate_matchtype
,m.location_id match_location_id
,concat_ws(' ',m.address,m.city,m.state,m.zip) match_address
,m.latitude match_latitude
,m.longitude match_longitude
,m.matchtype match_matchtype
,m.matchcount match_count
from data i
left join (
select sourcekey
,location_id
,matchtype
,latitude
,longitude
from data_locate dl
) l on l.sourcekey=i.sourcekey
left join (
select sourcekey
,locations_0_location_id location_id
,matchtype
,matchcount
,locations_0_address address
,locations_0_city city
,locations_0_state state
,locations_0_zip zip
,locations_0_latitude latitude
,locations_0_longitude longitude
from data_match
) m on m.sourcekey=i.sourcekey
);
update data_bridge
set location_id_duplicates=upd.cnt
from (
select location_id_selected
,count(*) cnt
from data_bridge
where location_id_selected is not null
group by 1
having count(*)>1
) upd
where upd.location_id_selected=data_bridge.location_id_selected;
update data_bridge
set location_id_duplicaterank=upd.rnk
from (
select sourcekey
,row_number() over (partition by location_id_selected order by distancem,sourcekey) rnk
from data_bridge
where location_id_selected is not null
) upd
where upd.sourcekey=data_bridge.sourcekey;
Interpreting Results (DuckDB)
Using the data_bridge table, it is now possible to interrogate the results.
If you would like to export the data_bridge table from DuckDB, it is easily done using the copy syntax.
copy (select * from data_bridge) to 'data_bridge.csv';
Below are some examples showing how to interrogate the data_bridge results table.
Find source records that have no equivalency in the Fabric
select *
from data_bridge
where location_id_true=false;
Find duplicative source records where multiple are matching to a single location_id.
select *
from data_bridge
where location_id_true=true and location_id_duplicaterank>1;
Find all records where there is no match to a location_id or the match is a duplicate.
select *
from data_bridge
where location_id_true=false or location_id_duplicaterank>1;
Bridge Table Schema
The data_bridge table built by the process above includes a number of useful columns and is intended to represent a unified view of both the source data and matching results. Descriptions of the columns are provided below.
| Column | Definition |
|---|---|
sourcekey | Reflects back the provided sourcekey. |
source_latitude | Reflects back the provided latitude. |
source_longitude | Reflects back the provided longitude. |
source_text | Reflects back the provided address. |
match_true | Indicates whether a location_id was identified using the Match API. |
locate_true | Indicates whether a location_id was identified using the Locate API. |
location_id_true | Indicates whether a location_id was identified using any API. |
location_id_selected | A chosen location_id based on prioritization of API results. |
location_id_disagreement | Indicates whether the location_id identified using more than one API disagree. |
location_id_duplicates | Indicates whether a location_id was assigned across multiple source records. |
location_id_duplicaterank | When location_id_duplicates is true, the rank is determined by increasing distance between the location_id_selected and source coordinates. |
distancem | The distance in meters between the location_id_selected and the source coordinates. |
locate_location_id | location_id from the Locate API. |
locate_latitude | latitude from the Locate API. |
locate_longitude | longitude from the Locate API. |
locate_matchtype | The quality of the match made by the Locate API. See Locate API documentation. |
match_location_id | location_id from the Match API. |
match_address | Concatenated address from the Match API. |
match_latitude | latitude from the Match API. |
match_longitude | longitude from the Match API. |
match_matchtype | The quality of the match made by the Match API. See Match API documentation. |
match_count | The count of matches made by the Match API. First ranked result is always chosen. |