Advanced Records Matching
Records matching refers to the process of mapping existing data records into the Fabric. This can be accomplished via the utilization of an existing location_id value, geographically locating using coordinates, or by performing an address lookup. The best approach depends on the source data and what the data will then be used for in the future, thus there is no one “best way” that applies across all use cases.
This guide utilizes CostQuest Pytools and DuckDB and uses a small set of sample data (available here).
A vintage must be chosen and then held constant across all API calls. This guide uses a vintage of 202506, available vintages can be determined using the fabric/vintages API.
Prerequisites
- CostQuest API Key granted access to the Fabric, Fabric Extensions, and Geoservices APIs.
- Python (3.10 tested)
- DuckDB (1.1.3 tested)
Assign Location IDs (Python)
Depending on what attributes are present in your source data, you can utilize a mix of CostQuest APIs to assign appropriate location_id values for your chosen vintage.
Locate API
For records with geographic coordinates (must be represented as WGS84 latitude and longitude), use the fabricext/locate API to geographically determine a suitable location_id. To maximize potential links to a Fabri location_id, set the parceldistancem to 1000 meters, and the neardistancem to 100 meters.
Match API
Records with address data (either as components or as a single piece of text) can be matched to a location_id using the fabricext/match API.
PyTools Code
Below is a code snippet that runs the sample file data.csv through both of the above APIs.
from cqazapipytools import *
with cqazapipytools(os.environ['CQAPIKEY']) as cp:
# setup
vintage = '202506'
data = cp.csvRead('data.csv')
# locate
locate = cp.locate(vintage, cp.transformList(data, 'select', ['sourcekey','latitude','longitude']), parceldistancem=1000,neardistancem=100)
cp.csvWrite('data_locate.csv', locate)
# match
match = cp.match(vintage, cp.transformList(data, 'select', ['sourcekey','text']))
cp.csvWrite('data_match.csv', match)
Data Analysis (DuckDB)
Create a bridge table by joining all of the resulting output information from each of the processes above, bringing forward the relevant information.
Launch DuckDB in the directory with your data.csv and output files from the API processes above.
duckdb data.duckdb
Load the data into the duckdb database.
create table data as (select * from 'data.csv');
create table data_locate as (select * from 'data_locate.csv');
create table data_match as (select * from 'data_match.csv');
The queries below will build the bridge table for analysis.
install spatial;
load spatial;
drop table if exists data_bridge;
create table data_bridge as (
select i.sourcekey
,i.latitude source_latitude
,i.longitude source_longitude
,i.text source_text
,case when m.location_id is not null then 1 else 0 end match_true
,case when l.location_id is not null then 1 else 0 end locate_true
,case when l.location_id is not null or m.location_id is not null then 1 else 0 end location_id_true
,case when l.location_id is not null then l.location_id when m.location_id is not null then m.location_id end location_id_selected
,case when l.location_id is not null and m.location_id is not null and l.location_id<>m.location_id then 1 else 0 end location_id_disagreement
,0 location_id_duplicates
,0 location_id_duplicaterank
,st_distance_spheroid(st_point(i.latitude,i.longitude),
case when l.location_id is not null then st_point(l.latitude,l.longitude) when m.location_id is not null then st_point(m.latitude,m.longitude) end
) distancem
,l.location_id locate_location_id
,l.latitude locate_latitude
,l.longitude locate_longitude
,l.matchtype locate_matchtype
,m.location_id match_location_id
,concat_ws(' ',m.address,m.city,m.state,m.zip) match_address
,m.latitude match_latitude
,m.longitude match_longitude
,m.matchtype match_matchtype
,m.matchcount match_count
from data i
left join (
select sourcekey
,location_id
,matchtype
,latitude
,longitude
from data_locate dl
) l on l.sourcekey=i.sourcekey
left join (
select sourcekey
,locations_0_location_id location_id
,matchtype
,matchcount
,locations_0_address address
,locations_0_city city
,locations_0_state state
,locations_0_zip zip
,locations_0_latitude latitude
,locations_0_longitude longitude
from data_match
) m on m.sourcekey=i.sourcekey
);
update data_bridge
set location_id_duplicates=upd.cnt
from (
select location_id_selected
,count(*) cnt
from data_bridge
where location_id_selected is not null
group by 1
having count(*)>1
) upd
where upd.location_id_selected=data_bridge.location_id_selected;
update data_bridge
set location_id_duplicaterank=upd.rnk
from (
select sourcekey
,row_number() over (partition by location_id_selected order by distancem,sourcekey) rnk
from data_bridge
where location_id_selected is not null
) upd
where upd.sourcekey=data_bridge.sourcekey;
Interpreting Results (DuckDB)
Using the data_bridge table, it is now possible to interrogate the results.
If you would like to export the data_bridge table from DuckDB, it is easily done using the copy syntax.
copy (select * from data_bridge) to 'data_bridge.csv';
Below are some examples showing how to interrogate the data_bridge results table.
Find source records that have no equivalency in the Fabric
select *
from data_bridge
where location_id_true=false;
Find duplicative source records where multiple are matching to a single location_id.
select *
from data_bridge
where location_id_true=true and location_id_duplicaterank>1;
Find all records where there is no match to a location_id or the match is a duplicate.
select *
from data_bridge
where location_id_true=false or location_id_duplicaterank>1;