This page documents how to hash data while its already into postgres.
Hashing data from couchDB
Use the couch2pg tool to import couchDb data into a postgres as a single table.
Refer to the page Data migration from CouchDb for details on how to import data from couchDB into postgres.
Note: Skip step 7 and 8 on that page since we want to hash the data before migrating to the postgres model.
Enable the package uuid-ossp to enable random uuid generation on postgres. Run the below command on psql prompt
CREATE EXTENSION IF NOT EXISTS "uuid-ossp";
Run the below script to Create a function for generating a random number within a range
CREATE FUNCTION random_between(low integer, high integer) RETURNS integer STRICT LANGUAGE plpgsql AS $$ BEGIN RETURN floor(random()* (high-low + 1) + low); END; $$;
To hash the data use the below steps in sequential order. You must be having the couchdb
table in the OpenSRP postgres database that will store the hashed data before starting
- Checkout the repository https://github.com/OpenSRP/path-zambia-etl
- The hashing scripts are on the directory
hashing_data
- Run the script hash_couchdb_doc_id_and_rev.sql
This will hash the couchdb document id to a random uuid and set the revision to a constant so that its not possible to identify the data. This script runs on the couchdb staging table and should be run before any data migration scripts. - Run the data migration scripts to transform the data to the correct domain tables on the postgres model. This will be the step 7 on the page Data migration from CouchDb
- We used a random generator to generate 100,000 random person names and these are assigned randomly when hashing the data. The random names in the file person_names.csv. You can use your own file.
- Import random names to be used to used to hash the clients. Run the script person_names.sql to import the CSV file with the random names to a database table. It create a table person_names and populates it with random person names. Make sure to update the path of CSV file, it must be the absolute path on the local computer.
- Run the script hash_clients.sql
This script hashes personal identifiers and attributes. Random uuids are used, and random names used from the the above script. - Run the script update_client_references.sql
This script will update the client reference usuallybaseEntityId
to the random uuid that was generated in step 7 above for the dependent documents e.g events, alerts, multimedia etc - Clean up the created tables. The scripts above will create the tables
ids_mapping
andperson_names
in the public schema of the postgres database being used. Drop the two tables and also the couchDB staging tablecouchdb
. Use the below commands
drop table couchdb;
drop table ids_mapping;
drop table person_names;
Hashing data already on the postgres model
To hash data that is already on the postgres model modify the script hash_couchdb_doc_id_and_rev.sql to hash the document id and revision of the client tables in both the client
table and client_metadata
table.
Follow steps 5-9 above.