Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 10 Current »

This page documents how to hash data while its already into postgres.

Enable the package uuid-ossp on postgres to enable random uuid generation. Run the below command on psql prompt

CREATE EXTENSION IF NOT EXISTS "uuid-ossp";

Run the below script to create a function for generating a random number within a range

random_between function
CREATE FUNCTION random_between(low integer, high integer)
  RETURNS integer
STRICT
LANGUAGE plpgsql
AS $$
BEGIN
   RETURN floor(random()* (high-low + 1) + low);
END;
$$;


Hashing data from couchDB


Use the couch2pg tool to import couchDb data into a postgres as a single table.

Refer to the page Data migration from CouchDb for details on how to import data from couchDB into postgres.

Note: Skip step 7 and 8 on that page since we want to hash the data before migrating to the postgres model.

To hash the data use the below steps in sequential order. You must be having the couchdb table in the OpenSRP postgres database that will store the hashed data before starting

  1. Checkout the repository https://github.com/OpenSRP/path-zambia-etl
  2. The hashing scripts are on the directory hashing_data
  3. Run the script hash_couchdb_doc_id_and_rev.sql
    This will hash the couchdb document id to a random uuid and set the revision to a constant so that  its not possible to identify the data. This script runs on the couchdb staging table and should be run before any data migration scripts.
  4. Run the data migration scripts to transform the data to the correct domain tables on the postgres model. This will be the step 7 on the page Data migration from CouchDb 
  5. We used a random generator to generate 100,000 random person names and these are assigned randomly when hashing the data. The random names in the file person_names.csv. You can use your own file.
  6. Import random names to be used to used to hash the clients. Run the script person_names.sql to import the CSV file with the random names to a database table. It create a table person_names and populates it with random person names. Make sure to update the path of CSV file, it must be the absolute path on the local computer.
  7. Run the script hash_clients.sql
    This script hashes personal identifiers and attributes. Random uuids  are used, and random names used from the the above script.
  8. Run the script update_client_references.sql
    This script will update the client reference usually baseEntityId to the random uuid that was generated in step 7 above for the dependent documents e.g events, alerts, multimedia etc
  9. Clean up the created tables. The scripts above will create the tables ids_mapping and person_names  in the public schema of the postgres database being used. Drop the two tables and also the couchDB staging table couchdb. Use the below commands 
             drop table couchdb;
             drop table ids_mapping;
             drop table person_names;

Hashing data already on the postgres model

To hash data that is already on the postgres model modify the script  hash_couchdb_doc_id_and_rev.sql to hash the document id and revision of the client tables in both the client table and client_metadata table.
Follow steps 5-9 above.

  • No labels