BIOPHICS clean-up


Responsible team memberPierre Dane
Current Team Member
StatusIN PROGRESS
Targeted release Date2020-02-28
Scoping Complete Date

2020-01-31


Jira Issue RVL-803 - Getting issue details... STATUS
LOEL
Priority2

Status

#

Step

Status

1Responsible prepares rough business analysis/reqs doc
2Get CHAI review
3Get Akros Review
4Get Ona feedback


5Responsible - 1 iteration for feedback


6Ona sign off
7Ona tech spec scoping
8Ona LOE 
9Ona scheduling

Requirements

  • Validate that all records added or updated in the Android app are visible in the Biophics database (or other data warehouse)

    • Currently, household members and household updates do not reflect

    • Ensure that events, locations, clients, plans and tasks are all streamed to the server

  • Ensure that all fields from each form are present in Biophics and columns are named correctly

    • Check column names

    • Check completeness

TBD: Need confirmation from Ona that streaming of entities to the warehouse (Biophics or other) is the agreed path forward

Proposed Solutions

The solution selected for Biophics should be generalizable to the Postgres data warehouse and should the same solution used for  RVL-789 - Getting issue details... STATUS  

  • Raw json data from the OpenSRP server should be made available to the Biophics server
  • These can be polled using Nifi and posted to SQL Server in a similar fashion to how the data is currently being sent, but without any processing on the Nifi side - Nifi would act as a simple pipe streaming entities
  • ENTITIES:
    • Clients (Families and Family Members)
    • Locations
    • Structures
    • Plans
    • Tasks
    • Events
    • Organisations
    • Providers
  • PAYLOAD:
    • The payload could be an exact representation of the entities in JSON as returned by the existing API queries e.g. /rest/plan, /rest/event, /rest/task
      • TODO: APIs may need to be updated to provide ALL entities, including those that have been archived or voided.
      • TODO: Decision logic may need to be replicated in the data warehouse e.g. which event is displayed for a task if 2 events for that task have been submitted
    • The payload could be packaged in a format that is easier for the relational database to consume - e.g. CSV
    • The payload could include all related entities necessary (e.g. the sync could be based on LOCATIONS and for any location that has had an action, all related entities could be packaged)
    • The payload SHOULD be optimized for size if possible e.g. compressed and possibly packaged as Protocol Buffers, Thrift  or Avro messages (nice deck here)
    • The payload could be batches of records or single records
  • DESERIALIZATION 
    • Deserialization  will depend on the structure of the payload
    • Deserialization of JSON is described below in the Assumed Solution
  • PIPELINE
    • The pipeline could be a pub-sub append-only log type e.g. Kafka (Intro to Kafka for Mac here - REALLY easy to get up and running quickly)
    • The pipeline could be a polling service e.g. Nifi 

Assumed Solution:

  • JSON Entities will arrive in SQL Server via Nifi
  • JSON will be parsed using SQL Server JSON query language and inserted into raw relation tables (e.g. event, event_details, event_obs for the event entities)
  • the raw tables will be queried and relevant logic applied to populate HLD tables (tables specific to the entity type and sub-type (e.g. blood screening vs family registration events) - these will be the 'materialized views' 
  • These views should be partitioned by date (suggest monthly partitions) where there are large amounts of data - in Postgres , in SQL Server. This enables faster querying, as well as vastly improved repopulation of the 'materialized views' when data changes as partitions can be repopulated individually (deleting partitions is also almost instant) 

     Example SQL code for deserialization:

select  *
from raw_data 
cross apply openjson(JSON_query(event_json,'$.details')) WITH (
	taskstatus  varchar(500) '$.taskStatus', 
	locationUUID uniqueidentifier '$.locationUUID'
) as obs
where ISJSON(event_json) = 1

Known Issues

  • Deletes and voids may not be easily available
  • Some raw event entities arriving in Biophics are not valid JSON e.g. the obs array in the Structure events (assume this is to do with character escaping when inserting into SQL as the events are correct in the API)

    "obs": [{
    "fieldType": "formsubmissionField",
    "fieldDataType": "text",
    "fieldCode": "structure",
    "parentCode": "",
    "values": ["

    Unknown macro: {" type "}

    "
    ]