Shared Analytics Across Secure and Unsecured Networks
Next week I will be attending an IEEE working group meeting on IEEE P2795. This standard identifies the requirements for using shared analytics over secured and unsecured networks. It establishes a consistent method of using an overarching interoperability framework to utilize one or more disparate data systems for analytic purposes without an analytic user having explicit access to or sharing the data within these systems.
This standard allows for a high assurance method of sharing access to information for analysis without moving data beyond firewall protection. It facilitates sharing virtual access to aggregate data without the need for direct access to personal health information (PHI), personally identifiable information (PII), or other sensitive data. The standard supports a scenario where an entity (institution and/or technology) with an analytic might wish to analyze data stored at another entity (institution and/or technology) within appropriate cyber security constraints.
Defining the Problem
While this standard deals directly with personal health information (PHI) and personally identifiable information (PII) in healthcare data, the solutions being explored here are applicable to data science problems in the ODNI agencies….how can we enable Data Science/Analytics on data and still protect it? Metadata tagging - information about the data - is crucial to solving the data lake security problem and is the linchpin to the effective management of data throughout the data science/analytics lifecycle.
Data Science/Analytics has changed the way we use data. Effective use of a data science platform is entirely dependent on access to data. One of the core ideas behind the practice of data science is unification. Smaller insights are aggregated into larger patterns, shedding light on opportunities to solve a problem. When structural barriers exist, even the most sophisticated algorithms will reach an impasse. No greater barrier to analysis exists than data silos.
Currently, the most sensitive data in the intelligence community is protected by physical and logical boundaries, with classification applied to the data at the silo level. This type of arrangement inhibits functional data science, which needs access to the entire data estate. Combing data with different security levels; placing data that has a higher security level in a pool with data that has a lower security level creates another challenge. To enable the pooling of data with different classification levels a more precise way to control security and classification at the object and file level is needed. Each file or object will need to have a security level applied to it, and each object or file will need to be able to “self protect”, automatically enforcing security requirements. There are petabytes of data within the ODNI’s control. Only when each object is capable of “self protecting” so data sources can pooled will large scale insights be unlocked by the talented Data Scientists working on behalf of the intelligence community.
Learn More
The graphic for this article comes from the Asimov Institute site on a page called the “Neural Network Zoo”. The graphic describes a Boltzmann Machine. Some neurons are marked as input neurons and others remain “hidden”. The input neurons become output neurons at the end of a full network update. It starts with random weights and learns through back-propagation, or more recently through contrastive divergence (a Markov chain is used to determine the gradients between two informational gains). Compared to a HN, the neurons mostly have binary activation patterns. As hinted by being trained by MCs, BMs are stochastic networks. The training and running process of a BM is fairly similar to a HN: one sets the input neurons to certain clamped values after which the network is set free (it doesn’t get a sock). While free the cells can get any value and we repetitively go back and forth between the input and hidden neurons. The activation is controlled by a global temperature value, which if lowered lowers the energy of the cells. This lower energy causes their activation patterns to stabilise. The network reaches an equilibrium given the right temperature.
The original paper can be found here.
Hinton, Geoffrey E., and Terrence J. Sejnowski. “Learning and releaming in Boltzmann machines.” Parallel distributed processing: Explorations in the microstructure of cognition 1 (1986): 282-317.