On March 13, 2018, Carrie Hritz, Associate Director of Research, and Mary Shelley, Associate Director of Synthesis, submitted a response to a Request for Information from The Chief Statistician of the United States and the Statistical and Science Policy Branch (SSP) in the US Office of Management and Budget (OBM). The request sought information on using new technologies and methodologies for combining data from multiple sources. The goal of collecting such information was to inform the adoption of revised statistical standards for the federal use of combined data, including the production of principal key economic indicators and demographic statistical products.
SESYNC’s complete response can be found below.
“Based on our experience synthesizing diverse scientific and government data sets, SESYNC (www.sesync.org) offers the following perspective in response to OMB’s Request for Information in the January 12, 2018 Federal Register. Many specific actions will be required under a larger framework for modernizing the federal statistical system. As the Chief Statistician and SSP seek to establish priorities and coordinate research efforts across the Federal Statistical System to improve federal statistics, they should centralize agency metadata, develop inter-agency use cases, and examine existing models of data integration within the federal government. These efforts will be facilitated by a number of specific recommendations related to the first four areas in the request for information.
Central Repository for Agency Metadata
As a first step toward tackling the substantial challenges presented in the NAS reports, the SSP should direct the creation of a centralized system for metadata storage and retrieval across the statistical agencies. For federal statistical datasets, the system should collect, store, and disseminate information about which data sets are available within the government and their limitations on use. Ideally such a system would not only identify data sets, but would also provide units of analysis, field names, methods of collection, points of contact, and caveats for use for other federal users. There would be few to no problems with privacy at this stage since only metadata would be centralized, not the data themselves. This system would assist federal users at the various statistical agencies in knowing which other agencies might possess data relevant to their mission.
To prove the utility of such a system, the SSP should solicit “use cases” from the statistical agencies, i.e., they should ask, “what are examples of questions which your agency cannot answer now or which will be increasingly hard to answer given emerging challenges, such as declining survey participation and increased need for timeliness?” The SSP should then take an active role in collating and connecting agencies which may have complimentary data. A central metadata repository would be critical to facilitating these connections and outlining plans for data synthesis that could address the use cases in question.
Existing Models within the Federal Government
The SSP should look to existing models of inter and intra-agency data sharing in guiding the coordination of datasets and operationalizing the recommendations from the 2017 NAS reporti. For example, establishment of standards, coordination, dissemination, and identification of priorities in federal geospatial data is enhanced by the efforts of the Federal Geographic Data Committee (https://www.fgdc.gov/nsdi). The FGDC committee, comprised of represents of relevant data holding agencies and the committee staff, advances the National Geospatial Data Infrastructure initiative, establishing geospatial data meta standards, data curation and security. Since its establishment in accordance in (OMB) Circular A-16, the committee coordinates efforts between geospatial professionals across agencies with collaborating partners including state, Tribal, and local governments; academic institutions; and a broad array of private sector geographic, statistical, demographic, and other business information providers. Data is disseminated, where possible, via the web platform (GeoPlatform-https://www.geoplatform.gov) and point to data.gov for longer term storage. Lessons learned and building from this existing effort to coordinate across agencies, private industry and academia should be considered in the formation of any new efforts by the SSP to reduce inefficiencies and redundancies in federal data management efforts. Other examples of existing federal efforts such as Homeland Infrastructure Foundation Level Data working group members can provide lessons regarding the integration of state and local level data, and the challenges of integrating sensitive datasets.
Specific Recommendations for Priority Areas
(1) Current and emerging techniques for linking and analyzing combined data;
- Creating a single standardized vocabulary and/or data ontology for the Federal Statistical Service could greatly improve users ability to link data across agencies and to be confident that the measures and the intended use of those different data products are commensurate
- Supplying data products that are refactored to a variety of spatial resolutions would greatly help data integration. For example, if the Community and Demographics data in a product like the EnviroAtlas were refactored to the HUC12 scale (like the biophysical data is represented), this would lower the barrier to entry for data integration.
(2) on-going research on methods to describe the quality of statistical products that result from these techniques;
- Providing clear and concise methodologies that outline sampling weights for small data sets that can make the nationally representative based on several demographic and/or geographic characteristics will be helpful for researchers conducting large-scale trend analyses. Some of the ongoing work out of HHS focused on small-area estimates for various health trends seems to be refining these weighting approaches.
(3) computational frameworks and systems for conducting such work;
- When possible, move controlled access to data (like the Federal Statistical Research Data Centers) to the cloud and a virtual restricted environment, rather than require users to go physically to a center. The increasing use of distributed management of sensitive data provides an opportunity to update this infrastructure.
(4) privacy or confidentiality issues that may arise from combining such data;
- The methodology and approach used by the USAID Demographic and Health Survey (DHS) data sets for geographic displacement of household-level data is well-articulated for the end user and could be a model for other government agencies.
Ideally, these endeavors would culminate in a system based on a shared ontology that enables the integration and interoperability of the diverse datasets collected across agencies as well as assist stakeholders and other users of federal data. SESYNC has initiated an effort to accomplish a similar goal using open data from the federal government to develop a platform that facilitates data discovery and motivates synthesis in the Food-Water-Energy nexus, integrating data from across a diverse set of qualitative and quantitative sources.”