About Elfinwood Data Science Blog.
Elfinwood is a term used to describe stunted forests characteristic of most subalpine and alpine regions world-wide. Also referred to as krummholtz, these miniature forests occur at the upper altitudinal limits of trees where the environmental conditions, namely extremely cold temperatures and strong winds, impose restrictions on the trees physiology, and forces them to adapt. The most notable adaption is the change in growth form from tall and erect, to short and squat, or even prostrate. As a vegetation ecologist, elfinwood is to me a physical manifestation of ecology, the branch of biology that deals with the relationship of organisms to one another and the environment. When I see stands of elfinwood it tells me something about the environment, that I’m at the transition zone between subalpine and alpine physiography, and about the plant species that I’m likely to encounter.
In the digital age in which we now live, data management and analysis in the natural sciences has gone through a transition of its own. Over the past 2 decades, computational power and connectivity have increased exponentially, and the term “data science” was coined.
Data science has been defined in various way, of which the following definition by Irizarry (2020) sums it up nicely:
“Data science is an umbrella term to describe the entire complex and multistep processes used to extract value from data.”
I interpret the term “value” here to refer to data driven, actionable insights or improvements in the understanding of a system gained by analyzing data and synthesizing the results.
Irizarry (2020) lists 3 areas of expertise:
- Data Engineer: Deals with the hardware, efficient computing, and data storage infrastructure.
- Data Science Software Developer: Develops data science software.
- Data Analyst: Analyzes and explores data, fit models and applies machine learning algorithms, and presents the results.
I propose a fourth area of expertise, that of Data Steward. Data stewards oversee all elements of data management from preparing a data management plan, to developing and maintaining the database infrastructures, and overseeing data quality assurance and control (QAQC) and archiving. Going forward, I’ll refer to the 4 groups above collectively as “data scientists”.
It is my experience that the role of the data steward, and the importance of properly managing data in the environmental sciences has been, until relatively recently, undervalued. Our colleges and universities do a good job of teaching us the philosophies that underlie our scientific disciplines, the field methods and protocols necessary for conducting field surveys, and the statistical underpinnings and data analysis techniques necessary to plan an effective study design and summarize and synthesize our data. However, it seems that rarely, if ever, are we taught how to properly manage and curate the field data that we collect. Perhaps this is in part because data management isn’t sexy. Field work is sexy, and analyzing data and published those results in scientific journals is sexy, but data management…definitely not.
The objective of this blog is to help improve data management in general, with examples given from the biological and environmental sciences. To this end, I’ll present an introduction to data management and a database schema model with the intent of moving us towards a more integrated approach to data management. The materials presented will be equivalent to that of a graduate level course in data management. Here is a preliminary list of topics that I plan to cover:
- Why data management matter
- The Data Pipeline
- Preparing a data management plan
- Recording data in the field
- Field data management
- Data management software
- Version control
- PostgreSQL: An Overview
- Data types
- Schema models
- Data tables
- Data columns
- Reference tables and referential integrity
- Superplot/plot/subplot concepts
- Single visit sample units
- Multiple visit sample units
- Spatial data
- Missing data
- Managing voucher specimen and lab sample data
- Data completeness tiers and minimum dataset
- Database views
Next Time on Elfinwood Data Science Blog
In the next post I’ll discuss several misconceptions that contribute to the undervaluing and poor execution of data management.
Irizarry, R. A. (2020). The Role of Academia in Data Science Education . Harvard Data Science Review, 2(1). https://doi.org/10.1162/99608f92.dd363929
Copyright © 2020, Aaron Wells