Elfin…what?

About Elfinwood Data Science Blog.

A stand of elfinwood, including lodgepole pine (Pinus contorta ssp. contorta) and subalpine fir (Abies lasiocarpa), in the northern Wind River Range, WY.

Elfinwood is a term used to describe stunted forests characteristic of most subalpine and alpine regions world-wide. Also referred to as krummholtz, these miniature forests occur at the upper altitudinal limits of trees where the environmental conditions, namely extremely cold temperatures and strong winds, impose restrictions on the trees physiology, and forces them to adapt. The most notable adaption is the change in growth form from tall and erect, to short and squat, or even prostrate. As a vegetation ecologist, elfinwood is to me a physical manifestation of ecology, the branch of biology that deals with the relationship of organisms to one another and the environment. When I see stands of elfinwood it tells me something about the environment, that I’m at the transition zone between subalpine and alpine physiography, and about the plant species that I’m likely to encounter.

In the digital age in which we now live, data management and analysis in the natural sciences has gone through a transition of its own. Over the past 2 decades, computational power and connectivity have increased exponentially, and the term “data science” was coined.

Data science has been defined in various way, of which the following definition by Irizarry (2020) sums it up nicely:

“Data science is an umbrella term to describe the entire complex and multistep processes used to extract value from data.”

I interpret the term “value” here to refer to data driven, actionable insights or improvements in the understanding of a system gained by analyzing data and synthesizing the results.

Irizarry (2020) lists 3 areas of expertise:

  • Data Engineer: Deals with the hardware, efficient computing, and data storage infrastructure.
  • Data Science Software Developer: Develops data science software.
  • Data Analyst: Analyzes and explores data, fit models and applies machine learning algorithms, and presents the results.

I propose a fourth area of expertise, that of Data Steward. Data stewards oversee all elements of data management from preparing a data management plan, to developing and maintaining the database infrastructures, and overseeing data quality assurance and control (QAQC) and archiving. Going forward, I’ll refer to the 4 groups above collectively as “data scientists”.

It is my experience that the role of the data steward, and the importance of properly managing data in the environmental sciences has been, until relatively recently, undervalued. Our colleges and universities do a good job of teaching us the philosophies that underlie our scientific disciplines, the field methods and protocols necessary for conducting field surveys, and the statistical underpinnings and data analysis techniques necessary to plan an effective study design and summarize and synthesize our data. However, it seems that rarely, if ever, are we taught how to properly manage and curate the field data that we collect. Perhaps this is in part because data management isn’t sexy. Field work is sexy, and analyzing data and published those results in scientific journals is sexy, but data management…definitely not.

Additionally, with the vast computational power available to us today, data scientists are capable of rapidly analyzing massive amounts of data. Take for instance, Google Earth Engine (GEE), Google’s cloud computing geospatial software that combines petabytes (i.e. thousands of terabytes) of satellite and geospatial imagery, with planetary-scale analysis capabilities, all freely available on the web. Google Earth Engine gives just about anyone whose willing to learn a little Javascript the power to perform spatial analysis at the scale of the entire globe. To fully benefit from these advances in technology it’s more imperative than ever that we properly manage and share our data. To facilitate this, at a minimum we need to coordinate our database schemas and domain lists within, and between, our respective disciplines.

The objective of this blog is to help improve data management in general, with examples given from the biological and environmental sciences. To this end, I’ll present an introduction to data management and a database schema model with the intent of moving us towards a more integrated approach to data management. The materials presented will be equivalent to that of a graduate level course in data management. Here is a preliminary list of topics that I plan to cover:

  • Why data management matter
  • The Data Pipeline
  • Preparing a data management plan
  • Recording data in the field
  • Field data management
  • Data management software
  • Version control
  • PostgreSQL: An Overview
  • Data types
  • Schema models
  • Data tables
  • Data columns
  • Reference tables and referential integrity
  • Superplot/plot/subplot concepts
  • Single visit sample units
  • Multiple visit sample units
  • Spatial data
  • Missing data
  • Managing voucher specimen and lab sample data
  • Data completeness tiers and minimum dataset
  • Database views
  • Metadata

Next Time on Elfinwood Data Science Blog

In the next post I’ll discuss several misconceptions that contribute to the undervaluing and poor execution of data management.

Literature Cited

Irizarry, R. A. (2020). The Role of Academia in Data Science Education . Harvard Data Science Review, 2(1). https://doi.org/10.1162/99608f92.dd363929

Copyright © 2020, Aaron Wells

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: