Databases for Field Biologists, pt. I
I took two full semesters' worth of graduate level statistics courses during my tenure in grad school. I was intending to pursue a concurrent Master's in Statistics with my doctorate, but ran out of funding before I was able to complete the additional course requirements. However, during all of the time I spent in statistics classes, I was never introduced to SQL or relational databases. My understanding is that the business analytics program in the Statistics Department at my alma mater does have some introduction to SQL and databases, which is great as I consider that program more along the lines of a technical degree, and the Master's in Statistics more of a purely academic degree. The purpose of this post is to essentially provide a primer and some good training resources that I wish I had found once I entered full-time employment and was thrown into the deep end of SQL and databases. I am still learning and privileged to work with talented SQL writers who still help me with my coding today, but hopefully this helps someone who is in the same boat I was two years ago: deep knowledge of what to do with the data once I had it, expertise in the statistical tools and methods available to answer questions, fluent in R and/or Python, but no SQL experience.Why databases?
Say you are a graduate student in the life sciences like I was. If you are very lucky, your advisor is the steward of a long-term study propagated by a stream of M.S. and doctoral candidates. These projects are usually thoroughly funded, crank out publications, and would be the highlight of department reviews. There would be hundreds of gigabytes of data to work with. In my experience, there are *many* more professors and labs that do not have the benefit of working on a project like this. New assistant professors, striving to achieve tenure, only able to fund a handful of students every couple of years - this, I believe, is the state of affairs for the vast majority of research labs and positions in American research institutions. In the likely event that you receive a research assistantship in one of these smaller labs, you may be tasked with collecting all of your data in the field, manually transcribing the data into no more than a few Excel spreadsheets, and then performing the analyses for your thesis from those handful of flat Excel spreadsheets. This was my experience for all of the projects I lead in my 9.5 year experience in academia. In this situation, the cost in overhead of developing a relational database to store all of your data far exceeds the benefits it would provide.
Once you leave academia, however, you likely will encounter data stored in relational databases. This may be in a public sector position where governmental requirements mandate stringent data integrity and persistence structures (ideally), or in the private sector, where data storage and maintenance practices can mean the perseverance (or loss) of critical business data. Regardless of your future plans, it is beneficial to experience working with databases, and you can easily take the few spreadsheets leftover from you thesis and build your very own local database with only a few short command line commands!
To keep things simple, I will only discuss limited aspects of the full functionality of relational database management systems (RDBMS), both for clarity and so that I don't show the holes in my own understanding. Start where I did for the example I'll reference in this section: say I have three .csv files: one with in-stream physical habitat data, one with GIS land-use data, and one with stream invertebrate community data. All three datasets are linked by a column with some kind of reference to site ID. That column linking the three data sets together relates each data set to the other and in database parlance is called the 'key'. An RDBMS will then allow you to execute SQL (structured query language) code to parse the datasets together or separately, relating them with the key column. In my toy example, you could write something like "SELECT * FROM COMMUNITY WHERE SITE_ID = 'XYZ';", which would return all community data for the site XYZ, or '"SELECT COMMUNITY.*, LAND_USE.* FROM COMMUNITY INNER JOIN LAND_USE ON SITE_ID;" which would return all records in community and land_use as long as the community record has a corresponding site_id in land_use.
Side note: most RDBMS systems provide the benefit of separating the physical storage of the data from the abstract logic manipulating it in 'client-server' relationships. What this means is that the RDBMS also allows 'client' machines (i.e. your local computer) to reach out and manipulate/fetch data stored in a centralized repository called the 'server'. There are additional benefits to separating physical storage and data manipulation logic that are beyond the scope of this post.
Providing RDBMS services is big business, and companies that provide it include Oracle and Microsoft, two of the biggest companies on the planet. Licensing from these companies and building enterprise-level database infrastructure can employ dozens of people at even middling size companies. However, nearly all computers today come with a free, local-only DBMS: SQLite (website) In a later post I will document starting a new SQLite database right on your own computer, but you can see instructions at that website and here. It is a wonderful tool for beginning to experiment with SQL and databases using your own data.
Efficiently parsing your data and linking related datasets together have obvious benefits. In my toy example, it would be trivial to load all three files into an R session and then leverage dplyr code to join the data (coincidentally using logic and terms lifted from SQL!). However, R has a diverse ecosystem of interfaces with relational databases, including the DBI package currently being developed and maintained by RStudio. If you take the time to build a database for your own data, a next step would be to begin exploring the functionality of DBI (docs here).