Sunday, February 9, 2020

Create your own database for fun and profit

Databases for Field Biologists, pt. I

I took two full semesters' worth of graduate level statistics courses during my tenure in grad school. I was intending to pursue a concurrent Master's in Statistics with my doctorate, but ran out of funding before I was able to complete the additional course requirements. However, during all of the time I spent in statistics classes, I was never introduced to SQL or relational databases. My understanding is that the business analytics program in the Statistics Department at my alma mater does have some introduction to SQL and databases, which is great as I consider that program more along the lines of a technical degree, and the Master's in Statistics more of a purely academic degree. The purpose of this post is to essentially provide a primer and some good training resources that I wish I had found once I entered full-time employment and was thrown into the deep end of SQL and databases. I am still learning and privileged to work with talented SQL writers who still help me with my coding today, but hopefully this helps someone who is in the same boat I was two years ago: deep knowledge of what to do with the data once I had it, expertise in the statistical tools and methods available to answer questions, fluent in R and/or Python, but no SQL experience.

Why databases?

Say you are a graduate student in the life sciences like I was. If you are very lucky, your advisor is the steward of a long-term study propagated by a stream of M.S. and doctoral candidates. These projects are usually thoroughly funded, crank out publications, and would be the highlight of department reviews. There would be hundreds of gigabytes of data to work with. In my experience, there are *many* more professors and labs that do not have the benefit of working on a project like this. New assistant professors, striving to achieve tenure, only able to fund a handful of students every couple of years - this, I believe, is the state of affairs for the vast majority of research labs and positions in American research institutions. In the likely event that you receive a research assistantship in one of these smaller labs, you may be tasked with collecting all of your data in the field, manually transcribing the data into no more than a few Excel spreadsheets, and then performing the analyses for your thesis from those handful of flat Excel spreadsheets. This was my experience for all of the projects I lead in my 9.5 year experience in academia. In this situation, the cost in overhead of developing a relational database to store all of your data far exceeds the benefits it would provide.

Once you leave academia, however, you likely will encounter data stored in relational databases. This may be in a public sector position where governmental requirements mandate stringent data integrity and persistence structures (ideally), or in the private sector, where data storage and maintenance practices can mean the perseverance (or loss) of critical business data. Regardless of your future plans, it is beneficial to experience working with databases, and you can easily take the few spreadsheets leftover from you thesis and build your very own local database with only a few short command line commands!

To keep things simple, I will only discuss limited aspects of the full functionality of relational database management systems (RDBMS), both for clarity and so that I don't show the holes in my own understanding. Start where I did for the example I'll reference in this section: say I have three .csv files: one with in-stream physical habitat data, one with GIS land-use data, and one with stream invertebrate community data. All three datasets are linked by a column with some kind of reference to site ID. That column linking the three data sets together relates each data set to the other and in database parlance is called the 'key'. An RDBMS will then allow you to execute SQL (structured query language) code to parse the datasets together or separately, relating them with the key column. In my toy example, you could write something like "SELECT * FROM COMMUNITY WHERE SITE_ID = 'XYZ';", which would return all community data for the site XYZ, or '"SELECT COMMUNITY.*, LAND_USE.* FROM COMMUNITY INNER JOIN LAND_USE ON SITE_ID;" which would return all records in community and land_use as long as the community record has a corresponding site_id in land_use.

Side note: most RDBMS systems provide the benefit of separating the physical storage of the data from the abstract logic manipulating it in 'client-server' relationships. What this means is that the RDBMS also allows 'client' machines (i.e. your local computer) to reach out and manipulate/fetch data stored in a centralized repository called the 'server'. There are additional benefits to separating physical storage and data manipulation logic that are beyond the scope of this post.

Providing RDBMS services is big business, and companies that provide it include Oracle and Microsoft, two of the biggest companies on the planet. Licensing from these companies and building enterprise-level database infrastructure can employ dozens of people at even middling size companies. However, nearly all computers today come with a free, local-only DBMS: SQLite (website) In a later post I will document starting a new SQLite database right on your own computer, but you can see instructions at that website and here. It is a wonderful tool for beginning to experiment with SQL and databases using your own data.

Efficiently parsing your data and linking related datasets together have obvious benefits. In my toy example, it would be trivial to load all three files into an R session and then leverage dplyr code to join the data (coincidentally using logic and terms lifted from SQL!). However, R has a diverse ecosystem of interfaces with relational databases, including the DBI package currently being developed and maintained by RStudio. If you take the time to build a database for your own data, a next step would be to begin exploring the functionality of DBI (docs here).

Monday, February 3, 2020

Restart

Beginning Again

My last post to this blog was in October of 2016. I began posting to this blog nearly five years ago from the depths of my doctoral program. The observation I had made at the time, however, is as true then as it is today: in order to stand out from the crowd in data science, it is imperative that one maintains multiple extra-curricular projects on top of whatever is happening during one's day job. Many of the data scientists I follow on social media appear to be experts at juggling their actual work load as well as contributing to open-source projects, performing science communication through social media and blog posts, and hacking on side projects. I am picking up this blog to achieve similar: sharpen my rapidly oxidizing writing skills, write about topics that interest me, document data science projects I work on on the side, and of course pad my portfolio. 

Post-Grad

I have more or less successfully made the transition I had hoped for out of my doctoral program and into the private sector over the course of the last two years. I was privileged enough to be offered a part-time contracting gig before I had received my doctorate at an environmental consulting firm that provided the perfect segue out of public or private conservation biology work into a more purely business-oriented data science position. While I truly miss the deep personal connection I felt with conservation biology work, I am in a much better job track and industry for my overall emotional and financial health (not to mention career opportunity and stability). Of course there was a massive learning curve associated with my jump into business analytics, but I have taken refuge in the fact that the mathematical tools and statistics are the same, and I can learn the new context on the fly. My current job also required a dive into the deep end in SQL and Oracle database structure, an experience that has been at times painful but which I am now very thankful for. Working closely with our DBA, I have learned an incredible amount about not only retrieving the data I need in a useful format, but also internal database construction and operations and database and query performance tuning.

Why Blogging?

I mentioned at the start that I had made the observation several years ago that it appeared that successful data scientists maintained a diverse extra-curricular portfolio on top of whatever they were doing from 9-5. From the comfort of full-time employment, I can appreciate why this is necessary in this field. For the last two years, I have completed several projects wherein I've learned new R packages, learned Python and SQL basically from scratch, developed new analytical products and metrics, and so on; none of what I have done for my current employer can be shared because it is all proprietary to the employer. The world of proprietary business knowledge was not something I thought of much in my past life as a biologist, to be honest. However, a company seeking to pay a mid- to senior-level data scientist the market rate would certainly be within their rights to seek some evidence that applicants can do what they say they can do. I have also grown to appreciate the differences that exist between understanding in theory how some particular package or model works, and successfully implementing that package or model. Since all of my 9-5 work is proprietary, how would I be able to prove that I know what I'm talking about, or have done what I say I've done. Enter the extra-curriculars.