Sunday, February 9, 2020

Create your own database for fun and profit

Databases for Field Biologists, pt. I

I took two full semesters' worth of graduate level statistics courses during my tenure in grad school. I was intending to pursue a concurrent Master's in Statistics with my doctorate, but ran out of funding before I was able to complete the additional course requirements. However, during all of the time I spent in statistics classes, I was never introduced to SQL or relational databases. My understanding is that the business analytics program in the Statistics Department at my alma mater does have some introduction to SQL and databases, which is great as I consider that program more along the lines of a technical degree, and the Master's in Statistics more of a purely academic degree. The purpose of this post is to essentially provide a primer and some good training resources that I wish I had found once I entered full-time employment and was thrown into the deep end of SQL and databases. I am still learning and privileged to work with talented SQL writers who still help me with my coding today, but hopefully this helps someone who is in the same boat I was two years ago: deep knowledge of what to do with the data once I had it, expertise in the statistical tools and methods available to answer questions, fluent in R and/or Python, but no SQL experience.

Why databases?

Say you are a graduate student in the life sciences like I was. If you are very lucky, your advisor is the steward of a long-term study propagated by a stream of M.S. and doctoral candidates. These projects are usually thoroughly funded, crank out publications, and would be the highlight of department reviews. There would be hundreds of gigabytes of data to work with. In my experience, there are *many* more professors and labs that do not have the benefit of working on a project like this. New assistant professors, striving to achieve tenure, only able to fund a handful of students every couple of years - this, I believe, is the state of affairs for the vast majority of research labs and positions in American research institutions. In the likely event that you receive a research assistantship in one of these smaller labs, you may be tasked with collecting all of your data in the field, manually transcribing the data into no more than a few Excel spreadsheets, and then performing the analyses for your thesis from those handful of flat Excel spreadsheets. This was my experience for all of the projects I lead in my 9.5 year experience in academia. In this situation, the cost in overhead of developing a relational database to store all of your data far exceeds the benefits it would provide.

Once you leave academia, however, you likely will encounter data stored in relational databases. This may be in a public sector position where governmental requirements mandate stringent data integrity and persistence structures (ideally), or in the private sector, where data storage and maintenance practices can mean the perseverance (or loss) of critical business data. Regardless of your future plans, it is beneficial to experience working with databases, and you can easily take the few spreadsheets leftover from you thesis and build your very own local database with only a few short command line commands!

To keep things simple, I will only discuss limited aspects of the full functionality of relational database management systems (RDBMS), both for clarity and so that I don't show the holes in my own understanding. Start where I did for the example I'll reference in this section: say I have three .csv files: one with in-stream physical habitat data, one with GIS land-use data, and one with stream invertebrate community data. All three datasets are linked by a column with some kind of reference to site ID. That column linking the three data sets together relates each data set to the other and in database parlance is called the 'key'. An RDBMS will then allow you to execute SQL (structured query language) code to parse the datasets together or separately, relating them with the key column. In my toy example, you could write something like "SELECT * FROM COMMUNITY WHERE SITE_ID = 'XYZ';", which would return all community data for the site XYZ, or '"SELECT COMMUNITY.*, LAND_USE.* FROM COMMUNITY INNER JOIN LAND_USE ON SITE_ID;" which would return all records in community and land_use as long as the community record has a corresponding site_id in land_use.

Side note: most RDBMS systems provide the benefit of separating the physical storage of the data from the abstract logic manipulating it in 'client-server' relationships. What this means is that the RDBMS also allows 'client' machines (i.e. your local computer) to reach out and manipulate/fetch data stored in a centralized repository called the 'server'. There are additional benefits to separating physical storage and data manipulation logic that are beyond the scope of this post.

Providing RDBMS services is big business, and companies that provide it include Oracle and Microsoft, two of the biggest companies on the planet. Licensing from these companies and building enterprise-level database infrastructure can employ dozens of people at even middling size companies. However, nearly all computers today come with a free, local-only DBMS: SQLite (website) In a later post I will document starting a new SQLite database right on your own computer, but you can see instructions at that website and here. It is a wonderful tool for beginning to experiment with SQL and databases using your own data.

Efficiently parsing your data and linking related datasets together have obvious benefits. In my toy example, it would be trivial to load all three files into an R session and then leverage dplyr code to join the data (coincidentally using logic and terms lifted from SQL!). However, R has a diverse ecosystem of interfaces with relational databases, including the DBI package currently being developed and maintained by RStudio. If you take the time to build a database for your own data, a next step would be to begin exploring the functionality of DBI (docs here).

Monday, February 3, 2020

Restart

Beginning Again

My last post to this blog was in October of 2016. I began posting to this blog nearly five years ago from the depths of my doctoral program. The observation I had made at the time, however, is as true then as it is today: in order to stand out from the crowd in data science, it is imperative that one maintains multiple extra-curricular projects on top of whatever is happening during one's day job. Many of the data scientists I follow on social media appear to be experts at juggling their actual work load as well as contributing to open-source projects, performing science communication through social media and blog posts, and hacking on side projects. I am picking up this blog to achieve similar: sharpen my rapidly oxidizing writing skills, write about topics that interest me, document data science projects I work on on the side, and of course pad my portfolio. 

Post-Grad

I have more or less successfully made the transition I had hoped for out of my doctoral program and into the private sector over the course of the last two years. I was privileged enough to be offered a part-time contracting gig before I had received my doctorate at an environmental consulting firm that provided the perfect segue out of public or private conservation biology work into a more purely business-oriented data science position. While I truly miss the deep personal connection I felt with conservation biology work, I am in a much better job track and industry for my overall emotional and financial health (not to mention career opportunity and stability). Of course there was a massive learning curve associated with my jump into business analytics, but I have taken refuge in the fact that the mathematical tools and statistics are the same, and I can learn the new context on the fly. My current job also required a dive into the deep end in SQL and Oracle database structure, an experience that has been at times painful but which I am now very thankful for. Working closely with our DBA, I have learned an incredible amount about not only retrieving the data I need in a useful format, but also internal database construction and operations and database and query performance tuning.

Why Blogging?

I mentioned at the start that I had made the observation several years ago that it appeared that successful data scientists maintained a diverse extra-curricular portfolio on top of whatever they were doing from 9-5. From the comfort of full-time employment, I can appreciate why this is necessary in this field. For the last two years, I have completed several projects wherein I've learned new R packages, learned Python and SQL basically from scratch, developed new analytical products and metrics, and so on; none of what I have done for my current employer can be shared because it is all proprietary to the employer. The world of proprietary business knowledge was not something I thought of much in my past life as a biologist, to be honest. However, a company seeking to pay a mid- to senior-level data scientist the market rate would certainly be within their rights to seek some evidence that applicants can do what they say they can do. I have also grown to appreciate the differences that exist between understanding in theory how some particular package or model works, and successfully implementing that package or model. Since all of my 9-5 work is proprietary, how would I be able to prove that I know what I'm talking about, or have done what I say I've done. Enter the extra-curriculars.

Monday, October 17, 2016

The Quest for a Vectorized Image

In order to generate some (brief) new content for my blog, I'm posting here my first attempt at vectorizing an image. This is a process where an image is generated  by math, instead of from fixed pixels, the benefit being the vector is infinitely scalable without getting pixelated like typical .jpeg and .png files will when resized. I used the program Inkscape, which is free and open source. According to the little bit of research I did, my preferred image manipulation software GIMP doesn't do vectors easily. So without further ado, here's my sweet new vectorized Lake Sturgeon!



I'm definitely a fan of the color scheme, but there's a lot of lost definition in the tail, scutes, and barbels. So I'll need to keep tweaking it to get an image that I can use in presentations, but I'm proud of this first attempt.

Friday, July 1, 2016

Lake Sturgeon Spawning GIF

Files in the .gif format are pretty ubiquitous in social media. They're essentially moving pictures, which are more entertaining than static pictures but have a much smaller filesize than true video files. It's a pretty straightforward process in Photoshop to convert a video clip into a .gif, but I found today that the real trick is to figure out how to limit the filesize for use on Facebook, in PowerPoint, etc. For my first GIF, I decided to use some USGS footage (which I think [hope] is public domain) of Lake Sturgeon spawning in the St. Claire River over an artificially constructed reef below a hydroelectric project. I was able to get the file down from over 32MB to 1.7MB by importing every other frame from the video source, shrinking the size of the picture, and limiting the length of the video being played back, all by toggling settings within Photoshop. So, here it is!



Video collected by USGS, and you can watch the source video here.

I highly recommend watching the video, the divers get up close and personal with the sturgeon on the spawning reef, and highlight one individual carrying what I think are Sea Lamprey parasites. At the end of the video, they show sturgeon eggs down in the interstitial spaces within the substrate.

Thursday, September 10, 2015

Is the recycling getting picked up next week?

To develop my Python skills, I have been working through the codeacademy tutorials in Python for beginners. I really think that when learning a new program, the best way to get comfortable with it is to just work on my own projects. To that end, I wrote a simple script to solve a problem I run into in my daily life with unfortunate regularity. The City of Knoxville collects trash every Monday, and recycling every other Monday. I have a terrible time remembering whether the next Monday will be recycling pick-up day or not. This script will let me know whether or not the next coming Monday is a recycling day.

The conceptual flow for this script is:

1. Pull the current date from the computer

2. Assess how many weeks it has been since 31 August 2015, a day when the recylcing was picked up.

3. Determine whether the current week is an odd week since that week that the recycling was known to be picked up, and if so, return a print out stating that this coming Monday will be recycling day. If not, it will return a different output.

This script utilizes the 'datetime' module from the Python 2.7.3 library. It was neat to see the differences and similarities between Python functions and R functions. Importing the function in Python is just like importing packages in R. I like the syntax in the module, such as in the line 'now = datetime.datetime.now()', where I call the module (the first 'datetime') and then the function (the 'datetime.now').

The code for the script can be found at:

https://gist.github.com/dwalke44/3d50b824de04e294bf9d

Tuesday, September 8, 2015

Fun with RMarkdown

This document is a practice exhibition with RMarkdown, generating simulated data, ggplot2, and writing functions.
Packages to include: stats, ggplot2
  1. Generating new data from given distributions
    • Normally Distributed - n = 100, mean = 0, SD = 1
    library(stats)
    library(ggplot2)
    Normaldist<- rnorm(100, 0, 1)
    qplot(Normaldist, geom = 'histogram', main = 'Normally Distributed Data')
    ## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
    • Poisson Distribution - n = 1000 , mean = 50
    Poisdist<- rpois(1000, 50)
    qplot(Poisdist, geom = 'histogram', main = 'Poisson Distributed Data')
    ## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
    • Binomial Distribution - 10 replicates of 100 coin tosses - ntotal = 1000 , chance of success (heads) = 0.5
    cointoss<-rbinom(10, 100, 0.5)
    trials<-c(1,2,3,4,5,6,7,8,9,10)
    trials.success<-data.frame(trials, cointoss) 
    cointossplot<-qplot(trials, cointoss, geom = 'bar', stat='identity', xlab = 'Trial Number', ylab = 'Number of Heads',  main = 'Number of Heads in 1000 Trials')
    cointossplot + scale_x_discrete(levels(trials))
    #This step adds space for an 11th trial, so I state explicitly the labels for the x-axis
    cointossplot + scale_x_discrete(limits=c(1,2,3,4,5,6,7,8,9,10)) 
  2. Function to generate summary statistics about the Normal dataset.
    #A function that returns a table of the mean, variance, and standard deviation of the input dataset
    sum.stat1<- function(x){
      mean<- mean(x)
      variance<- var(x)
      standarddev<- sqrt(var(x))
    
      result<-data.frame(mean, variance, standarddev)
    
      return(result)
    }
    
    #Running the function with the normally distributed simulated data
    fxn1<-sum.stat1(Normaldist)
    fxn1
    ##        mean  variance standarddev
    ## 1 0.1409396 0.9489135   0.9741219
    #Compare the results of my function to the summary() function
    
    fxn2<-summary(Normaldist)
    fxn2
    ##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    ## -2.3670 -0.5213  0.1364  0.1409  0.8603  2.7350