Experimental data Today I want to talk about two-sample hypothesis testing, or A/B testing in data science parlance. This topic has been on my mind a lot as of late because I’m planning on leaving academia for data science, and that means I may also be transitioning from analyzing survey data to running experiments on unstructured, “big” data. Because academia and business have different goals and resources available to them, the methods they use are also different, even when approaching a similar problem.
This week I’m not doing a data analysis project so much as a data cleaning project. One of the most common problems I come across in data cleaning is how to get summary statistics for various groups in the data. And it’s also one of the most annoying problems, because I invariably forget how to do it, and end up having to go back to old code to copy and paste.
Since moving to San Francisco at the end of February, I have been driving and walking around town, trying to get to know my new city. And even though I’ve only been here a few weeks, I already feel affinity1 for my own neighborhood. Which is good, because I’m also too lazy to leave it. I also learned recently that you can download your location history via google. In the spirit of discovering my new city and possibly also discovering something about my (homebody) self, today I am going to make a map of where I have been since moving here three weeks ago.
Two weeks ago I claimed that women report higher job satisfaction when they work in countries where tech is more male-dominated. And then instead of backing up my claim last week, I got sidetracked by questions of sample size and statistical power. In a previous blog post I introduced the Kaggle survey on women in tech and I did some basic data cleaning for that survey. To save time and get to the point, I now pick up where I left off.
This week I am trying to embed a shiny app on a static website using blogdown. In a couple of weeks I get to present a short introduction of blogdown at the first ever R-ladies meetup in the Netherlands following a presentation on Rmarkdown and Shiny1. It will be a nice bonus if I can show how to embed shiny apps in blogdown! Kaggle tech survey For this demonstration I’m going to use data from the freely available Kaggle survey on data science and machine learning.
Just as the original Titanic VHS was published in two video cassettes, this Titanic analysis is also being published in two posts. In this post–part 2–I’m going to be exploring random forests for the first time, and I will compare it to the outcome of the logistic regression I did last time. Random forest vs. Logistic regression Last time I explained how logistic regression uses a link function transforms non-linear relationships into linear ones.
Yes, this is yet another post about using the open source Titanic dataset to predict whether someone would live or die. At this point, there’s not much new I (or anyone) can add to accuracy in predicting survival on the Titanic, so I’m going to focus on using this as an opportunity to explore a couple of R packages and teach myself some new machine learning techniques. I will be doing this over two blog posts.