## My 2018 in review – what had an impact on me

As 2018 comes to an end, I wanted to reflect and write down some of the things that have impacted me this year, and into the future. I made these thoughts brief, as I want to be concise and prioritize what had the most impact. Hopefully readers find my thoughts useful in a practical or thought provoking way. I’m happy to talk more about any of these topics, just reach out or comment!

The following thoughts are roughly categorized, and not in any particular order. Disclaimer: this page does not contain medical advice, every individual’s body and mind is different.

# 🏋️‍♂️ Health

• Stretching (before and after every workout) and continual rehab/strengthening (after every workout) has completely eliminated the re-emergence of weight training related injuries *knocks on wood*. As well as avoiding certain exercises that naturally aggravate old injuries. Stretching and softening muscles is one of Tom Brady’s secrets to his longevity. And LeBron’s too: “play hard, have fun, and stretch”.
• Some of my favorite stretches and rehab/warmup exercises this year:
• Zinc supplements have staved off oncoming colds several times for me this year. This is my go-to immune health supplement, which I “superdose” (i.e. 3-5 tablets a day) when I feel a cold coming.
• A cup of coffee (caffeine) works wonders for me. I never realized how much more energy and alertness it gave me before this year, when I started drinking it more often because it’s free and tasty at work. I can only have one cup though, and earlier in the morning, or else I stay up all night. I’ve been using it together with L-theanine. I like to save this combo for special situations (also so that I don’t develop caffeine dependence and withdrawal).

# 🧠 Mental

• Floating has helped me relax and stay centered. It’s also given me some thought provoking experiences. I like Lift in Brooklyn. Sign up for their mailing list, they have deals/coupons a few times every year.
• I’ve really enjoyed Sam Harris’s Waking Up App. His meditations and lessons are educational and thought provoking, in addition to being very relaxing of course.
• Speaking of Sam, I found his recent podcast with the TV mentalist and hypnotist Derren Brown fascinating; hypnosis can be powerful. I’m exploring self-hypnosis, as well as acupuncture, after hearing of a friend of a friend having allergies “cured” from it. I expect the placebo effect—namely the power of expectation and belief—to play a huge role in why these things “work”. Even if that’s true though, it means these practices can still be beneficial.
• The mind and body are so connected that all of this might as well be under Health.

# 🕴 Career

• I continue to love building digital products that people use. Some of the things I created in 2018:
• Snowglobe: a more helpful salary database
• [in progress] ShiftReader: a better speed reading training tool than what Spreed was. The link is just a landing page with fake pricing (I’m doing price testing), so click “Sign Up” and enter your email if you’re interested in email updates.
• [sorta dead] CryptoMint: was previously a paid subscription newsletter for crypto news with automated sentiment analysis on scraped articles, which actually had a good amount of subscribes. After deciding I did not want to be in the business of selling “predictions”, esp. in a market like crypto, I turned it into a free crypto newsletter (where the articles are still being scraped) that I only sometimes send out. I have about 430 people on the mailing list.
• [dead] CryptoSaver: a web app that automated dollar cost averaging into crypto. I killed it after realizing that users were still terrified of some web app placing crypto buy orders automatically through Coinbase, even though it was via oauth, each buy order had to be manually approved, and that the app wouldn’t have any permissions to do anything else on the account like sell or transfer.  I didn’t invest much before talking to users about this idea (and I try not to with most of my ideas): I only put up a legit looking landing page and did some light Python work to understand how the Coinbase API worked.
• Data Science Salary Predictor: a web app version of the data science salary model described in O’Reilly’s 2017 Data Science Salary Survey
• I’m really happy to have found “solo entrepreneurship” communities this year, like the Indie Hackers community and Microconf, and specific people in that community I can talk to, like Christian
• I’ve been working at Squarespace as a Data Scientist for a little over a year and a half now, working closely to support Product. The thoughts below are primarily about that kind of Data Science, vs. machine learning engineering type roles, or Data Scientists that support other stakeholders like Marketing or Sales. I’ve gotten a good chance to learn and think about:
• How Data Scientists and PMs should work together: more of a partnership and less of a conduit for data access. Like any good relationship, it takes time and effort to develop that partnership.
• Event data standardization, event tracking “grammars” that are intuitive and self documenting, and the importance of data governance in a truly data-driven organization. And by data-driven orgs I mean orgs that use data (and Data Science) in a meaningful way to drive product-level and even company strategy-level decisions, not an org that only looks at if metrics are going up. 📈 Like all things in life, a balance of both is necessary.
• The power of quantitative + qualitative research in understanding users i.e. what Data Scientists (can) do + what User Researchers do. Data shows what users do. User interviews get at why users do what they do, or what they couldn’t do (which you can’t observe with data). Together, they are the voice of the user.
• I’m very bullish on Segment, and the massive and growing value they provide for Product orgs that want to be data driven (which is also a growing number). For example, I love what they’ve created with Protocols and Typewriter. Now that they’re the centralized data hub for companies, they can build powerful analytical products like Personas too.

# 📖 Books

• A few of the most impactful ones I read this year:

# And saving the best for last

I got engaged to the love of my life!

• May we all have another year full of learning and exploration.

# List of Firebase Analytics Nuances

1. “Turn on” parameter reporting from the start if you have dimensions in your events that you want to see numbers for, at a glance, in Firebase.
3. Firebase’s default Funnel reports are “open” funnels, not “closed” funnels.

## “Turn on” parameter reporting from the start if you have dimensions in your events that you want to see numbers for, at a glance, in Firebase Analytics.

Firebase Analytics gives you some basic visualizations out of the box, like how many times a certain event fires, over time. I had an event that would fire whenever an upgrade popup was shown to a user, and I specified a parameter called “source” which would note which action preceded the upgrade screen, so I could see the most common paid features that free-tier users tried to access. However, Firebase Analytics did not report on this “source” dimension at all until I manually set up “parameter reporting” for it. So don’t forget to enable “parameter reporting” for important event parameters/dimensions that you care about!

1. In the Event view, click the three vertical dots to the far right of your event, then add a parameter of your event to the table by clicking and dragging
2. Save
3. Firebase Analytics will start collecting numbers for that parameter (here, “source”), which you’ll be able to see in the report for the parent event (here, “upgrade_popup_show”)

By default, your raw event data is collected and made available to you only after you link Firebase to BigQuery. When I first implemented Firebase, launched my app, and got a handful of users, I could see a high level picture of their behavior via Firebase Analytics’ basic visualizations. A few weeks later, I found out that I had to link Firebase to BigQuery explicitly to start telling Firebase to “save” my raw event data, and only after doing so did I see that raw data coming in (and saved into tables in BigQuery). So I had “lost” the first several weeks of raw event data, which isn’t bad for my small app, but could be more costly for a high profile, heavily marketed app launch where mobile analytics and being able to mine insights from the data matters more.

Note that when you link Firebase to BigQuery, you’ll need to upgrade to Google Cloud Platform’s Blaze plan, which is a pay-as-you-go, or pay only for the bandwidth, storage, etc. that you use, plan. You can visit their calculator to estimate your costs, but so far, collecting the data and running infrequent BigQuery SQL queries for my app has been free.

## Firebase’s default Funnel reports are “open” funnels, not “closed” funnels.

If you go into Firebase Analytics’ Funnels page, you’ll see an area where you can create a funnel easily. After trying to do so, I found out that the funnels Firebase creates are “open” funnels, meaning that at each step of the funnel, a user doesn’t have to have completed the previous step of the funnel to be included in the count of that step. In my opinion, “closed” funnels, where at each step of a funnel a user at that step has to have completed the preceding step, are much more informative; it’s also a core feature of other event analytics tools like Mixpanel and Heap. Several others are also confused about Google’s decision to have Firebase only report open funnels.

For example, I created a funnel in Firebase Analytics to report on what percentage of users who open my app for the first time go on to take their 1st photo with my app, then what percentage of those go on to take their 2nd photo, etc. I expected fewer and fewer users to make it to each step of the funnel, so was surprised when I saw what appeared to be 100% of users who take one photo take two, 100% of users who take two photos take three, etc. Until I found out that Firebase had constructed an open funnel:

There isn’t a setting in Firebase Analytics to see closed funnels yet, so I decided to create a closed funnel in BigQuery with SQL, on my raw event data.

I won’t go into the details here, but I tested a few different kinds of SQL queries for constructing closed funnels, and the following “LEFT JOIN”-based one had much better performance than a “subqueries”-based one that you may find elsewhere on the internet. You too can create closed funnels to better understand the flow of your users, if your event data is in BigQuery: here’s my SQL query for the closed funnel “first open -> take 1st photo -> take 2nd photo -> take 3rd photo” (using UNNEST to flatten arrays because BigQuery stores stuff like that):

```SELECT
count(distinct e0.user_dim.app_info.app_instance_id) as first_openers
, count(distinct e1_user) as photo_taken_1
, count(distinct e2_user) as photo_taken_2
, count(distinct e3_user) as photo_taken_3
FROM `youday_IOS.app_events_*` as e0, UNNEST (e0.event_dim) as e0_events
LEFT JOIN (
SELECT
events.name as e1_eventname
, e.user_dim.app_info.app_instance_id as e1_user
, events.timestamp_micros as e1_ts
FROM `youday_IOS.app_events_*` as e, UNNEST (e.event_dim) as events
) ON e0.user_dim.app_info.app_instance_id = e1_user
LEFT JOIN (
SELECT
events.name as e2_eventname
, e.user_dim.app_info.app_instance_id as e2_user
, events.timestamp_micros as e2_ts
FROM `youday_IOS.app_events_*` as e, UNNEST (e.event_dim) as events
) ON e1_user = e2_user
AND e2_ts > e1_ts -- 2nd photo taken after 1st
LEFT JOIN (
SELECT
events.name as e3_eventname
, e.user_dim.app_info.app_instance_id as e3_user
, events.timestamp_micros as e3_ts
FROM `youday_IOS.app_events_*` as e, UNNEST (e.event_dim) as events
) ON e2_user = e3_user
AND e3_ts > e2_ts -- 3rd photo taken after 2nd
WHERE e0_events.name = 'first_open'```

# Firebase for Mobile Product Analytics

Firebase makes it easy to track events and collect all of them into a datastore, so you have the data you need to (quantitatively) understand how users are using your mobile app. There are just a few “manual switches” that someone using Firebase Analytics should know about, to ensure that they’re collecting complete behavioral data from the start. Firebase can also improve its visualizations to be more informative and insightful, so users don’t have to write SQL as much. Firebase certainly has the potential to get there, with its relatively affordable “utility” or “pay-as-you-go” pricing model and its superior data storage and querying capabilities (good luck trying to get your raw data out of the other event analytics platforms). I enjoy learning from my users to build a better product, and having the data to do so, and am excited to see what Firebase Analytics can do over time for the advancement of product analytics.

## Using Expected Value for Classifier Use in Business Problems

I’ve been reading Data Science for Business, by Provost and Fawcett, a very useful book that explains some of the most important principles and topics in data science. The authors’ language and structure helps a lot in developing an intuitive understanding of key data science concepts like model tuning, model evaluation, and various models themselves like decision trees, linear models, and k nearest neighbors. I highly recommend the book if you’re someone who works with data scientists, if you’re a beginner data scientist, or even if you’re a data science expert who’s looking for a good resource to refresh your fundamentals with.

I found this one chapter particularly interesting because it talks about a framework, or way of thinking, that I haven’t really heard about elsewhere. While specific tactics, such as how different kinds of models work, are definitely important and a large part of what a Data Scientist needs to know and be able to do, I think higher level strategy is also important. Anyways, the framework is highly practical, which fits the authors’ theme for the book: that data science isn’t just about analyzing data, but also about understanding the business problem in an analytical way. I wished there was something tangible and interactive to go along with their explanations in this chapter (and others), so I decided to create a guide of sorts, this blog post plus an interactive Jupyter Notebook you can download and play with. The blog post provides context if you haven’t read the corresponding chapter in the book yet, so the Jupyter Notebook is near the end.

If you have the book already, this blog post corresponds to the latter “half” of Chapter 7, “Decision Analytic Thinking I: What Makes a Good Model?”. This guide and especially the Jupyter Notebook assumes that the reader already has some familiarity with the basic ideas of machine learning, such as supervised learning (specifically classification), data pre-processing, holdout set testing, and model evaluation.

### When applying data science to solve business problems: what is the real goal?

Like approaching any sort of problem, you have to uncover what the real goal of a data analytic project is. It can be tempting to get caught up with the surface level question or jump straight into solutions.

For example, questions about customers come up a lot in business: which customers are most likely to churn? Which customers are most receptive to upselling? The idea is that once we can predict which customers are most likely to be upsold, we can call them, try to get them to buy more items like an add-on for the thingamajig they just bought, and generate more revenue for the business. Let’s run with this “upselling” case as an example.

The real business goal for answering “which customers are most receptive to upselling?” is so that we can not only generate more revenue from upselling customers, but also maximize the profit generated from our efforts. Not all customers will be equally likely to be upsold (some are curmudgeons, others might have a real need for the other products we’re selling), those who we do upsell could purchase different amounts of stuff, and the act of upselling costs us time and money (which can also be variable). So how do we even structure a problem like this, and then decide what to do?

### Introduction to the expected value framework, and how it helps break down problems

Let’s introduce the expected value framework, and weave it into how we’d structure and break down our business objective for this “upselling” project.

As a quick refresher:

expected value (of a variable) – a predicted value of a variable, calculated as the sum of all possible values, each multiplied by the probability of its occurrence

Basically, what do we anticipate, or expect, the value of some variable to be, given that there is some uncertainty in the chances of different outcomes happening.

Frame the question in terms of expected value

Back to our upselling question. Each customer has his/her own probability of being upsold, and likely amount that they will be upsold for; there’s also a cost to upselling, which we may have to eat if we call a customer who doesn’t want to buy anything else from us. So, thinking in terms of expected value, each customer will have an expected profit, given that we reach out to that customer to try and upsell them. More specifically:

Which means that, assuming we reach out to a customer, the expected value of profit ($E(Profit)$) equals the probability of upselling the customer ($p_u$) times the profit we’d get from upselling the customer, plus the probability of failing to upsell the customer (1 minus the probability of upselling the customer) times the profit we’d get from failing to upsell the customer.

Breaking out profit in each potential outcome:

Where $v_u$ is the value, or revenue generated, from upselling the customer, and $c$ is the cost of trying to upsell the customer (we assume the cost is constant across customers for simplicity). Notice in the second half of the equation that if we fail to upsell the customer, the outcome is that we get \$0 in revenue and eat the cost ($-c$) of trying.

Now, the path to obtaining our original business goal, to maximize total profits, is clear: try to upsell all customers where the expected profit of trying to upsell each one is greater than 0 (assuming we don’t have any budget or constraint on how many customers we can upsell to).

Expected value breaks the problem down for us

Also, thinking in terms of expected value has now broken up the problem nicely for us: to figure out the expected profit of trying to upsell a customer, (1) figure out the probability that upselling will work $p_u$, the (2) value of a successful upsell $v_u$, and the (3) cost $c$ of trying to upsell a customer.

Now, we can go more low level and think about how we might address each piece analytically. We can build a machine learning model, a classifier, on historical customer data of which kinds of customers were successfully upsold and which kinds weren’t, to address (1) and generate a predicted $p_u$, or probability that upselling will work, for each customer. For simplicity, we’ll assume that both (2) and (3) are constant are constant across all customers, but technically, you could build another model to predict (2), the value of a successful upsell for a given customer.

More specifically, for (1), our historical customer data is a snapshot of all customers that we’ve previously tried to upsell to, at time t. One column in the data is whether or not (e.g. a 1 or -1, or 1 or 0) we were able to successfully upsell each customer by some future date t+1, say 3 months later; this is the target variable. The other columns, or features, contain data on each customer before time t, such as number of previous purchases, number of times customer has been back to our online store, shipping zip code (which we can estimate income level with), etc.

Now we have a structure, thanks to EV (expected value), for evaluating whether we should try to upsell any individual customer in order to maximize company profits.

Example

Let’s plug in some numbers to see how we might use our structure to make decisions on whether we should try to upsell a customer or not.

Take Customer A. Based off of what we know about other customers that are similar to him, our machine learning model predicts that he has a 91% chance of being upsold, if we call him.

Let’s assume that if we upsell a customer, they will spend \$100 to buy an add-on to the thingamajig they already bought. Let’s also assume that on average, it takes a 30 minute phone call at a salesperson’s hourly wage of \$30 / hour, to try to upsell someone, so the cost of upselling is \$15.

Therefore, the expected profit for trying to upsell Customer A will be:

$E(Profit_A) = 0.91 * (\100 - \15) + 0.09 * (-\15) = \76$

And since the expected profit is positive, it is worth it to try and upsell him, because on average (if we keep trying to upsell people like him), we will generate \$76 in profits each time for the company.

Now let’s look at Customer B. Based off of what we know about other customers that are similar to her, our machine learning model predicts that she has a 4% chance of being upsold, if we call her.

So, the expected profit for trying to upsell Customer B will be:

$E(Profit_B) = 0.04 * (\100 - \15) + 0.96 * (-\15) = \-11$

We should not try to upsell customers like Customer B, because on average, we will lose \$11 each time.

If we do this expected value calculation for each customer we’re thinking about upselling to, we can arrive at a subset of customers where the expected profit of upselling each one is positive, and thus if we try to upsell all of them, our expected total profit will be maximized.

See this Jupyter Notebook for a full example of training a machine learning model on historical customer data to predict whether or not a customer will be upsold or not, and the associated probabilities of each outcome happening. These probabilities, along with the expected value framework, are then used to show which customers we should try to upsell to maximize our company’s profit.

Considerations

Note that using the expected value framework to calculate something like expected profit depends entirely on two things: the probabilities of different outcomes (e.g. a customer successfully being upsold or not) and the benefit or cost of each outcome.  Both can be estimated with models and comprehensive data, but not always very well, or it may be impossible in the first place. This is where both business and data understanding come into play: a data scientist has to understand what data is available and what it can be used for, and also understand how the business works so that accurate cost/benefit numbers can be gathered. This also means that the results of using expected value are sensitive to changes in either type of variable, probabilities or cost/benefit numbers. Though the expected value framework can be a practical and structured way to break down a business analytic problem, the data scientist may have to use other methods to inform action if he/she doesn’t have enough confidence in the probability or cost/benefit estimates. Like all things in life, there is no one size fits all approach: the EV framework is a tool in a data scientist’s big toolbox.

Thanks for reading, I’m always open to questions, suggestions, or other kinds of feedback!

## Creating a stock market sentiment Twitter bot with machine learning based image processing

One of the side projects I worked on in the past handful of months was Mr. Market Feels: a stock market sentiment Twitter bot that used automated image processing to extract and tweet the value of CNN Money’s Fear and Greed Index every day.

Motivation

There have been attempts to backtest the predictive power of the Fear and Greed Index when buying and selling the overall stock market index depending on the value (the results suggest there isn’t much much edge for that particular strategy). Anecdotally though, I’ve found the CNN Fear and Greed Index (what I’ll call FGI for short) to be a pretty good indicator of when this bull market has bottomed out during a short-term retracement, and when I used to have more time, have used it to trade options with decent success. Going to CNN’s website every day to check the FGI was a pain, and I also wanted the numerical values in case I wanted to run some analyses in the future, so I wondered if I could automatically extract the daily Fear and Greed Index values.

Challenge

I saw this as a fun and short coding project that would help me and others while giving me practice with image processing, so I dove in.

The goal was to extract the FGI “value” and “label” from CNN’s site every day. The value of the Index is 95 and the label is “Extreme Greed” in the screenshot of the FGI below:

Extracting the FGI value and label isn’t as easy as using OCR (optical character recognition) on the image and getting the results: for one, there is a lot of extraneous text in the image. Two: the pixel location of the value and label that we want changes as the FGI changes. Three: the relative position of the value and label also changes as the FGI changes. You can see points two and three in the image below: now, the FGI label (“Extreme Fear”) is to the top left of the FGI value (1). In the original image, the FGI label (“Neutral”) is directly right of the FGI value (53).

Why does all of this matter? Because for clean OCR, images need to be standardized. Or at least they do for Tesseract, the open source OCR engine created by Google. In Tesseract’s case, images of text shouldn’t contain any other artifacts (that the engine might try to interpret as text), should be scaled large enough, have as much image contrast as possible (e.g. black text on white), and be either horizontally or vertically aligned.

Most of the pre-processing of the FGI images to standardize them for Tesseract was straight forward enough. Without going into way too much detail, I used the Python Pillow library to automatically convert the image to black and white, apply image masks to eliminate extraneous parts of the image–like the “speed dial” and the “historical FGI table” on the right hand side–and crop the image down leave only the FGI value and label, like this:

Or this:

Here’s where challenge number three came up: the FGI value and label aren’t always either horizontally or vertically aligned, and this reduced Tesseract’s accuracy. For example, in the first image, the FGI label is diagonal from the FGI value. Running Tesseract OCR on it returns “NOW:[newline]Extreme[newline]Fear”, which completely misses the value “10” because of the diagonal alignment. You can try out Tesseract OCR with the above images, or with your own, here.

An Interdisciplinary Solution of Sorts

One solution to the challenge above split the resulting image into two images, one with the FGI value and a separate one with the label, so that Tesseract could be run on both and know that both images were either horizontally or vertically aligned. Basically, from a single FGI image, I wanted two images that looked like these:

and

In thinking about ways to implement that, I first thought about the principles of unsupervised clustering, from the field of machine learning. With clustering, the intermediate, processed FGI image could be segmented and split appropriately by finding the cluster of pixels that corresponded to the FGI value (“10”), and the other cluster of pixels that corresponded to the FGI label (“Now: Extreme Fear”).

Turns out that using the k-means clustering algorithm for image segmentation is pretty common practice.

First, a copy of the image was “pixelated” to ensure that the k-means algorithm would converge on the two correct clusters:

Then, the code applied k-means to find the centroids of the two clusters (green dots). It then derived the line connecting the two centroids (green line), and calculated the bisecting perpendicular line (red line), which can be seen as a “partition” between the two clusters of black pixels.

From there, the original black and white FGI image could be split along the partition line, which would result in the desired two images: one for the FGI value and one for the FGI label. From here, Tesseract would have these two standardized images as inputs and would be able to cleanly extract the FGI value and label.

Conclusion

Lastly, I put the script onto a web server, told a cron job to run it daily, and hooked it up to Twitter’s API to automatically post to the Twitter account Mr. Market Feels. I named it after Ben Graham’s moody Mr. Market.

I just finished reading Poor Charlie’s Almanack (an amazing book full of wisdom and life principles) so Charlie Munger’s multidisciplinary approach to life is on my mind. Though this project was probably a little less multidisciplinary than he means because machine learning and image processing are closely related fields, I still saw it as an example of how broad and varied knowledge and skills can come together to solve a problem effectively. To quote Munger on specialized knowledge: “To the man with only a hammer, every problem looks like a nail.”

UPDATE 6/9/2018: Mr. Market Feels has been been broken for a handful of months because various financial data APIs that I’ve tried using have been deprecated. I recently found out about IEX’s free and publicly available financial data API, which Mr. Market Feels is now using and will hopefully make its first tweet post-fix on Monday. I would also highly recommend reading Flash Boys: Michael Lewis tells such an intriguing story about the arms race going on in high frequency trading and the birth of IEX.

Technologies used:

## Learning from machine learning: ensembling, and other important skills

In my downtime, I’ve been using Kaggle to get better at applying machine learning to solve problems. The process is not only teaching me new technical skills, but also reminding me of some useful principles that can be applied elsewhere. To keep things digestible, this is the second post of two (the first one is here).

### A short list of important skills for a data scientist

When trying to get better at a skill, I try to tackle the highest leverage points–here’s what I’ve been able to gather about three skills that are important in being a data scientist*, from talking with others and reading about machine learning, and experiencing it firsthand with the client projects I do.

1. Feature engineering
2. Communication (includes visualization)
3. Ensembling

The first two are relatively self-explanatory, ensembling brings some pretty interesting concepts that apply to decision-making, in my opinion.

*I’ll be referring to the “applier of machine learning” aspect of “data science”.

#### Feature engineering

Feature engineering is the process of cleaning, transforming, combining, disaggregating, etc. your data to improve your machine learning model’s predictive performance. Essentially, you’re using existing data to come up with new representations of the data in the hopes of providing more signal to the model–feature selection is removing less useful features, thus feeding the model less noise, which is also good. The practitioner’s own domain knowledge and experience is used a lot here to engineer features in a way that will improve the model’s performance instead of hurt it.

There are a few tactics that can be generally applied to engineer better features, such as normalizing the data to help certain kinds of machine learning models perform better. But usually, the largest “lift” in performance comes from engineering features in a way that’s specific to the domain or even problem.

An example is using someone’s financial data to predict likelihood of default, on a loan for example. You might have the person’s annual income and monthly debt payments (e.g. for auto loans, mortgages, credit cards, the new loan they’re applying for), but those somewhat closer to the lending industry will tell you that a “debt to income ratio” is a better metric for predicting default, because it essentially measures how capable the person is of paying of his/her debt, all in one number. After calculating it, a data scientist would add this feature to the training data, and would find that their machine learning model performs better at predicting default.

As such, feature engineering (and in fact, most of machine learning) is sort of an art vs. a science, where a creative spark for an innovative way to engineer a domain specific feature is more effective than hard and fast rules. They say feature engineering can’t be taught from books, only experience, which is why I think Kaggle is in an interesting position because they’re essentially crowdsourcing the best machine learning methodologies for all sorts of problems and domains. There’s a treasure trove of knowledge on there, and if structured a little better, Kaggle could contribute a lot to machine learning education.

What potentially useful features/data could we engineer from timestamp strings? We could generate year, month, day, day of week, etc. numeric data columns–much more readable by a machine learning model.

#### Communication

During a recent chat with one of the core developers of the Python scikit-learn package, I asked what he thought some of the most important skills for a data scientist are. I sort of expected technical skills, but one of the first things that came up was communication, or being able to convey findings and why those findings matter to both internal and external stakeholders, like customers. This one’s self explanatory–what good is data if you can’t act upon it.

In fact, it seems like communicating well for data scientists might be even more important than it is for professions like programmers or designers because there’s a larger gap between result and action. For example, with a design or app, a decision maker can look at it or play around with it do understand it reasonably well to make decision, whereas a decision maker usually can’t just see a bunch of numbers that were spit out by a machine learning model and know what to do: how are those numbers actionable, why should someone believe those numbers, etc. Visualization is a piece of this, as it’s choosing the right charts, design, etc. to communicate your data’s message most effectively.

#### Ensembling

In machine learning, an ensemble is a collection of models that can be combined into something that performs better than the individual models.

An example: one way this is done is via the voting method. The different base, or “level 0”, models each make a prediction on, say, whether a person is going to go into default in the next 90 days. Model A predicts “yes”, model B predicts “yes”, and model C predicts “no”. The final decision then becomes the majority vote, here “yes”.

There are many other ways of ensembling models together. An important and powerful one is called stacking, and it is applying another machine learning model–called a “generalizer”, or “level 1” model–on the predictions of the base models themselves. This is better than the voting method because you’re letting the level 1 machine learning model decide which level 0 models to believe more than others based on the training data you feed into the system, instead of arbitrarily saying “the majority rules”.

A high level flow chart of how stacking works.

Ensembling is a key technique in machine learning to improve predictive performance. Why does it work? We all have an intuitive understanding for why it should work, because it’s a decision making framework we all have probably used, or been a part of, before. Different people know different things, and so may make different decisions given a particular problem. When we combine them in some way–like a majority vote in Congress or at the company we work at–we “diversify” away the potential biases and randomness that comes from just following one decision maker. Then, if you add in some mechanism to learn which decision makers should have their decisions weighed more than others based off of past performance, the system can become even more predictive–what areas could benefit from this improved, performance based decision-making process?*

*Proprietary trading companies, where every trade is a data point and thus generated very frequently, do this more intelligent way of ensembling, in a way, by allocating more money to traders who’ve performed better than others historically. A trader who is maybe slightly profitable but makes uncorrelated trades–for example by trading in another asset class–will still be given a decently sized allocation, because his trades hedge other traders’ trades, thus improving the overall performance of the prop trading company. Analogously, in machine learning, ensembling models that make uncorrelated predictions improves overall predictive performance.

### Resources

Here are some resources related to the topics described above that were recommended to me and that I found most useful, I hope they’re helpful to you too.

• A good overview of the principles of data science and machine learning for non-technical and technical folk alike: Data Science for Business
• Code example of stacking done with sklearn models
• An important thing for a data scientist to have before any of the stuff above is a good understanding of statistics, Elements of Statistical Learning is a detailed survey of the statistical underpinnings of machine learning.

## Learning from machine learning: deliberate practice

In my downtime, I’ve been using Kaggle to get better at applying machine learning to solve problems. The process is not only teaching me new technical skills, but also reminding me of some useful principles that can be applied elsewhere. To keep things digestible, this is the first post of two.

### Deliberate practice, with Kaggle

Deliberate practice–practice that is repeatable, hard, and has fast feedback (e.g. with a coach)–is needed to master any skill. Kaggle provides a great medium for machine learning deliberate practice: you can still solve the problems that were for old competitions, read about what the top performers did, and get instant feedback on how well your machine learning model performed vs. other peoples’.

Aside from accessible deliberate practice, self-learning this way has another big benefit over some of the in-person data science/machine learning classes I’ve observed: the student has control. I can learn as fast or as slow as I need to. I can learn about what I want: not only about what I find most interesting, but about what the top performers on Kaggle and other experts are doing to be successful.

I attempt to solve a machine learning problem on Kaggle, see how I performed, read about and take notes on what the top performers did, and fill in my knowledge gaps with lots of research on Google, continuously cycling between writing down questions about new terms or concepts that come up and answering them. The self-paced, deliberate nature of this learning avoids what Sal Khan calls “Swiss cheese gaps” in education–though of course, it is up to the learner him/herself to stay disciplined and engaged.

The “cycle” of deliberate practice described. Important things to note: it is closed, which allows for the learning from feedback, and it is fastwhich allows for that learning to happen quickly, and to be timely.

Something like Khan Academy provides a great structure for self-paced, deliberate-practice-oriented learning for more “traditional” academic topics. I see opportunity for more things like it, in other educational areas. Also, if anyone has found any helpful tools for self-learning, would love to hear about them. I personally use a lot of Google Docs for note-taking, mind42 for topic hierarchies, pinboard to keep track of my online research, sometimes Quizlet to help me memorize things.

### Next: 80/20-ing machine learning

In the next post, I will get slightly more technical and into some of the “highest leverage” machine learning concepts and skills, as well as share some resources (including advice from one of the most helpful machine learning educators and practitioners I’ve had the pleasure to interact with). There should also be at least one principle/mental model for those less interested in the technicals of machine learning. As always, please be critical and feel free to discuss anything and everything, I love learning from other perspectives.

## Pharma Paid Physicians \$6.5B in 2014 – Looking Into The Open Payments Dataset

My friend Jesse introduced me the Open Payments Dataset, which tracks the details of all payments made by “applicable” healthcare manufacturers (like pharmaceutical companies, medical device manufacturers) to any doctor they work with. A federal program maintains this database, which is a product of the Sunshine Act, part of the Affordable Care Act.

Why does this database exist? Basically because of the incentives created by industry being able to pay doctors to work on things that will ultimately help industry–like new drugs or medical devices. The hope is that more transparency will reduce any harmful influence that industry could have on medical research, education, and clinical decision making. In the words of Senator Grassley, co-author of the Sunshine Act:

Disclosure brings about accountability, and accountability will strengthen the credibility of medical research, the marketing of ideas and, ultimately, the practice of medicine. The lack of transparency regarding payments made by the pharmaceutical and medical device community to physicians has created a culture that this law should begin to change substantially. The reform represented in the Grassley-Kohl Sunshine Law is in patients’ best interest.

The healthcare industry pays physicians a lot, almost \$6.5B in 2014 alone. What is being paid for though (or, what does industry report the payments are for)? Who’s getting paid, and how much? I decided to do a quick analysis to start answering these questions and to see if there was anything interesting at a high level.

### Most top paid physicians get paid royalties or license fees

The most a single physician got paid in 2014 was almost \$44M. The interesting thing is that for this physician and several other top paid physicians, almost the entire total came from payments that were categorized is this unhelpfully-named category, “Compensation for services other than consulting, including serving as faculty or as a speaker at a venue other than a continuing education program” (orange).

A large majority of the other of the top paid physicians got paid primarily from “Royalty or License” (green), which makes sense: a surgeon may invent a new surgical technique and license it to a medical device company.

Another interesting phenomenon is that a handful of doctors in the top 100 earners were paid by industry solely for their research (purple). The status quo of industry having all the money and thus paying/funding research–sometimes both the design of and execution of the research–can create incentives with negative consequences for the validity of the results.

You can play around with the charts like the one below by zooming, mousing over data points to see their values, and showing/hiding different data series by clicking on each one in the legend. Physician names have been replaced with numbers for anonymity.

### Orthopedic surgeons received the most industry payments, followed cardiovascular physicians

Orthopedic surgeons received the most money from industry, almost twice the amount that cardiovascular physicians received, in 2014. Interestingly, most of payments to orthopedic surgeons, and other types of surgeons, were for royalties or licenses (green), whereas most payments for physicians–cardiovascular and otherwise–were for “Compensation for services other than consulting” (orange), “Research” (purple), and “Consulting” (purple).

Click to show interactive chart (some labels are crazy long so embedding didn’t look good. “A&O” stands for “Allopathic & Osteopathic Physicians”):

### The healthcare industry pays a lot of money for research

Out of the \$6.5B total payments to physicians in 2014, \$3.2B, or almost half, of those payments were for research. We can see this when aggregating the payments by the name of the drug or device manufacturer: companies like Genentech, Pfizer, and Novartis dominate the dollar amount of payments made to physicians, and most of their payments are for “Research” (brown). Further down the line, you can see medical device manufacturers like Stryker and Medtronic paying physicians mostly for “Royalty and License” (green).

Click to show interactive chart:

### Physicians in CA received, by far, the most amount of money from industry.

The graph below shows how much money physicians received for research and “general” payments (any payment that isn’t classified as “Research”), grouped by the state they work in; the size of each bubble represents the number of physicians in that state.

CA had significantly more physicians receive payments (8081) than the runner-up state, NY (5981), and thus the physicians that worked in CA received a lot more money from industry, in aggregate.

Though drilling into state by state differences in the data (e.g. the dominant “purpose” CA physicians vs. physicians in other states get paid for) is an exercise for another time, we get a hint for why this phenomenon might exist by looking at the teaching hospitals that were affiliated with the physicians who got paid by industry the most.

Click to show interactive chart:

Physicians affiliated with the City of Hope National Medical Center in Los Angeles received the most industry payments, by far, and almost all if it from royalties or license fees (green). Genentech has been known to pay massive royalties for the drugs developed at City of Hope, including the crazy expensive cancer treatments Herceptin and Avastin.

### Do physicians get rewarded with fancy dinners and extravagant trips?

By looking at the data, we can find which physicians got paid the most for “Entertainment”, “Food and Beverage”, and “Travel and Lodging”. But we won’t know for sure, because remember, all this payment data is reported by the healthcare industry themselves, and while there are some financial penalties for inaccurate reports, I don’t see an easy way for the government to verify the validity of the data.

The “worst offenders” were essentially given, by industry, \$60 meals three meals a day for every day of the year, went on \$590 per day trips, and spent \$43 a day (about \$300 a week) for entertainment and fun. Sounds like the life (except a little more on the entertainment and fun please).

### Conclusion

There’s a lot of money being transferred from the healthcare industry to physicians, which means a ton of data since all of this has to be reported now. In fact, I didn’t even touch another part of the dataset, how much ownership each physician has in a particular drug or device manufacturer, which could give even more color on misaligned incentives. Also, without aggregation of some of the data fields, the raw, transaction/payment level data took up close to 6GB of space, and I didn’t want to spin up a Spark cluster or something. Luckily, the Open Payments site provides a web service that allowed me to aggregate and filter the raw data, dramatically reducing the dataset’s size.

With the Sunshine Act being first introduced in 2007, then shot down, then enacted as part of the ACA in 2010, and with the Centers for Medicare and Medicaid Services (CMS) now responsible for collecting this data on top of everything else it does, hopefully we find some useful applications for the Open Payments dataset.

This analysis and post were done pretty quickly, many thanks to Carol for giving me some immediate ideas and feedback! And to iPython Notebook, and the pandas and plotly libraries.

## Cancer clinical trials and the problem of low patient accrual

Inspired by this contest to come up with ideas to increase the low amount of patient accrual for cancer clinical trials, I decided to look more into the data. Bold, by the way, is one of my all time favorite books, and was co-authored by the creator of the herox.com website, the xprize Foundation, and co-founder of Planetary Resources: Peter Diamandis. Truly someone to look up to.

Anyways, the premise of the contest is that over 20% of cancer clinical trials don’t complete, so the time and effort spent is wasted. The most common reason for this termination is the clinical trial not being able to recruit enough patients. Just how common is the low accrual reason though? And are there obvious characteristics of clinical trials that can help us better predict which ones will complete successfully, and what does that suggest about building better clinical trial protocols? I saw this as an opportunity to explore an interesting topic, while playing around with the trove of data at clinicaltrials.gov and various data analysis python libraries: seaborn for graphing, scikit-learn for machine learning, and the trusty pandas for data wrangling.

### Basic data characteristics

I pulled the trials for a handful of the cancers with the most clinical trials (completed, terminated, and in progress), got around 27,000 trials, and observed the following:

• close to 60% of the studies are based in the US*
• almost 25% of all US based trials ever (finished and in progress) are still recruiting patients

• of those trials that are finished and have results, close to 20% terminated early, and 80% completed successfully (which matches the numbers the contest cited)

• almost 50% of all US based trials are in Phase II, almost 25% are in Phase I

• and interestingly, the termination rate does not differ very significantly across studies in different phases

### Termination reasons

Next, I was interested in finding out just how common insufficient patient accrual was as a trial termination reason vs. others reasons. This was a little tricky, as clinicaltrials.gov gives principal investigators a free-form text field to enter their termination reason. So “insufficient patient accrual” could be described as “Study closed by PI due to lower than expected accrual” or “The study was stopped due to lack of enrollment”. So I used k-means clustering (after term frequency-inverse document frequency feature extraction) of the termination reasons to find groups of reasons that meant similar things, and then manually de-duped the groups (e.g. combining the “lack of enrollment” and “low accrual” groups into the same group because they meant the same thing).

I found that about 52% of terminated clinical trials end because of insufficient patient accrual. This implies that about 10% of clinical trials that end (either successfully, or because they’re terminated early) do so because they can’t recruit enough patients for the study.

### Predicting clinical trial termination?

Clinicaltrials.gov provides a bunch of information on each clinical trial–trial description, recruitment locations, eligibility criteria, phase, sponsor type (industry, institutional, other) to name a few–which begs the question: can this information be used to predict whether a trial will terminate early, specifically because of low patient? Are there visible aspects of a clinical trial that are related to a higher or lower probability that it fails to recruit enough patients? One might think that the complexity of trial eligibility criteria and the number of hospitals from which the trial can recruit from could be related to sufficient patient accrual.

Here was my attempt to get at a solution to this question analytically: fitting/training a logit regression multi class classifier–whether a trial would be “completed”, “terminated because of insufficient accrual”, or “terminated for other reasons”–on a random partition of clinical trial data, and measuring its accuracy at classifying out-of-sample clinical trials. The predictors were of two types: characteristic (e.g. phase, number of locations, sponsor type, etc.) and “textual”, or features extracted from text based data like the study’s description and eligibility criteria. Some of these features came from a similar tf-idf vectorization process as described in the k-means section above, other features were the simple character lengths of these text blocks. Below is a plot showing the relationship between two of these features: length of the eligibility criteria block of text, and length of the study’s title, two metrics that perhaps get at the complexity of a clinical trial.

The result: the logit model could only predict correctly whether trials would complete successfully, terminate because of low accrual, or terminate for other reasons 83.6% of the time. This is a pretty small improvement over saying “I think this trial will complete successfully” to every trial you come across, in which case you would be correct 80.6% of the time (see the Completed vs. Terminated pie chart above). Cancer clinical trials are very diverse, so it makes sense that there don’t seem to be any apparent one-size-fits-all solutions to improving patient accrual.