All of the code used in this answer can be found in this notebook on GitHub
Producing a retention profile is a fairly common task for an analyst at consumer tech company. A retention curve captures the degree to which cohorts degrade (users churn) over time -- like a survival function, the point of a retention curve is to depict the probability that some percentage of users will still be "alive" in the product at some point of time (eg. 30% of the users that joined in a cohort retain at Day 21). The point of generating a retention curve at the cohort level is that it allows product managers and analysts to use that to forward project how many users from a given cohort will stay within the product over time.
One approach to doing this is to just measure the historical retentive properties of cohorts and apply that to theoretical new cohorts that will join the product (through paid advertising, organic discovery, etc.). The problem with this is that 1) it requires a very long timeline in order to make judgments over the course of a cohort's life: if I want to use this historical approach to estimate how many users in a cohort that I'll acquire tomorrow will still be around 365 days after joining (ie. in 366 days), then I'd need at least one cohort that is 365 days old to have some have some historical guidance on what day 365 will look like for this new cohort (and ideally I'd have at least a few of those cohorts). That means that I'd either need to wait for at least a year to watch those cohorts evolve before I can make estimates at Day 365 for new cohorts, or I'd simply only make estimates out at the level to which I have robust data for existing cohorts (eg. I only make Day X estimates for new cohorts based on historical data I have that includes at least 10 cohorts, so if I have 10 cohorts that have passed through to Day 60, then I only make Day 60 estimates for new cohorts). This also isn't ideal if I need to have at least a rough estimate of long-term retention in my product in order to do paid user acquisition.
So the second approach to producing a retention curve is to use as much robust data as I have available to project a retention curve out to some point in the future and then revisit that curve as more historical data becomes available. The idea here is that I can probably understand the underlying shape that a retention curve takes after gathering enough retention data to a point -- say, Day 30 -- so once I feel comfortable with that shape, I can project it out across a timeline that is useful to me in doing product analysis and setting marketing strategy.
Here's an example that uses Python. Let's say that I have 31 days of cohorts in my product, meaning that my sample of Day 1 retention values (Day 1 being the second day of a cohort's life -- Day 0 is the day that they join) is 30 and my sample of Day 30 retention values is 1. Let's also say that, for some strategic reason, it's important for me to understand how many users in a cohort will remain in my product by Day 90.
We can create some sample data for the cohorts through Day 31:
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
from scipy.optimize import curve_fit
underlying = [ ( 0, [ 100, 100 ] ), ( 1, [ 55, 60 ] ), ( 7, [ 38, 44 ] ), ( 14, [ 26, 32 ] ),
( 21, [ 20, 25 ] ), ( 30, [ 14, 18 ] ) ]
cohorts = 31
cohorts_retention_data = 
for i in range( cohorts + 1 ):
for k in underlying:
x = k[ 0 ]
if i >= x:
y = random.randint( k[ 1 ][ 0 ], k[ 1 ][ 1 ] )
cohorts_retention_data.append( ( x, y ) )
figure(num=None, figsize=(8, 6), dpi=80, facecolor='w', edgecolor='k')
plt.scatter( *zip( *cohorts_retention_data ) )
When graphed, this data looks like this:
Let's say that we only want to use retention days with at least 10 data points (ie. 10 cohorts have aged that long and reached that Day X point) in building the projection. We can probably pretty quickly count that in our heads, but we can also just use list comprehensions to get counts:
day_x = [ x[ 0 ] for x in cohorts_retention_data ]
day_x_counts = [ list( i ) for i in set( tuple( i ) for i in [ [ x, day_x.count( x ) ] for x in day_x ] ) ]
day_x_counts.sort( key=lambda x: x[ 0 ] )
print( day_x_counts )
Which produces this list of lists, where the first element of each inner list is the Day X retention waypoint and the second element is the number of cohorts that we have data for on that waypoint:
[[0, 31], [1, 30], [7, 24], [14, 17], [21, 10], [30, 1]]
So Day 21 is the last Day X retention waypoint that had at least 10 cohorts pass through it. So we can only use data up to Day 21 to produce the projection.
Now, to make a projection, we need to make some assumptions about what kind of function best fits this type of data. We can experiment with two different functions: a log function and an exponential function. We'll use Scipy's curve_fit method to fit these functions to the data, and curve_fit expects these functions to be defined with some number of parameters, like this:
def log_func( x, a, b, c ):
return -a * np.log2( b + x ) + c
def exp_func( x, a, b, c ):
return a * np.exp( -b * x ) + c
In this case, I'm giving both functions three parameters (a, b, and c) for curve_fit to use when it tries to fit these functions to the underlying data (an in-depth explanation of how curve_fit works is beyond the scope of this answer, but more information can be found here).
Now that these functions are defined, we need to pull out the X and Y values into lists. We also need to limit the data that we use in the projections to that which fits our requirement of cohorts' Day X retention points having at least 10 values. We can accomplish this with the following code:
projection_endpoint = 90
#the day we want to project retention curve out to
max_cohort = max( [ x[ 0 ] for x in day_x_counts if x[ 1 ] >= 10 ] )
#get the max cohort X value with at least 10 data points
x_data = [ x[ 0 ] for x in cohorts_retention_data if x[ 0 ] <= max_cohort ]
#pull the x values out of the cohort_retention_data
y_data = [ x[ 1 ] for x in cohorts_retention_data if x[ 0 ] <= max_cohort ]
#pull the y values out of the cohort_retention_data
#note that the x and y data above isn't sorted, but each value in one list corresponds to the value #in the other list. This means we can't sort one list and not the other: the value of x_data[ z ]
#matches the value of y_data[ z ].
log_popt, log_pcov = curve_fit( log_func, x_data, y_data )
#get the log function parameters using curve_fit on the x and y data
exp_popt, exp_pcov = curve_fit( exp_func, x_data, y_data )
#get the exp function parameters using curve_fit on the x and y data
log_y_projected = log_func( np.arange( projection_endpoint ), *log_popt )
#project out the values to the projection_endpoint using the log function parameters
exp_y_projected = exp_func( np.arange( projection_endpoint ), *exp_popt )
#project out the values to the projection_endpoint using the exp function parameters
Now that we have projected out our log and exponential functions, we can plot them to see how they compare to our actual data using this code:
figure(num=None, figsize=(8, 6), dpi=80, facecolor='w', edgecolor='k')
plt.scatter( *zip( *cohorts_retention_data ), label="Original Data" )
plt.plot( log_y_projected, label="Log Function Projections" )
plt.plot( exp_y_projected, label="Exp Function Projections" )
Which produces this graph:
The log function seems to fit the data better (albeit imperfectly) than the exponential function does, but we can validate that by calculating the sum of squared error for each function against the actual data.
We'll do that by defining a function called
sumSquaredError that takes two arguments: a list of projected y values (from the log and exp functions) for certain dates and a list of lists of the actuals (structured like this:
actuals[ x_date ][ values ] where x_date corresponds to the same index for the projected values list, ie. we're comparing the set of actual values we have against the single predicted value that our function came up with):
def sumSquaredError( projected_y, actual_y ):
squared_error = sum( [ abs( actual_y[ x ][ i ] - projected_y[ x ] )**2
for x in range( len( projected_y ) )
for i in range( len( actual_y ) )
This function iterates through the x values that are present in the projections (remember that our sample data only had a few different x values present; in your normal data, you'd probably have values for every retention day) and squares the sum of the absolute difference for each actual value against the projected value. That sum is the sum of squared error, and it represents the aggregate distance between our projections and our actual values across each point in the curve.
Next we need to format our projected and actual data in a way that this function can use. We can do that using this code:
log_ss_y_values = [ log_y_projected[ i ] for i in list( set( x_data ) ) ]
exp_ss_y_values = [ exp_y_projected[ i ] for i in list( set( x_data ) ) ]
actual_y_values = [  for i in range( len( list( set( x_data ) ) ) ) ]
for z in [ i for i in cohorts_retention_data if i[ 0 ] in list( set( x_data ) ) ]:
this_index = list( set( x_data ) ).index( z[ 0 ] )
actual_y_values[ this_index ].append( z[ 1 ] )
log_ss = sumSquaredError( log_ss_y_values, actual_y_values )
exp_ss = sumSquaredError( exp_ss_y_values, actual_y_values )
print( "Log Function squared error: " + str( log_ss ) )
print( "Exponential Function squared error: " + str( exp_ss ) )
When we run the above, we calculate the squared error for each function and output it. The actual values you see here will be different each time you run all of this code as it uses random numbers to generate the actual values, but here's the output from mine:
Log Function squared error: 200.03263751553877
Exponential Function squared error: 1042.8203626240693
As you can see, the Log function has a lower squared error value, meaning it fits the data better -- up through Day 21, which, as you might remember, was the date through which we used data to make the projections.
As a product manager or analyst, I might use the Log function projection to estimate future retention dates, but I'd be circumspect about its accuracy, especially on longer timelines, until I had more data to feed into these models. And I'd continuously update these models over time as cohorts aged and I brought more cohorts into the product, increasing the overall data set the models can draw from.
But the above is a fairly standard approach at projecting out retention rates and building projection curves. What's missing here is any sense of dimensionality: we have data grouped together into one large bucket of "retention data," but it's not split into segments based on the various meaningful factors that can influence cohort retention, such as source channel (eg. Facebook or Organic or Google), country, device type, etc. In order to use retention profiles for producing business insight but especially for running paid marketing campaigns, we'd want to create these retention profiles for the combinations of dimensions that we use to analyze our user base. For instance, we might have one retention profile for cohorts acquired on Facebook of iPhone users in the US and another retention profile for cohorts acquired on Google of iPhone users in Canada. It is this segmentation that allows user acquisition teams to systematically set bid levels on different acquisition channels.