I see that the number of new organic users that download my app increases when spend levels of paid campaigns increase, but how can I measure what that impact is?

+1 vote

All of the code used in this answer can be found on GitHub

Assuming there is a relationship between Paid and Organic installs, and that the relationship is linear (eg. the number of Organic installs doesn't increase exponentially as Paid installs increase, which happens sometimes), then you can measure the relationship between Paid and Organic installs with a simple linear regression fairly easily in Python.

As an example, we can start by seeding some sample Organic and Paid install data and plotting it. We'll start by creating a sample Paid install data set that increases over a 100-day period. We'll then use some arbitrary multiplier to create a sample of Organic data from the Paid install data that has a relationship to it (ie. the Organic installs are not totally unrelated to Paid installs):

x = [ 1, 100 ] #day 1 to 111
y = [ 30, 500 ] #y here is paid installs, so start at 30 at day 1 and end at 500 (the end = day 100, as per x)

model = linregress( x, y ) #create a simple linear regression model from the data above

#we'll use the linear model parameters to project Paid Installs out to Day 100, but we'll also add some noise
y = [ ( model[ 0 ] * x + model[ 1 ] ) * ( 1 + random.randint( -400, 100 ) / 1000 ) for x in range( x[ 1 ] ) ]
#now we'll use some arbitrary multiplier (divisor, actually) to create a sample Organic installs sample
#that has a relationship to paid
y2 = [ (v/ random.randint( 3, 5 ) ) * ( 1 + random.randint( 100, 300 ) / 1000 ) for v in y ]

#Plot the Paid and Organic install data
figure(num=None, figsize=(8, 6), dpi=80, facecolor='w', edgecolor='k')
plt.plot( y, label='Paid Installs' )
plt.plot( y2, label='Organic Installs' )
plt.ylabel( 'Installs' )
plt.xlabel( 'Day' )
plt.legend()
plt.show()

The plot should look something like this (we used random data to seed the samples, so your graph won't look exactly the same):

The blue line here is Paid installs and the orange line is Organic installs. We can see that both graphs seem to be increasing linearly and that there is some relationship between Paid and Organic installs. This would be the starting point for undertaking this analysis if you are observing your own data: you see that there seems to be some relationship between Paid and Organic installs.

In order to quantify the relationship between Paid and Organic, we'll calculate the covariance between the two variables. Covariance is a measure of how variables move together: if covariance is positive and "high", it means that when one variable increases, another variable increases with a similar magnitude (and the opposite is true if covariance is negative and "high"). This thread on StackExchange does a good job of explaining covariance. The purpose of calculating covariance is to validate our assumption that these two variables have some sort of relationship. We'll calculate it with:

#calculate the Population covariance between Paid and Organic.
cov = np.cov( y, y2, bias=True )[ 0 ][ 1 ]
print( "Covariance: " + str( cov ) )

The value of the covariance between Paid and Organic when I run this code is 5414, which is "high". I use "high" in quotations here because the value of the covariance is relative to the values of the underlying data, meaning that 5000 could be "high" for some set of data and "low" for another set with larger values. In our case, 5000 is high, but we can also calculate the normalized Pearson's coefficient to confirm:

#Calculate the Pearson's correlation coefficient between Paid and Organic installs
cor_coeff = np.corrcoef( y, y2 )
print( "Correlation Coefficient: " + str( cor_coeff[ 0 ][ 1 ] ) )

When I run this, I get 0.918. Pearson's coefficient is a measure of the linear relationship between two variables and exists as a value from -1 (totally linearly uncorrelated) to +1 (totally linearly correlated). Since our value of .92 is very close to 1, it means our variables are highly linearly correlated.

Since that's true, we should create another simple linear regression model to predict Organic installs as a function of Paid installs:

#Now we'll create a second simple linear regression model where Paid Installs is the independent variable (X)
#and Organic installs is the dependent variable (Y). This will express the relationship between Paid and Organic
model2 = linregress( y, y2 )

#Now we'll create a list of Organic installs at various levels of Paid installs, from 1 to 100 (Paid Installs)
organic_paid = [ ( model2[ 0 ] * x + model2[ 1 ] ) for x in range( 100 ) ]

This organic_paid variable is a list of Organic install values at various levels of Paid installs, up through 100. We can plot this to get a sense of the relationship (which is represented via the slope of the line):

#Plot the Paid (x) vs. Organic (y) projection
figure(num=None, figsize=(8, 6), dpi=80, facecolor='w', edgecolor='k')
plt.plot( organic_paid, label='X: Paid Installs, Y: Expected Organic Installs' )
plt.ylabel( 'Organic Installs' )
plt.xlabel( 'Paid Installs' )
plt.legend()
plt.show()

A rough reading of this says that about 5 Organic installs from a level of 20 Paid installs and 10 Organic installs from a level of about 40 Paid installs. We can calculate the exact values like this:

#Some examples
print( "Paid Installs: 10, Organic Installs: " + str( int( model2[ 0 ] * 10 + model2[ 1 ] ) ) )
print( "Paid Installs: 50, Organic Installs: " + str( int( model2[ 0 ] * 50 + model2[ 1 ] ) ) )
print( "Paid Installs: 100, Organic Installs: " + str( int( model2[ 0 ] * 100 + model2[ 1 ] ) ) )
print( "Paid Installs: 1000, Organic Installs: " + str( int( model2[ 0 ] * 1000 + model2[ 1 ] ) ) )

Paid Installs: 10, Organic Installs: 1
Paid Installs: 50, Organic Installs: 14
Paid Installs: 100, Organic Installs: 31
Paid Installs: 1000, Organic Installs: 329

Note that this example was specifically designed with an underlying linear relationship between Paid and Organic installs -- you might find that no such relationship exists in your own data. You might also find that a relationship does exist at some level of Paid installs but doesn't exist at lower or higher levels -- you might need to reach some threshold of Paid installs before people find your product Organically. And the relationship can change at various levels, too: it's not hard to envision an app that sees Organic discovery skyrocket when Paid installs reach the level of tens of thousands per day. The exploration phase of this process is important: you should visualize your data sets and break them out into specific periods of time before trying to model a relationship.

by (9.8k points)