Predicting long-term user engagement from short-term behavior

Dan Quach
Insight
Published in
11 min readNov 17, 2020

--

Photo by abigail low on Unsplash

User engagement is one of many crucial elements to study and understand deeply for both well-established companies and nascent startups throughout the development of a product. A product’s health can be broadly measured from the delicate balance between new user adoption, engagement, retention, and churn, all of which are closely related. With the fierce competition of the mobile app market today, many developers are vying for the attention of large populations of smartphone users around the world. It is not uncommon for the average person to use an app only a handful of times before quitting once the initial novelty wears off, leaving companies with the very tough problem of finding new ways to increase the level of engagement with their product.

User engagement with the startup’s platform, the subject of this article

As an Insight Fellow, I partnered with a company that provides a mobile app payment service allowing its users to make simple cash transfers to one another. In addition to its peer-to-peer social transaction features akin to Venmo or Cash App, one of its unique core functionalities is the ability for users to create ‘Pools’ in which many users across the platform may contribute to a central cash pot. These collections can be used for any general purpose: casual outings, event planning, roommate rent payments, betting pools, charities, etc. The platform is seeing a monthly growth in registered users of 2–3% with over 390K current registered users. About 38% (150k) of those users have been active in the last year and 8% (31k) within the last month.

The Problem

A problem that the company wanted to address was how to derive insights from data on already engaged users to identify any common behavior patterns that can be leveraged to promote the same level of engagement in new users.

In discussing the problem, they had identified two segments of their long-term engaged user population base that they wanted to understand:

  • “Day One” Users — consistent, regular users from day one
  • “Late Bloomer” Users — sporadic early users, with an increase in engagement at a later date

The behaviors of these two segments can be seen in the following figures.

Usage pattern of example ‘day one’ user — shows consistent, regular usage from day one
Usage pattern of a ‘late bloomer’ user — sporadic usage until sudden increase in engagement at a later date

In these two user segments, “day one” users readily adopt the service with regular and consistent usage upon registration upon the platform. The second class, “late bloomers”, shows sparse, intermittent usage for many months until a pivotal moment where they show a significant increase in engagement in the time that follows.

This curiosity led to the question of whether a trigger for long-term user engagement can be deduced from actions taken near the moment of increased engagement. If so, this would be more easily deduced from the day one users since the temporal edge of the engagement period for late bloomers is murky and less well-defined. We will therefore focus on studying the early actions of the day one users that potentially led to their continued usage. Such actions may then be actively encouraged to new and existing intermittent users to promote more long-term user engagement with the product.

Identifying Day One Users

The company provided access to their full PostgreSQL database, which consists largely of timestamped user activity throughout the platform across their years of operation including transactions between users/pools, banking deposit/withdrawals, friendships, comments, pool creation, and pool connections. The time-dependence of these data points is a critical and challenging component in engineering useful features for this analysis.

Since the users were not labeled, one non-trivial task was isolating the day one users by formulating a set of conditions under which their behavior is appropriately captured.

We classify a user as a day one user if they have had:

  1. At least one transaction within the first month of account activation, and
  2. At least five active months in the first year of usage.

This choice of definition emerges from the balance of a sufficient long-term average of active engagement in the form of transactions, while avoiding the potential contamination of late bloomer users into the class. We do not impose any condition of consecutive months of usage to provide flexibility and accommodate minimal sporadic early usage behaviors. Of the total population of users with at least one transaction (205,590 users), a total of 13,084 users satisfy this day one criteria, revealing a fairly significant imbalance.

With the class label defined, we can use a supervised learning algorithm to try and discover underlying patterns within this class. A classification algorithm that can adequately discriminate these particular users from the general population will provide meaningful insight into the problem. In particular, we are interested in identifying key actions that may be occurring in the early stages of their engagement to assess if there are any usage patterns common among this cohort.

To this end, and to prevent data leakage in the modeling, we restrict our features of the user to the brief time window of one month after first signing up on the platform and use this information to predict the long-term twelve month engagement of the user. It should be noted here that this is an ambitious goal to predict the 12 month aggregated out-of-sample behavior from only the brief actions within the first month.

As raw data, we have available various actions a user can make within the app including transactional payments or receipts to other users and pools or various social interactions. These time-stamped events can be engineered into interesting basic features of time-aggregated or activity frequency measurements to characterize the user behavior.

The class definition for the desired cohort uses conditions that appear closely related to the types of features that are engineered, which may lead to concerns of bias. However it is important to emphasize that these features correspond only to the first month behavior of the user, i.e. 4 weeks, whereas the class definition corresponds to the aggregate behavior over 52 weeks.

Modeling

We have a standard binary classification problem between a ‘day one’ user and a ‘non-day one’ user. A number of algorithms are considered such as logistic regression, support vector machines, random forests, and light GBM. A balanced random forest (RF) classifier with class re-weighting was ultimately chosen using an 80/20 train/test split, which provided the best results. RF provides a relatively simple, but robust, classifier that can handle non-normally distributed data as well as some of the non-linearities that were observed as well.

Normalized confusion matrix for RF model

The corresponding normalized confusion matrix for this classification is shown, which conveys the algorithm’s ability to correctly classify the users according to their classes along the diagonal elements of the matrix with the number of misclassifications along the off-diagonal. The corresponding F1 score for this classifier is 0.343.

The precision-recall (PR) curve shown below is a useful metric for evaluating a binary classifier as the discrimination threshold between classes for the model is varied. The PR curve is particularly useful for evaluating the performance of a classifier with imbalanced classes, compared to the more traditional receiver operating characteristic (ROC) curve, due to its primary focus on the positive minority class. The area under the curve (AUC) for this model is 0.312 compared to the baseline random classifier AUC of 0.064, which indicates that there is indeed a weak signal that can be detected by the model.

Precision-recall curve for the RF model

The different metric evaluations of the algorithm indicate that there is a weak but slightly discernible pattern in the actions taken within the first month of account sign up that can predict a twelve month aggregate engagement for that user. We can look at the feature importances for this classification as follows to see which aspects of the user provide the model with the most information when making its decision classifications.

Feature importances from RF classifier

From these importances, we see that five of the top seven pertain to the user’s frequency and quantity of pool payments made or pools joined. The other two are related to the transactional network connections that are established within the first month-long period, which include the average number of users in the pools joined by the user and the number of unique individuals a user is connected to through these pools. These are all intuitive signs for increasing user engagement on the platform, although it may be difficult to extract actionable insight directly from this. However, we can step back and note that the pool payments show more importance than, say, the direct peer-to-peer transactions, additionally, the rate at which it occurs (i.e. the average time elapsed between transactions) is of high importance as well. Perhaps we can use these pieces of information to create an ‘Aha!’ moment for the company.

The company was in search for a direction in which to guide their ship, a simple ‘North Star’ metric, similar to Facebook’s simple ‘7 friends in 10 days’. It should be noted that this metric is not a mathematically precise value and not meant to be a rigid, uncompromising rule, but rather a beacon for which the company can direct its efforts toward a common goal to achieve among its users and promote actions that may lead to long-term engagement on the platform. We can derive such a metric by comparing the size of the user body that satisfy the engaged criteria as well as the size of the user body that performed a specific action and evaluate their overlap, as illustrated in the following Venn diagram. Motivated by the feature importances, we use the number of pool payments within the first month as the action criteria to satisfy.

The result of scanning over N, the number of payments, can be seen in the following bar chart, which plots the overlap percentage (the number of users who satisfy both criteria divided by the union of the two sets) as a function of the minimum number of pool payments made in the first month.

Percent overlap of [engaged users] and [users who have made at least N payments in the first month] as a function of the [number of (minimum) pool payments made in the first month]

We choose the action criteria with the highest overlap percentage, in this case at least two pool payments within the first month. With only 15% overlap, this is not a powerful indicator for converting a user from unengaged to engaged, but it provides a simple and clear direction to focus efforts. This provides an actionable insight in which the company can now strive toward gearing new users toward making at least two pool payments within the first month of signing up onto the platform. With, on average, 10k new registered users monthly, this could increase their long-term user base by around 1,500 users each month, who will use it with some consistency within their first year.

There are a number of potential reasons for why this may be correlated with long-term retention, for example having multiple uses within a short time frame allows the user to repeatedly experience the user interface and familiarize themselves with the features and acclimate to the platform to more readily incorporate the service into their lifestyle. Additionally, having multiple uses could also be related to having multiple social circles on the platform, which would lend itself toward long-term engagement when there are more parties available with whom to transact. Regardless of the causal explanation for these user actions and long-term engagement, which may be studied more deeply with further experiments and analyses and data collection, the company can work toward incentivizing new users to meet this threshold and perhaps in the long run, hopefully, be able to improve the long-term retention on their platform.

Assumptions and Limitations

It is encouraging to see that even when restricting information to the limited time frame immediately following the user sign up date, the potential for long-term engagement can be determined. However, this is not without its caveats, and it is important to be mindful of the assumptions and limitations of the study presented.

This study does not provide as much insight beyond the twelve month period and does not require users to be presently engaged. The restriction came from an attempt to avoid including late bloomers into the day one sample. However, it would be interesting to perform an analysis that can also incorporate the late bloomers if they can be neatly encapsulated within similarly defined conditions as the day one users. Perhaps a study can be performed with a ‘rolling window’ of time in one month intervals to determine the subsequent year’s behavior as an extension of the analysis described here. Nonetheless, additional studies will be required to prevent churn and improve retention after the scope of a year.

“Day one” and “late bloomers” may be a useful heuristic description of two types of observed long-term engaged users, but internal variations in the usage patterns between these cohorts and the wide spectrum of possible user behavior led to the difficult challenge of determining the ‘perfect’ definition for such users in order to perform a supervised learning study. Future studies could potentially use unsupervised clustering techniques to more naturally and precisely identify groups of users that better reflect their innate behavior.

It was assumed that “all social networks are created equal.” A more intricate social network analysis is needed, e.g. the length of time a user’s friends/connections themselves have been on the platform upon the user joining the platform. Essentially, it would also be interesting to quantify the level of engagement of a particular user’s peers to assess the strength of this effect.

Correlations with app version history were not well-established. While the core functionality of direct peer-to-peer payments and pool payments have existed throughout the lifetime of the product, changes in user interface/experience may also hold insights in usage/discovery of features in the app. A dedicated A/B test study and further analysis of the long-term user behavior as a result may help with disentangling these effects.

Reflections

This article presents a strategy to determine the long-term aggregate behavior of an app user based on limited short-term information of their actions taken in their first weeks of use. This posed an interesting but challenging task, particularly in the area of feature engineering and class definition of an engaged user on the platform. However, a binary classifier model was built and trained to identify a select cohort of users who were engaged early on, which provided a set of early action features that correlated with the classification of these engaged users. We used this information to define a ‘magic’ moment for the client company as a guide for further development of their platform as they continue to increase their user base that will hopefully be more engaged in the long term.

Are you ready to make a change & transition to a career in tech? Sign up to learn more about Insight Fellows programs.

--

--

Recently graduated physics nerd. Greatest accomplishment? Doing it all without coffee. ☕️