Overview#

Our objective was to develop a user recommendation system that fosters network growth by connecting users with similar interests. To achieve this, we designed a system that utilizes on-chain data for each user. We employed four distinct recommendation algorithms, as illustrated in the architectural overview below:

Top trending users
Friends of friends
Tag similarity
Post similarity

System#

This section describes the overall architecture of the recommender system, including the flow of data and the different recommendation algorithms utilized.

Recommendation system logic:

If the user is new (< 1 week, < 3 days): appends trending users
If the user is not active: appends trending users
If the user is active: appends friends-of-friends
If the user has a tag: appends tag similarity
If the user has posted: appends post similarity
If the user was inactive for some period: appends trending users

Recommender System Architectural Overview

Commands#

This section provides details about the functions the models use for their recommendations.

Along with the main recommender functions, we expose commands that can be scheduled for repeating updates.

get_recommendations_per_user()

See near_recommender/src/process_recommendations.py for reference on how to use the main logic, pushing updated JSON objects for each user to an S3 bucket.

update_corpus()

The update_corpus command is responsible for updating the locally stored Large Language Model, which is defined in the path variable within the similar_posts module.

Each update in the pooled word embeddings model triggers an increment in its version number.

The function load_pretrained_model, which is used in similar_posts, handles loading the latest version.

Machine Learning Model Management

Feel free to optimize this with your own version management for machine learning models.

near_recommender.src.features package

Models#

This section provides details about the machine learning models and algorithms used in the recommender system.

Friends-of-Friends#

The friends-of-friends algorithm appends users who are friends-of-friends to the recommendation list. This model is applied when the user is active.

near_recommender.src.models.friends_friends.get_friends_of_friends(spark_df_path)#

Reads a CSV file as a Spark DataFrame and trains an XGBoost model to predict user connections.

Parameters:: spark_df_path (str) – The path to the CSV file containing the input data for the Spark DataFrame.
Returns:: A dictionary containing the predicted users as a NumPy array.
Return type:: Dict

Similar Profile Tags#

The similar profile tags algorithm appends users who have similar profile tags to the recommendation list. This algorithm is applied when the user has specified tags.

near_recommender.src.models.similar_tags.get_similar_tags_users(user, top_k=5)#

Returns the top-k users with similar tags as the specified user.

Parameters:

user (str) – The name of the user for whom similar users are to be found.
top_k (int, optional) – The number of similar users to be returned. Defaults to 5.

Returns:

A dictionary containing the top-k similar users and their similarity score.

Return type:

Dict[str, List[Dict[str, str]]]

Raises:

ValueError – If the input dataframe is empty or contains NaN values.
TypeError – If the input top_k value is not an integer.

near_recommender.src.models package

Data#

This section provides details about the queries used to feed the models.

SQL queries#

This query is used to remove duplicates from hive_metastore.mainnet.silver_near_social_txs_parsed table.

The query utilizes two Common Table Expressions (CTEs) to handle transactions. All transactions are aggregated, with a new column tx_count counting the number of times each transaction appears. Then, using this column, transactions are filtered:

duplicates This CTE captures all transactions that are duplicated (tx_count > 1).
unique_txs This CTE contains only the unique transactions (tx_count = 1).

By merging these two CTEs, we create a new table without any duplicates. The resulting table is saved as hive_metastore.sit.near_social_txs_clean.

This query is used to create an edge table that represents the social network in a graph format.

Each row of this table represents a follow transaction with:

signer_id
user followed
type (FOLLOW or UNFOLLOW)
date

First step is to parse the graph:follow argument from the hive_metastore.mainnet.silver_near_social_txs_parsed table and filter non-null, non-empty and non-failed transactions as represented by the WHERE clause on the first CTE.

Then, two different CTEs extract the followed user name on two different scenarios:

single_user_follows extracts the substring that contains the user name. In this case, each transaction represents following one single user
batch_user_follows uses the same logic to extract the user name but uses first an explode to deal with multiple follows in a single transaction. This was enabled by a batch following widget.

By merging the two CTEs above, we obtain a table with the edges of the graph representing the following connections in the social network. This table is saved as hive_metastore.sit.graph_follows.

Query to create a table with different user metrics and calculate trending metric from them.

Metrics can be divided into 3 categories:

Metrics directly aggregated from the hive_metastore.sit.near_social_txs_clean table. These include 10 all-time metrics and 4 metrics for the past 30 days. All of them are calculated in the CTEs metrics_raw and last_month_activity.
Metrics that need additional CTES for their calculation:
- Likes are calculated parsing the index:like argument from the hive_metastore.mainnet.silver_near_social_txs_parsed table and agreggating for the signer:id and the likee separately. Additionally, both metrics are calculated for the past 30 days.
- Follows are calculated following the same structure as for the graph table query (see section above). When calculating the total followers and following for each user, separate methods have been used, empirically testing them with data from the social network. There is some slight mismatch which is thought to be caused by some missing transactions, but the final result is accurate enough (Max. 3% error on one user from the top 10 most followed accounts).
- Comments for the past 30 days are calculated parsing the post:comment:item argument from the hive_metastore.mainnet.silver_near_social_txs_parsed table.
Engagement, activity and trending metrics. These are calculated based on the previous categories, with the following definitions:
- Engagement is calculated as the sum of comments, likers and followers in the past 30 days. There is a weighted variant using 1, 0.5 and 0.1 weights respectively.
- Activity is calculated as the sum of posts, likes and followings in the past 30 days. There is a weighted variant using 1, 0.5 and 0.1 weights respectively.
- The trending metric is defined as the ratio between weighted engagement and weighted activity with the intention of selecting the users which create the most appealing content for the network.

near_recommender.src.data package

near_recommender package
- Subpackages
- Module contents

Overview#

System#

Commands#

Models#

Trending Users#

Friends-of-Friends#

Similar Posts#

Similar Profile Tags#

Data#

SQL queries#