Tlahui - Data Science blog

Rocket League is what you get when you mix a racing game with soccer. The idea is simple: you control an RC car and your objective is to shoot the ball into the opponent's net. What makes the game interesting is the addition of boost, which enables the car to soar through the air and provides plenty of opportunity for outplays.

While most of us are quite bad at controlling the car in the air, the pro players have honed their abilities to perfection. They compete in the Rocket League Championship Series (RLCS), an event in which teams from all around the world try to qualify throughout the year for the main event: the World Series.

Rules and structure aside (see here if you are interested), something caught my attention while following the tournament for the past few years. Every now and then, teams will make changes to their roster, especially at the turn of RLCS seasons. Therefore, trying to predict which teams are going to be the best ones for the season after all the movements. This gave me an idea: I will build an Elo system to rate the teams' performance, and then train a model to predict the new ratings after all the roster changes. The current will be the first of a series of 3 blog posts following my journey through database handling, Elo rating system creation and performance modelling. The code is freely available here.

ETL: Schema Architecture

For this section to make sense as a standalone, I will perform an ETL process from https://zsr.octane.gg into a local database. The first step is to read the documentation and decide which data we want to extract.

The API has 8 major sections, 3 of which hold calls related to records and stats, which are already included in the other 5 methods. The main database diagram should look a little like this:

Initial relational diagram for the database.

One line in the connection terminal indicates a "one" relationship, while the three-lined terminal indicates a "many" relationship. Each team consists of 3 players (plus a coach and substitutes), and many teams can participate in many events. Each event is made up of several matches, each of which is played between 2 opposing teams. Finally, each match consists of at least one game (depending on the format) Between the same 2 teams.

I plan to implement the Elo score at the game level, and therefore need the data on which specific players composed the team in question for the game. Unfortunately, the team method does not include the history of different players that have been part of the roster, it only includes the active ones at the time of the call. We will have to get this data directly from the "Games" call and normalize it into a different table. There are several other data groups that, due to the JSON formatting, will need to be split into different tables to fit the normalized relational model.

Final relational diagram for the database.

As can be seen in the diagram, I decided to also save the match data by player, as this was conveniently already aggregated for us to use. Each event can be further subdivided into stages (e.g. qualifier, bracket, finals) and thus required an extra table. Finally, each player can have several accounts to play online (depending on if they play on Steam, Epic Games or both). The whole structure and variables of the tables can be seen in this file, which creates them into a Postgresql schema called rocket_league. I added the columns "created_at" and "updated_at" to maintain control of the ETL process, as they were not tracked from within the API.

ETL: Data Handling

As this was a small enough ETL process, it could all fit neatly into a single python script without having to separate API calls, transformations and loading. All calls were done with the requests library.

I built a general function called load_pages() which controls the calls made to the API. It serves two main purposes:

Delimit the data we want to extract. For the "teams" and "players" tables, the sorting method was not working and you can't filter results by date, but for everything else, we can just extract the records in which the date field is greater or equal to our latest "updated_at".
Loop through the pages in the response as, after some testing, I determined that the maximum number of items per page was 500.

The responses were encoded in JSON, so we need to do something about that. I created this file to store the simplified names of each element of the call. This, along with the format_df() function will help me rename the columns to a nice snake case and prune the JSON tree from undesired elements.

Finally, each API call has its own handling function:

load_teams(): formats variable names and loads to the database.
load_players(): formats variable names, extracts player's accounts and loads to the database.
load_events(): formats variable names, extracts event's stages and loads to the database.
load_matches(): formats variable names, extracts the match's players (each match has 2 teams, so they must be separated) and loads to the database.
load_games(): formats variable names, extracts the game's players (each game has 2 teams, so they must be separated) and loads to the database.

As an extra precaution, the get_new_records() function scans the database for existing records and compares them to the API calls. It returns a new data frame with only new records to prevent id duplication, especially for the "teams" and "players" tables as they can't be filtered.

This script can then be scheduled to run periodically to maintain the database up to date! All the necessary files to recreate this post can be found here 🙂.