How we’ve reworked our listings at OLX
When we started as the Personalisation and Relevance team at OLX, we faced the big challenge of migrating all the endpoints related with listings (Home Feed, Search and Recommendations) from a monolithic application (PHP) to a microservices architecture (mainly Java), to take the ownership and start to improve our listings. More than 2 years later, we’ve released our platform in many countries, handling ~800K requests per minute and this is what we did.
Analysis
First, we talked with all the stakeholders to understand what we had and what we needed. In summary, the solution should be:
- Easy to release new algorithms to retrieve ads from any datasource.
- Easy to A/B test those algorithms from backend side.
- Fault tolerant so no algorithm is critical and is easily replaced by another.
We were reading papers published by other companies (such Pinterest, E-Bay and Amazon) to understand what the big players are doing, discussing a lot and after so many drafts, we came to our own solution that fits our needs.
Overview
The platform is mainly composed of three parts:
- Indexing: have our own search engine where ads are indexed.
- Retrieval: code the algorithms to get and rank the ads from a datasource.
- Blending: combine the retrieval algorithms in different ways.
Indexing
A service is in charge of consuming events from Kafka that are published by the posting flow of the core every time a new ad is posted in the platform, and index those ads into a Solr with the data we need to perform the searches.
Retrieval
Another service is where we code the retrieval algorithms (mainly Solr queries) which retrieve and rank the ads based on a specific criteria. All of them have the same API, receiving specific parameters and returning a ranked list of ads (only the ids and the scores).
We named this algorithms as spells, so basically a spell is a piece of code that do the work needed to retrieve ads from a datasource, and rank them using a specific criteria which is what identifies the spell. So, different spells rank the ads in a different way (and could use a different datasource as well).
Even though we have only one service to do the retrieval part, it could be split in many services, for example, if multiple datasources are used or the ranking part is too complex or different to be placed in the same component. Actually we could have spells implemented in different programming languages, as long as they have the same API.
Blending
A third service is needed in order to combine all the results returned by the spells in a one single result-set which is the final listing to return to the user.
Layout
First component of the blending part is to provide a way to configure which spells must be combined and how their results are presented to the user. The first and simple approach is to have fixed positions (or slots) where we define which spell fills each one.
We started with 3 different spells, the first one returning featured ads which users pay to have a boost and are showed at the top positions of each page, the second finding recommendations based on the recent activity of the user, and the last one retrieving items that were posted in locations close to the user position.
Blender
Then the blender itself, that gets from the layout which spells have to be executed, pull the results for each one in parallel, and combine them based on the configuration defined in the layout. The blender doesn’t care about what each spell actually does, so all the parameters received from the client are forwarded to the spells in the same way when the pulling is performed.
Fault tolerance is required in this process, since some spells could fail or just don’t return results for a specific request, Blender needs to handle this exceptions. One thing that blender does is having a fallback, which is basically another spell (usually simpler and reliable) where it can get results in case of failures or empty results of the original one. Another one is using circuit breakers, so when a spell was continuously failing, is removed from the blending until is recovered.
User Pool
After blender generates the final result-set, we save those results for the user so pagination is done on this structure instead of doing the retrieval and blending every time the user request a new page. Basically we generate the full listing (all the pages the user could look for) when the first page is requested, and after that we get the results directly from the user pool in the following pages, until the client request the first page again or the user pool expires (typically after 1 hour).
Behind the scenes, user pools are Sorted Sets in Redis, where we only put the ad ids and scores. Since this is implemented in memory and we have one pool for each user, is critical setting an expiration so memory is not wasted.
Hydration
Finally, since we were using only the ad ids at this point, we need to get the full data of each ad from another service to retrieve to the frontends (PWA, Android and iOS apps) so they can do the presentation part. These ads are persisted in a MySQL database by the core, so a cache layer was added (Redis) in order to have faster access to this data.
Embebed A/B Testing tool
From the beginning we thought the platform to easy deliver and measure new features to our users. That’s why our blender component is fully A/B testable, so we can try different spells or blending strategies without changing the API, being transparent to our frontends.
We split users based on their ids, and assign them a layout, measuring which configuration performs better.
We have tested and combined more than 30 spells in our listings, and actually one of them, which is one of the most successful in terms of conversion, was developed by another team who made it for a different platform and market, but was easily integrated with our platform thanks to this architecture.
Conclusion
Still adding new spells and improving the blending strategies, we made an architecture which allowed us to validate hypotheses and deliver value to our users really fast, working together with people around the globe to accomplish the same objective: show the users what they want to see.