How we Brought Down Storage for the Ads Platform (Part 2)
In the earlier blog, we explored different solutions we went through to solve the following problem:
our code needs to compare the number of times a user has already seen an ad to the configured frequency in real-time and for that, we need storage of how many times a user has seen an ad. On the face of it, this problem may look simple enough but when we are solving at the scale of 16 crore MAU, it gets challenging
In this part, we’ll explore two major solutions, one of which we have implemented to address the problem.
Specifications our end solution needs to address
- Low Latency is of prime importance because we need very fast reads
- A minimum of 500kops should be supported by the system
- A minimum of 10k active ads at any given time
- Close to 100% accuracy required, given that 100% accuracy is not a compulsion
- Storage costs should be as minimum as possible.
Solution 1: Using count-min sketch
The count-min sketch is a probabilistic data structure whose purpose is to store frequency information for a stream of data. Below is how it works:
- You choose x hash functions that can give you a value between 0 to y, and then you have a table created with x rows and (y+1) columns. Eg: if you choose 4 hash functions that can give a value between 0 to 6, you’ll create a table with 4 rows and 7 columns.
The input data into the system is passed through all the hash functions with each function returning a numeric value. The corresponding cell value of the hash function row(x) and the numeric value column(y) is then incremented by 1. Eg: Let’s say you get an input “A” into the system and you pass it through all the 4 hash functions, with the outputs as 0, 4, 3, 5 respectively. Below is how the table will look now:
The same process is repeated for each input that our system receives and after a few inputs, our count-min sketch could look like below:
Now let’s say we want to know how many times our system has seen the value “A” from the table above, we would pass “A” through our 4 hash functions again which would give the same values of 0, 4, 3, 5 respectively. We then pick up the corresponding cell values and the minimum of the values is the output. Eg: here our values would be 4, 1, 1, 12. This would mean our system has seen “A” 1 time.
- We take the minimum value to decrease the impact of collisions as the same cell value could be incremented by some other inputs into the system.
- To decrease the impact of collisions further, we could increase the length and breadth of the table.
- As you can see, this is a probabilistic data structure because it gives you an estimate of the frequency of the input while using a minimal amount of space to store large sets of data.
Now that we know how count-min sketch works, the next part was the research and implementation. We decided to use Redis as it offers the count-min sketch data structure as a part of the RedisBloom module. Our plan was to use a new count-min sketch for each ad campaign in order to store the frequency information of the users who’ve seen the campaign.
Our findings after experimenting with live data:
- It works perfectly for small use-cases where the stream of input data is small and the number of configured campaigns is also small.
- At ShareChat’s scale, the count-min sketch doesn’t use as little space as we thought where there could be 1000s of active campaigns at a given time and over 16 crore MAU. Eg: if a campaign is shown to 1 crore unique users with an error rate of 5.5%, we would need a sketch of (50 Lakh * 20) size, which ends up using 8GB of space. That’s 8GB for each campaign at a considerably high error rate. It would have used much more space for a lower error rate.
- It is not possible to increase the width and height of the count-min sketch at runtime after creation. This became a problem because the frequency caps configured inside an ad campaign can change anytime.
Along with the issues encountered, the initial requirement also changed to allow configuring frequency caps per user per x minutes/x hours/xdays/x weeks/x months per campaign. Eg: we should now be able to have a campaign that needs to be shown to a user once every 5 minutes, twice every 2 hours, 5 times a day, 10 times every 2 weeks, 20 times every 4 months.
This new requirement could definitely not be satisfied by the count-min sketch because this meant storing actual historical data of the timestamps when a user has viewed a particular campaign.
Solution 2: Using Cloud Bigtable to store the historical data of a user’s views on a campaign
Cloud Bigtable is a managed, NoSQL database from GCP for analytical data. It can scale seamlessly while giving sub-10ms latencies.
Concepts of Bigtable schema design:
Bigtable is a key/value store, not a relational store. It does not support joins, and transactions are supported only within a single row.
Each table has only one index, the row key. There are no secondary indices. Each row key must be unique.
Rows are sorted lexicographically by row key, from the lowest to the highest byte string. Row keys are sorted in big-endian byte order (sometimes called network byte order), the binary equivalent of alphabetical order.
Column families are not stored in any specific order.
Columns are grouped by column family and sorted in lexicographic order within the column family.
The intersection of a row and column can contain multiple timestamped cells. Each cell contains a unique, timestamped version of the data for that row and column.
All operations are atomic at the row level. This means that an operation affects either an entire row or none of the row.
Ideally, both reads and writes should be distributed evenly across the row space of a table.
Bigtable tables are sparse. A column doesn’t take up any space in a row that doesn’t use the column.
Now that we understand the basic concepts of Bigtable, let’s visualize how we are storing the data in Bigtable with a sample data flow by assuming we have campaigns “A”, “B”, “C”, “D” and “E” in our system along with 3 users “Joe”, “John”, and “James”.
The following is the flow of view events into our system:
- timestamp 1(ts1) → Joe views campaign A
- timestamp 2(ts2) → John views campaign E
- timestamp 3(ts3) → Joe views campaign A
- timestamp 4(ts4) → John views campaign C
- timestamp 5(ts5) → James views campaign A
- timestamp 6(ts6) → James views campaign B
- timestamp 7(ts7) → James views campaign B
This is how the Bigtable data will look at the end of the flow above:
Assuming we have this data when the system gets a request to check the frequency caps for a campaign-user combination, we have to pick up the particular cell value and compare it with the configured frequency caps. We’ll have the result of whether a user satisfies the configured frequency caps or not.
Pros of using this solution:
- Bigtable gives 5–10ms latencies while fetching data per user per campaign, which is quite good.
- The total storage used by this solution is approximately 1TB and has been live for around 3 months now. This is better compared to the storage usage which we were getting from earlier solutions.
Finishing up
As I mentioned at the end of part 1, all of these solutions are valid but each solution suits a different use case and scale. You should choose the solution which is relevant to you.
WRITTEN BY Pulkit Chaudhary
The article was originally published on ShareChat TechByte.