Caching Data using Python

We all know that caching is basically keeping the important/most popular data in memory rather than in a disk for faster execution. But someone has to be sensible enough to decide what needs to be cached and how? There is a common tendency to start keeping any data in memory without realizing the fact that memory is expensive, also, more unwanted data means compromising with execution time.

There may be different reasons of caching data, but below are the 2 most important facts -

1. To reduce I/O (network calls) as much as possible.

2. Avoid re-computation.

Once we decide what needs to be cached then we may need to think of

1. How long data needs to be hold in cache.

2. When new data will be onboarded how we are going to manage the memory.

For this we need to follow any caching policy. There are different policies, but LRU (Least Recently Used) is the most popular one. LRU policy will keep the latest data on the top of the stack and it will remove data from the bottom of the stack which is not been used since long.

Now, when we think about caching at a large scale, we may be talking about distributed caching mechanism. The distributed cache has capability to maintain the data consistency and it’s scalable too. As the required data is available in distributed cache hence if one server fails (as shown below) same data can be pulled from the cache.

Let’s talk about consistency. We are pulling data from cache that’s alright, but there is another copy in disk too. So, while reading and writing make sure both the versions (by referring version mean modified data) are in sync. There are several ways to maintain the data consistency and it depends on the requirement. The below table describes different consistency methods based on the situations.

Below I’ve explained the same three techniques Write-through, Write-back and Write-around.

Write-through:

If we are looking for an I/O confirmation only after writing data back to Memory and Disk no matter in which sequence, then we should consider Write-through.

Pros: 1) There will be no data loss in case cache is disrupted. 2) Reading is fast.

Cons: As we are writing data in two places every time hence this process will be slow.

Use Case: If our application is performing frequent read operation and less write. So, it’s okay to spend little bit more time to write data once and then read as many times required.

Write-back:

If there is a need of I/O confirmation only after writing data back to Memory then we should consider Write-through. Writing data on Disk will take place later based on the schedule/backend process. So, there will always be a possibility of having different versions of data.

Pros: Both Reading and Writing is fast as it’s in-memory.

Cons: 1) Data inconsistency. 2) Also, there will be a risk of data loss if memory fails and the data was not persisted to disk before that.

Use Case: Response time of reading and writing data is almost same. If there is a demand of frequent write and read on the same data then we can consider this approach.

Write-around:

This is little different. Here data is written only to Disk and then I/O completion is confirmed. Data will not be written back to Memory unless there is query looking for the same data in memory. Once there is an in-memory request the requested data will be copied from disk to memory based on the defined process.

Pros: No unnecessary write in memory.

Cons: High possibility of encountering cache miss for recently written data and hence reading will be time consuming.

Use Case: If we are not reading very recently written data then this approach is good to consider.

Hope this article gives you a very high level overview of different techniques to store data based on the situation.

Big Data Engineer and pySpark Developer