I would say you're right about the differences between MyISAM and InnoDB. The "f...

apexauk · on June 4, 2008

all good sense.. we'll be looking at memcached soon, and our use of Propel should make it easy to integrate caching into the ORM like you suggest.

Question, though: I admit i've not given this much thought, but isn't the hard part working out when to clear the cache for individual rows/objects when the db is updated? I'd be grateful to hear any tips from how you've delt with this, considering it sounds like we'll be implementing something very similar..

thamer · on June 4, 2008

It is a problem, that's for sure.

Let's take a simple example, the display of the list of available forums on every page.

In our class declarations diagrams, we have something like this:

    +----------------------+
    | Forum                |
    +----------------------+
    | name: string(1,100)  |
    | topic: text          |
    | ...                  |
    | CACHE(shared, 86400) |
    +----------------------+

This creates a table for a forum definition, with its entries in the shared cache (Memcache) for a day - CACHE(local...) is in APC's user cache.

This means that when a developer writes ForumModule::getAll(), it builds the SQL query:

    SELECT * FROM Forum;

and uses it as the key to query the cache. In case of a cache miss, the query is sent to MySQL and the data added to the cache with a lifetime of 86400 seconds.

When we add a new forum in the DB, it has no impact on the website, since the data is taken out of the cache 99.9%+ of the time. We then have to manually erase the cache using a maintenance script, running ForumModule::removeAllCache() that will generate the same key and remove the item.

Note that this is not always possible... For instance, if you're caching the list of forums but have dozens of SQL results stored in cache, such as:

    SELECT topic FROM Forum WHERE id = 42;
    SELECT id FROM Forum WHERE category = 1729;
    ...

Then these will be used as keys to get results from the cache, and we won't be able to remove them. Last I checked, you couldn't enumerate the keys from Memcache - you can with APC.

We usually solve this by setting low TTL values. A query per day or a query every hour is not going to make a difference, but guarantees that all your changes will be visible within a short time. In some rare cases (large upgrade of the code base), we have flushed the caches completely. If some people have to wait an hour to have the new forum displayed, it's not an issue. Certainly we tune these values depending on the user experience that will follow.

There is something that I didn't mention but is quite important: Do not always rely on the user to invalidate your caches. If the cached value is a list of forums, fine. If it is a large tag cloud with hundreds of queries, update it using a cron. Otherwise if 50 clients go to the tag cloud page just after the cache expired and all get a cache miss and run the rebuild process at the same time, you'll get a big hit on the DB. On a related note, don't rely on Memcache, the data might not be there. Always have a fallback data source in case the cache is not giving you what you expected. This can mean using other caches for datasets that are complex to compute - for a tag cloud, I would store the compiled data in MySQL, not in Memcache; I don't want the client to regenerate them and I don't want the client to see a message saying that the data is unavailable at the moment.