I would say you're right about the differences between MyISAM and InnoDB.
The "feature configuration" system at Facebook is pretty good. That said, it's not perfect: the fact that there are two different caches whether you're running in CLI or Apache SAPI mode has been a problem for me in the past; I still haven't figured out a way to run scripts from a crontab while getting access to the data cached by web pages. Running curl might be an option...
For a CDN, it really depends on the number of files you have, and how much data you're sending out. If you're paying for a hosting plan with more bandwidth in order to serve all your static images, then it might be interesting. You can calculate the costs yourself, actually: http://calculator.s3.amazonaws.com/calc5.html
There are alternatives to S3 (I compared Akamai and S3 in a study in the web company I work for, S3 came out cheaper for our traffic patterns.) This is not to say that cheaper is better, you obviously have to look at the whole picture.
I general, the main problem we have to examine before moving our files to a CDN is the cache policy. S3 has datacenters both in Europe and the US, and there certainly is a propagation delay when you write a file.
There are techniques to avoid this, such as naming your images with a version number, but you have to make sure your clients are going to find the files when your pages start linking to them.
Another important issue is the downtime. S3 recently had a several-hour downtime... this has never happened with our local hosting provider. What will your plan be when this happens? If you can detect that your CDN is down, can you redirect the traffic to your host? Will this risk bringing you down to, if you downsized the equipment after the CDN migration?
It's certainly not an easy decision to take, but if done right and properly planned, a CDN can save you a lot of money.
Anyway, have a look at Memcached. We added a "Cacheable" attribute to class definitions in our in-house ORM so that cacheable data is taken from Memcache instead of querying the DB (that is, if it is available in the cache; otherwise there is a miss + Memcache SET).
For our relatively large website (~1000 people connected now, 1.7M subscribers), Memcache is saving 1 billion queries a month. I would suspect that APC is saving at least that many also.
Having this feature in the ORM makes it transparent for the developers. This is important, because having to wrap hundreds of DB GETs in
all good sense.. we'll be looking at memcached soon, and our use of Propel should make it easy to integrate caching into the ORM like you suggest.
Question, though: I admit i've not given this much thought, but isn't the hard part working out when to clear the cache for individual rows/objects when the db is updated? I'd be grateful to hear any tips from how you've delt with this, considering it sounds like we'll be implementing something very similar..
This creates a table for a forum definition, with its entries in the shared cache (Memcache) for a day - CACHE(local...) is in APC's user cache.
This means that when a developer writes ForumModule::getAll(), it builds the SQL query:
SELECT * FROM Forum;
and uses it as the key to query the cache. In case of a cache miss, the query is sent to MySQL and the data added to the cache with a lifetime of 86400 seconds.
When we add a new forum in the DB, it has no impact on the website, since the data is taken out of the cache 99.9%+ of the time. We then have to manually erase the cache using a maintenance script, running ForumModule::removeAllCache() that will generate the same key and remove the item.
Note that this is not always possible... For instance, if you're caching the list of forums but have dozens of SQL results stored in cache, such as:
SELECT topic FROM Forum WHERE id = 42;
SELECT id FROM Forum WHERE category = 1729;
...
Then these will be used as keys to get results from the cache, and we won't be able to remove them. Last I checked, you couldn't enumerate the keys from Memcache - you can with APC.
We usually solve this by setting low TTL values. A query per day or a query every hour is not going to make a difference, but guarantees that all your changes will be visible within a short time. In some rare cases (large upgrade of the code base), we have flushed the caches completely. If some people have to wait an hour to have the new forum displayed, it's not an issue. Certainly we tune these values depending on the user experience that will follow.
There is something that I didn't mention but is quite important: Do not always rely on the user to invalidate your caches. If the cached value is a list of forums, fine. If it is a large tag cloud with hundreds of queries, update it using a cron. Otherwise if 50 clients go to the tag cloud page just after the cache expired and all get a cache miss and run the rebuild process at the same time, you'll get a big hit on the DB.
On a related note, don't rely on Memcache, the data might not be there. Always have a fallback data source in case the cache is not giving you what you expected.
This can mean using other caches for datasets that are complex to compute - for a tag cloud, I would store the compiled data in MySQL, not in Memcache; I don't want the client to regenerate them and I don't want the client to see a message saying that the data is unavailable at the moment.
The "feature configuration" system at Facebook is pretty good. That said, it's not perfect: the fact that there are two different caches whether you're running in CLI or Apache SAPI mode has been a problem for me in the past; I still haven't figured out a way to run scripts from a crontab while getting access to the data cached by web pages. Running curl might be an option...
For a CDN, it really depends on the number of files you have, and how much data you're sending out. If you're paying for a hosting plan with more bandwidth in order to serve all your static images, then it might be interesting. You can calculate the costs yourself, actually: http://calculator.s3.amazonaws.com/calc5.html There are alternatives to S3 (I compared Akamai and S3 in a study in the web company I work for, S3 came out cheaper for our traffic patterns.) This is not to say that cheaper is better, you obviously have to look at the whole picture.
I general, the main problem we have to examine before moving our files to a CDN is the cache policy. S3 has datacenters both in Europe and the US, and there certainly is a propagation delay when you write a file. There are techniques to avoid this, such as naming your images with a version number, but you have to make sure your clients are going to find the files when your pages start linking to them.
Another important issue is the downtime. S3 recently had a several-hour downtime... this has never happened with our local hosting provider. What will your plan be when this happens? If you can detect that your CDN is down, can you redirect the traffic to your host? Will this risk bringing you down to, if you downsized the equipment after the CDN migration?
It's certainly not an easy decision to take, but if done right and properly planned, a CDN can save you a lot of money.
Anyway, have a look at Memcached. We added a "Cacheable" attribute to class definitions in our in-house ORM so that cacheable data is taken from Memcache instead of querying the DB (that is, if it is available in the cache; otherwise there is a miss + Memcache SET). For our relatively large website (~1000 people connected now, 1.7M subscribers), Memcache is saving 1 billion queries a month. I would suspect that APC is saving at least that many also.
Having this feature in the ORM makes it transparent for the developers. This is important, because having to wrap hundreds of DB GETs in
is pretty tedious and leads to unmaintainable code.