How to Dramatically Speed Up Your Web Application: An Introduction to memcached

Six Apart often speaks of how its technology and contributions to open source help many of the most popular Internet applications scale to unprecedented levels. One of these tools is memcached, yet as important and ubiquitous as it is, it surprisingly lacks the documentation necessary to help beginners in realizing its power and ease of use. Memcached is not for priests and gurus, despite how technical its homepage may appear. Memcache is a tool that can be easily installed and used by almost any developer on almost any platform.

This guide was written in an attempt to give developers out there an introduction to this incredibly powerful tool, and hopefully equip them with enough information to actually get started in building applications on the Internet that are faster and more reliable.

What is memcached?

Despite its name, it is surprising how many people don’t know the answer to this question. So let’s dissect it. A “cache” is a mechanism used by computers to store frequently used information in a readily accessible place to reduce the need to retrieve that information repeatedly and needlessly. A common cache that everyone who uses the Internet is probably familiar with is your web browser’s cache. This cache works by effectively downloading all the HTML, CSS, Javascript and media files for every web page you visit. This information is cached so that the next time you access the same page on the Internet is can be loaded and displayed much more quickly. Because there is no limit to the number of web sites you might visit and the browser may cache, caches are almost always constrained by size. When a cache exceeds its size limit, it knows how to optimize what is kept around to ensure that the information it stores is the most likely information to be needed again.

The “mem” in memcached refers to memory of course. Therefore, memcached is a memory cache; but it is so more than that. Memcache can be deployed anywhere and be accessed from anywhere over a network. Additionally the cache can span as many machines as you need, but no matter how many machines make up your memcached cluster, as far as your code is concerned memcached acts as a single cache so that you never have to worry about which machine your information is stored on. You just say to the cache, “give the object named ‘foo’” and memcached knows where to go to get ‘foo’ for you.

No doubt if you took Computer Science in school you were cautioned of the temptation to abuse caches because there is a law of diminishing returns in regards to the size of your cache: the larger your cache gets the more costly it is retrieve and store information within it. Memcache however is not heavily constrained in this way, because the cache at large is made up of lots of little caches. This allows memcached to be much more responsive even when the cache itself begins to reach sizes that might be really inefficient in other circumstances.

Finally, memcached is fast. It utilizes highly efficient, non-blocking networking libraries to ensure that memcached is always fast even under heavy load. In other words, in circumstances where your database might be falling over, memcached won’t be. Which is precisely what memcache was designed to do: to take the load off of your database, which for the majority of popular web applications is the biggest performance bottleneck and risk to scalability.

How do I use memcached?

Ok, so now that you get the gist of what memcached is, how do you use it? At the 10,000 foot view, here is the basic sequence of events you want to create in your code:

  1. Right before you query the database to retrieve data, check to see if that information is stored in the cache. If the data is found in memcached, then use the cached data as opposed to querying the database for the information.
  2. If the information is not found in memcached, then go ahead, call the database. Once you load the data you queried for, don’t forget to put in the cache. Then proceed normally. Now, in subsequent calls to fetch this information you don’t need to call the database at all.
  3. Now, if the information changes for some reason, a user updates the data for example, then delete the data from the cache. That way if someone tries to load it again, they will be forced to go back to the database to get it again. This keeps the cache fresh and accurate.

For those prose to describe code, here is some basic pseudo code that says the same thing:

Class Foo {
    public static findById(id) {
        if (obj = memcached.get(id)) return obj;
        obj = loadFromDatabase(id);
        memcached.put(id,obj);
        return obj;
    }
    public update() {
        memcached.delete(this.id);
        updateDatabase(this);
    }
    …
}

Does it seem too simple? It is. But that’s a good thing right?

Choosing a Key

When working with memcached, it is easiest to think of the cache as one big associative array in which each item in the array is indexed by some arbitrary string. Therefore to manipulate data stored in the cache you need to define something called a “key.” A key uniquely identifies data stored in the cache, and is used when storing, retrieving and removing data from the cache.

Technically a key can be any value you choose. I personally try to key key names consistent within an application. The pattern I usually opt for is something like this:

key ::= data type name “:” primary key

Where “data type name” is the name of the class/object being stored and “primary key” is the unique identifier for that record for that data type. For example, in my test case management software Test Run, I use keys that look like this:

  • “TestCase:3283”
  • “TestRunAccount:93”
  • “TestPlan:283”
  • “User:8922”
  • And so on and so forth

But at the end of the day, the key is not important provided that you know what it is, how to generate it, and that it uniquely identifies the information being stored.

Is memcached secure?

Access to memcached is not protected by a username and password, neither is the data within it. So while access control does not exist natively for memcached, simple things can be done to harden your instance of memcached to make it secure:

  • Prevent external access - Deploy memcached behind your firewall and only allow machines from within a specific network to access to the cache.
  • Encrypt the data you store in the cache – I personally feel like this is overkill because for most applications it adds an extra hoop to jump through every single time you visit the cache. But for the hyper-paranoid that work in shared environments, I suppose this is something worth considering.
  • Choose obscure keys – there is no way for a user to query memcached for a list of keys, therefore the only way for someone to retrieve information stored there is if they know the key for the corresponding information. Therefore, if your keys have predictable names then it would be relatively easy for someone to guess them. So make your keys somewhat obscure. Consider creating a simple key like “object:10032” and then generate a sha1 hash of it. This will create a very obscure key name while using a very standard, easy to remember key naming scheme of your choosing.

So is memcached secure? Well, while it does not have built in security features, it can easily be made secure.

What should I use memcached for?

Memcached was designed to alleviate database load. Try to think of the data that is likely to be requested over and over again in close succession. The most common thing I can think of is user and account data. In my applications I tend to load the current user and all their account data with every request. Because the same user information is loaded over and over again as they navigate my application, I stash their information in memcached so that from request to request I am not having to hit the database to retrieve it.

I also use memcached to store data that is costly to compute – a good example are counts. Whenever you have a SQL statement that computes a count of the number of items in a group, you are probably making a relatively more expensive database query. I try to identify “expensive queries” and cache their results so that I minimize the number of times I have to make them.

To help me in crunch times I have gotten into the habit of inserting debug logging around database queries to time them and output how long they took to execute. That way when things starting to melt, I can flip a switch to turn on debug logging and then quickly identify the query that is the mostly likely culprit for my performance problem. From that point it is easy for me to devise a caching scheme for its data.

Sample Code

By now you should have more then a good idea about how to deploy memcached. So let’s get our hand dirty with some code. There are many memcached libraries to choose from, and libraries for all of the most popular programming languages. The code samples below are not complete samples, but they will give you an idea of what the API looks like for memcached in each language: Perl, PHP and Java.

Perl

For Perl users I recommend using Data::ObjectDriver. D::OD solves two problems for you: it eliminates the need for you to write SQL ever again (hallelujah) and automatically caches stuff in memcached for you. Praise the lord. Working with memcached has never been so seamless and effortless.

my $driver = Data::ObjectDriver::Driver::Cache::Memcached->new(
        dsn      => 'dbi:mysql:dbname',
        username => 'username',
        password => 'password',
        cache    => Cache::Memcached->new(
                        servers => [ '127.0.0.1:11211' ],
                    ),
);
## Set up the classes for your recipe and ingredient objects.
package Recipe;
use base qw( Data::ObjectDriver::BaseObject );
__PACKAGE__->install_properties({
    columns     => [ 'recipe_id', 'title' ],
    datasource  => 'recipe',
    primary_key => 'recipe_id',
    driver      => $driver,
});

## This query will hit the database
my $recipe1 = Recipe->new;
$recipe1->recipe_id(10);
$recipe1->load;

## This query won’t
my $recipe2 = Recipe->new;
$recipe2->recipe_id(10);
$recipe2->load;

PHP

To use memcached in PHP you must install the memcached PHP extension. This is relatively easy with newer versions of PHP that come prepackaged with pear and pecl. Once the extension is installed, this is what the code looks like:

$memcache = new Memcache; $memcache->connect(MEMCACHESERVER, MEMCACHEPORT) or die ("Could not connect to memcache server.");

Class Foo { public static function findById($id) { global $memcache; $key = 'Foo:'.$id; $f = $memcache->get($key); if (is_a($c,'Foo')) { return $c; } $f = new Foo(); $f->id = $id; $f->select(); $memcache->set($key, $f, false, 45); return $f; } // More code here }

Java

Here is a simple code sample using Java (more code samples found here:

import com.danga.MemCachedClient;
    MemCachedClient mc = new MemCachedClient();
    String key   = "cacheKey1";     
    Object value = SomeClass.getObject();   
    mc.set(key, value);

Summary

But whatever the language you are using, memcached is simple and easy to deploy. It does not require a lot of technical knowledge to use or use effectively – it just does what it is supposed to. And if you build in support for memcached early on in your products development lifecycle you are making a worthwhile investment should you build something that doesn’t suck and actually gets used by people, perhaps even a lot of people. And if you are so lucky, your application won’t fall over due to its own success.

Recommended Entries

10 Comments

Nice post, Byrne. Very informative and interesting.

Do we happen to have an MT::ObjectDriver subclass that supports memecached? Seems like that would be a no-brainer and great not only for in-app performance but also for MT-Search.

Jay - not that I am aware of. It seems simple enough to create though. Are you volunteering buddy?

finally found the original from which you copied : http://www.infinitywage.com/memcaching

and to think you'll would be original... sheesh...

It's possible that you might have functions that modify something that is a part of several larger complex object, and you need to clear out entire sections of your cache to be sure.

Using a pattern like:
function getcomplex1($id)
{
$version = $memcache.get('getcomplex1.version')
$result = $memcache.get('getcomplex1-'.$version.'-'.$id);
--do something if not cached
return $result;
}


function getcomplex2($id)
{
$version = $memcache.get('getcomplex2.version')
$result = $memcache.get('getcomplex2-'.$version.'-'.$id);
--do something if not cached
return $result;
}


function updatepart($id,$value)
{
$database.updatepart($id,$value);
$memcache.set('getcomplex1.version', $memcache.get('getcomplex1.version')+1);
$memcache.set('getcomplex2.version', $memcache.get('getcomplex2.version')+1);
$memcache.set('part-'.$id, $value);
}

This makes it really easy to invalidate multiple sections of your cache as necessary.

Correct me if I'm wrong, but I think there is a typo in the sample PHP code.

if (isa($c,'Foo')) { return $c; } Should read: if (isa($f,'Foo')) { return $f; }

This is some great tutorial for memcached.

Hashing the key? That can't be a good thing can it? Then there is chance of key collition. That is s1.equals(s2) && sha1(string1).equals(string2). This can happen since hashes are by definition non-reversible since they map a larger domain onto a smaller domain.

This means that if I do cache.get(sha1(user1)) I actually may get the content for user2 back if the sha1 of both keys are equal.

Instead consider adding the sha1 to the original unique key.

Dear Anonymous, SHA1 hashes are strong crypto hashes; they are collision-resistant for practical purposes despite mapping a larger domain to a smaller one. For example, the Git version control system depends on this, storing every object only by it's sha1 hash. There is no need to use the original unique key.

If i have a complex structure that i wish to cache. For example in php: a layered associative array. I could store it in two ways.

1) serialize the structure and store the whole structure under one key 2) break the structure down into multiple keys and store each primitive data type in the structure with a unique key.

It seems obvious that option 1 is faster and less of a resource hog. But are there any limits to the size of the structure i can store in the value?

Excellent introduction to MemCached

Leave a comment

what will you say?


Recent Comments

Close