Most of the end user services which use LDAP for authentication/authorization purposes usually feature the following characteristics:

  • The LDAP queries performed target a single entry. The most common query filter is uid=<username> which can be easily indexed (equality index) and normally should only have one candidate entry (which means that the LDAP server can just use the index to find the entry in the database).
  • Only a small percentage (as small as 10-20%) of the total user population is active at any given moment. That means that your cache memory needs not be as large as your database but only as large as your active working set. The user entries in the working set are the ones that will get accessed a lot and increase your cache hit ratio while the rest of the entries might never get accessed or be asked for very infrequently so maintaining a cache memory large enough to keep them is just a waste of memory. For example in the Greek School Network we ‘ve got around 200,000 ldap entries but in a single day only 40,000-50,000 entries will be accessed. That’s 20-25% percent which means that you can get great performance by tuning entry/directory cache to hold 40-50% of the database.  In our case we get 98%(!) entry cache hit ratio with an entry cache size of 60,000 entries.

Although the above are the usual case, there are a few services that involve much heavier LDAP queries. One common example is mailing lists. We ‘ve got 70,000 teachers so an LDAP query to create a mailing list holding all their emails will hit the All-ID’s threshold (thus will not use an index) and will need to look through the whole entry database. Imagine having to run the same type of query for various other user categories (students, user’s belonging to specific organizational units and so on).  The load on the LDAP server starts getting bigger as well as the total execution time of all queries.

In our 200K user database the id2entry file is 1,3GB and consequently scanning all the database from disk can take quite some time.  That translates to roughly 6KB/entry. If you also take into account the indexes total size (500MB) and a minimum entry cache of around 60,000 entries (~900MB) it’s obvious that you need at least 4GB to keep database on memory and still be able to handle other LDAP queries and return processed entries quickly.

Since in this case you ‘ll hit the All ID’s threshold no matter how you run the LDAP query the only alternative is to try and find a way to lower the database size. The way to do that is Fractional Replication. Fractional Replication allows you to setup a replica that will only request a specific set of attributes for all entries. You obviously have to be a bit careful to always request all attributes required by the objectclasses on your LDAP entries but other than that you can keep the attribute set to the bare minimum to have your heavy LDAP queries working. For the mailing list example you only need to keep the attributes used for entry selection (edupersonaffiliation, edupersonorgunitdn), the mail attributes (mail, mailalternateaddress) and drop the rest. That way you can manage to lower your database size to a fraction of the original and be able to keep it in memory for all required queries lowering your query processing time dramatically.

Fractional Replication is a feature provided by all modern LDAP servers like OpenLDAP and Sun Java DS. Sun DS also provides Class of Service which can be used to dynamically create attribute values and lower your original database size.