We ‘ve setup a DS6.X farm for the Greek School Network. It includes two datacenters (one for north and one for south Greece), each with one write master and two read-only replicas. The write masters are in a multi-master topology while the read-only replicas all get updated from both masters. The idea is to be able to lose up to one data-center without complete ldap service failure. We ‘ve been playing with the servers for about a month now and my impressions so far are:

  • The web console is far superior to the previous Java console. Much faster, easily accessible by just a web browser and nicely setup.
  • The installation is much harder now and requires far too many commands to get things going. Instead of just a zip with an installer you now have to install the directory server, play with cacao and dscc (directory server control center), add the console in the application server and learn a lot of <whatever>adm commands. Also we found the installation guide to be quite lacking and a bit unclear in a few issues.
  • Security should be tighter with the new server since you now have to worry about a dozen ports instead of just the ldap(s) and console ports. You have a few ports for cacao two ports (http/https) for the console and the usual directory server ports
  • The documentation quality is a bit degraded compared to previous products. Seems like documentation writing is starting to be outsourced to India as well 🙂 I got the impression that the 5.X documentation was written by the engineers themselves while the 6.X was written from an outside partner.
  • Until 6.1 we ‘ve faced quite a lot of stability issues (the servers run Solaris on 64-bit x86). The ldap server would crash for no reason and sometimes replication would fail and replicas would have to be reinitialized. We ‘ve installed 6.2 today and so far so good but i can say i ‘m rather disappointed with the software quality compared to the one i had gotten used to with the 5.X versions.

We haven’t tried to do any stress testing yet to see how the servers behave from a performance point. Online Replica initialization (over a LAN) on the other hand can reach speeds of at least 500 entries/sec (with 256MB import cache) which are quite impressive. I ‘ll try and update this post as things move on.

Ludo posted a nice link to some FS tuning guidelines for a DS

Updates…

9/10

Today slapd on one of my read-only replicas would not accept replication updates from the master. The replica had died with a kernel panic a day ago (although it did not complain about needing an fsck). Seems that the database got corrupted and i had to reinitialize it. I had a few years to see database corruption and i can’t say i ‘m very pleased with that. Here ‘re the error logs as well:

[09/Oct/2007:13:34:06 +0300] – INFORMATION – conn=-1 op=-1 msgId=-1 – get key @99613:=nic.mes.sch.gr failed, error -30987 DB_PAGE_NOTFOUND: Requested page not found
[09/Oct/2007:13:34:06 +0300] – DEBUG – conn=-1 op=-1 msgId=-1 – database index operation failed BAD 1130, err=-30987 DB_PAGE_NOTFOUND: Requested page not found
[09/Oct/2007:13:34:06 +0300] – DEBUG – conn=-1 op=-1 msgId=-1 – database index operation failed BAD 1140, err=-30987 DB_PAGE_NOTFOUND: Requested page not found
[09/Oct/2007:13:34:06 +0300] – DEBUG – conn=0 op=0 msgId=8 – index_add_mods failed, err=-30987 DB_PAGE_NOTFOUND: Requested page not found
[09/Oct/2007:13:34:06 +0300] – WARNING<10249> – Incremental Protocol – conn=-1 op=-1 msgId=-1 – Internal error Failed to replay LDAP operation with csn 470dd51e000000030000 for replica dc=sch,dc=gr, got error: Operations error (1).

10/10

After a reboot on one of the write masters (using the reboot solaris command) i got a corrupt database again (first got the message about ‘Detected Disorderly Shutdown last time Directory Server was running, recovering database.’). I am starting to get a little worried…

3/1

I got back from christmas holidays and found one datacenter in ruins. Both read-only replicas had corrupt database. The master directory server will die on it’s knees every minute just after trying to contact the corrupt replicas. As things are right now, there’s no way i ‘m using 6.2 and i ‘m seriously thinking of moving back to 5.2 in order to move services to the new server farm. As an added bonus we had GRNET report that they also faced database corruption in their own in-house 6.2 installation

Error logs:

Read-Only Replica:

[03/Jan/2008:13:04:45 +0200] – DEBUG – conn=-1 op=-1 msgId=-1 – libdb: PANIC: fatal region error detected; run recovery
[03/Jan/2008:13:04:45 +0200] – ERROR<20501> – Backend Database – conn=-1 op=-1 msgId=-1 – Serious failure in dblayer_txn_begin, err=-30974 (DB_RUNRECOVERY: Fatal error, run database recovery)
[03/Jan/2008:13:04:45 +0200] – DEBUG – conn=-1 op=-1 msgId=-1 – Modify BAD 81, err=-30974 DB_RUNRECOVERY: Fatal error, run database recovery
[03/Jan/2008:13:04:45 +0200] – WARNING<10249> – Incremental Protocol – conn=-1 op=-1 msgId=-1 – Internal error Failed to replay LDAP operation with csn 47702878000000030000 for replica dc=sch,dc=gr, got error: Operations error (1).
[03/Jan/2008:13:04:45 +0200] – ERROR<8312> – Repl. Transport – conn=-1 op=-1 msgId=-1 – Internal error [C] Replay of pending changes failed, returning 1
[03/Jan/2008:13:04:45 +0200] – ERROR<8313> – Repl. Transport – conn=57 op=4 msgId=5 – Internal error [C] Decoding of group of changes failed, returning 1
[03/Jan/2008:13:04:45 +0200] – DEBUG – conn=-1 op=-1 msgId=-1 – libdb: PANIC: fatal region error detected; run recovery

Master:

[03/Jan/2008:13:15:08 +0200] – Database recovery is 0% complete.
[03/Jan/2008:13:15:08 +0200] – Database recovery is 100% complete.
[03/Jan/2008:13:15:09 +0200] – WARNING<20805> – Backend Database – conn=-1 op=0 msgId=-1 – search is not indexed base=’dc=sch,dc=gr’ filter='(|(objectclass=groupOfNames)(objectclass=groupOfUniqueNames))’ scope=’sub’
[03/Jan/2008:13:15:09 +0200] – WARNING<10249> – Incremental Protocol – conn=-1 op=-1 msgId=-1 – Internal error Could not replay operation with csn 47702853000000030000 to consumer dsr1.att.sch.gr:389/dc=sch,dc=gr, got error: Operations error (1).
[03/Jan/2008:13:15:09 +0200] – ERROR<8293> – Incremental Protocol – conn=-1 op=-1 msgId=-1 – Unable to replay update A fatal problem occurred on the consumer side : cn=dsr1.att.sch.gr:389,cn=replica,cn=dc=sch,dc=gr,cn=mapping tree,cn=config, with error 1
[03/Jan/2008:13:15:09 +0200] – WARNING<10253> – Incremental Protocol – conn=-1 op=-1 msgId=-1 – Unable to replay update Could not replay operation with csn 47702878000000030000 to consumer dsr1.att.sch.gr:389/dc=sch,dc=gr
[03/Jan/2008:13:15:09 +0200] – WARNING<10253> – Incremental Protocol – conn=-1 op=-1 msgId=-1 – Unable to replay update Failed to replay change (uniqueid 04719382-1dd211b2-8007c278-5e5f43e6, CSN 47702878000000030000) to replica [dsr1.att.sch.gr:389/dc=sch,dc=gr]: Error 901.
[03/Jan/2008:13:15:09 +0200] – ERROR<8221> – Incremental Protocol – conn=-1 op=-1 msgId=-1 – Failed and requires administrator action [dsr1.att.sch.gr:389]
[03/Jan/2008:13:15:09 +0200] – WARNING<10249> – Incremental Protocol – conn=-1 op=-1 msgId=-1 – Internal error Could not replay operation with csn 47702878000000030000 to consumer dsr2.att.sch.gr:389/dc=sch,dc=gr, got error: Operations error (1).
[03/Jan/2008:13:15:09 +0200] – ERROR<8293> – Incremental Protocol – conn=-1 op=-1 msgId=-1 – Unable to replay update A fatal problem occurred on the consumer side : cn=dsr2.att.sch.gr:389,cn=replica,cn=dc=sch,dc=gr,cn=mapping tree,cn=config, with error 1
[03/Jan/2008:13:15:09 +0200] – WARNING<10253> – Incremental Protocol – conn=-1 op=-1 msgId=-1 – Unable to replay update Could not replay operation with csn 477028a7000000030000 to consumer dsr2.att.sch.gr:389/dc=sch,dc=gr
[03/Jan/2008:13:15:09 +0200] – WARNING<10253> – Incremental Protocol – conn=-1 op=-1 msgId=-1 – Unable to replay update Failed to replay change (uniqueid 8e15cc82-a8f811da-80bab85b-9a69096a, CSN 477028a7000000030000) to replica [dsr2.att.sch.gr:389/dc=sch,dc=gr]: Error 901.
[03/Jan/2008:13:15:09 +0200] – ERROR<8221> – Incremental Protocol – conn=-1 op=-1 msgId=-1 – Failed and requires administrator action [dsr2.att.sch.gr:389]
[03/Jan/2008:13:15:09 +0200] – Listening on all interfaces port 389 for LDAP requests
[03/Jan/2008:13:15:09 +0200] – Listening on all interfaces port 686 for LDAPS requests
[03/Jan/2008:13:15:09 +0200] – slapd started.
[03/Jan/2008:13:15:09 +0200] – INFO: 183295 entries in the directory database.
[03/Jan/2008:13:15:09 +0200] – INFO: add:0, modify:0, modrdn:0, search:0, delete:0, compare:0, bind:0 since startup.
[03/Jan/2008:13:15:56 +0200] – Sun-Java(tm)-System-Directory/6.2 B2007.226.2248 (64-bit) starting up
[03/Jan/2008:13:15:56 +0200] – WARNING<20488> – Backend Database – conn=-1 op=-1 msgId=-1 – Detected Disorderly Shutdown last time Directory Server was running, recovering database

7/1

Installed the hot fix for the bug pointed out by Ludo. After a few runs of dsadm repack on the master server i managed to get it to a point where it would complain about the database being messed up instead of just dying on start-up. Had to recover the databases from backups in order to get them running and initialized all replicas again. It’s funny, i tried to initialize the servers by importing only the root DN but even then, the server would first try to recover the database before initializing it. Strange behaviour I would assume that since i ‘m doing an init, it would discard everything it had in the database. Let’s see how it goes this time.

Advertisements