We ‘ve setup a DS6.X farm for the Greek School Network. It includes two datacenters (one for north and one for south Greece), each with one write master and two read-only replicas. The write masters are in a multi-master topology while the read-only replicas all get updated from both masters. The idea is to be able to lose up to one data-center without complete ldap service failure. We ‘ve been playing with the servers for about a month now and my impressions so far are:
- The web console is far superior to the previous Java console. Much faster, easily accessible by just a web browser and nicely setup.
- The installation is much harder now and requires far too many commands to get things going. Instead of just a zip with an installer you now have to install the directory server, play with cacao and dscc (directory server control center), add the console in the application server and learn a lot of <whatever>adm commands. Also we found the installation guide to be quite lacking and a bit unclear in a few issues.
- Security should be tighter with the new server since you now have to worry about a dozen ports instead of just the ldap(s) and console ports. You have a few ports for cacao two ports (http/https) for the console and the usual directory server ports
- The documentation quality is a bit degraded compared to previous products. Seems like documentation writing is starting to be outsourced to India as well
I got the impression that the 5.X documentation was written by the engineers themselves while the 6.X was written from an outside partner. - Until 6.1 we ‘ve faced quite a lot of stability issues (the servers run Solaris on 64-bit x86). The ldap server would crash for no reason and sometimes replication would fail and replicas would have to be reinitialized. We ‘ve installed 6.2 today and so far so good but i can say i ‘m rather disappointed with the software quality compared to the one i had gotten used to with the 5.X versions.
We haven’t tried to do any stress testing yet to see how the servers behave from a performance point. Online Replica initialization (over a LAN) on the other hand can reach speeds of at least 500 entries/sec (with 256MB import cache) which are quite impressive. I ‘ll try and update this post as things move on.
Ludo posted a nice link to some FS tuning guidelines for a DS
Updates…
9/10
Today slapd on one of my read-only replicas would not accept replication updates from the master. The replica had died with a kernel panic a day ago (although it did not complain about needing an fsck). Seems that the database got corrupted and i had to reinitialize it. I had a few years to see database corruption and i can’t say i ‘m very pleased with that. Here ‘re the error logs as well:
[09/Oct/2007:13:34:06 +0300] – INFORMATION – conn=-1 op=-1 msgId=-1 – get key @99613:=nic.mes.sch.gr failed, error -30987 DB_PAGE_NOTFOUND: Requested page not found
[09/Oct/2007:13:34:06 +0300] – DEBUG – conn=-1 op=-1 msgId=-1 – database index operation failed BAD 1130, err=-30987 DB_PAGE_NOTFOUND: Requested page not found
[09/Oct/2007:13:34:06 +0300] – DEBUG – conn=-1 op=-1 msgId=-1 – database index operation failed BAD 1140, err=-30987 DB_PAGE_NOTFOUND: Requested page not found
[09/Oct/2007:13:34:06 +0300] – DEBUG – conn=0 op=0 msgId=8 – index_add_mods failed, err=-30987 DB_PAGE_NOTFOUND: Requested page not found
[09/Oct/2007:13:34:06 +0300] – WARNING<10249> – Incremental Protocol – conn=-1 op=-1 msgId=-1 – Internal error Failed to replay LDAP operation with csn 470dd51e000000030000 for replica dc=sch,dc=gr, got error: Operations error (1).
10/10
After a reboot on one of the write masters (using the reboot solaris command) i got a corrupt database again (first got the message about ‘Detected Disorderly Shutdown last time Directory Server was running, recovering database.’). I am starting to get a little worried…
3/1
I got back from christmas holidays and found one datacenter in ruins. Both read-only replicas had corrupt database. The master directory server will die on it’s knees every minute just after trying to contact the corrupt replicas. As things are right now, there’s no way i ‘m using 6.2 and i ‘m seriously thinking of moving back to 5.2 in order to move services to the new server farm. As an added bonus we had GRNET report that they also faced database corruption in their own in-house 6.2 installation
Error logs:
Read-Only Replica:
[03/Jan/2008:13:04:45 +0200] – DEBUG – conn=-1 op=-1 msgId=-1 – libdb: PANIC: fatal region error detected; run recovery
[03/Jan/2008:13:04:45 +0200] – ERROR<20501> – Backend Database – conn=-1 op=-1 msgId=-1 – Serious failure in dblayer_txn_begin, err=-30974 (DB_RUNRECOVERY: Fatal error, run database recovery)
[03/Jan/2008:13:04:45 +0200] – DEBUG – conn=-1 op=-1 msgId=-1 – Modify BAD 81, err=-30974 DB_RUNRECOVERY: Fatal error, run database recovery
[03/Jan/2008:13:04:45 +0200] – WARNING<10249> – Incremental Protocol – conn=-1 op=-1 msgId=-1 – Internal error Failed to replay LDAP operation with csn 47702878000000030000 for replica dc=sch,dc=gr, got error: Operations error (1).
[03/Jan/2008:13:04:45 +0200] – ERROR<8312> – Repl. Transport – conn=-1 op=-1 msgId=-1 – Internal error [C] Replay of pending changes failed, returning 1
[03/Jan/2008:13:04:45 +0200] – ERROR<8313> – Repl. Transport – conn=57 op=4 msgId=5 – Internal error [C] Decoding of group of changes failed, returning 1
[03/Jan/2008:13:04:45 +0200] – DEBUG – conn=-1 op=-1 msgId=-1 – libdb: PANIC: fatal region error detected; run recovery
Master:
[03/Jan/2008:13:15:08 +0200] – Database recovery is 0% complete.
[03/Jan/2008:13:15:08 +0200] – Database recovery is 100% complete.
[03/Jan/2008:13:15:09 +0200] – WARNING<20805> – Backend Database – conn=-1 op=0 msgId=-1 – search is not indexed base=’dc=sch,dc=gr’ filter=’(|(objectclass=groupOfNames)(objectclass=groupOfUniqueNames))’ scope=’sub’
[03/Jan/2008:13:15:09 +0200] – WARNING<10249> – Incremental Protocol – conn=-1 op=-1 msgId=-1 – Internal error Could not replay operation with csn 47702853000000030000 to consumer dsr1.att.sch.gr:389/dc=sch,dc=gr, got error: Operations error (1).
[03/Jan/2008:13:15:09 +0200] – ERROR<8293> – Incremental Protocol – conn=-1 op=-1 msgId=-1 – Unable to replay update A fatal problem occurred on the consumer side : cn=dsr1.att.sch.gr:389,cn=replica,cn=dc=sch,dc=gr,cn=mapping tree,cn=config, with error 1
[03/Jan/2008:13:15:09 +0200] – WARNING<10253> – Incremental Protocol – conn=-1 op=-1 msgId=-1 – Unable to replay update Could not replay operation with csn 47702878000000030000 to consumer dsr1.att.sch.gr:389/dc=sch,dc=gr
[03/Jan/2008:13:15:09 +0200] – WARNING<10253> – Incremental Protocol – conn=-1 op=-1 msgId=-1 – Unable to replay update Failed to replay change (uniqueid 04719382-1dd211b2-8007c278-5e5f43e6, CSN 47702878000000030000) to replica [dsr1.att.sch.gr:389/dc=sch,dc=gr]: Error 901.
[03/Jan/2008:13:15:09 +0200] – ERROR<8221> – Incremental Protocol – conn=-1 op=-1 msgId=-1 – Failed and requires administrator action [dsr1.att.sch.gr:389]
[03/Jan/2008:13:15:09 +0200] – WARNING<10249> – Incremental Protocol – conn=-1 op=-1 msgId=-1 – Internal error Could not replay operation with csn 47702878000000030000 to consumer dsr2.att.sch.gr:389/dc=sch,dc=gr, got error: Operations error (1).
[03/Jan/2008:13:15:09 +0200] – ERROR<8293> – Incremental Protocol – conn=-1 op=-1 msgId=-1 – Unable to replay update A fatal problem occurred on the consumer side : cn=dsr2.att.sch.gr:389,cn=replica,cn=dc=sch,dc=gr,cn=mapping tree,cn=config, with error 1
[03/Jan/2008:13:15:09 +0200] – WARNING<10253> – Incremental Protocol – conn=-1 op=-1 msgId=-1 – Unable to replay update Could not replay operation with csn 477028a7000000030000 to consumer dsr2.att.sch.gr:389/dc=sch,dc=gr
[03/Jan/2008:13:15:09 +0200] – WARNING<10253> – Incremental Protocol – conn=-1 op=-1 msgId=-1 – Unable to replay update Failed to replay change (uniqueid 8e15cc82-a8f811da-80bab85b-9a69096a, CSN 477028a7000000030000) to replica [dsr2.att.sch.gr:389/dc=sch,dc=gr]: Error 901.
[03/Jan/2008:13:15:09 +0200] – ERROR<8221> – Incremental Protocol – conn=-1 op=-1 msgId=-1 – Failed and requires administrator action [dsr2.att.sch.gr:389]
[03/Jan/2008:13:15:09 +0200] – Listening on all interfaces port 389 for LDAP requests
[03/Jan/2008:13:15:09 +0200] – Listening on all interfaces port 686 for LDAPS requests
[03/Jan/2008:13:15:09 +0200] – slapd started.
[03/Jan/2008:13:15:09 +0200] – INFO: 183295 entries in the directory database.
[03/Jan/2008:13:15:09 +0200] – INFO: add:0, modify:0, modrdn:0, search:0, delete:0, compare:0, bind:0 since startup.
[03/Jan/2008:13:15:56 +0200] – Sun-Java(tm)-System-Directory/6.2 B2007.226.2248 (64-bit) starting up
[03/Jan/2008:13:15:56 +0200] – WARNING<20488> – Backend Database – conn=-1 op=-1 msgId=-1 – Detected Disorderly Shutdown last time Directory Server was running, recovering database
7/1
Installed the hot fix for the bug pointed out by Ludo. After a few runs of dsadm repack on the master server i managed to get it to a point where it would complain about the database being messed up instead of just dying on start-up. Had to recover the databases from backups in order to get them running and initialized all replicas again. It’s funny, i tried to initialize the servers by importing only the root DN but even then, the server would first try to recover the database before initializing it. Strange behaviour I would assume that since i ‘m doing an init, it would discard everything it had in the database. Let’s see how it goes this time.




7 comments
Comments feed for this article
October 9, 2007 at 9:01 am
Karl H. Beckers
Hmm,
I don’t see how the installation is harder. There is still a zip packaged offering if you prefer that. And that can be easily installed (even on an unsupported plattform like ubuntu in 6 easy steps http://charly4711.wordpress.com/2007/09/18/directory-server-enterprise-edition-dsee-62-on-ubuntu/).
About security issues around ports, I also don’t feel like the situation has degraded in any way. In previous versions you had the admin server with similar issues, now it’s the cacao agent.
Karl
October 9, 2007 at 9:35 am
kkalev
I installed using the zip package. And i am sorry but i personally actually find it easier to just run the installer and answer the installer questions (like in 5.2) and then just lauch the java console instead of running dsee_deploy, numerous *adm commands and deploying the dscc with the sun console. Maybe it’s just a matter of personal taste, that’s my taste anyway.
I am not trying to say that 6.X posses security risks i am just stating the fact that we needed to block more than a few ports in order to stay safe of outside threats.
October 20, 2007 at 12:53 am
wyang
6.2 has an installer like 5.2 as well. You need to download JES 5 update 1 to use that installer (you might also just be able to download the identity management subset only). Also, the interactive installer uses sun web console instead of requiring sun app server (which I think the ZIP dist requires, but I may be wrong). However, the JES5u1 dist doesn’t include identity sync or directory editor components.
November 14, 2007 at 6:49 pm
eldapo - Sun JSDS 6.1 Review
[...] a short review of Sun’s Java Systems Directory Server 6.1 by another working directory admin here. In it Kostas Kalevras talks about his recent experience deploying the product in his org’s [...]
November 19, 2007 at 6:52 pm
Quanah
Interesting, I used to get those same corruption errors when Stanford had Netscape DS servers. And those were back from 1996 or so. Sad to see that the underlying problems with the Netscape/Fedora/Sun DS products have with database reliability still exist. That routine data corruption was part of what spurred Stanford to move to something with reliability (i.e., OpenLDAP).
January 3, 2008 at 11:35 am
Ludovic Poitou
Directory Server 6.2 makes use of an updated version of the SleepyCat Berkeley DB that contains some specific performance improvement. It turns out that these changes are introducing a bug where some pages will be incorrectly Zero’d resulting in a database panic.
The corresponding defect for the Sun Directory Server is 6642430: DB corruption (zero’d pages) when performing db2ldif against large 20GB ldif file
The defect has been fixed and will be integrated in DS 6.3 available in a few months. Meanwhile a hot-fix is available from the Sun Support organization. To get it, open a support call.
April 3, 2008 at 4:49 pm
EHF
How have things gone since the recommended hot fix was installed?
I’m running into the same issues with v6.2.
Things run ok and no db corruptions occur if I *do not* have multi-master replication enabled. For now, I kept one master and demoted all my other directory servers to consumers. that seems to have kept things stable…but I’m still keeping my fingers crossed.