Backup failure on CCMDB

It is one of most common issues with backups, it stuck at backing up CCMDB component for an hour or so. And suddenly, we receive error message.

Where to look for the error-
It is good to visit RTMT and check recent alerts or may be application logs in syslog veiwer to check the error message. You will see a message like below-

Error message: CiscoDRFFailure Reason : DRF was unable to backup component CCMDB.Error : Please check the component logs for further details.

Why backup was stuck at CCMDB-
We get such an error mainly due to CDR entries in the database, which makes the DB too large to backup. If you take a look at the CCM DB component logs, you would be seeing similar messages like below-

/var/log/active/cm/cdr_repository/preserve/20091230/cmr_Cluster-C_03_2009123
00317_1559
/var/log/active/cm/cdr_repository/preserve/20091230/cdr_Cluster-C_09_2009123
00317_1562
/bin/tar: -: Wrote only 4096 of 10240 bytes
/bin/tar: Error is not recoverable: exiting now
CCMDB Backup failed, unable to tar data to master agent
Restoring CAR services...


Here, we might be hitting a classic bug when DRS gets stuck due to the size of preserved folder.
DRS stuck in CCMDB backup if many CDR files in Preserved folder

Symptom:
Callmanager 5.1.3 is able to backup the components to SFTP server successfully earlier, but as with the system running and the CDR records starting to build up, the backup start to fail at some stage.

Condition:
huge set of CDR/CMR files may accumulate in Preserve folders (e.g 110,000) caused a big backup.log file exceeded 1 MB(e.g 20MB)

Typical Scneario:
Customer does not want to use CAR but using third party billing server instead and CAR loader is disabled, in callmanager 5.1.3, this can result in large number of CDR/CMR accumulated in Preserve folder.

Workaround:

In 5.1.3 and later
Stopping CAR service will not stop accumulating flat CDR Files.

To clean up the CDR files temporarily to that DRS can proceed,

1. stop the CDR Agent service on all servers in the cluster, so no new CDR files will be pushed to the publisher.

2. check to make sure all the files have been pushed to the billing server(s), do the following command

ls -R /var/log/active/cm/cdr_repository/destination*

to make sure there is no symbolic link in any of the subfolders

3. Stop CDR Repository Manager, CAR Scheduler and CAR Web Service on the publisher

4. remove all the files under /var/log/active/cm/cdr_repository/preserve/< date> that have been accumulated. And remove all the symbolic links under /var/log/active/cm/cdr_repository/car/< date> with the following commands,

rm -rf /var/log/active/cm/cdr_repository/preserve/*
rm -rf /var/log/active/cm/cdr_repository/car/*


5. restart CDR Repository Manager, CAR Scheduler, CAR Web Services on the publisher.

To stop further accumulation of CDR files, start CAR Scheduler service, set the loader to schedule to continuously loading, load CDR only.

1. create a ccmadmin account if not yet in user group management on ccmadmin page
2. log in to CAR, go to System->Scheduler->CDR Load,
3. check "Continuous Loading 24/7" and "Load CDR only" boxes
4. click "update"
5. goto System->Database->Configure Automatic Database Purge,
6. set both "Min Age of Call Detail Records" and "Max Age of Call Detail Records' to 1
7. click "update"
8. go to Report Config->Automatic Generation/Alert
9. For each report, select "Disabled" status and click "Update"

Finally, restart CDR Agent service on all the servers.

workaround for DRS :
1. Restart DRS Master Agent and Local Agent.
2. Delete the backup.log from location /common/drf


For 7.x and later releases, this error is mainly due to SFTP limitation. Some SFTP servers are not capable of taking the data more than 2GB, while some SFTP servers can not take heavy data at the same time.
You would confirm why it is SFTP issue, while looking at the DRS MA/LA logs. The below is the snippet shows why SFTP.

2010-01-02 02:07:06,884 INFO [drfJobThread] -
drfJobThread:executeDirectSftpCommand: Done Executing the script.
Expected value is 0 for success. Command returned with a value 1
2010-01-02 02:07:06,884 DEBUG [drfJobThread] -
drfJobThread:executeDirectSftpCommand: Calling os.close()
2010-01-02 02:07:06,885 ERROR [drfJobThread] -
drfJobThread:executeDirectSftpCommand: Caught
com.maverick.ssh.SshIOException
2010-01-02 02:07:06,885 ERROR [drfJobThread] -
drfJobThread:executeDirectSftpCommand: SshIOException :
null.Message:Unexpected termination: Socket closed [Unknown cause]
2010-01-02 02:07:06,885 ERROR [drfJobThread] -
drfJobThread:executeDirectSftpCommand: SshIOException Reason Code : 1
2010-01-02 02:07:06,886 DEBUG [drfJobThread] -
drfJobThread:closeSFTPConnections: Closing SFTP Client...
2010-01-02 02:07:06,886 DEBUG [drfJobThread] -
drfJobThread:closeSFTPConnections: Closing Ssh2 Client...
2010-01-02 02:07:06,886 DEBUG [drfJobThread] -
drfJobThread:closeSFTPConnections: Closing Ssh Client...

In such a scenrio, it is advisable to use linux based SFTP servers.
If you want to install a linux based SFTP server, click here.

Comments

  1. Having the same problem in a totally different application that com.maverick.ssh package. In my case I found it was a problem with a timeout set to a low value.
    Hope it helps.

    ReplyDelete

Post a Comment

Popular posts from this blog

Status: Local Agent is not responding. This may be due to Master or Local Agent being down.

Understanding code yellow