EMC Avamar SSL Cert Generation

After completing a successful root to root Avamar migration I noticed that the old SSL certs were still being used. Through some digging I finally found a very simple and easy command to update it.

The gen-ssl-cert command installs a temporary Apache web server SSL cert and restarts the web server.

gen-ssl-cert [–debug] [–help] [–verbose]

Note: You must run the gen-ssl-cert as root, and the original files are backed up and saved as:
• /etc/httpd/conf/ssl.crt/server.crt.orig
• /etc/httpd/conf/ssl.key/server.key.orig

In order to view your current certificate you can use the following command:

root@avamarnew:/etc/apache2/ssl.crt/#: openssl x509 -noout -text -in server.crt

Here is the sample of what running the script looks like:

root@avamarnew:/srv/www/#: gen-ssl-cert
Generating RSA private key, 3072 bit long modulus
.........................................................................................++
...................................................................................................................................................++
e is 65537 (0x10001)
You are about to be asked to enter information that will be incorporated
into your certificate request.
What you are about to enter is what is called a Distinguished Name or a DN.
There are quite a few fields but you can leave some blank
For some fields there will be a default value,
If you enter '.', the field will be left blank.
-----
Country Name (2 letter code) [AU]:State or Province Name (full name) [Some-State]:Locality Name (eg, city) []:Organization Name (eg, company) [Internet Widgits Pty Ltd]:Organizational Unit Name (eg, section) []:Common Name (eg, YOUR name) []:Email Address []:
Please enter the following 'extra' attributes
to be sent with your certificate request
A challenge password []:An optional company name []:Signature ok
subject=/C=US/ST=SomeState/L=SomeLocale/O=SomeOrganization/OU=SomeOrganizationalUnit/CN=avamarhq.sccu.local/emailAddress=root
Getting Private key
gen-ssl-cert: INFO: installed these web server SSL temporary certificate files:
-rw------- 1 root root 1708 Mar 2 12:29 /etc/apache2/ssl.crt/server.crt
-rw------- 1 root root 2455 Mar 2 12:29 /etc/apache2/ssl.key/server.key

Checking for httpd2: running
Shutting down httpd2 (waiting for all children to terminate) done
Starting httpd2 (prefork)

Avamar Backup of a Windows VM fails with the error: Protocol error from VMX

Recently on a new install of Avamar version 6.1; I had a VMware Image based backup fail with error 10007. Upon further investigation of the backup job log I noticed that the snapshot failed with: A general system error occured: Protocol error from VMX. This is a good example of two very vague errors on both the backup system and the virtual infrastructure. So to further investigate we needed to look deeper into the vmware.log file…

Avamar 6.1 Unified Proxy Appliance for VMware

I’m going back over some of the new differences in Avamar 6.1 and am very impressed with the enhancements that Avamar now has with VMware image based backups. Before version 6.1 you needed a seperate image proxy for Linux and a seperate proxy for Windows, now with the new proxy design both Operating Systems have been integrated into one proxy. Not only does the proxy support both OS’s but now it also supports File Level Recovery to both OS’s where as Windows was only supported previously. To be more specific; the Unified Proxy now supports Windows NTFS, and Linux ext2, ext3, and LVM. The proxy does NOT support the following: Windows GPT Partitions, Windows Dynamic disks, Extended Partitions, Encypted Partitions, Compressed Partitions, or XFS.

Warning 6698 VSS exception code 0x800706be thrown freezing volumes – The remote procedure call failed

When you try to create shadow copies on large volumes that have a small cluster size (less than 4 kilobytes), or if you take snapshots of several very large volumes at the same time, the VSS software provider may use a larger paged pool memory allocation during the shadow copy creation than is required. If there is not sufficient paged pool memory available for the allocation, the shadow copy cannot complete and may cause the loss of all previous shadow copy tasks. Follow this article and apply the fix:

http://support.microsoft.com/kb/833167

Troubleshooting: A checkpoint validation (hfscheck) of server checkpoint data is overdue.

Event ID: 114113

Description:
Checkpoint validations (hfschecks) of server Checkpoints are performed to ensure that checkpoints are valid for disaster recovery needs. A regularly scheduled hfscheck did not take place as scheduled. If hfschecks are not being performed disaster recovery may not be possible.

Remedy:
Check to make sure that checkpoint is configured and enable it if it’s not. Check to make sure checkpoint validation is configured and enable it if it’s not. If checkpoint and validation are enabled and schedule and either no checkpoint was taken or no validation occurred, contact your support center.

Avamar Backup Job fails with error code 10007

I noticed a recent NDMP backup failed with error code 10007, here is the job log information:

2010-12-01 08:00:45 avndmp Error : Snapup of “ndmp-volume-name” aborted due to ‘Error during NDMP session’.
2010-12-01 08:00:45 avndmp Info : NDMP session result: avtar returned:176 ‘Fatal signal’ ndmp returned:157 ‘Miscellaneous error’
2010-12-01 08:00:45 avndmp Info : Final summary generated subwork 1, cancelled/aborted 1, snapview 0, exitcode 157
2010-12-01 08:00:45 avndmp FATAL : Fatal signal 11 in pid 21946
2010/12/01-08:00:45.23164 [avndmp_ctl_sup] FATAL ERROR: Fatal signal 11

[sociallocker id=”759″]I’m still investigating and will update the post when i find out the root cause.  [/sociallocker]

How to shutdown Avamar

The following is the procedure to Shutdown Avamar GSAN:
1. Log on to the system as user admin.

2. Load the ssh keys
ssh-agent bash
ssh-add .ssh/admin_key

3. Verify hfscheck and garbage collect are not running.
ps -eaf|egrep “gc_cron|cp_cron|hfscheck_cron”

If hfscheck is still running, run “hfscheck_kill” as user admin to kill it off.
If GC is still running, you will need to let it finish before continuing.
If CP is running, you will need to let it finish running.

4. Take a checkpoint (as dpn)
su – dpn
ssh-agent bash
ssh-add .ssh/dpnid
cp_cron –duplog
exit
exit (Note..you should now be back to admin)

5. Stop the EMS and MCS

suspend_crons

dpnctl stop ems

dpnctl stop mcs

6. Stop the GSAN
shutdown.dpn

7. Verify avamar is shutdown

dpnctl status

In this output it shows avamar is down:

dpnctl: INFO: gsan status: down
dpnctl: INFO: MCS status: down.
dpnctl: INFO: EMS status: down.
dpnctl: INFO: Scheduler status: down.
dpnctl: INFO: Maintenance operations status: suspended.
dpnctl: INFO: Unattended startup status: disabled.
dpnctl: INFO: [see log file “/usr/local/avamar/var/log/dpnctl.log”]

Now you can safely power off the hardware.

How to check node capactiy across an EMC Avamar grid

I recently came across an issue where the avamar garbage collect was not running. When I run a status.dpn on the grid I get the following message for the garbage collect status:

Last GC: finished Wed Aug 25 01:01:04 2010 after 00m 50s >> recovered 0.00 KB (MSG_ERR_DISKFULL)

The total grid utilization is currently at 87% and I also saw some Unacknowledged Events with the following information:



Code: 4202 Message: failed garbage collection with error MSG_ERR_DISKFULL



This error is a direct result of the garbage collect run limit being reached or exceeded due to excessive checkpoint overhead. To verify and check all of the node capacities use the following commands; also if this is a single node you will not have to use the mapall command.



su – admin

ssh-agent bash

ssh-add ~admin/.ssh/admin_key



Enter the passphrase for the admin keys. (If you dont know what it is then you should not be doing this) Then run:



mapall –noerror ‘df -h’



This should give you the filesystem for each node including the sizes, used, and space available. Then run:



avmaint nodelist | grep percent-full



This will give you a cleaner output of the numbers that really matter. Pay attention to each node’s “abs-percent-full”



In most cases you should contact EMC support to resolve this issue, however in some cases running an HFS check or checkpoint validation on your oldest checkpoint might free up enough overhead to get you back on track.