Showing posts with label technology. Show all posts
Showing posts with label technology. Show all posts

Saturday, April 13, 2013

Backing up my HostGator Account using Rsync

Once you exceed  a certain amount of disk space usage on Hostgator they will stop backing up your website. So if you want to ensure that a failure on their end doesn't cause your site to go down permanently, you'll want to start backing it up somewhere else.

For me, I don't want to spend any more money on a hosted backup solution, so I guess I'll be making a backup on my local machine.

 


Install cygwin

When selecting packages, search for "ssh".
Under the "Net" package select the openssh and libssh packages (you don't need any "sources" and "development" packages)

Then do another search for "rsync", and make sure its selected.

Then click through the installation steps to install Cygwin with those packages.


Setup SSH on HostGator
If you have a shared account (i.e. cheapest plan) then SSH access will be disabled by default. You'll have to open a support case with them in cPanel. I recommend using the Live Chat Support system to get the fastest turn around time.


Setting up passwordless SSH
After support enables SSH for you'll want to be able to login without having to type in a password.
First fire up your cygwin terminal and enter the following command:

ssh-keygen -t dsa -b 1024
 
Keep pressing enter to get through all the prompts (you don't want to add a passphrase). This will create your keys in your <cygwin>/home/<username>/.ssh directory. The program will tell you which file is your private "identification" key and which is your public key. Remember where these are.

Next we need add our public key to your account on HostGator.
Again from your cygwin terminal run:

ssh -p 2222 <username>@<hostname>

You will be prompted to login. Login and you will now be inside your HostGator account.
Then cd to .ssh/ directory.
Here you will want to add your public key that you generated earlier. To do that first open your public key file in a text editor and copy the entire line (Ctrl+C). Make sure you get the entire line, it may look like multiple lines if you have wordwrap turned on.
Next run these commands in the cygwin terminal:

chmod +w authorized_keys2
vi authorized_keys2

This opens the file for writing, and then starts the vi text editor.
Next, press 'i' which sets the vi editor into "insert" mode.
Use the arrow keys and the "End" button to get to the end of the last line. Press enter to create a new line at the very end of the file.
With the text cursor on the empty last line. right-click with your mouse to paste the public key information into the file.
Then press 'Esc', ':' , 'w', 'q', 'Enter'  (ignore the ' and , ). This undoes the "insert" mode, then switches to command mode (':'), where 'w' tells it to write the file and 'q' tells vi to quit.

Now we want to test our passwordless ssh so enter:

exit

And this should disconnect you from the HostGator.
Next run:

ssh -p 2222 <username>@<hostname>

and this time, it shouldn't prompt you for a password. If that's the case, enter

exit

again to disconnect. Otherwise, you'll have to go back over your steps to figure out what went wrong.

Testing Rsync
Now, I'd recommend you start rsync on a small test directory just to make sure all the settings are right. I used the "tmp" dir on HostGator in my example, but you can use whatever you like as long as its not important. You also need a folder on your local machine to serve as the backup location.

Again from your cygwin terminal, run:

rsync -avz --rsh='ssh -p2222' <username>@<hostname>:/home/<username>/tmp /cygdrive/c/<backup dir>

This runs rsync through an ssh tunnel on port 2222. The /home/<username>/tmp is the directory on HostGator you want to backup. The /cygdrive/c/ is cygwin's way of saying "C:\".
Once this finishes, look in your backup directory to see if you're satisfied with the results. If you are, you can change the directories to match what you really want to backup.

Automating Rsync
Since we're on windows, we can use the Task Scheduler to run rsync.
Define a new task with the following settings:

Executable: <cygwin>\bin\bash.exe
Arguments: -l -c "rsync -avz --rsh='ssh -p2222' <username>@<hostname>:/home/<username>/tmp /cygdrive/c/<backup dir> > /cygdrive/c/<log directory>/rsync.log 2>&1"

Set the schedule to whatever backup schedule you want. You'll want to make sure that you select the option that prevents the job from starting if the previous job is still running. The <log directory>/rsync.log file will contain the log of the process so you can always come back to view it if there are problems.

Once you save the task, you can manually run it in the task scheduler. I would go ahead and do that now and examine the log file to see if everything's okay.

Optionally, add the --delete flag
 Now that everything's working automatically you may want to add the --delete flag to the rsync command to delete files in your backup, that were deleted from the website. This ensures that the contents remain in sync after you delete something from HostGator.

Update - Changing Host Key

Once in a while HostGator may update the hardware your site is on, and that may cause a change in the IP address assigned to your domain. This causes the ssh calls through cygwin from your local machine to fail. You will get a:
 
WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!

message in your rsync logs when this happens. The easy fix for this is to simply delete lines from your /home/.ssh/known_hosts file corresponding to the old IP address/hostname. Then from the cygwin terminal, do a:

ssh -p2222 <username>@<hostname>

and login to your account manually. This will repopulate the known_hosts file with a new key.

Friday, September 16, 2011

Hadoop for Everyone

Apache Hadoop is an open source platform for running really huge jobs on a whole cluster of machines. Some of the most interesting problems in the world can only be practically solved using the power that hadoop can harness. The biggest problem for developers like me is that we don't have the time, space, or money to install a cluster of hundreds or even thousands of machines. Even if I did, I would be a big pain to maintain all those machines. And even though I work for a big company with all the resources to make a hadoop cluster, asking for 100 machines so I can "just try something" is never gonna fly.

One of the major pain-points of hadoop, is the fact that not all machines are treated the same. Some machines have to function in roles like NameNodes which require large amounts of RAM and some level of high availability/redundancy. On the other hand, the vast majority of machines can function as workers and can be simple commodity machines for cost savings. This means any efficient hadoop cluster is going to be a heterogeneous environment which further increases maintenance costs.

So what can we do? What we need is an on-demand hadoop cluster that you can pay for what you use. Amazon's EC2 and S3 have typically been used to provide metered webservices and data storage. However, deploying a hadoop cluster with heterogeneous server instances on a remote cloud still requires you to do a lot of the setup. Amazon realized this and created their Elastic MapReduce service. Now you can run your hadoop jobs on-demand with very little configuration and yet still have complete control over what class of machine you want to assign to the different roles in your cluster.

This makes a lot of sense for software developers like me. During development, I may use a small cluster to prove that my stuff is working. During QA, we might vary the size/configuration to see how our solution scales and do performance tuning. Our marketing and sales teams can have demos ready on the cloud anytime, anywhere in the world to showcase prospective customers. We would also have a good solution for customers who do not want a huge IT spending outlay before they are convinced of the value of our product. This might even absolve us of any legal/privacy concerns by letting the customer make their own agreements with Amazon whereas we provide just the software.