HDInsight – Provision a Hadoop Cluster on Azure

Creating a Hadoop cluster on Azure is easy, yet requires some steps. In this blog entry we provide a skeleton PowerShell script to get you started.

Requirements

To run this script, you need to have Azure PowerShell SDK installed on your machine. To check if it is already installed on your machine run the command:

PS C:\> Get-Module -ListAvailable Azure

 

    Directory: C:\Program Files (x86)\Microsoft SDKs\Azure\PowerShell\ServiceManagement
 
ModuleType Version    Name                                ExportedCommands
---------- -------    ----                                ----------------
Manifest   1.0.1      Azure                               {Disable-AzureServiceProjectRemoteDesktop, Enable-AzureSer...
 
PS C:\>

If the above command returns no data, then you need to install the SDK before proceeding. You can install the SDK in either of two ways, either via PowerShell Gallery or via Web PI. Follow this link for more details how to install the required components.

Starting the provisioning

Now that you have Azure PowerShell SDK installed, all you need to do is tweak the parameters below and then execute the script that follows.

Parameters

To make it easier to manipulate the script, we will use a set of variables to adjust how we want to create our cluster. These are as follows:

  1. # This is the prefix that we use to make our cluster elements unique
  2. # Change this to suit your taste
  3. $myPrefix = "clounce"
  4.  
  5. $clusterName = $myPrefix + "cluster"
  6.  
  7. # Adjust the number of nodes that you want
  8. $clusterNodes = 2
  9.  
  10. $clusterVersion = "3.3"
  11.  
  12. # Change the names below if you need to use pre-existing resources
  13. $resourceGroupName = $myPrefix + "rg"
  14. $storageAccountName = $myPrefix + "sa"
  15. $storageContainerName= $myPrefix + "cnt"
  16.  
  17. # Change this to your nearest Azure data centre
  18. $location = "West Europe"
  19.  
  20. # Change the subscription name to yours.  You can read this by running the command
  21. # Get-AzureSubscription. You may be asked for your Azure credentials.
  22. $subscriptionName = "Azure Pass"

Login to your subscription

Before we can start running Azure PowerShell commands, we need to authenticate and select the subscription that we want to use.

  1. Login-AzureRmAccount
  2. Select-AzureRmSubscription -SubscriptionName $subscriptionName

Setting up resources

The next step is to setup up the resources needed to run a Hadoop Cluster. The code that follows checks if the resource being created already exists and if so, it uses the already existing resource.

  1. # Create resource group
  2. # The -Force parameter suppresses any warning and uses the old ResourceGroup if it already exixts
  3. New-AzureRmResourceGroup -name $resourceGroupName -Location $location -Force
  4.  
  5. # Create storage account
  6. if (!(Test-AzureName -Storage $storageAccountName))
  7. {
  8.  New-AzureRmStorageAccount -ResourceGroupName $resourceGroupName -Name $storageAccountName -Location $location -Type Standard_RAGRS
  9. }
  10.  
  11. # Get storage key
  12. $storageAccountKey = Get-AzureRmStorageAccountKey -ResourceGroupName 
  13. $resourceGroupName -Name $storageAccountName |  %{ $_.Key1 }
  14.  
  15. # Create a storage context object
  16. $storageContext = New-AzureStorageContext -StorageAccountName 
  17. $storageAccountName -StorageAccountKey $storageAccountKey  
  18.  
  19. # Create a Blob storage container
  20. $blobContainer = Get-AzureStorageContainer -Name $storageContainerName -
  21. Context $storageContext -Verbose:$false -ErrorAction SilentlyContinue
  22. if($blobContainer -eq $null)
  23. {
  24.     New-AzureStorageContainer -Name $storageContainerName -Context $storageContext
  25. }

Create the Cluster

Now that we have the resources required created, we are ready to provision our cluster. Note that the provisioning will take some time to create. Please be patient!

  1. # Get user credentials to use when provisioning the cluster.
  2. Write-Verbose "Prompt user for ssh credentials to set during provisioning."
  3. $credentials = Get-Credential
  4. Write-Verbose "Use these credentials to login to the cluster via ssh when the script is complete." 
  5.  
  6. # Create a new HDInsight cluster
  7. New-AzureRmHDInsightCluster -ResourceGroupName $resourceGroupName `
  8.     -ClusterName $clusterName `
  9.     -Location $location `
  10.     -DefaultStorageAccountName "$storageAccountName.blob.core.windows.net" `
  11.     -DefaultStorageAccountKey $storageAccountKey `
  12.     -DefaultStorageContainer $storageContainerName  `
  13.     -ClusterType Hadoop `
  14.     -OSType Linux `
  15.     -Version $clusterVersion `
  16.     -ClusterSizeInNodes $clusterNodes `
  17.     -SshCredential $credentials

Viewing Cluster Details

Once the cluster is provisioned, you can get its details by running the Get-AzureRmHDInsightCluster as follows:

PS> Get-AzureRmHDInsightCluster -ClusterName $clusterName

Connecting to your Cluster

You have now created your cluster and you can connect to it using ssh or any other terminal software such as Putty. The url that you have to use can be obtained from the Azure Portal, look for the Secure Connect under the Cluster configuration or by using the following pattern.

First view your cluster information:

PS C:\Windows\System32> Get-AzureRmHDInsightCluster -ClusterName $clusterName
Location                  : West Europe
ClusterVersion            : 3.3.1000.0
OperatingSystemType       : Linux
ClusterState              : Running
ClusterType               : Hadoop
CoresUsed                 : 16
HttpEndpoint              : myclustername.azurehdinsight.net

Next, take the HttpEndpoint, and re-format it as follows:

myclustername-ssh.azurehdinsight.net

Finally, ssh to the url above.

Note of caution: Clusters can take a large amount of your subscription cost. We suggest that you remove the cluster when you are ready from it. You can keep the storage for later use. To delete the cluster run the following command:

Remove-AzureRmHDInsightCluster -ClusterName $clusterName

You can download the script from here.

Enjoy!