How to Setup Apache Drill Cluster to Query Data from GCS | Replacement for GCP BigQuery

      Comments Off on How to Setup Apache Drill Cluster to Query Data from GCS | Replacement for GCP BigQuery

Google Cloud Platform is one of the popular and fast-growing Cloud platforms. In this article, we will discuss alternatives for Google BigQuery which can help us reducing GCP billing.

Introduction:

BigQuery is a scalable serverless warehouse with SQL interface, blazing-fast queries, and per query pricing. It helps to get real-time insights with streaming and start using advanced and predictive analytics. BigQuery can also read data from external storage systems like GCS storage or Google sheets.

Even though BigQuery is a nice solution for data transfer use cases, It is a bit on the pricier side. When we have terabytes of data in Google Storage, the cost we spend on the test queries itself will be very high. Here is where we can use Apache Drill an opensource solution to replace BigQuery.

Uses of Apache Drill:

  • Drill is an innovative distributed SQL engine designed to enable data exploration and analytics on non-relational datastores.
  • Users can query the data using standard SQL and BI tools without having to create and manage schemas.
  • Drill can query data across other data stores such as MongoDB, Kafka, Object Storage(GCS, S3) within the same query.
  • Drill allows us to join data from across different systems without creating a unified data storage layer.
  • Drill also supports client connection via Industry-standard APIs such as ANSI SQL, ODBC/JDBC, RESTful APIs.
  • Users can simply run a postman call to query data from the storage.

How to setup Apache Drill:

Apache Drill can be installed in two different modes,

  • Embedded Mode (Standalone Single Node installation)
  • Distributed Mode (Drill Cluster with Zookeeper)

We will be deploying the Apache Drill cluster setup using terraform in this article.

Prerequisites:

  • Must have Google Cloud Platform Account.
  • Create Google Cloud Service Account with GCS Storage Owner role for the bucket where the data is stored.
  • Download the shaded version of the GCS Hadoop connecter Jar File and upload it to a GCS Bucket location from where Drill VMs can easily access it with instance service account.
  • Install Terraform and Git in the server where you run the terraform template.

Store GCS Hadoop connecter Jar in Bucket:

The Cloud Storage connector is an open-source Java library that lets you run Apache Hadoop or Apache Spark jobs directly on data in Cloud Storage. Download the shaded version of the GCS Hadoop connecter Jar File and upload it to a GCS Bucket location using the following command,

gsutil cp ~/Downloads/gcs-connector-hadoop2-2.1.5-shaded.jar gs://devopstree-gcs/gcs-connecter/

Clone Apache Drill Terraform Template Git repo:

Clone this repo using the following command https://github.com/kaysush/apache-drill-terraform.git,

git clone https://github.com/kaysush/apache-drill-terraform.git

Note: This is actually a simple template, we need to change based on the company standards.

Update the placeholders in terraform template:

  1. If you are not applying this terraform script from an IaC VM which has a service account and cloud-platform access scope, you need to replace the PATH-TO-SERVICE-ACCOUNT-JSON-FILE variable with the GCP Terraform service account JSON path in provider.tf.(If your VM has access to create the required resources, remove the credentials line from provide.tf). I have given zone in region option as I’m going to create resources only in one zone
provider "google" {
    region = "us-west1-c"
    project = "project-nuance-2720XX"
}
  1. Replace BUCKET-NAME placeholder of main.tf with bucket name.
  2. Replace the JSON-KEY-SUBPATH variable with the JSON key which has access to the bucket where the Data is stored.
  3. Replace CONNECTOR-JAR-SUBPATH variable with the GCS Connector Jar path.
    data "template_file" "core_site_xml" {
        template = file("${path.module}/templates/core-site.xml.tpl")
        vars = {
            gcp_project = "project-nuance-2720XX"
        }
    }
    
    data "template_file" "start_drill" {
        template = file("${path.module}/templates/start_drill.sh.tpl")
        vars = {
            zookeeper_hostname = module.zookeeper.zookeeper_hostname
            core_site_content = data.template_file.core_site_xml.rendered
            bucket_name = "devopstree-gcs"
            sa_key_subpath = "<JSON-KEY-SUBPATH>" #Optional
            connector_jar_subpath = "gcs-connecter/gcs-connector-hadoop2-2.1.5-shaded.jar"
        }
    }

    Note: Keeping service account JSON in the GCS bucket is not a good idea so the sa_key_subpath line can be removed, get bucket access to the instance default service account.

  4. In the Terraform templates folder we have the core-site.xml template which is the Drill config file, we can change fs.gs.auth.service.account.json.keyfile property to fs.gs.auth.service.account.json.email if we don’t want to use key file authentication.
   <property>
      <name>fs.gs.auth.service.account.json.email</name>
      <value>[email protected]om</value>
   </property>

Terraform Template Apply:

  • Initialize Terraform,
terraform init

Once initiated, run the plan command and check if it throws any syntax error or permission issues.

terraform plan
  • Apply terraform template using the following command,
terraform apply --auto-approve

This will spin up 3 apache drill nodes and one zookeeper node and configure Drill in distribution mode.

Apply complete! Resources: 5 added, 0 changed, 0 destroyed.

How to access Apache Drill UI:

To access Drill UI, we can hit the IP Address of any Drill VM in the browser with port 8047 like shown in below,

http://<drill-hostname-or-ip-address>:8047

GCS Storage Plugin Configuration:

We can configure the plugin in two different ways,

  • we can directly edit /apache-drill-1.17.0/conf/storage-plugins-override.conf file and restart drillbit.
  • From the UI we can navigate to Storage on the toolbar and press Create under the Plugin Management which will allow us to add a new plugin configuration.
  • Give a name in the Storage Name and Download the GCS storage plugin config by clicking here and paste it in the configuration.

  • Save the file and we are good with the plugin configuration. Now we will be able to query GCS using SQL query form Drill UI.

Query Data From GCS using Drill UI:

We will be able to query the data from the GCS using Rest API call to Drill machine but here in this article we care seeing only from Web UI,

Here, You can see CSV data uploaded to the bucket. We can also upload different file types like psv, tsv, JSON, parquet, Avro, etc.

Run the following query from the UI to get the result,

SELECT * from gcs.`root`.`SacramentocrimeJanuary2006.csv` LIMIT 100;

Click on the submit button, then we will be able to see the data like below,

Troubleshooting:

Error: ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found

==> This error indicates that the GCS Connector jar is not properly configured in Drill. check if it is properly placed.

Error: EXECUTION_ERROR ERROR: Unable to allocate buffer of size 262144 (rounded from 204800) due to memory limit

==> Increase planner.memory_limit to large enough per query.

For example, set it to 512MB:

alter session set `planner.memory_limit`=536870912;

Helpful Reference Link:

https://drill.apache.org/docs/distributed-mode-prerequisites/

Sharing is caring!

About DINESH SC

I'm an IT professional having multiple years of experience in IT Infrastructure planning, System integrations, Project implementation, and delivery. DevOps Enthusiast skilled with broad ranges of technology adoption. Well versed with Cloud Computing and Application Service Offerings such as SaaS/PaaS/IaaS. Expert in aligning business goals, mission and process to architect innovative solutions and strategies.