Skip to main content

BigQuery

The BigQuery module is more complete than others, it allows the common transformation of the library, which is the transformation from other types into BigQuery, and BigQuery types into others.

This module also includes a complete integration with a BigQuery environment, meaning that if BigQuery credentials are given, the library include methods to create tables in BigQuery, directly using other types.

Create BigQuery Tables

import org.datatools.bigdatatypes.bigquery.BigQueryTable
import org.datatools.bigdatatypes.formats.Formats.implicitDefaultFormats

case class MyTable(field1: Int, field2: String)
BigQueryTable.createTable[MyTable]("dataset_name", "table_name")

This also works with Structs, Lists and Options. See more examples in Tests

Transform field names

There is a Format object that allows us to decide how to transform field names, for example, changing CamelCase for snake case

import org.datatools.bigdatatypes.bigquery.BigQueryTable
import org.datatools.bigdatatypes.formats.Formats.implicitSnakifyFormats

case class MyTable(myIntField: Int, myStringField: String)
BigQueryTable.createTable[MyTable]("dataset_name", "table_name")
//This table will have my_int_field and my_string_field fields
tip

Tables can be created using directly an instance of any other type of the library. Example from Spark:

val df: Dataframe = ???
val schema: StructType = df.schema
BigQueryTable.createTable[StructType](schema, "dataset_name", "table_name")

Time Partitioned tables

Using a Timestamp or Date field, tables can be partitioned in BigQuery using a Time Partition Column

import org.datatools.bigdatatypes.bigquery.BigQueryTable
import org.datatools.bigdatatypes.formats.Formats.implicitSnakifyFormats

case class MyTable(field1: Int, field2: String, myPartitionField: java.sql.Timestamp)
BigQueryTable.createTable[MyTable]("dataset_name", "table_name", "my_partition_field")

Create a table with more than one Case Class

In many cases we work with a Case Class that represents our data but we also want to add some metadata fields like updated_at, received_at, version and so on. In these cases we can work with multiple Case Classes and fields will be concatenated:

import org.datatools.bigdatatypes.bigquery.BigQueryTable
import org.datatools.bigdatatypes.formats.Formats.implicitDefaultFormats

case class MyData(field1: Int, field2: String)
case class MyMetadata(updatedAt: Long, version: Int)
BigQueryTable.createTable[MyData, MyMetadata]("dataset_name", "table_name")

This can be done up to 5 concatenated classes

Create BigQuery schema from a Case Class

import com.google.cloud.bigquery.{Field, Schema}
import org.datatools.bigdatatypes.formats.Formats.implicitDefaultFormats
import org.datatools.bigdatatypes.bigquery.BigQueryTypes

case class MyTable(field1: Int, field2: String)
//List of BigQuery Fields, it can be used to construct an Schema
val fields: List[Field] = BigQueryTypes[MyTable].bigQueryFields
//BigQuery Schema, it can be used to create a table
val schema: Schema = Schema.of(fields.asJava)

Or use the provided extension method for the creation of the Schema

val fields: List[Field] = BigQueryTypes[MyTable].bigQueryFields
val schema: Schema = fields.schema

From a Case Class instance

import com.google.cloud.bigquery.Field
import org.datatools.bigdatatypes.formats.Formats.implicitDefaultFormats
import org.datatools.bigdatatypes.bigquery.BigQueryTypes._

case class MyTable(field1: Int, field2: String)
val data = MyTable(1, "test")
val fields: List[Field] = data.asBigQuery

From any other type

e.g: Spark Schema

val myDataframe: Dataframe = ???
val bqFields: List[Field] = myDataframe.schema.asBigQuery
val bqSchema: Schema = myDataframe.schema.asBigQuery.schema

See more info about creating tables on BigQuery in the official documentation

Connecting to your BigQuery environment

If you want to create tables using the library you will need to connect to your BigQuery environment through any of the GCloud options. Probably the most common will be to specify a service account and a project id. It can be added on environment variables. The library expects:

  • PROJECT_ID: <your_project_id>
  • GOOGLE_APPLICATION_CREDENTIALS: <path_to_your_service_account_json_file>