View on GitHub

spark-bigquery-ingestion

Spark based Data Warehouse Ingestion Tool, load an entire database into Google Bigquery with one job, without any code.

Data Source Configuration

Here you find more details related to how to setup a new Spark Job as a Step.

Data Source Types:

Google Storage gs
- It reads all subfolders of an Google Storage folder, if any subfolder contains one or more data files (JSON, XML, CSV, XLS, TXT, XSLX, AVRO, PARQUET), the component automatically creates a new table with the folder’s name.
MongoDB mongodb
- The component reads all collections from given database, and creates an internal thread to ingest the data
JDBC jdbc
- The component reads all tables containg in the given database, creates an thread for each table and ingest the data into the Google Storage table.

Examples:

File: ingestion-workflow.yml

JDBC

``` yaml

https://cloud.google.com/dataproc/docs/concepts/workflows/using-yamls

jobs:

sparkJob: args: - jdbc # Data Source Type - jdbc:postgresql://host:port/ # Source DB JDBC URL - # Source DB username - # Source DB Password - : # Target Big Query Database - '30' # Timeout in minutes to execute the Ingestion - '25' # Number of Component internal Threads, default 20 jarFileUris: - gs:///lib/igt_database.jar # Project Jarfile, saved in Google Storage - gs:///lib/postgresql-42.2.5.jar # JDBC Jar file, saved in Google Storage mainClass: br.com.bruno.data.ingestion.Starter # Scala Main Class properties: # Read more https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/cluster-properties spark.executor.cores: '5' spark.executor.memory: 4g stepId: jdbc_step
sparkJob: args: - gs # Google Storage - /stg/cna/data/landing/ # Source Google Storage root folder, it will read all sub-folders, if exists one or more data files (json, xml, csv) in any subfolder it will create a table and load the content into the new table - "*" # Username use * - "*" # password use * - : # Target - '30' # Timeout to execute the data Load - '1' # Number of Component internal Threads, default 20 jarFileUris: - gs:///lib/igt_database.jar # Project Jarfile, saved in Google Storage mainClass: br.com.bruno.data.ingestion.Starter # Scala Main Class properties: # Read more https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/cluster-properties spark.executor.cores: '4' spark.executor.memory: 4g stepId: google_storage
sparkJob: args:
- mongodb # Data Source Type
- mongodb://:@:/?authSource=admin&authMechanism=SCRAM-SHA-1 # Source DB URL
- # Source DB username
- # Source DB Password
- : # Target Big Query Database
- ‘30’ # Timeout in minutes to execute the Ingestion
- ‘25’ # Number of Component internal Threads, default 20 jarFileUris:
- gs:///lib/igt_database.jar # Project Jarfile, saved in Google Storage mainClass: br.com.bruno.data.ingestion.Starter # Scala Main Class properties: # Read more https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/cluster-properties spark.executor.cores: '5' spark.executor.memory: 4g stepId: mongodb_step placement: # Read more: https://cloud.google.com/dataproc/docs/concepts/workflows/using-yamls managedCluster: clusterName: db-ingestion-cluster-temp config: gceClusterConfig: zoneUri: us-east1-b networkUri: projects//global/networks/default masterConfig: diskConfig: bootDiskSizeGb: 30 bootDiskType: pd-standard machineTypeUri: n1-standard-8 softwareConfig: properties: dataproc:dataproc.allow.zero.workers: 'true' dataproc:dataproc.monitoring.stackdriver.enable: 'true' ``