View on GitHub

spark-bigquery-ingestion

Spark based Data Warehouse Ingestion Tool, load an entire database into Google Bigquery with one job, without any code.

Spark Data Warehouse Ingestion Tool Bigquery

Spark based Data Warehouse Ingestion Tool, load an entire database into Google Bigquery with one job, without any code.

Data Sources:

Stack

Runtime

It runs on top of any Spark Cluster, to make things cheaper and easier, it may run on top of Google Dataproc, using dynamic allocation, please read the workflow section.

Requirements

How to run

  1. To run on any Spark Cluster, please generate the jar file, using the command:

sbt clean assembly

Run on Google Dataproc
  1. Upload the jar file to the Google Storage folder gs://<your-googlecloud-project-name>/lib/spark-bigquery-ingestion.jar

  2. Change the file ingestion-workflow.yml replacing your-googlecloud-project-name> for your google project name.

  3. Configure all your datasources, please read setup datasources

  4. Create all databases referenced in your datasources, using the same region of the Dataproc cluster.

  5. Run the command gcloud auth login only for the first time.

  6. Execute o script execute-load.sh

Extend the Jdbc IoLoader

If you need to include a new RDBMS such Oracle or MSSQL, add a new Query for the RDBMS you may need in method in createQueryMap and createDriverMap in class Jdbc

Add new Data Source

If you need to add a new Data Source like a new API ou any other direct approach, follow the steps below:

  1. Implement a new IoLoader examples: package

  2. Add new IoLoader in Class Factory Class: IoLoaderFactory

Future features

Incremental ingestion

External Resources

Data Storage on Google Big Query, Spotify: https://github.com/spotify/spark-bigquery

License

Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0