Saltar a contenido

Copy Into

COPY INTO in Databricks

COPY INTO feature is used to load files from volumes to Databricks tables and has feature of idempotency ie the data does not get duplicated in the table.

Steps

  1. Create Volume
CREATE VOLUME dev.bronze.landing
  1. Create folder inside volume
dbutils.fs.mkdirs("/Volumes/dev/bronze/landing/input")
  1. Copy sample dataset into volume
dbutils.fs.cp("/databricks-datasets/definitive-guide/data/retail-data/by-day/2010-12-01.csv","/Volumes/dev/bronze/landing/input")

dbutils.fs.cp("/databricks-datasets/definitive-guide/data/retail-data/by-day/2010-12-02.csv","/Volumes/dev/bronze/landing/input")
  1. Create bronze table
COPY INTO dev.bronze.invoice_cp
FROM '/Volumes/dev/bronze/landing/input'
FILEFORMAT = CSV
PATTERN = '*.csv'
FORMAT_OPTIONS ( -- for the files, if they are different format one has 3 cols other has 5 then merge them
  'mergeSchema' = 'true',
  'header' = 'true'
)
COPY_OPTIONS ( -- at table level meerge Schema
  'mergeSchema' = 'true'
)
  1. Select from table

We can see 5217 rows.

image

  1. Run COPY INTO again

image

No affected rows so copy into does not duplicate. Its idempotent.

How does copy into maintain the log of data files ingested?

The delta log maintains json version tracking that has information and path of files processed.

image

Custom Transformations while loading

COPY INTO dev.bronze.invoice_cp_alt
FROM 
(
  SELECT InvoiceNo,StockCode,cast(Quantity as DOUBLE),current_timestamp() as _insert_date 
  FROM 
  '/Volumes/dev/bronze/landing/input'
)

FILEFORMAT = CSV
PATTERN = '*.csv'
FORMAT_OPTIONS ( -- for the files, if they are different format one has 3 cols other has 5 then merge them
  'mergeSchema' = 'true',
  'header' = 'true'
)
COPY_OPTIONS ( -- at table level meerge Schema
  'mergeSchema' = 'true'
)

image