Copy Into
COPY INTO
in Databricks¶
COPY INTO feature is used to load files from volumes to Databricks tables and has feature of idempotency ie the data does not get duplicated in the table.
Steps¶
- Create Volume
- Create folder inside volume
- Copy sample dataset into volume
dbutils.fs.cp("/databricks-datasets/definitive-guide/data/retail-data/by-day/2010-12-01.csv","/Volumes/dev/bronze/landing/input")
dbutils.fs.cp("/databricks-datasets/definitive-guide/data/retail-data/by-day/2010-12-02.csv","/Volumes/dev/bronze/landing/input")
- Create bronze table
COPY INTO dev.bronze.invoice_cp
FROM '/Volumes/dev/bronze/landing/input'
FILEFORMAT = CSV
PATTERN = '*.csv'
FORMAT_OPTIONS ( -- for the files, if they are different format one has 3 cols other has 5 then merge them
'mergeSchema' = 'true',
'header' = 'true'
)
COPY_OPTIONS ( -- at table level meerge Schema
'mergeSchema' = 'true'
)
- Select from table
We can see 5217 rows.
- Run COPY INTO again
No affected rows so copy into does not duplicate. Its idempotent.
How does copy into maintain the log of data files ingested?¶
The delta log maintains json version tracking that has information and path of files processed.
Custom Transformations while loading¶
COPY INTO dev.bronze.invoice_cp_alt
FROM
(
SELECT InvoiceNo,StockCode,cast(Quantity as DOUBLE),current_timestamp() as _insert_date
FROM
'/Volumes/dev/bronze/landing/input'
)
FILEFORMAT = CSV
PATTERN = '*.csv'
FORMAT_OPTIONS ( -- for the files, if they are different format one has 3 cols other has 5 then merge them
'mergeSchema' = 'true',
'header' = 'true'
)
COPY_OPTIONS ( -- at table level meerge Schema
'mergeSchema' = 'true'
)