ETL ๊ณผ ELT (+Amazon Redshift)

ETL ๊ณผ ELT

ELT(Extract, Load, Transform)์™€ ETL(Extract, Transform, Load)์€ ๋ฐ์ดํ„ฐ ์›จ์–ดํ•˜์šฐ์Šค์— ๋ฐ์ดํ„ฐ๋ฅผ ์ด๋™์‹œํ‚ค๋Š” ๋‘ ๊ฐ€์ง€ ์ผ๋ฐ˜์ ์ธ ๋””์ž์ธ ํŒจํ„ด์œผ๋กœ, ์ด ๋‘ ํŒจํ„ด์˜ ์ฃผ์š” ์ฐจ์ด์ ์€ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ ํŒŒ์ดํ”„๋ผ์ธ์—์„œ ๋ณ€ํ™˜(transformation)์ด ์–ธ์ œ ๋ฐœ์ƒํ•˜๋Š”์ง€์— ์žˆ์Šต๋‹ˆ๋‹ค.

  1. ETL(Extract, Transform, Load): ์ด ๋ฐฉ์‹์—์„œ๋Š” ๋ฐ์ดํ„ฐ๊ฐ€ ์›จ์–ดํ•˜์šฐ์Šค์— ๋กœ๋“œ๋˜๊ธฐ ์ „์— ๋ณ€ํ™˜์ด ์ด๋ฃจ์–ด์ง‘๋‹ˆ๋‹ค. ETL์€ ์†Œ์Šค ์‹œ์Šคํ…œ์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ์ถ”์ถœํ•˜๊ณ , ๊ทธ ๋ฐ์ดํ„ฐ๋ฅผ ์ •์ œ ๋ฐ ๋ณ€ํ˜•ํ•œ ํ›„, ๋งˆ์ง€๋ง‰์œผ๋กœ ๊ทธ ๊ฒฐ๊ณผ๋ฅผ ๋ฐ์ดํ„ฐ ์›จ์–ดํ•˜์šฐ์Šค์— ๋กœ๋“œํ•˜๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค. ์ด ๋ฐฉ์‹์€ ๋ฏธ๋ฆฌ ์ •์˜๋œ ๋น„์ฆˆ๋‹ˆ์Šค ๊ทœ์น™์— ๋”ฐ๋ผ ๋ฐ์ดํ„ฐ๋ฅผ ์ •์ œํ•˜๊ณ  ๋ณ€ํ˜•ํ•  ํ•„์š”๊ฐ€ ์žˆ์„ ๋•Œ ์œ ์šฉํ•ฉ๋‹ˆ๋‹ค.

  2. ELT(Extract, Load, Transform): ์ด ๋ฐฉ์‹์—์„œ๋Š” ์›จ์–ดํ•˜์šฐ์Šค์— ๋ฐ์ดํ„ฐ๋ฅผ ๋กœ๋“œํ•œ ํ›„์— ๋ณ€ํ™˜์ด ์ด๋ฃจ์–ด์ง‘๋‹ˆ๋‹ค. ELT๋Š” ์†Œ์Šค ์‹œ์Šคํ…œ์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ์ถ”์ถœํ•˜๊ณ , ๊ทธ ๋ฐ์ดํ„ฐ๋ฅผ ๋จผ์ € ๋ฐ์ดํ„ฐ ์›จ์–ดํ•˜์šฐ์Šค์— ๋กœ๋“œํ•œ ํ›„, ํ•„์š”์— ๋”ฐ๋ผ ์›จ์–ดํ•˜์šฐ์Šค ๋‚ด์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ๋ณ€ํ˜•ํ•˜๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค. ์ด ๋ฐฉ์‹์€ ๋Œ€์šฉ๋Ÿ‰์˜ ๋น„์ •ํ˜• ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•ด์•ผ ํ•˜๋Š” ๋น… ๋ฐ์ดํ„ฐ ํ™˜๊ฒฝ์—์„œ ์œ ์šฉํ•˜๋ฉฐ, ๋ฐ์ดํ„ฐ์˜ ํ˜•์‹์ด๋‚˜ ๊ตฌ์กฐ๊ฐ€ ๋ฏธ๋ฆฌ ์˜ˆ์ธก๋˜์ง€ ์•Š๋Š” ์ƒํ™ฉ์—์„œ ํšจ๊ณผ์ ์ž…๋‹ˆ๋‹ค.

๋”ฐ๋ผ์„œ, ์–ด๋–ค ์ƒํ™ฉ์—์„œ ELT์™€ ETL์„ ๊ตฌ๋ถ„ํ•˜์—ฌ ์‚ฌ์šฉํ•ด์•ผ ํ•˜๋Š”์ง€๋Š” ์ฃผ๋กœ ๋ฐ์ดํ„ฐ์˜ ์–‘, ๋ณต์žก์„ฑ, ๋ณ€ํ™˜์˜ ํ•„์š”์„ฑ ๋“ฑ์— ๋”ฐ๋ผ ๋‹ฌ๋ผ์ง‘๋‹ˆ๋‹ค. ์ผ๋ฐ˜์ ์œผ๋กœ, ETL์€ ์ž‘๊ฑฐ๋‚˜ ์ค‘๊ฐ„ ๊ทœ๋ชจ์˜ ์ •ํ˜• ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•œ ๋ณต์žกํ•œ ๋ณ€ํ™˜์„ ์ˆ˜ํ–‰ํ•ด์•ผ ํ•  ๋•Œ, ELT๋Š” ํฐ ๊ทœ๋ชจ์˜ ๋ฐ์ดํ„ฐ์…‹์ด๋‚˜ ๋œ ๊ตฌ์กฐํ™”๋œ ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค๋ฃจ์–ด์•ผ ํ•  ๋•Œ ๋”์šฑ ์œ ์šฉํ•˜๋‹ค๊ณ  ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ฒซ ๋ฒˆ์งธ ํŒจํ„ด์ธ ETL์€ ๋ฐ์ดํ„ฐ ์›จ์–ดํ•˜์šฐ์Šค๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ๋กœ๋“œํ•˜๊ธฐ ์ „์— ๋ฐ์ดํ„ฐ๋ฅผ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค. ๋‘ ๋ฒˆ์งธ ํŒจํ„ด์ธ ELT๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ์ดํ„ฐ ์›จ์–ดํ•˜์šฐ์Šค์— ๋กœ๋“œํ•˜๊ณ , ์ต์ˆ™ํ•œ SQL ๊ตฌ๋ฌธ๊ณผ ๋Œ€๊ทœ๋ชจ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ(MPP; Massively Parallel Processing) ์•„ํ‚คํ…์ฒ˜์˜ ์„ฑ๋Šฅ์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ฐ์ดํ„ฐ ์›จ์–ดํ•˜์šฐ์Šค ๋‚ด์—์„œ ๋ณ€ํ™˜์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค

Redshift Spectrum

Amazon Redshift๋Š” AWS์—์„œ ์™„์ „ํžˆ ๊ด€๋ฆฌ๋˜๋Š” ๋ฐ์ดํ„ฐ ์›จ์–ดํ•˜์šฐ์Šค ์„œ๋น„์Šค์ž…๋‹ˆ๋‹ค. ๋ถ„์‚ฐํ˜•, ๋Œ€๊ทœ๋ชจ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ(MPP), ๊ณต์œ  ์—†๋Š” ์•„ํ‚คํ…์ฒ˜(?)๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. Redshift Spectrum์€ Amazon Redshift์˜ ๊ธฐ๋ณธ ๊ธฐ๋Šฅ์œผ๋กœ, Amazon S3์˜ ๋ฐ์ดํ„ฐ ๋ ˆ์ดํฌ์— ์ €์žฅ๋œ ๋ชจ๋“  ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด ํ˜„์žฌ ์‚ฌ์šฉ ์ค‘์ธ BI ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜๊ณผ SQL ํด๋ผ์ด์–ธํŠธ ๋„๊ตฌ๋ฅผ ์ด์šฉํ•ด Amazon Redshift์˜ SQL์„ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ด์ค๋‹ˆ๋‹ค.

์ผ๋ฐ˜์ ์ธ ํŒจํ„ด์€ Amazon Redshift์— ๋กœ์ปฌ๋กœ ์ €์žฅ๋œ ์ž์ฃผ ์•ก์„ธ์Šคํ•˜๋Š” ํ•ซ ๋ฐ์ดํ„ฐ์™€ Amazon S3์— ๋น„์šฉ ํšจ์œจ์ ์œผ๋กœ ์ €์žฅ๋œ ์›œ ๋˜๋Š” ์ฝœ๋“œ ๋ฐ์ดํ„ฐ๋ฅผ ๋ชจ๋‘ ํฌํ•จํ•˜๋Š” ์ฟผ๋ฆฌ๋ฅผ ์‹คํ–‰ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์™ธ๋ถ€ ํ…Œ์ด๋ธ”์— ๋Œ€ํ•ด ์Šคํ‚ค๋งˆ ๋ฐ”์ธ๋”ฉ์ด ์—†๋Š” ๋ทฐ๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด, ์—ฌ๋Ÿฌ ์œ ์Šค ์ผ€์ด์Šค์— ๋Œ€ํ•ด ํด๋Ÿฌ์Šคํ„ฐ์™€ S3์—์„œ ์ปดํ“จํŠธ ๋ฆฌ์†Œ์Šค์™€ ์ €์žฅ์†Œ๋ฅผ ๋…๋ฆฝ์ ์œผ๋กœ ํ™•์žฅํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

Redshift Spectrum์€ Apache Parquet, Avro, CSV, ORC, JSON ๋“ฑ๊ณผ ๊ฐ™์€ ๋‹ค์–‘ํ•œ ๊ตฌ์กฐํ™”๋œ ๋ฐ ๋น„๊ตฌ์กฐํ™”๋œ ํŒŒ์ผ ํ˜•์‹์„ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค. Amazon Athena, Amazon EMR, Amazon SageMaker์™€ ๊ฐ™์€ ๋‹ค๋ฅธ ์„œ๋น„์Šค๋“ค์ด S3 ๋ฐ์ดํ„ฐ ๋ ˆ์ดํฌ์—์„œ ์ง์ ‘ ์•ก์„ธ์Šคํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Concurrency Scaling

๋™์‹œ์„ฑ ์Šค์ผ€์ผ๋ง์„ ์‚ฌ์šฉํ•˜๋ฉด, Amazon Redshift๋Š” ์ž๋™์  ๋ฐ ํƒ„๋ ฅ์ ์œผ๋กœ ์ฟผ๋ฆฌ ์ฒ˜๋ฆฌ๋ ฅ์„ ํ™•์žฅํ•˜์—ฌ ์ˆ˜๋ฐฑ ๊ฐœ์˜ ๋™์‹œ ์ฟผ๋ฆฌ์— ๋Œ€ํ•ด ์ผ๊ด€๋˜๊ฒŒ ๋น ๋ฅธ ์„ฑ๋Šฅ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ๋™์‹œ์„ฑ์ด ์ฆ๊ฐ€ํ•จ์— ๋”ฐ๋ผ, ๋Œ€๊ธฐ ์‹œ๊ฐ„ ์—†์ด ๋น ๋ฅด๊ฒŒ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋„๋ก ์Šค์ผ€์ผ๋ง ๋ฆฌ์†Œ์Šค๊ฐ€ Amazon Redshift ํด๋Ÿฌ์Šคํ„ฐ์— ์ถ”๊ฐ€๋ฉ๋‹ˆ๋‹ค. ์ž‘์—… ๋ถ€ํ•˜ ์ˆ˜์š”๊ฐ€ ์ค„์–ด๋“ค๋ฉด, Amazon Redshift๋Š” ๋น„์šฉ์„ ์ ˆ์•ฝํ•˜๊ธฐ ์œ„ํ•ด ์ž๋™์œผ๋กœ ๋™์‹œ์„ฑ ์Šค์ผ€์ผ๋ง ๋ฆฌ์†Œ์Šค๋ฅผ ์ข…๋ฃŒํ•ฉ๋‹ˆ๋‹ค.

Data Lake Export

Amazon Redshift๋Š” ์ด์ œ ๋ถ„์„์„ ์œ„ํ•œ ํšจ์œจ์ ์ธ ์—ด ๊ธฐ๋ฐ˜ ์ €์žฅ ํ˜•์‹์ธ Apache Parquet ํ˜•์‹์œผ๋กœ S3์˜ ๋ฐ์ดํ„ฐ ๋ ˆ์ดํฌ์— ์ฟผ๋ฆฌ ๊ฒฐ๊ณผ๋ฅผ ์–ธ๋กœ๋“œํ•˜๋Š” ๊ฒƒ์„ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค. Parquet ํ˜•์‹์€ ํ…์ŠคํŠธ ํ˜•์‹์— ๋น„ํ•ด ์–ธ๋กœ๋“œํ•˜๋Š” ๋ฐ ์ตœ๋Œ€ ๋‘ ๋ฐฐ ๋” ๋น ๋ฅด๋ฉฐ, S3์—์„œ ์ตœ๋Œ€ ์—ฌ์„ฏ ๋ฐฐ ๋” ์ ์€ ์ €์žฅ ๊ณต๊ฐ„์„ ์†Œ๋น„ํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ ํ•˜๋‚˜ ์ด์ƒ์˜ ํŒŒํ‹ฐ์…˜ ์—ด์„ ์ง€์ •ํ•˜์—ฌ ์–ธ๋กœ๋“œ๋œ ๋ฐ์ดํ„ฐ๊ฐ€ S3 ๋ฒ„ํ‚ท์˜ ํด๋”์— ์ž๋™์œผ๋กœ ํŒŒํ‹ฐ์…˜๋˜๋„๋ก ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋กœ ์ธํ•ด ์ฟผ๋ฆฌ ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋˜๊ณ  ์–ธ๋กœ๋“œ๋œ ๋ฐ์ดํ„ฐ์˜ ๋‹ค์šด์ŠคํŠธ๋ฆผ ์†Œ๋น„ ๋น„์šฉ์ด ๊ฐ์†Œํ•ฉ๋‹ˆ๋‹ค.

Use Cases: ELT using Amazon Redshift

The following diagram shows how Redshift Spectrum allows you to simplify and accelerate your data processing pipeline from a four-step to a one-step process with the CTAS (Create Table As) command.

Use Cases: ETL using Amazon Redshift

As shown in the following diagram, once the transformed results are unloaded in S3, you then query the unloaded data from your data lake either using Redshift Spectrum if you have an existing Amazon Redshift cluster, Athena with its pay-per-use and serverless ad hoc and on-demand query model, AWS Glue and Amazon EMR for performing ETL operations on the unloaded data and data integration with your other datasets (such as ERP, finance, and third-party data) stored in your data lake, and Amazon SageMaker for machine learning.

Analyze requirements to decide ELT versus ETL

Amazon Redshift์˜ MPP ์•„ํ‚คํ…์ฒ˜์™€ Spectrum ๊ธฐ๋Šฅ์€ ๋Œ€๊ทœ๋ชจ ๊ด€๊ณ„ํ˜• ๋ฐ SQL ๊ธฐ๋ฐ˜ ELT ์›Œํฌ๋กœ๋“œ(์กฐ์ธ, ์ง‘๊ณ„)์— ๋Œ€ํ•ด ํšจ์œจ์ ์ด๋ฉฐ ์ž˜ ์„ค๊ณ„๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. Amazon Redshift๋ฅผ ์ด์šฉํ•˜์—ฌ ํšจ์œจ์ ์ธ ELT ์†”๋ฃจ์…˜์„ ์„ค๊ณ„ํ•˜๋ ค๋ฉด ๋‹ค์Œ์„ ์‹ ์ค‘ํ•˜๊ฒŒ ๊ณ ๋ คํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

  • Type of data from source systems (structured, semi-structured, and unstructured)

  • Nature of the transformations required (usually encompassing cleansing, enrichment, harmonization, transformations, and aggregations)

  • Row-by-row, cursor-based processing needs versus batch SQL

  • Performance SLA and scalability requirements considering the data volume growth over time

  • Cost of the solution

Reference (Original Content)

Last updated