✅DataHub
The Metadata Platform for the Modern Data Stack

Quickstart 로 Datahub 를 구성하는 경우
Datahub 는 Docker 환경에서 실행되므로 Docker 가 미리 설치되어 있어야 합니다.
Datahub CLI 설치
python3 -m pip install --upgrade pip wheel setuptools
python3 -m pip install --upgrade acryl-datahub
datahub version설치 및 실행 환경에 따라 다를 수 있으나 정상적으로 버전 정보가 출력되는 것을 확인 가능합니다.
DataHub CLI version: 0.10.1
Python version: 3.9.15 | packaged by conda-forge | (main, Nov 22 2022, 08:52:10)
[Clang 14.0.6 ]로컬 환경에서 Datahub 인스턴스 실행
앞서 설치한 datahub 명령어를 사용하여 DataHub 인스턴스를 로컬 환경에 구성합니다.
datahub docker quickstart
datahub docker quickstart --arch m1 # Apple silicon issueQuickstart 로 구성되는 Datahub 설정 변경
Datahub 배포는 아래와 같이 docker-compose 를 사용하고 있음을 공식 문서에서 확인할 수 있습니다.
datahub CLI 설치가 완료되면 위 경로에 있는 docker-compose.yaml 파일이 존재하게되며, Datahub 구성 요소들에 대한 설정 값들이 포함되어 있습니다. 해당 파일을 수정한 뒤 quickstart 명령어를 실행하는 방식으로 설정을 변경할 수도 있겠지만, 가능하면 공식 문서에서 제공하는 방법을 사용하는 것이 좋습니다.
# ~/.datahub
networks:
default:
name: datahub_network
services:
broker:
container_name: broker
depends_on:
- zookeeper
environment:
- KAFKA_BROKER_ID=1
- KAFKA_ZOOKEEPER_CONNECT=zookeeper:2181
- KAFKA_LISTENER_SECURITY_PROTOCOL_MAP=PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT
- KAFKA_ADVERTISED_LISTENERS=PLAINTEXT://broker:29092,PLAINTEXT_HOST://localhost:9092
- KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR=1
- KAFKA_GROUP_INITIAL_REBALANCE_DELAY_MS=0
- KAFKA_HEAP_OPTS=-Xms256m -Xmx256m
- KAFKA_CONFLUENT_SUPPORT_METRICS_ENABLE=false
hostname: broker
image: confluentinc/cp-kafka:7.2.2
ports:
- ${DATAHUB_MAPPED_KAFKA_BROKER_PORT:-9092}:9092
datahub-actions:
depends_on:
- datahub-gms
environment:
- DATAHUB_GMS_HOST=datahub-gms
- DATAHUB_GMS_PORT=8080
- DATAHUB_GMS_PROTOCOL=http
- DATAHUB_SYSTEM_CLIENT_ID=__datahub_syste
...
...
...웹 대시보드 9002 포트를 변경하는 방법
datahub docker quickstart 명령어 중 몇 가지 컨테이너는 포트를 쉽게 변경할 수 있도록 지원하고 있지만 웹 대시보드의 포트 옵션은 없습니다. docker-compose.yml 파일을 일부 수정하여 새로 배포하면 쉽게 변경하실 수 있습니다. 아래와 같이 datahub-frontend-react 설정 yaml 에서 포트 정보를 수정합니다.
9002 --> 8895
# ~/.datahub/quickstart/docker-compose.yml 을 변경하거나 복사본을 만들어 수정
datahub-frontend-react:
container_name: datahub-frontend-react
depends_on:
- datahub-gms
environment:
- DATAHUB_GMS_HOST=datahub-gms
- DATAHUB_GMS_PORT=8080
- DATAHUB_SECRET=YouKnowNothing
- DATAHUB_APP_VERSION=1.0
- DATAHUB_PLAY_MEM_BUFFER_SIZE=10MB
- JAVA_OPTS=-Xms512m -Xmx512m -Dhttp.port=9002 -Dconfig.file=datahub-frontend/conf/application.conf
-Djava.security.auth.login.config=datahub-frontend/conf/jaas.conf -Dlogback.configurationFile=datahub-frontend/conf/logback.xml
-Dlogback.debug=false -Dpidfile.path=/dev/null
- KAFKA_BOOTSTRAP_SERVER=broker:29092
- DATAHUB_TRACKING_TOPIC=DataHubUsageEvent_v1
- ELASTIC_CLIENT_HOST=elasticsearch
- ELASTIC_CLIENT_PORT=9200
hostname: datahub-frontend-react
image: ${DATAHUB_FRONTEND_IMAGE:-linkedin/datahub-frontend-react}:${DATAHUB_VERSION:-head}
ports:
# - ${DATAHUB_MAPPED_FRONTEND_PORT:-9002}:9002 # before change
- ${DATAHUB_MAPPED_FRONTEND_PORT:-8895}:9002 # after change
volumes:
- ${HOME}/.datahub/plugins:/etc/datahub/plugins만약 Datahub 가 이미 docker 환경에 구성되어 있다면 다음 명령어를 사용하여 컨테이너를 비활성화 합니다.
datahub docker quickstart --stop수정된 yml 파일로 quickstart 를 진행합니다.
datahub docker quickstart -f ~/.datahub/quickstart/docker-compose-custom.yml포트를 확인해봅니다.
0.0.0.0:8895->9002/tcp
(base) ➜ ~ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
c5242ec43ac8 linkedin/datahub-frontend-react:v0.10.0 "/bin/sh -c ./start.…" 18 minutes ago Up 18 minutes (healthy) 0.0.0.0:8895->9002/tcp datahub-frontend-react
1e22b7a66002 acryldata/datahub-actions:head "/bin/sh -c 'dockeri…" 38 minutes ago Up 18 minutes datahub-datahub-actions-1
13f11a194bd9 confluentinc/cp-schema-registry:7.2.2 "/etc/confluent/dock…" 57 minutes ago Up 18 minutes 0.0.0.0:8081->8081/tcp schema-registry
dd9637420312 confluentinc/cp-kafka:7.2.2 "/etc/confluent/dock…" 57 minutes ago Up 18 minutes 0.0.0.0:9092->9092/tcp broker
920d1ab8147e linkedin/datahub-gms:v0.10.0 "/bin/sh -c /datahub…" 57 minutes ago Up 18 minutes (healthy) 0.0.0.0:8080->8080/tcp datahub-gms
c6935472793a elasticsearch:7.10.1 "/tini -- /usr/local…" 57 minutes ago Up 18 minutes (healthy) 0.0.0.0:9200->9200/tcp, 9300/tcp elasticsearch
6f6e387edeb5 confluentinc/cp-zookeeper:7.2.2 "/etc/confluent/dock…" 57 minutes ago Up 18 minutes 2888/tcp, 0.0.0.0:2181->2181/tcp, 3888/tcp zookeeper
ada557edf61d mariadb:10.5.8 "docker-entrypoint.s…" 57 minutes ago Up 18 minutes 0.0.0.0:3306->3306/tcp mysql
462c72836930 kindest/node:v1.25.3 "/usr/local/bin/entr…" 2 hours ago Up 2 hours 127.0.0.1:54067->6443/tcp kind-control-plane
(base) ➜ ~
Port 충돌 처리
공식 문서에 따르면 quickstart 버전의 datahub 는 아래 포트를 기본 값으로 사용한다고 합니다.
3306 for MySQL
9200 for Elasticsearch
9092 for the Kafka broker
8081 for Schema Registry
2181 for ZooKeeper
9002 for the DataHub Web Application (datahub-frontend)
8080 for the DataHub Metadata Service (datahub-gms)
만약 quickstart 설정 단계에서 위 컨테이너들의 포트를 변경하고자 할 경우 다음과 같이 flag 를 전달하여 지정하는 방식을 사용할 수 있습니다.
datahub docker quickstart --mysql-port 53306datahub docker quickstart 명령어 help 문서를 참고바랍니다.
(base) ➜ quickstart datahub docker quickstart --help
Usage: datahub docker quickstart [OPTIONS]
Start an instance of DataHub locally using docker-compose.
This command will automatically download the latest docker-compose configuration from GitHub, pull the latest
images, and bring up the DataHub system. There are options to override the docker-compose config file, build the
containers locally, and dump logs to the console or to a file if something goes wrong.
Options:
--version TEXT Datahub version to be deployed. If not set, deploy using the defaults from the
quickstart compose. Use 'stable' to start the latest stable version.
--build-locally Attempt to build the containers locally before starting
--pull-images / --no-pull-images
Attempt to pull the containers from Docker Hub before starting
-f, --quickstart-compose-file FILE
Use a local docker-compose file instead of pulling from GitHub
--dump-logs-on-failure If true, the docker-compose logs will be printed to console if something fails
--graph-service-impl TEXT If set, forces docker-compose to use that graph service implementation
--mysql-port POSITIVEINT If there is an existing mysql instance running on port 3306, set this to a free port
to avoid port conflicts on startup
--zk-port POSITIVEINT If there is an existing zookeeper instance running on port 2181, set this to a free
port to avoid port conflicts on startup
--kafka-broker-port POSITIVEINT
If there is an existing Kafka broker running on port 9092, set this to a free port
to avoid port conflicts on startup
--schema-registry-port POSITIVEINT
If there is an existing process running on port 8081, set this to a free port to
avoid port conflicts with Kafka schema registry on startup
--elastic-port POSITIVEINT If there is an existing Elasticsearch instance running on port 9092, set this to a
free port to avoid port conflicts on startup
--stop Use this flag to stop the running containers
--backup Run this to backup a running quickstart instance
--backup-file FILE Run this to backup data from a running quickstart instance [default:
/Users/ghlee/.datahub/quickstart/backup.sql]
--restore Run this to restore a running quickstart instance from a previous backup (see
--backup)
--restore-file TEXT Set this to provide a custom restore file
--restore-indices Enable the restoration of indices of a running quickstart instance. Note: Using
--restore will automatically restore-indices unless you use the --no-restore-indices
flag.
--no-restore-indices Disables the restoration of indices of a running quickstart instance when used in
conjunction with --restore.
--standalone_consumers Launches MAE & MCE consumers as stand alone docker containers
--kafka-setup Launches Kafka setup job as part of the compose deployment
--arch TEXT Specify the architecture for the quickstart images to use. Options are x86, arm64,
m1 etc.
--help Show this message and exit.
Stopping DataHub
To stop DataHub's quickstart, you can issue the following command.
datahub docker quickstart --stopCustomization
If you would like to customize the DataHub installation further, please download the docker-compose.yaml used by the cli tool, modify it as necessary and deploy DataHub by passing the downloaded docker-compose file:
datahub docker quickstart --quickstart-compose-file <path to compose file>함께 읽어보면 좋은 자료
쏘카에서 Docker 환경이 아닌 K8S 환경에서 Datahub 를 사용한 데이터 카탈로그 어플리케이션을 구성한 내용을 소개하고 있습니다.
뱅크 샐러드에서 Data Discovery Platform 으로 Datahub 를 사용하게된 내용을 소개하고 있습니다.
Last updated