DataHub

The Metadata Platform for the Modern Data Stack

LinkedIn's generalized metadata search & discovery tool.

Quickstart 로 Datahub 를 구성하는 경우

Datahub 는 Docker 환경에서 실행되므로 Docker 가 미리 설치되어 있어야 합니다.

Datahub CLI 설치

python3 -m pip install --upgrade pip wheel setuptools
python3 -m pip install --upgrade acryl-datahub
datahub version

설치 및 실행 환경에 따라 다를 수 있으나 정상적으로 버전 정보가 출력되는 것을 확인 가능합니다.

DataHub CLI version: 0.10.1
Python version: 3.9.15 | packaged by conda-forge | (main, Nov 22 2022, 08:52:10) 
[Clang 14.0.6 ]

로컬 환경에서 Datahub 인스턴스 실행

앞서 설치한 datahub 명령어를 사용하여 DataHub 인스턴스를 로컬 환경에 구성합니다.

datahub docker quickstart
datahub docker quickstart --arch m1 # Apple silicon issue

Quickstart 로 구성되는 Datahub 설정 변경

Datahub 배포는 아래와 같이 docker-compose 를 사용하고 있음을 공식 문서에서 확인할 수 있습니다.

This will deploy a DataHub instance using docker-compose. If you are curious, the docker-compose.yaml file is downloaded to your home directory under the .datahub/quickstart/ directory.

datahub CLI 설치가 완료되면 위 경로에 있는 docker-compose.yaml 파일이 존재하게되며, Datahub 구성 요소들에 대한 설정 값들이 포함되어 있습니다. 해당 파일을 수정한 뒤 quickstart 명령어를 실행하는 방식으로 설정을 변경할 수도 있겠지만, 가능하면 공식 문서에서 제공하는 방법을 사용하는 것이 좋습니다.

# ~/.datahub
networks:
  default:
    name: datahub_network
services:
  broker:
    container_name: broker
    depends_on:
      - zookeeper
    environment:
      - KAFKA_BROKER_ID=1
      - KAFKA_ZOOKEEPER_CONNECT=zookeeper:2181
      - KAFKA_LISTENER_SECURITY_PROTOCOL_MAP=PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT
      - KAFKA_ADVERTISED_LISTENERS=PLAINTEXT://broker:29092,PLAINTEXT_HOST://localhost:9092
      - KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR=1
      - KAFKA_GROUP_INITIAL_REBALANCE_DELAY_MS=0
      - KAFKA_HEAP_OPTS=-Xms256m -Xmx256m
      - KAFKA_CONFLUENT_SUPPORT_METRICS_ENABLE=false
    hostname: broker
    image: confluentinc/cp-kafka:7.2.2
    ports:
      - ${DATAHUB_MAPPED_KAFKA_BROKER_PORT:-9092}:9092
  datahub-actions:
    depends_on:
      - datahub-gms
    environment:
      - DATAHUB_GMS_HOST=datahub-gms
      - DATAHUB_GMS_PORT=8080
      - DATAHUB_GMS_PROTOCOL=http
      - DATAHUB_SYSTEM_CLIENT_ID=__datahub_syste
      ...
      ...
      ...

웹 대시보드 9002 포트를 변경하는 방법

datahub docker quickstart 명령어 중 몇 가지 컨테이너는 포트를 쉽게 변경할 수 있도록 지원하고 있지만 웹 대시보드의 포트 옵션은 없습니다. docker-compose.yml 파일을 일부 수정하여 새로 배포하면 쉽게 변경하실 수 있습니다. 아래와 같이 datahub-frontend-react 설정 yaml 에서 포트 정보를 수정합니다.

  • 9002 --> 8895

# ~/.datahub/quickstart/docker-compose.yml 을 변경하거나 복사본을 만들어 수정
datahub-frontend-react:
    container_name: datahub-frontend-react
    depends_on:
    - datahub-gms
    environment:
    - DATAHUB_GMS_HOST=datahub-gms
    - DATAHUB_GMS_PORT=8080
    - DATAHUB_SECRET=YouKnowNothing
    - DATAHUB_APP_VERSION=1.0
    - DATAHUB_PLAY_MEM_BUFFER_SIZE=10MB
    - JAVA_OPTS=-Xms512m -Xmx512m -Dhttp.port=9002 -Dconfig.file=datahub-frontend/conf/application.conf
      -Djava.security.auth.login.config=datahub-frontend/conf/jaas.conf -Dlogback.configurationFile=datahub-frontend/conf/logback.xml
      -Dlogback.debug=false -Dpidfile.path=/dev/null
    - KAFKA_BOOTSTRAP_SERVER=broker:29092
    - DATAHUB_TRACKING_TOPIC=DataHubUsageEvent_v1
    - ELASTIC_CLIENT_HOST=elasticsearch
    - ELASTIC_CLIENT_PORT=9200
    hostname: datahub-frontend-react
    image: ${DATAHUB_FRONTEND_IMAGE:-linkedin/datahub-frontend-react}:${DATAHUB_VERSION:-head}
    ports:
    # - ${DATAHUB_MAPPED_FRONTEND_PORT:-9002}:9002  # before change
    - ${DATAHUB_MAPPED_FRONTEND_PORT:-8895}:9002  # after change
    volumes:
    - ${HOME}/.datahub/plugins:/etc/datahub/plugins

만약 Datahub 가 이미 docker 환경에 구성되어 있다면 다음 명령어를 사용하여 컨테이너를 비활성화 합니다.

datahub docker quickstart --stop

수정된 yml 파일로 quickstart 를 진행합니다.

datahub docker quickstart -f ~/.datahub/quickstart/docker-compose-custom.yml

포트를 확인해봅니다.

  • 0.0.0.0:8895->9002/tcp

(base) ➜  ~ docker ps                                                                   
CONTAINER ID   IMAGE                                     COMMAND                  CREATED          STATUS                    PORTS                                        NAMES
c5242ec43ac8   linkedin/datahub-frontend-react:v0.10.0   "/bin/sh -c ./start.…"   18 minutes ago   Up 18 minutes (healthy)   0.0.0.0:8895->9002/tcp                       datahub-frontend-react
1e22b7a66002   acryldata/datahub-actions:head            "/bin/sh -c 'dockeri…"   38 minutes ago   Up 18 minutes                                                          datahub-datahub-actions-1
13f11a194bd9   confluentinc/cp-schema-registry:7.2.2     "/etc/confluent/dock…"   57 minutes ago   Up 18 minutes             0.0.0.0:8081->8081/tcp                       schema-registry
dd9637420312   confluentinc/cp-kafka:7.2.2               "/etc/confluent/dock…"   57 minutes ago   Up 18 minutes             0.0.0.0:9092->9092/tcp                       broker
920d1ab8147e   linkedin/datahub-gms:v0.10.0              "/bin/sh -c /datahub…"   57 minutes ago   Up 18 minutes (healthy)   0.0.0.0:8080->8080/tcp                       datahub-gms
c6935472793a   elasticsearch:7.10.1                      "/tini -- /usr/local…"   57 minutes ago   Up 18 minutes (healthy)   0.0.0.0:9200->9200/tcp, 9300/tcp             elasticsearch
6f6e387edeb5   confluentinc/cp-zookeeper:7.2.2           "/etc/confluent/dock…"   57 minutes ago   Up 18 minutes             2888/tcp, 0.0.0.0:2181->2181/tcp, 3888/tcp   zookeeper
ada557edf61d   mariadb:10.5.8                            "docker-entrypoint.s…"   57 minutes ago   Up 18 minutes             0.0.0.0:3306->3306/tcp                       mysql
462c72836930   kindest/node:v1.25.3                      "/usr/local/bin/entr…"   2 hours ago      Up 2 hours                127.0.0.1:54067->6443/tcp                    kind-control-plane
(base) ➜  ~ 

Port 충돌 처리

공식 문서에 따르면 quickstart 버전의 datahub 는 아래 포트를 기본 값으로 사용한다고 합니다.

  • 3306 for MySQL

  • 9200 for Elasticsearch

  • 9092 for the Kafka broker

  • 8081 for Schema Registry

  • 2181 for ZooKeeper

  • 9002 for the DataHub Web Application (datahub-frontend)

  • 8080 for the DataHub Metadata Service (datahub-gms)

만약 quickstart 설정 단계에서 위 컨테이너들의 포트를 변경하고자 할 경우 다음과 같이 flag 를 전달하여 지정하는 방식을 사용할 수 있습니다.

datahub docker quickstart --mysql-port 53306

datahub docker quickstart 명령어 help 문서를 참고바랍니다.

(base) ➜  quickstart datahub docker quickstart --help
Usage: datahub docker quickstart [OPTIONS]

  Start an instance of DataHub locally using docker-compose.

  This command will automatically download the latest docker-compose configuration from GitHub, pull the latest
  images, and bring up the DataHub system. There are options to override the docker-compose config file, build the
  containers locally, and dump logs to the console or to a file if something goes wrong.

Options:
  --version TEXT                  Datahub version to be deployed. If not set, deploy using the defaults from the
                                  quickstart compose. Use 'stable' to start the latest stable version.
  --build-locally                 Attempt to build the containers locally before starting
  --pull-images / --no-pull-images
                                  Attempt to pull the containers from Docker Hub before starting
  -f, --quickstart-compose-file FILE
                                  Use a local docker-compose file instead of pulling from GitHub
  --dump-logs-on-failure          If true, the docker-compose logs will be printed to console if something fails
  --graph-service-impl TEXT       If set, forces docker-compose to use that graph service implementation
  --mysql-port POSITIVEINT        If there is an existing mysql instance running on port 3306, set this to a free port
                                  to avoid port conflicts on startup
  --zk-port POSITIVEINT           If there is an existing zookeeper instance running on port 2181, set this to a free
                                  port to avoid port conflicts on startup
  --kafka-broker-port POSITIVEINT
                                  If there is an existing Kafka broker running on port 9092, set this to a free port
                                  to avoid port conflicts on startup
  --schema-registry-port POSITIVEINT
                                  If there is an existing process running on port 8081, set this to a free port to
                                  avoid port conflicts with Kafka schema registry on startup
  --elastic-port POSITIVEINT      If there is an existing Elasticsearch instance running on port 9092, set this to a
                                  free port to avoid port conflicts on startup
  --stop                          Use this flag to stop the running containers
  --backup                        Run this to backup a running quickstart instance
  --backup-file FILE              Run this to backup data from a running quickstart instance  [default:
                                  /Users/ghlee/.datahub/quickstart/backup.sql]
  --restore                       Run this to restore a running quickstart instance from a previous backup (see
                                  --backup)
  --restore-file TEXT             Set this to provide a custom restore file
  --restore-indices               Enable the restoration of indices of a running quickstart instance. Note: Using
                                  --restore will automatically restore-indices unless you use the --no-restore-indices
                                  flag.
  --no-restore-indices            Disables the restoration of indices of a running quickstart instance when used in
                                  conjunction with --restore.
  --standalone_consumers          Launches MAE & MCE consumers as stand alone docker containers
  --kafka-setup                   Launches Kafka setup job as part of the compose deployment
  --arch TEXT                     Specify the architecture for the quickstart images to use. Options are x86, arm64,
                                  m1 etc.
  --help                          Show this message and exit.

Stopping DataHub

To stop DataHub's quickstart, you can issue the following command.

datahub docker quickstart --stop

Customization

If you would like to customize the DataHub installation further, please download the docker-compose.yaml used by the cli tool, modify it as necessary and deploy DataHub by passing the downloaded docker-compose file:

datahub docker quickstart --quickstart-compose-file <path to compose file>

함께 읽어보면 좋은 자료

쏘카에서 Docker 환경이 아닌 K8S 환경에서 Datahub 를 사용한 데이터 카탈로그 어플리케이션을 구성한 내용을 소개하고 있습니다.

뱅크 샐러드에서 Data Discovery Platform 으로 Datahub 를 사용하게된 내용을 소개하고 있습니다.

Last updated