Databricks Runtime 6.0 für ML (nicht unterstützt)

Artikel
03/01/2024

Dieses Image wurde von Databricks im Oktober 2019 veröffentlicht.

Databricks Runtime 6.0 für Machine Learning bietet eine sofort einsatzbereite Umgebung für maschinelles Lernen und Data Science auf Basis von Databricks Runtime 6.0 (nicht unterstützt). Databricks Runtime ML enthält viele beliebte Machine Learning-Bibliotheken, einschließlich TensorFlow, PyTorch, Keras und XGBoost. Zudem wird ein verteiltes Deep Learning-Training mit Horovod unterstützt.

Weitere Informationen, einschließlich Anweisungen zum Erstellen eines Databricks Runtime ML-Clusters, finden Sie unter KI und Machine Learning in Databricks.

Neue Funktionen

Databricks Runtime 6.0 ML basiert auf Databricks Runtime 6.0. Informationen zu den Neuerungen in Databricks Runtime 6.0 finden Sie in den Versionshinweisen zu Databricks Runtime 6.0 (nicht unterstützt).

Abfrage von MLflow-Experimentdaten im großen Stil mithilfe der neuen MLflow Spark-Datenquelle

Die Spark-Datenquelle für MLflow-Experimente bietet jetzt eine Standard-API zum Laden von MLflow-Experimentlaufdaten. Dies ermöglicht eine Abfrage und Analyse von MLflow-Experimentdaten im großen Stil mithilfe von Datenrahmen-APIs. Für ein bestimmtes Experiment enthält der Datenrahmen „run_ids“, Metriken, Parameter, Tags, „start_time“, „end_time“, Status und die „artifact_uri“ für Artefakte. Weitere Informationen finden Sie unter MLflow-Experiment.

Verbesserungen

Allgemeine Verfügbarkeit von Hyperopt

Hyperopt in Azure Databricks ist jetzt allgemein verfügbar. Wichtige Verbesserungen seit der öffentlichen Vorschau umfassen die Unterstützung der MLflow-Protokollierung für Spark-Worker, die richtige Verarbeitung von PySpark-Broadcastvariablen sowie einen neuen Leitfaden zur Modellauswahl mit Hyperopt. Außerdem haben wir kleine Fehler in den Protokollmeldungen, bei der Fehlerbehandlung und der Benutzeroberfläche behoben und unsere Dokumentationen einfacher für die Leser gestaltet. Weitere Informationen finden Sie in der Dokumentation zu Hyperopt.

Wir haben aktualisiert, wie Azure Databricks Hyperopt-Experimente protokolliert, sodass Sie jetzt eine benutzerdefinierte Metrik während der Ausführung von Hyperopt protokollieren können, indem Sie die Metrik an die mlflow.log_metric-Funktion übergeben (siehe log_metric). Dies ist nützlich, wenn Sie zusätzlich zum Verlust, der beim Aufruf der hyperopt.fmin-Funktion standardmäßig protokolliert wird, benutzerdefinierte Metriken protokollieren möchten.
MLflow
- MLflow Java-Client 1.2.0 wurde hinzugefügt
- MLflow wird jetzt als Bibliothek der obersten Ebene heraufgestuft
Aktualisierte Machine Learning-Bibliotheken
- Upgrade von Horovod von 0.16.4 auf 0.18.1
- Upgrade von MLflow von 1.0.0 auf 1.2.0
Upgrade der Anaconda-Distribution von 5.2.0 auf 2019.03

Entfernen

Der Databricks ML-Modellexport wird entfernt. Verwenden Sie stattdessen MLeap zum Importieren und Exportieren von Modellen.
In der Hyperopt-Bibliothek wurden die folgenden Eigenschaften von hyperopt.SparkTrials entfernt:
- SparkTrials.successful_trials_count
- SparkTrials.failed_trials_count
- SparkTrials.cancelled_trials_count
- SparkTrials.total_trials_count
Sie werden durch die folgenden Funktionen ersetzt:
- SparkTrials.count_successful_trials()
- SparkTrials.count_failed_trials()
- SparkTrials.count_cancelled_trials()
- SparkTrials.count_total_trials()

Systemumgebung

Die Systemumgebung in Databricks Runtime 6.0 ML unterscheidet sich wie folgt von Databricks Runtime 6.0:

DBUtils: enthält nicht das Bibliothekshilfsprogramm (dbutils.library) (Legacy).

Bibliotheken

In den folgenden Abschnitten sind die Bibliotheken aufgelistet, die in Databricks Runtime 6.0 ML enthalten sind und sich von den in Databricks Runtime 6.0 enthaltenen Bibliotheken unterscheiden.

Bibliotheken der obersten Ebene

Databricks Runtime 6.0 ML enthält die folgenden Bibliotheken der obersten Ebene:

Python-Bibliotheken

Databricks Runtime 6.0 ML verwendet Conda für die Python-Paketverwaltung und enthält viele beliebte ML-Pakete. Im folgenden Abschnitt wird die Conda-Umgebung für Databricks Runtime 6.0 ML beschrieben.

Python 3 in CPU-Clustern

name: databricks-ml
channels:
  - pytorch
  - defaults
dependencies:
  - _libgcc_mutex=0.1=main
  - _py-xgboost-mutex=2.0=cpu_0
  - _tflow_select=2.3.0=mkl
  - absl-py=0.7.1=py37_0
  - asn1crypto=0.24.0=py37_0
  - astor=0.8.0=py37_0
  - backcall=0.1.0=py37_0
  - backports=1.0=py_2
  - bcrypt=3.1.6=py37h7b6447c_0
  - blas=1.0=mkl
  - boto=2.49.0=py37_0
  - boto3=1.9.162=py_0
  - botocore=1.12.163=py_0
  - c-ares=1.15.0=h7b6447c_1001
  - ca-certificates=2019.1.23=0
  - certifi=2019.3.9=py37_0
  - cffi=1.12.2=py37h2e261b9_1
  - chardet=3.0.4=py37_1003
  - click=7.0=py37_0
  - cloudpickle=0.8.0=py37_0
  - colorama=0.4.1=py37_0
  - configparser=3.7.4=py37_0
  - cryptography=2.6.1=py37h1ba5d50_0
  - cycler=0.10.0=py37_0
  - cython=0.29.6=py37he6710b0_0
  - decorator=4.4.0=py37_1
  - docutils=0.14=py37_0
  - entrypoints=0.3=py37_0
  - et_xmlfile=1.0.1=py37_0
  - flask=1.0.2=py37_1
  - freetype=2.9.1=h8a8886c_1
  - future=0.17.1=py37_0
  - gast=0.2.2=py37_0
  - gitdb2=2.0.5=py37_0
  - gitpython=2.1.11=py37_0
  - grpcio=1.16.1=py37hf8bcb03_1
  - gunicorn=19.9.0=py37_0
  - h5py=2.9.0=py37h7918eee_0
  - hdf5=1.10.4=hb1b8bf9_0
  - html5lib=1.0.1=py_0
  - icu=58.2=h9c2bf20_1
  - idna=2.8=py37_0
  - intel-openmp=2019.3=199
  - ipython=7.4.0=py37h39e3cac_0
  - ipython_genutils=0.2.0=py37_0
  - itsdangerous=1.1.0=py37_0
  - jdcal=1.4=py37_0
  - jedi=0.13.3=py37_0
  - jinja2=2.10=py37_0
  - jmespath=0.9.4=py_0
  - jpeg=9b=h024ee3a_2
  - keras=2.2.4=0
  - keras-applications=1.0.8=py_0
  - keras-base=2.2.4=py37_0
  - keras-preprocessing=1.1.0=py_1
  - kiwisolver=1.0.1=py37hf484d3e_0
  - krb5=1.16.1=h173b8e3_7
  - libedit=3.1.20181209=hc058e9b_0
  - libffi=3.2.1=hd88cf55_4
  - libgcc-ng=8.2.0=hdf63c60_1
  - libgfortran-ng=7.3.0=hdf63c60_0
  - libpng=1.6.36=hbc83047_0
  - libpq=11.2=h20c2e04_0
  - libprotobuf=3.8.0=hd408876_0
  - libsodium=1.0.16=h1bed415_0
  - libstdcxx-ng=8.2.0=hdf63c60_1
  - libtiff=4.0.10=h2733197_2
  - libxgboost=0.90=he6710b0_0
  - libxml2=2.9.9=hea5a465_1
  - libxslt=1.1.33=h7d1a2b0_0
  - llvmlite=0.28.0=py37hd408876_0
  - lxml=4.3.2=py37hefd8a0e_0
  - mako=1.0.10=py_0
  - markdown=3.1.1=py37_0
  - markupsafe=1.1.1=py37h7b6447c_0
  - mkl=2019.3=199
  - mkl_fft=1.0.10=py37ha843d7b_0
  - mkl_random=1.0.2=py37hd81dba3_0
  - mock=3.0.5=py37_0
  - ncurses=6.1=he6710b0_1
  - networkx=2.2=py37_1
  - ninja=1.9.0=py37hfd86e86_0
  - nose=1.3.7=py37_2
  - numba=0.43.1=py37h962f231_0
  - numpy=1.16.2=py37h7e9f1db_0
  - numpy-base=1.16.2=py37hde5b4d6_0
  - olefile=0.46=py37_0
  - openpyxl=2.6.1=py37_1
  - openssl=1.1.1b=h7b6447c_1
  - pandas=0.24.2=py37he6710b0_0
  - paramiko=2.4.2=py37_0
  - parso=0.3.4=py37_0
  - pathlib2=2.3.3=py37_0
  - patsy=0.5.1=py37_0
  - pexpect=4.6.0=py37_0
  - pickleshare=0.7.5=py37_0
  - pillow=5.4.1=py37h34e0f95_0
  - pip=19.0.3=py37_0
  - ply=3.11=py37_0
  - prompt_toolkit=2.0.9=py37_0
  - protobuf=3.8.0=py37he6710b0_0
  - psutil=5.6.1=py37h7b6447c_0
  - psycopg2=2.7.6.1=py37h1ba5d50_0
  - ptyprocess=0.6.0=py37_0
  - py-xgboost=0.90=py37he6710b0_0
  - py-xgboost-cpu=0.90=py37_0
  - pyasn1=0.4.6=py_0
  - pycparser=2.19=py37_0
  - pygments=2.3.1=py37_0
  - pymongo=3.8.0=py37he6710b0_1
  - pynacl=1.3.0=py37h7b6447c_0
  - pyopenssl=19.0.0=py37_0
  - pyparsing=2.3.1=py37_0
  - pysocks=1.6.8=py37_0
  - python=3.7.3=h0371630_0
  - python-dateutil=2.8.0=py37_0
  - python-editor=1.0.4=py_0
  - pytorch-cpu=1.1.0=py3.7_cpu_0
  - pytz=2018.9=py37_0
  - pyyaml=5.1=py37h7b6447c_0
  - readline=7.0=h7b6447c_5
  - requests=2.21.0=py37_0
  - s3transfer=0.2.1=py37_0
  - scikit-learn=0.20.3=py37hd81dba3_0
  - scipy=1.2.1=py37h7c811a0_0
  - setuptools=40.8.0=py37_0
  - simplejson=3.16.0=py37h14c3975_0
  - singledispatch=3.4.0.3=py37_0
  - six=1.12.0=py37_0
  - smmap2=2.0.5=py37_0
  - sqlite=3.27.2=h7b6447c_0
  - sqlparse=0.3.0=py_0
  - statsmodels=0.9.0=py37h035aef0_0
  - tabulate=0.8.3=py37_0
  - tensorboard=1.13.1=py37hf484d3e_0
  - tensorflow=1.13.1=mkl_py37h54b294f_0
  - tensorflow-base=1.13.1=mkl_py37h7ce6ba3_0
  - tensorflow-estimator=1.13.0=py_0
  - tensorflow-mkl=1.13.1=h4fcabd2_0
  - termcolor=1.1.0=py37_1
  - tk=8.6.8=hbc83047_0
  - torchvision-cpu=0.3.0=py37_cuNone_1
  - tqdm=4.31.1=py37_1
  - traitlets=4.3.2=py37_0
  - urllib3=1.24.1=py37_0
  - virtualenv=16.0.0=py37_0
  - wcwidth=0.1.7=py37_0
  - webencodings=0.5.1=py37_1
  - websocket-client=0.56.0=py37_0
  - werkzeug=0.14.1=py37_0
  - wheel=0.33.1=py37_0
  - wrapt=1.11.1=py37h7b6447c_0
  - xz=5.2.4=h14c3975_4
  - yaml=0.1.7=had09818_2
  - zlib=1.2.11=h7b6447c_3
  - zstd=1.3.7=h0b5b093_0
  - pip:
    - argparse==1.4.0
    - databricks-cli==0.9.0
    - docker==4.0.2
    - fusepy==2.0.4
    - gorilla==0.3.0
    - horovod==0.18.1
    - hyperopt==0.1.2.db8
    - matplotlib==3.0.3
    - mleap==0.8.1
    - mlflow==1.2.0
    - nose-exclude==0.5.0
    - pyarrow==0.13.0
    - querystring-parser==1.2.4
    - seaborn==0.9.0
    - tensorboardx==1.8
prefix: /databricks/conda/envs/databricks-ml

Spark-Pakete mit Python-Modulen

Spark-Paket	Python-Modul	Version
graphframes	graphframes	0.7.0-db1-spark2.4
spark-deep-learning	sparkdl	1.5.0-db5-spark2.4
tensorframes	tensorframes	0.7.0-s_2.11

R-Bibliotheken

Die R-Bibliotheken sind mit den R-Bibliotheken in Databricks Runtime 6.0 identisch.

Java- und Scala-Bibliotheken (Scala 2.11-Cluster)

Zusätzlich zu Java- und Scala-Bibliotheken in Databricks Runtime 6.0 enthält Databricks Runtime 6.0 ML die folgenden JAR-Dateien:

Gruppen-ID	Artefakt-ID	Version
com.databricks	spark-deep-learning	1.5.0-db5-spark2.4
com.typesafe.akka	akka-actor_2.11	2.3.11
ml.combust.mleap	mleap-databricks-runtime_2.11	0.14.0
ml.dmlc	xgboost4j	0.90
ml.dmlc	xgboost4j-spark	0.90
org.graphframes	graphframes_2.11	0.7.0-db1-spark2.4
org.mlflow	mlflow-client	1.2.0
org.tensorflow	libtensorflow	1.13.1
org.tensorflow	libtensorflow_jni	1.13.1
org.tensorflow	spark-tensorflow-connector_2.11	1.13.1
org.tensorflow	tensorflow	1.13.1
org.tensorframes	tensorframes	0.7.0-s_2.11