How Do I View The Data Register In A Logix5550

Cloud computing has fundamentally changed how companies operate – users are no longer field of study to the restrictions of on-bounds hardware deployments such as concrete limits of resources and onerous environs upgrade processes. With the convenience and flexibility of deject services comes challenges on how to properly monitor how your users utilize these conveniently available resources. Failure to do so could result in problematic and costly anti-patterns (with both cloud provider core resources and a PaaS like Databricks). Databricks is cloud-native by design and thus tightly coupled with the public deject providers, such as Microsoft and Amazon Spider web Services, fully taking advantage of this new paradigm, and the inspect logs capability provides administrators a centralized mode to understand and govern activity happening on the platform. Administrators could use Databricks inspect logs to monitor patterns similar the number of clusters or jobs in a given twenty-four hour period, the users who performed those actions, and any users who were denied authority into the workspace.

In the outset blog post of the series, Trust merely Verify with Databricks, we covered how Databricks admins could use Databricks inspect logs and other cloud provider logs every bit complementary solutions for their cloud monitoring scenarios. The main purpose of Databricks audit logs is to allow enterprise security teams and platform administrators to track access to data and workspace resources using the various interfaces available in the Databricks platform. In this article, we will cover, in detail, how those personas could process and analyze the audit logs to rail resource usage and identify potentially costly anti-patterns.

Audit Logs ETL Pattern

Databricks Audit Log ETL Design and Workflow.

Databricks delivers inspect logs for all enabled workspaces as per delivery SLA in JSON format to a client-endemic AWS S3 bucket. These audit logs contain events for specific deportment related to master resource like clusters, jobs, and the workspace. To simplify delivery and further analysis by the customers, Databricks logs each issue for every action as a divide record and stores all the relevant parameters into a sparse StructType chosen requestParams.

In lodge to make this information more attainable, nosotros recommend an ETL process based on Structured Streaming and Delta Lake.

Databricks ETL process utilizing structured streaming and Delta Lake.

Utilizing Structured Streaming allows us to:
- Leave land management to a construct that's purpose built for land direction. Rather than having to reason almost how much time has elapsed since our previous run to ensure that we're only calculation the proper records, we can utilize Structured Streaming'southward checkpoints and write-alee log to ensure that we're only processing the newly added audit log files. We tin can pattern our streaming queries as triggerOnce daily jobs which are like pseudo-batch jobs
Utilizing Delta Lake allows u.s. to exercise the post-obit:
- Gracefully handle schema development, specifically with regards to the requestParams field, which may take new StructField based on new actions tracked in the audit logs
- Hands utilize tabular array to tabular array streams
- Accept reward of specific performance optimizations like OPTIMIZE to maximize read performance

For reference, this is the medallion reference architecture that Databricks recommends:

Databricks' medallion architecture for the ETL process

Statuary: the initial landing zone for the pipeline. Nosotros recommend copying data that's as shut to its raw form equally possible to easily replay the whole pipeline from the beginning, if needed

Silver: the raw data get apple-pie (think data quality checks), transformed and potentially enriched with external data sets

Gilded: production-grade data that your entire company tin rely on for business organization intelligence, descriptive statistics, and data science / automobile learning

Following our own medallion architecture, nosotros suspension it out as follows for our inspect logs ETL design:

Raw Data to Bronze Table

Stream from the raw JSON files that Databricks delivers using a file-based Structured Stream to a statuary Delta Lake table. This creates a durable copy of the raw data that allows us to replay our ETL, should we find any issues in downstream tables.

Databricks raw data to Bronze Table ETL process, which creates a durable copy of the raw data and allows ETL replay to troubleshoot table issues downstream.

Databricks delivers audit logs to a customer-specified AWS S3 bucket in the form of JSON. Rather than writing logic to determine the state of our Delta Lake tables, we're going to utilize Structured Streaming's write-ahead logs and checkpoints to maintain the state of our tables. In this case, we've designed our ETL to run once per day, then we're using a file source with triggerOnce to simulate a batch workload with a streaming framework. Since Structured Streaming requires that we explicitly define the schema, nosotros'll read the raw JSON files once to build it.

            streamSchema = spark.read.json(sourceBucket).schema

Nosotros'll then instantiate our StreamReader using the schema nosotros inferred and the path to the raw audit logs.

            streamDF = (     spark     .readStream     .format("json")     .schema(streamSchema)     .load(sourceBucket) )

Nosotros then instantiate our StreamWriter and write out the raw inspect logs into a bronze Delta Lake tabular array that's partitioned past engagement.

            (streamDF .writeStream .format("delta") .partitionBy("date") .outputMode("suspend") .option("checkpointLocation", "{}/checkpoints/statuary".format(sinkBucket)) .choice("path", "{}/streaming/bronze".format(sinkBucket)) .option("mergeSchema", True) .trigger(once=Truthful) .commencement() )

Now that nosotros've created the table on an AWS S3 bucket, we'll need to register the table to the Databricks Hive metastore to brand admission to the information easier for end users. We'll create the logical database audit_logs, earlier creating the Bronze table.

            CREATE DATABASE IF NOT EXISTS audit_logs  spark.sql(""" CREATE TABLE IF NOT EXISTS audit_logs.bronze USING DELTA LOCATION '{}/streaming/statuary' """.format(sinkBucket))

If you lot update your Delta Lake tables in batch or pseudo-batch style, information technology's best practice to run OPTIMIZE immediately following an update.

            OPTIMIZE audit_logs.bronze

Statuary to Silverish Table

Stream from a bronze Delta Lake table to a silver Delta Lake table such that it takes the sparse requestParams StructType and strips out all empty keys for every record, along with performing some other bones transformations similar parsing email address from a nested field and parsing UNIX epoch to UTC timestamp.

Databricks Bronze to Silver Table ETL process, which strips out empty record keys and performs basic transformations, such as parsing email addresses.

Since we send audit logs for all Databricks resource types in a common JSON format, we've defined a canonical struct chosen requestParams which contains a spousal relationship of the keys for all resource types. Somewhen, we're going to create individual tables for each service, so we want to strip downwards the requestParams field for each table so that information technology contains just the relevant keys for the resources blazon. To reach this, nosotros ascertain a user-defined role (UDF) to strip away all such keys in requestParams that accept nix values.

            def stripNulls(raw):     render json.dumps({i: raw.asDict()[i] for i in raw.asDict() if raw.asDict()[i] != None})    strip_udf = udf(stripNulls, StringType())

Nosotros instantiate a StreamReader from our bronze Delta Lake table:

            bronzeDF = (     spark     .readStream     .load("{}/streaming/bronze".format(sinkBucket))     )

Nosotros then utilise the post-obit transformations to the streaming data from the bronze Delta Lake table:

strip the null keys from requestParams and store the output as a string
parse email from userIdentity
parse an actual timestamp / timestamp datatype from the timestamp field and store it in date_time
drop the raw requestParams and userIdentity

            query = (     bronzeDF     .withColumn("flattened", strip_udf("requestParams"))     .withColumn("electronic mail", col("userIdentity.email"))     .withColumn("date_time", from_utc_timestamp(from_unixtime(col("timestamp")/yard), "UTC"))     .driblet("requestParams")     .drop("userIdentity") )

We then stream those transformed records into the SIlver Delta Lake table:

            (query .writeStream .format("delta") .partitionBy("engagement") .outputMode("append") .pick("checkpointLocation", "{}/checkpoints/silver".format(sinkBucket)) .pick("path", "{}/streaming/silverish".format(sinkBucket)) .selection("mergeSchema", True) .trigger(once=True) .showtime() )

Again, since we've created a table based on an AWS S3 saucepan, we'll desire to annals it with the vive Metastore for easier access.

            spark.sql(""" CREATE TABLE IF NOT EXISTS audit_logs.silver USING DELTA LOCATION '{}/streaming/silver' """.format(sinkBucket))

Although Structured Streaming guarantees exactly in one case processing, we can still add an assertion to check the counts of the Bronze Delta Lake table to the Silverish Delta Lake table.

            assert(spark.table("audit_logs.bronze").count() == spark.tabular array("audit_logs.silver").count())

Equally for the bronze table earlier, we'll run OPTIMIZE after this update for the silvery table as well.

            OPTIMIZE audit_logs.silver

Silver to Gold Tables

Stream to individual gold Delta Lake tables for each Databricks service tracked in the inspect logs

Databricks Silver to Gold Table ETL process, which provides the pared-down analysis required by the workspace administrators.

The gold audit log tables are what the Databricks administrators will apply for their analyses. With the requestParams field pared downwardly at the service level, information technology's now much easier to get a handle on the analysis and what's pertinent. With Delta Lake's ability to handle schema development gracefully, as Databricks tracks boosted actions for each resource type, the gold tables will seamlessly change, eliminating the need to hardcode schemas or babysit for errors.

In the final step of our ETL process, nosotros showtime define a UDF to parse the keys from the stripped down version of the original requestParams field.

            def justKeys(string): return [i for i in json.loads(string).keys()]  just_keys_udf = udf(justKeys, StringType())

For the adjacent big chunk of our ETL, we'll define a role which accomplishes the post-obit:

gathers the keys for each record for a given serviceName (resources type)
creates a set of those keys (to remove duplicates)
creates a schema from those keys to utilise to a given serviceName (if the serviceName does not have any keys in requestParams, we give it i key schema called placeholder)
write out to private gilt Delta Lake tables for each serviceName in the silver Delta Lake table

            def flattenTable(serviceName, bucketName):     flattenedStream = spark.readStream.load("{}/streaming/silvery".format(bucketName))     flattened = spark.table("audit_logs.silver")     ...

We extract a listing of all unique values in serviceName to use for iteration and run above role for each value of serviceName:

            serviceNameList = [i['serviceName'] for i in spark.table("audit_logs.silver").select("serviceName").distinct().collect()]  for serviceName in serviceNameList:     flattenTable(serviceName, sinkBucket)

As before, register each Golden Delta Lake table to the Hive Metastore:

            for serviceName in serviceNameList: spark.sql(""" CREATE Tabular array IF NOT EXISTS audit_logs.{0} USING DELTA LOCATION '{ane}/streaming/gilded/{two}' """.format(serviceName,sinkBucket,i))

And so run OPTIMIZE on each table:

            for serviceName in serviceNameList: spark.sql("OPTIMIZE audit_logs.{}".format(serviceName))

Again equally earlier, asserting that the counts are equal is non necessary, just we do it nonetheless:

            flattened_count = spark.table("audit_logs.silver").count()  total_count = 0 for serviceName in serviceNameList:     total_count += (spark.tabular array("audit_logs.{}".format(serviceName)).count())  assert(flattened_count == total_count)

Nosotros now have a gilt Delta Lake table for each serviceName (resources blazon) that Databricks tracks in the audit logs, which we tin can now use for monitoring and assay.

Audit Log Assay

In the above section, we process the raw audit logs using ETL and include some tips on how to make data access easier and more than performant for your cease users. The offset notebook included in this article pertains to that ETL process.

The 2nd notebook we've included goes into more detailed analysis on the audit log events themselves. For the purpose of this blog post, we'll focus on just one of the resource types – clusters, but we've included analysis on logins as another example of what administrators could do with the data stored in the audit logs.

It may be obvious to some every bit to why a Databricks administrator may want to monitor clusters, just it bears repeating: cluster uptime is the biggest driver of price and we want to ensure that our customers go maximum value while they're utilizing Databricks clusters.

A major portion of the cluster uptime equation is the number of clusters created on the platform and we can use audit logs to determine the number of Databricks clusters created on a given day.

Past querying the clusters' gold Delta Lake table, we can filter where actionName is create and perform a count by date.

            SELECT date, count(*) AS num_clusters  FROM clusters  WHERE actionName = 'create'  GROUP Past ane  ORDER Past one ASC

Graphical view of cluster usage provided by the Databricks Audit Log solution.

There's non much context in the higher up nautical chart because we don't accept data from other days. But for the sake of simplicity, let's assume that the number of clusters more than tripled compared to normal usage patterns and the number of users did not change meaningfully during that time period. If this were truly the example, then ane of the reasonable explanations would be that the clusters were created programmatically using jobs. Additionally, 12/28/19 was a Sat, so we don't look there to be many interactive clusters created anyways.

Inspecting the requestParam StructType for the clusters table, we encounter that there'south a cluster_creator field, which should tell us who created it.

            SELECT requestParams.cluster_creator, actionName, count(*)  FROM clusters  WHERE engagement = '2019-12-28'  Group Past 1,2  Club Past 3 DESC

Cluster usage table report generated by the Databricks Audit Log solution.

Based on the results above, we detect that JOB_LAUNCHER created 709 clusters, out of 714 full clusters created on 12/28/xix, which confirms our intuition.

Our next step is to effigy out which particular jobs created these clusters, which nosotros could excerpt from the cluster names. Databricks job clusters follow this naming convention task-<jobId>-run-<runId>, so we can parse the jobId from the cluster proper name.

            SELECT split(requestParams.cluster_name, "-")[ane] AS jobId, count(*)  FROM clusters  WHERE actionName = 'create' AND date = '2019-12-28' GROUP BY 1  Club Past ii DESC

The Databricks Audit Log solution enables you to surface the job id responsible for cluster creation, to assist with root cause analysis.

Hither we see that jobId "31303" is the culprit for the vast majority of clusters created on 12/28/xix. Another slice of information that the inspect logs shop in requestParams is the user_id of the user who created the job. Since the creator of a job is immutable, we tin can simply take the beginning record.

            SELECT requestParams.user_id  FROM clusters  WHERE actionName = 'create' AND date = '2019-12-28' AND split up(requestParams.cluster_name, "-")[1] = '31303'  LIMIT i

Now that we take the user_id of the user who created the job, we tin utilize the SCIM API to get the user's identity and ask them direct about what may have happened here.

In addition to monitoring the full number of clusters overall, we encourage Databricks administrators to pay special attention to all purpose compute clusters that do not have autotermination enabled. The reason is because such clusters volition continue running until manually terminated, regardless of whether they're idle or not. You lot tin identify these clusters using the following query:

            SELECT date, count(*) Every bit num_clusters  FROM clusters  WHERE actionName = 'create' AND requestParams.autotermination_minutes = 0 AND requestParams.cluster_creator IS null  GROUP By 1 ORDER By ane ASC

If you're utilizing our example information, you'll notice that at that place are v clusters whose cluster_creator is nada which means that they were created by users and not by jobs.

The cluster usage by cluster creator email table report provided by the Databricks Audit Log solution.

By selecting the creator'southward electronic mail accost and the cluster's name, we tin can identify which clusters we need to terminate and which users we need to talk over the best practices for Databricks resources management.

How to start processing Databricks Inspect Logs

With a flexible ETL process that follows the all-time practice medallion architecture with Structured Streaming and Delta Lake, we've simplified Databricks audit logs assay by creating individual tables for each Databricks resource type. Our cluster analysis example is just one of the many ways that analyzing audit logs helps to identify a problematic anti-pattern that could lead to unnecessary costs. Please employ the following notebooks for the verbal steps we've included in this post to try it out at your cease:

ETL Notebook
Analysis Notebook

For more information, you tin can likewise lookout man the recent tech talk: All-time Practices on How to Procedure and Analyze Audit Logs with Delta Lake and Structured Streaming.

An example of the operational metrics now available for review in the Spark UI through Delta Lake 0.6.0

For a slightly different architecture that processes the inspect logs as soon as they're available, consider evaluating the new Automobile Loader capability that we discuss in item in this weblog mail.

We desire our customers to maximize the value they go from our platform, so please reach out to your Databricks account squad if y'all have whatever questions.