# PySpark-LogAnalysis

#### Cosmos DB (DocumentDB) Insights Log Processor

The **pyspark-loganalysis** .py script helps to analyze large amounts of operational log data written by Cosmos DB (DocumentDB) and Application Insights. When a large amount of transactions are executed in a short timeframe, the log file (PT5M.json) can be > 1GB which prevents quick analysis on data that has been collected for several days, weeks, or months.

To prepare these files, the `doTransformations.sh` script uses `jq` to explode the array so that we can import the JSON files on a line-by-line basis.

### Prerequisites

From the command line, verify/install [jq](https://stedolan.github.io/jq/):

```
$ jq

jq - commandline JSON processor [version 1.5-1-a5b5cbe]
Usage: jq [options] <jq filter> [file...]
```

## Usage

### Run JSON transformation on insights-logs-dataplanerequests files

These files were available on the HDInsight Cluster already since the Azure Storage Blob was made available by scripting the command on each head node. Update the `fileCount` and `filePaths` variables in the `doTransformations.sh` script before executing:

```
$ sudo sh doTransformation.sh
```

and there will be an output similar to:

```
Enumerating the objects, please wait...

There are 1402 JSON files to be transformed

Press any key to continue...
```

### (Optional) Delete the untransformed files

Update `fileCount` and `filePaths` in the `doDelete.sh` file to match the path of `doTransformations.sh`: $ sudo sh doDelete.sh

```
Enumerating the objects, please wait...

There are 1402 unparsed JSON files to be deleted

Press any key to continue...
```

### Process data on Spark Cluster

In the Jupyter (.ipynb) Notebook, there are 18 cells that seperate the execution stages. To run from terminal:

```
$ ./bin/spark-submit --files LogAnalysis-Master.py
```

or

```
$ ./bin/pyspark LogAnalysis-Master.py
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.moffitt.app/master.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.