From 3d8429fe3074407deab54500c12dcdad00c76228 Mon Sep 17 00:00:00 2001
From: AlanCoding <arominge@redhat.com>
Date: Fri, 18 Nov 2016 11:20:44 -0500
Subject: [PATCH] Logging Integration feature documentation

---
 docs/logging_integration.md | 199 ++++++++++++++++++++++++++++++++++++
 1 file changed, 199 insertions(+)
 create mode 100644 docs/logging_integration.md

diff --git a/docs/logging_integration.md b/docs/logging_integration.md
new file mode 100644
index 0000000000..6981e7ac75
--- /dev/null
+++ b/docs/logging_integration.md
@@ -0,0 +1,199 @@
+# Integration with Third-Party Log Aggregators
+
+This feature builds in the capability to send detailed logs to several kinds
+of 3rd party external log aggregation services. Services connected to this
+data feed should be useful in order to gain insights into Tower usage
+or technical trends. The data is intended to be
+sent in JSON format over a HTTP connection using minimal service-specific
+tweaks engineered in a custom handler or via an imported library.
+
+## Loggers
+
+This features introduces several new loggers which are intended to
+deliver a large amount of information in a predictable structured format,
+following the same structure as one would expect if obtaining the data
+from the API. These data loggers are the following.
+
+ - awx.analytics.job_status
+     - Summaries of status changes for jobs, project updates, inventory
+       updates, and others
+ - awx.analytics.job_events
+     - Data returned from the Ansible callback module
+ - awx.analytics.activity_stream
+     - Record of changes to the objects within the Ansible Tower app
+ - awx.analytics.system_tracking
+     - Data gathered by Ansible scan modules ran by scan job templates
+
+These loggers only use log-level of INFO.
+
+Additionally, the standard Tower logs should be deliverable through this
+same mechanism. It should be obvious to the user how to enable to disable
+each of these 5 sources of data without manipulating a complex dictionary
+in their local settings file, as well as adjust the log-level consumed
+from the standard Tower logs.
+
+## Supported Services
+
+Currently committed to support:
+
+ - Splunk
+ - Elastic Stack / ELK Stack / Elastic Cloud
+
+Under consideration for testing:
+
+ - Sumo Logic
+ - Datadog
+ - Loggly
+ - Red Hat Common Logging via logstash connector
+
+### Elastic Search Instructions
+
+In the development environment, the server can be started up with the
+log aggregation services attached via the Makefile targets. This starts
+up the 3 associated services of Logstash, Elastic Search, and Kibana
+as their own separate containers individually.
+
+In addition to running these services, it establishes connections to the
+tower_tools containers as needed. This is derived from the docker-elk
+project. (https://github.com/deviantony/docker-elk)
+
+```bash
+# Start a single server with links
+make docker-compose-elk
+# Start the HA cluster with links
+make docker-compose-cluster-elk
+```
+
+Kibana is the visualization service, and it can be accessed in a web browser
+by going to `{server address}:5601`.
+
+If you were to start from scratch, standing up your own version the elastic
+stack, then the only change you should need is to add the following lines
+to the logstash `logstash.conf` file.
+
+```
+filter {
+	json {
+		source => "message"
+	}
+}
+```
+
+#### Debugging and Pitfalls
+
+Backward-incompatible changes were introduced with Elastic 5.0.0, and
+customers may need different configurations depending on what
+versions they are using.
+
+# Log Message Schema
+
+Common schema for all loggers:
+
+| Field | Information |
+|-----------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| cluster_host_id | (string) unique identifier of the host within the Tower cluster |
+| level | (choice of DEBUG, INFO, WARNING, ERROR, etc.) Standard python log level, roughly reflecting the significance of the event All of the data loggers as a part of this feature use INFO level, but the other Tower logs will use different levels as appropriate |
+| logger_name | (string) Name of the logger we use in the settings, for example, "awx.analytics.activity_stream" |
+| @timestamp | (datetime) Time of log |
+| path | (string) File path in code where the log was generated |
+
+## Activity Stream Schema
+
+| Field             | Information                                                                                                             |
+|-------------------|-------------------------------------------------------------------------------------------------------------------------|
+| (common)          | this uses all the fields common to all loggers listed above                                                             |
+| actor             | (string) username of the user who took the action documented in the log |
+| changes           | (string) unique identifier of the host within the Tower cluster                                                         |
+| operation         | (choice of several options) the basic category of the changed logged in the activity stream, for instance, "associate". |
+| object1           | (string) Information about the primary object being operated on, consistent with what we show in the activity stream    |
+| object2           | (string) if applicable, the second object involved in the action                                                        |
+
+## Job Event Schema
+
+This logger echoes the data being saved into job events, except when they
+would otherwise conflict with expected standard fields from the logger,
+in which case the fields are named differently.
+Notably, the field `host` on the job_event model is given as `event_host`.
+There is also a sub-dictionary field `event_data` within the payload,
+which will contain different fields depending on the specifics of the
+Ansible event.
+
+This logger also includes the common fields.
+
+## Scan / Fact / System Tracking Data Schema
+
+These contain a detailed dictionary-type field either services,
+packages, or files.
+
+| Field        | Information                                                                                                                                                                                                       |
+|--------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| (common)     | this uses all the fields common to all loggers listed above                                                                                                                                                       |
+| services     | (dict, optional) For services scans, this field is included and has keys based on the name of the service NOTE: Periods are disallowed by elastic search in names, and are replaced with "_" by our log formatter |
+| packages     | (dict, optional) Included for log messages from package scans                                                                                                                                                     |
+| files        | (dict, optional) Included for log messages from file scans                                                                                                                                                        |
+| host         | (str) name of host scan applies to                                                                                                                                                                                |
+| inventory_id | (int) inventory id host is inside of  
+
+
+## Job Status Changes
+
+This is a intended to be a lower-volume source of information about
+changes in job states compared to job events, and also intended to
+capture changes to types of unified jobs other than job template based
+jobs.
+
+In addition to common fields, these logs include fields present on
+the job model.
+
+## Tower Logs
+
+In addition to the common fields, this will contain a `msg` field with
+the log message. Errors contain a separate `traceback` field.
+
+# Configuring Inside of Tower
+
+Parameters needed in order to configure the connection to the log
+aggregation service will include most of the following for all
+supported services:
+
+ - Host
+ - Port
+ - some kind of token
+ - enabling sending logs, and selecting which loggers to send
+ - flag to use HTTPS or not
+
+Some settings for the log handler will not be exposed to the user via
+this mechanism. In particular, threading (enabled), and connection type
+(designed for HTTP/HTTPS).
+
+Parameters for the items listed above should be configurable through
+the Configure-Tower-in-Tower interface.
+
+
+# Acceptance Criteria Notes
+
+Connection: Testers need to replicate the documented steps for setting up
+and connecting with a destination log aggregation service, if that is
+an officially supported service. That will involve 1) configuring the
+settings, as documented, 2) taking some action in Tower that causes a log
+message from each type of data logger to be sent and 3) verifying that
+the content is present in the log aggregation service.
+
+Schema: After the connection steps are completed, a tester will need to create
+an index. We need to confirm that no errors are thrown in this process.
+It also needs to be confirmed that the schema is consistent with the
+documentation. In the case of Splunk, we need basic confirmation that
+the data is compatible with the existing app schema.
+
+Tower logs: Formatting of Traceback message is a known issue in several
+open-source log handlers, so we should confirm that server errors result
+in the log aggregator receiving a well-formatted multi-line string
+with the traceback message.
+
+Log messages should be sent outside of the
+request-response cycle. For example, loggly examples use
+`requests_futures.sessions.FuturesSession`, which does some
+threading work to fire the message without interfering with other
+operations. A timeout on the part of the log aggregation service should
+not cause Tower operations to hang.
+