Open Telemetry Trace
Background
Based on application logs and a sales checker service that triggers alerts on a slack channel, the current solution offers some insight into the application but falls short when it comes to deep understanding of exceptions and the general business flow.
Due to the sheer amount of logs and complex integrations between the application microservices, it is not uncommon to find scenarios where the development team can spend hours tracing the problem back to its origin.
- Non-standard logs
- The log format and content depends on implementation and log uniformity is a challenge by itself.
- Logs might be outdated or absent
- As services shift, evolve and so should logs. This can be a source for inconsistencies as logs may not be updated or can be wrongly removed during the feature implementation.
- Logs are difficult to query and trace
- Even with the help of tools such Elasticsearch, querying and tracing logs can be a difficult task due to complex interactions among different services in the application.
- As an example, an error response from quote-engine but we only dispose of the policy number to find the root cause. The data provided is not enough for a complete investigation, which may lead to long research.
- Not able to trace all exception occurrences
- In the case shown by the previous item the error was not logged or at least not easily accessible.
- Sales triggers are not flexible
- Triggers are configured as code in the sales checker service. Any change to the logic must be implemented, tested and deployed.
- Sales alert do not provide insight into root causes
- The alert contains only the basic information necessary for an investigation to start. It contains the lead, partner, vertical and a brief description of the problem.
- Limited business process visibility
- As the alerts are based on sale statuses, we have limited insight into the process. With such a limited view we cannot verify which steps represent bottlenecks or are more prone to errors. Useful information for maintaining and improving the system.
- No concise view of the business flow
- Currently there are no dashboards or easily accessible views that can provide correlated information about the many steps that compose the Bolttech business flow.
Detailed Design
To fill the gaps that the current approach, the proposed solution must fulfill some requirements:
- Requires minimal code and service changes
- Must be flexible and configurable
- Must capable of querying multiple data sources
- Must be an evolutive tool that can help multiple types of users
- Must have a dashboard that shows useful information about business flow
Taking the requirements in consideration, the proposed solutions employs an application monitoring tool named Open Telemetry, a microservice responsible for correlating the data captured by the tool and a comprehensive dashboard for the final user.
Usage
- Tracing Apps
- Datadog
- Services and Observability
Sample application
NestJS Applications
NestJs applications should be upgraded to version 9.
-
Just the fact of upgrade the nest-app to version 9 will able the service to ship all logs to the service visibility
As we applied the
@edirect/trace
to the nest-app, we dont need to have the trace part of code on nest-app applications with nest-app version 9In case of impossibility to upgrade, use the "Other" instructions using
@edirect/trace
directlyIntegration Service was not able to upgrade because it uses a react-ssr that does not have compatibility with nestjs 9, so, even as a nestjs application, we implemented the "Other" way
-
-
Upgrade nest-app, all edirect dependencies and all nestjs dependencies
-
We should initialize our application with
otel
as a requirement for node startup command:node --require ./node_modules/@edirect/trace/dist/otel.js dist/main
(dist/main or your application entrypoint)
-
Remove the old client initialization as we moved it to the node require
If the service was using old trace version
const config = require('config'); const { services: { APM }, middlewares: traceMiddlewares } = require('@edirect/trace'); const apmConfig = config.get('apm'); if (apmConfig.enabled) { const apm = new APM(apmConfig.service, apmConfig.url, apmConfig.traceServer); apm.startTrace(); }
-
Remove the middleware usage
If the service was using old trace version
Remove it completely as we moved it to inside of
@edirect/trace
:if (apmConfig.enabled) { app.use(new traceMiddlewares.Body().body); }
-
NextJS Applications
NextJS applications should be upgraded to version 12 or above.
-
Install the trace package:
npm install --save @edirect/trace
-
Create a middleware
-
When you upgrade the NextJS application to the version 12(or above) you will be able to use Middlewares.
Create a file called as
middleware.ts(or .js)
inside of thesrc
folder.import { NextResponse } from 'next/server'; import type { NextRequest } from 'next/server'; import NextJsTracer from '@edirect/trace/dist/middlewares/nextjs'; // This function can be marked `async` if using `await` inside export function middleware(request: NextRequest) { // @edirect/trace new NextJsTracer().body(req as unknown as Request & Record<string, unknown>); NextResponse.next(); }
-
-
Install dotenv and create a dotenv.config
npm install --save dotenv
dotenv.config.js
const dotenv = require('dotenv'); const nodeEnv = process.env.NODE_ENV || 'development'; dotenv.config({ path: `.${nodeEnv}.env` });
-
Add the environment variables to your NextJS app: Example:
.development.env
## OPEN TELEMETRY TRACE_TAG_OWNER=OWNER TRACE_TAG_SCOPE=ie TRACE_TAG_SERVICENAME=frontend-v2 TRACE_TAG_TENANT={STAGE}-{INSTANCE} TRACE_SERVER_URL={URL} TRACE_TAG_SERVICE=frontend-v2-{STAGE}-{INSTANCE} TRACE_TAG_VERSION=1.0.0 TRACE_TAG_CLUSTER=stag-ie TRACE_TAG_ENV=ie-stag-broker
-
Serve your production build on the standalone mode. Add this line to your
next.config.js
filemodule.exports = { // ...myOtherSettings output: 'standalone' }
-
Change the start and dev command:
{ "scripts": { "dev": "cross-env NODE_ENV=development node --require ./dotenv.config.js --require ./node_modules/@edirect/trace/dist/otel.js ./node_modules/next/dist/bin/next -p 3100", "start": "cross-env NODE_ENV=production node --require ./dotenv.config.js --require ./node_modules/@edirect/trace/dist/otel.js ./build/client/standalone/server.js -p 3100" } }
Other (Not nest-app:9)
If you are unable to update to nest 9 or is not nest (express)
-
The usage of
@edirect/trace
will need your care to inject our middleware to log all requests from the app-
Install the trace package
npm install --save @edirect/trace
-
Import the trace package
import { middlewares as traceMiddlewares } from '@edirect/trace';
-
Add the trace middleware to your application routes
Express:
app.use(new traceMiddlewares.Body().body);
NestJs with nest-app:
consumer .apply(new traceMiddlewares.Body().body) .forRoutes({path: '/**', method: RequestMethod.ALL});
-
We should initialize our application with
otel
as a requirement for node startup command:node --require ./node_modules/@edirect/trace/dist/otel.js dist/main
(dist/main or your application entrypoint)
-
Remove the old client initialization as we moved it to the node require
If the service was using old trace version
const config = require('config'); const { services: { APM }, middlewares: traceMiddlewares } = require('@edirect/trace'); const apmConfig = config.get('apm'); if (apmConfig.enabled) { const apm = new APM(apmConfig.service, apmConfig.url, apmConfig.traceServer); apm.startTrace(); }
-
Clean the middleware usage
If the service was using old trace version
Replace:
if (apmConfig.enabled) { app.use(new traceMiddlewares.Body().body); }
with:
app.use(new traceMiddlewares.Body().body);
-
Applying k8s configurations (Required)
We created a standard way to deploy our application configurations and simplify its configurations:
-
Sometimes it could be already done by another team and you don't need to manage the configuration, just generate the configurations and apply it
-
bolttech-broker-asia\staging\config-map.yaml
(or your environment config-map)auth: #or your service trace: tags: scope: ie #your tech center env: ie-stag-broker #current k8s cluster environment cluster: stag-ie #current k8s cluster tenant: ${namespace} service: ${name}-${namespace} servicename: ${name} version: "1.0.0" owner: OWNER server: opentelemetry-collector.stag.bolttechbroker.net #cluster opentelemetry server
Then you should generate and apply your brand new configurations:
-
Generate
This command will generate the new configuration files for the services.
# Staging Clusters edi infra bolttech-broker-asia k8s generate <ENVIRONMENT> <SERVICE> edi infra bolttech-broker-asia k8s generate stage1-vnbroker rules-engine # VN Example # Production Clusters edi infra <PRODUCTION_BROKER> k8s generate <ENVIRONMENT> <SERVICE> edi infra bolttech-broker-asia-hk k8s generate live-hkbroker-a policy-issuing-service --prod # HK Example
-
Apply
This command will apply the new configurations and redeploy the service on the cluster.
# Staging Clusters edi infra bolttech-broker-asia k8s apply staging <STAGE/RC> <SERVICE> --redeploy edi infra bolttech-broker-asia k8s apply staging rc-vnbroker frontend-service --redeploy # VN Example # Production Clusters edi infra <PRODUCTION_BROKER> k8s apply <CLUSTER> <ENVIRONMENT> <SERVICE> edi infra bolttech-broker-asia-hk k8s apply cluster-a live-hkbroker-a plan-service --prod --redeploy # HK Example
- Key notes when using the infrastructure repositories:
- Before running any command make sure you have completed the setup of the
edi-cli
tool and that you have all the related projects (infrastructure
andk8s-templates
) inside the same root folder. - Make sure to have the most recent version of the
master
branch before running any commands. - Remember to ALWAYS COMMIT AND PUSH YOUR CHANGES TO THE REPOSITORIES, OTHERWISE THE NEW CONFIGURATIONS WILL NOT BE APPLIED TO SERVICES WHEN THEY ARE DEPLOYED USING THE JENKINS PIPELINES.
- To make things easier, make sure to check the helper script on
edi-cli
project. Located at theedi-cli-cli
folder. - Consider hiding the folders of the brokers you will not need to work with. This will greatly improve your navigation inside the project.
- Before running any command make sure you have completed the setup of the
Debugging
-
How to debug it on vscode:
As we don't debug using node command directly, we suggest you to use this in your
.vscode/launch.json
{ "version": "0.2.0", "configurations": [ { "type": "node", "request": "launch", "name": "Debug with OTEL", "args": ["${workspaceFolder}/src/main.ts"], "runtimeArgs": [ "--inspect", "-r", "./node_modules/@edirect/trace/dist/otel.js", "-r", "tsconfig-paths/register", "-r", "ts-node/register", ], "console": "integratedTerminal", "envFile": "${workspaceFolder}/.development.env" }, ] }
and append the OTEL environment variables to your local environments file:
TRACE_TAG_OWNER=OWNER TRACE_TAG_SCOPE=pgw TRACE_TAG_SERVICENAME=payment-gateway TRACE_TAG_TENANT=local-paythbroker TRACE_SERVER_URL=opentelemetry-collector.stag.bolttechpay.net TRACE_TAG_SERVICE=payment-gateway-local-paythbroker TRACE_TAG_VERSION=1.0.0 TRACE_TAG_CLUSTER=localhost TRACE_TAG_ENV=development