Running Node.js apps in production
Frederic Hemberger
@fhemberger
Who wrote a Node.js app so far? (Webapp, API, etc.)
Who runs this app in production?
Topics I'll talk about today:
Deployment
Run Node.js (and keep it running)
Metrics
This talk is just supposed to give a brief overview
All tools and ressources mentioned are linked at the end of the presentation.
Deployment
Different popular deployment techniques:
Git Hooks
GitHub Webhooks
Capistrano, Fabric, deploy.sh, et. al.
Git Hooks
Pushing to Git remote on your server
# ./git/hooks/post-receive
cd /var/www/myapp.com
git pull
npm install --production
service myapp restart
...
Done.
Git Hooks
Pro:
Easy for the developer: Just push to production (aka fire and forget)
Hosting-Platforms like Heroku use this method as well
Con:
But what happens on the server?
Deployment knowledge is stored separately from code
When deploying on multiple servers, post-receive hooks must be in sync
Solution:
Add the deploy script to your repository and symlink to post-receive-hook.
GitHub Webhooks
GitHub Webhooks
GitHub Webhooks
GitHub Webhooks
Pro:
When the rest of your development work already resolves around GitHub, it integrates nicely into the workflow
Con:
Hooks run all independently in parallel:
E.g. if the CI hook fails, the webhook for deployment still gets triggered.
Some CI services like Travis CI offer their own hooks to trigger a deployment afterwards.
Critical dependency for your deployment:
Remember, even GitHub is down or gets DDoS'ed from time to time
Requires server component running update script.
Must be secured to not accept fake payload or mess up deployment.
Capistrano, fabric, deploy.sh, et. al.
Remotely checks out your code from a repository
Directory is named after current date and/or revision
Symlinks it to current
deploy_directory
├─┬ releases
│ ├── 20140319001122
│ └── ...
├─┬ shared
│ ├── log
│ ├── pids
│ └── system
└── current ⇨ releases/20140319001122
Capistrano, fabric, deploy.sh, et. al.
Additionlly triggers scripts that can:
restart the web server
create a database and it's scheme
install/update your app's dependencies
Capistrano, fabric, deploy.sh, et. al.
Pro:
Clean server side application structure (including logs, shared files, etc.)
Trigger arbitrary scripts before/after the deployment
Quickly rewind to previous deployment on error
Con:
Introduces another language as additional dependency (Capistrano: Ruby; Fabric: Python)
Run Node.js (and keep it running)
Run Node.js (and keep it running)
Start the script as a daemon:
Nodemon/node-forever (written in Node.js)
supervise (UNIX daemontools)
Upstart (Ubuntu)
Example Upstart script
start on runlevel [2345]
stop on runlevel [06]
respawn
respawn limit 5 60
NODE_SCRIPT = /var/www/myapp/server.js
LOGFILE = /var/log/myapp.log
exec start-stop-daemon --start --chuid node \
--exec /usr/local/bin/node -- \
$NODE_SCRIPT >> $LOGFILE 2>&1
More elaborate: PM2
Process manager with built-in load-balancer
PM2
Monitor processes
Question: Who should be responsible for process management (creation, restarting, monitoring, clustering)? The OS? The startup script? The application itself?
Whatever method you use to run your applications:
Startup scripts should …
… be as general as possible (only path, environment, main JS file)
… not contain configuration settings for your application
… be included alongside your deployment (symlink if necessary)
… be kept under version control as well
Starting an app is like starting a car: The starter (keys, remote, button) doesn't need to know anything about the car. It only connects the wires which start the car.
However the controlling hardware must know the car's systems (engine type and performance, ABS, ESP) to act accordingly (maximum speed, braking effect, handling).
There are at least two occasions, where your app will not be available:
While deploying a new version
On application errors/exceptions
Deployment
Downtime during deployment should be kept to a minimum:
Only deploy tested code to production
Automate the entire deployment process
Use a cluster to reload workers (complete app restart is only needed if the master changes)
e.g. between requests or at the end of a user session
recluster
wrapper around Node.js's own cluster module
// cluster.js
var recluster = require('recluster'),
path = require('path')
cluster = recluster(path.join(__dirname, 'server.js'));
process.on('SIGUSR2', function() {
console.log('Got SIGUSR2, reloading cluster ...');
cluster.reload();
});
cluster.run();
Reload cluster workers: kill -s SIGUSR2 <cluster_pid>
recluster
// server.js
server.on('close', function() {
// cleanup
});
Errors/Exceptions
Different categories of errors:
Hardware/network errors:
You're screwed, can't do much about it.
Component errors:
Database not responding, files missing, wrong access privileges
Throw an exception, exit application (check your restart script!)
Programming errors:
Testing your code is great, but some bugs will eventually slip through.
Hardly assessable level of impact, try to fail gracefully
Usage errors:
Validate inputs, inform the user and offer guidance
Ideally, a simple error (request timeout, processing invalid/missing inputs) should never bring down the entire application.
Errors/Exceptions
Bind error handling to individual parts of your application
Those parts may differ in error handling: e.g. request errors, input parsing, external APIs/services
Try to resolve errors with minimum impact to the overall application:
Unable to connect? => Notify the user, log error, try again
Invalid input? => Notify the user, stop processing
Try to get focused stack traces: Easier for debugging
Metrics help you to see
What are people really doing? How do they use the application?
What errors do occur?
Where are bottlenecks?
Is someone messing with your app?
Metrics: Monitoring
What is going on?
CPU load, memory usage, Node.js heap size
HTTP requests, response times
Database monitoring, CPU/memory profiling, alerts
Monitoring: look
Pro:
Con:
Older fork of Nodetime (two years old)
Monitoring: Nodetime, New Relic, etc.
(Commercial Products)
Pro:
Many different metrics
Free tier
Con:
Free tiers are very limited:
Nodetime: Only one process(!),
New Relic: Only 24h data retention
May not be suitable for smaller or low-traffic projects
Smallest plans:
Nodetime: 99$/month, New Relic: 149$/month and host
Metrics: Logging
Keep your logs in one place, either on application level or in /var/log
.
Use log levels: Separate debug information from warnings and errors
Use a coherent log format (timestamp, level, message, payload)
Separate your access logs (e.g. in Express) from your application logs
Track your deployments with your analytics tools
Not everyone combs through log files all the time to find changes
Reveal different kind of metrics, e.g. "After our last deployment, mobile conversion rate increased by N%"
Metrics: Logging
One possible solution: Bunyan
All logs are stored in JSON format (timestamp, app, message, payload)
Uses streams, offers different targets out of the box: File, rotating file, database, etc.
Metrics: Logging
But …
Uncaught exceptions are still logged to stderr
Other components may still use console.log
statements
node app.js >> /var/log/myapp.log 2>&1
Again, multiple logs in different formats.
Still haven't found a 100% satisfying solution for myself
Analysis of gathered metrics
Different log formats and destinations make data analysis difficult:
# Apache access log
10.0.1.22 - - [15/Oct/2010:11:46:46 -0700] "GET /favicon.ico HTTP/1.1" 404 209
fe80::6233:4bff:fe29:3173 - - [15/Oct/2010:11:46:58 -0700] "GET / HTTP/1.1" 200 44
# Apache error log
[Fri Oct 15 11:46:46 2010] [error] [client 10.0.1.22] File does not exist: /Library/WebServer/Documents/favicon.ico
[Fri Oct 15 11:46:58 2010] [error] [client fe80::6233:4bff:fe29:3173] File does not exist: /Library/WebServer/Documents/favicon.ico
# typical Express.js log output
[Mon, 21 Nov 2011 20:52:11 GMT] 200 GET /foo (1ms)
Blah, some other unstructured output to from a console.log call.
»ELK« stack
E lasticsearch (Storage/Search)
L ogstash (Logfile processor)
K ibana (Logfile viewer)
»ELK« stack
Pro:
Very powerful and extendable log analysis
Parse logs for Squid, Apache, Nginx, Syslog, MySQL, …
Feed logs directly to statsd/Graphite
Easy querying and visualization
Realtime search
Open Source
Con:
Slightly more complex setup (Java, JRuby, etc.)
Thus might not fit for smaller projects/hosting solutions
Logstash
Turns messy data in different log formats …
# Apache access log
10.0.1.22 - - [15/Oct/2010:11:46:46 -0700] "GET /favicon.ico HTTP/1.1" 404 209
fe80::6233:4bff:fe29:3173 - - [15/Oct/2010:11:46:58 -0700] "GET / HTTP/1.1" 200 44
# Apache error log
[Fri Oct 15 11:46:46 2010] [error] [client 10.0.1.22] File does not exist: /Library/WebServer/Documents/favicon.ico
[Fri Oct 15 11:46:58 2010] [error] [client fe80::6233:4bff:fe29:3173] File does not exist: /Library/WebServer/Documents/favicon.ico
# typical Express.js log output
[Mon, 21 Nov 2011 20:52:11 GMT] 200 GET /foo (1ms)
Blah, some other unstructured output to from a console.log call.
Logstash
… into structured output
{
"message" => "127.0.0.1 - - [11/Dec/2013:00:01:45 -0800…
"@timestamp" => "2013-12-11T08:01:45.000Z",
"@version" => "1",
"host" => "cadenza",
"clientip" => "127.0.0.1",
"timestamp" => "11/Dec/2013:00:01:45 -0800",
"verb" => "GET",
"request" => "/xampp/status.php",
"httpversion" => "1.1",
"response" => "200",
"bytes" => "3891",
"referrer" => "\"http://cadenza/xampp/navi.php\"",
"agent" => "\"Mozilla/5.0 (Macintosh; Intel Mac OS X…
}
Logstash
Easily extendable to custom log formats
Read log information from file, Heroku, Redis, RabbitMQ, stdin, syslog, TCP, UDP, XMPP, ZeroMQ, …
Output to file, Ganglia, Graphite, Irc, Loggly, MongoDB, Nagios, RabbitMQ, Redis, Riak, S3, Statsd, Syslog, TCP, UDP, Websocket, XMPP, ZeroMQ, …
Kibana
Kibana