Hi everyone! My name is Eugene Obrezkov, and today I want to talk about one of the “scariest” platforms — Node.js. I will answer one of the most complicated questions about Node.js — “How does Node.js work?”.
I will present this article as if Node.js didn’t exist at all. This way, it should be easier for you to understand what’s going on under the hood.
The code found in this post is taken from existing Node.js sources, so after reading this article, you should be more comfortable with Node.js.
What do we need this for
The first question that may come to your mind — “What do we need this for?”.
Here, I’d like to quote Vyacheslav Egorov:
The more people stop seeing JS VM as a mysterious black box that converts JavaScript source into some zeros-and-ones the better.
The same idea applies to Node.js: “The more people stop seeing Node.js as a mysterious black box that runs JavaScript with low-level API the better”.
Just Do It
Let’s go back to 2009, when Node.js started its way.
We’d like to run JavaScript on back-end and get access to low-level API. We also want to run our JavaScript from CLI and REPL. We want JavaScript to do everything!
How would we do this? The first thing that comes to my mind is…
Browser
Browser can execute JavaScript. So we can take a browser, integrate it into our application, and that’s it.
Not really! Here are the questions that need to be answered:
- Does browser expose low-level API to JavaScript? No!
- Does it allow to run JavaScript from somewhere else? Both yes and no, it’s complicated!
- Do we need all the DOM stuff that the browser gives us? No! It’s overhead.
- Do we need browser at all? No!
We don’t need that. JavaScript is executed without a browser.
If the browser is not a requirement for executing JavaScript, what executes JavaScript then?
Virtual Machine (VM)
Virtual Machine executes JavaScript!
VM provides a high-level abstraction — that of a high-level programming language (compared to the low-level ISA abstraction of the system).
VM is designed to execute a single computer program by providing an abstracted and platform-independent program execution environment.
There are lots of virtual machines that can execute JavaScript including V8 from Google, Chakra from Microsoft, SpiderMonkey from Mozilla, JavaScriptCore from Apple and more. Choose wisely, because it may be a decision you may regret for the rest of your life.
I suggest that we choose Google’s V8, why? Because it’s faster than other VMs. I think you’ll agree that execution speed is important for back-end.
Let’s look at V8 and how it can help to build Node.js.
V8 VM
V8 can be integrated in any C++ project. Just take V8 sources and include them as a simple library. You can now use V8 API that allows you to compile and run JavaScript code.
V8 can expose C++ to JavaScript. It’s very important as we want to make low-level API available within JavaScript.
Those two points are enough to imagine rough implementation of our idea — “How we can run JavaScript with access to low-level API”.
Let’s draw a line here about all this stuff above, because in the next chapter we will start with C++ code. You can take Virtual Machine, in our case V8 -> integrate it in our C++ project -> expose C++ to JavaScript with V8 help.
But how can we write C++ code and make it available within JavaScript?
V8 Templates
Via V8 Templates!
A template is a blueprint for JavaScript functions and objects. You can use a template to wrap C++ functions and data structures within JavaScript objects.
For example, Google Chrome uses templates to wrap C++ DOM nodes as JavaScript objects and to install functions in the global scope.
You can create a set of templates and then use them. Accordingly, you can have as many templates as you want.
And V8 has two types of templates: Function Templates and Object Templates.
Function Template is the blueprint for a single function. You create a JavaScript instance of template by calling the template’s GetFunction method from within the context in which you wish to instantiate the JavaScript function. You can also associate a C++ callback with a function template called when the JavaScript function instance is invoked.
Object Template is used to configure objects created with function template as their constructor. You can associate two types of C++ callbacks with object templates: accessor callback and interceptor callback. Accessor callback is invoked when a specific object property is accessed by a script. Interceptor callback is invoked when any object property is accessed by a script. In a nutshell, you can wrap C++ objects/structures within JavaScript objects.
Look at this simple example. All this does is expose the C++ method LogCallback to the global JavaScript context.
At line #2, we are creating new ObjectTemplate. Then at line #3 we are creating new FunctionTemplate and associate C++ method LogCallback with it. Then we are setting this FunctionTemplate instance to ObjectTemplate instance. At line #9 we are just passing our ObjectTemplate instance to new JavaScript context, so that when you run JavaScript in this context, you’ll be able to call method log from global scope. As a result, the C++ method, associated with our FunctionTemplate instance, LogCallback, will be triggered.
As you see, it’s like defining objects in JavaScript, only in C++.
By now, we learned how to expose C++ methods/structures to JavaScript. We will now learn how to run JavaScript code in those changed contexts. It’s simple. Just compile and run principle.
V8 Compile && Run JavaScript
If you want to run your JavaScript in created context, you can make just two simple API calls to V8 — Compile and Run.
Let’s look at this example, where we are creating a new Context and running JavaScript inside.
At line #2, we are creating a JavaScript context (we can change it with templates described above). At line #5, we are making this context active for compiling and running JavaScript code. At line #8, we are creating a new string from the JavaScript source. It can be hard-coded, read from the file or any other way. At line #11 we are compiling our JavaScript source. At line #14 we are running it and expecting results. That’s all.
Finally, we can create simple Node.js, combining all the techniques described above :)
C++ -> V8 Templates -> Run JavaScript -> X
You can create VM instance (also known as Isolate in V8) -> create as much FunctionTemplate instances, with assigned C++ callbacks, as you want -> create ObjectTemplate instance and assign all created FunctionTemplate instances to it -> create JavaScript context with a global object as our ObjectTemplate instance -> run JavaScript in this context and Voila -> Node.js. Sweet!
But what is the “X” after “Run JavaScript” in chapter’s title? There is a little problem with implementation above. We missed one very important thing.
Imagine, that you wrote a lot of C++ methods (around 10k LOC) which can work with fs, http, crypto, etc… We have assigned them [C++ callbacks] to FunctionTemplate instances and import them [FunctionTemplate] in ObjectTemplate. After getting JavaScript instance of this ObjectTemplate we have access to all the FunctionTemplate instances from JavaScript via global scope. Looks like everything works great, but…
What if we don’t need fs right now? What if we don’t need crypto features at all? What about not getting modules from global scope, but requiring them on demand? What about not writing C++ code in one big file with all the C++ callbacks in there? So question mark means…
Modularity!
All those C++ methods should be split in modules and in different files (it simplifies the development) so that each C++ module corresponds to each fs, http or any other feature. The same logic is in JavaScript context. All the JavaScript modules must not be accessible from global scope, but accessible on demand.
Based on these best practices we need to implement our own module loader. That module loader should handle loading C++ modules and JavaScript modules so we can grab C++ module on demand from C++ code and the same for JavaScript context — grab JavaScript module on demand from JavaScript code.
Let’s start with C++ Module Loader first.
C++ Module Loader
Disclaimer: There will be a lot of C++ code here, so try not to lose your mind :)
Let’s start with basics of all module loaders. Each module loader must have a variable that contains all modules (or information on how to get them). Let’s declare C++ structure to store information about C++ modules and name it node_module.
We can store information about existing modules in this structure. As a result, we have a simple dictionary of all available C++ modules.
I will not explain all the fields from the structure above, but I want you to pay attention to one. In nm_filename we can store filename of our module, so we know where to load it from. In nm_register_func and nm_context_register_func, we can store functions we need to call when the module is required. These functions will instantiate Template instance. And nm_modname can store module name (not filename).
Next, we need to implement helper methods that work with this structure. We can write a simple method that can save information to our node_module structure and then use this method in our module definitions. Let’s call it node_module_register.
As you can see, all we are doing here is just saving new information about module into our structure node_module.
Now we can simplify registering process using a macro. Let’s declare a macro you can use in your C++ module. This macro is just a wrapper for node_module_register method.
First macro is a wrapper for node_module_register method. The other one is just a wrapper for previous macro with some pre-defined arguments. As a result we have a macro that accepts two arguments: modname and regfunc. When it’s called, we are saving new module information in our node_module structure. What do modname and regfunc mean? Well… modname is just our module name, like fs, for instance. regfunc is a module method we talked about earlier. This method should be responsible for V8 Template initialization and assigning it to ObjectTemplate.
As you can see, we can declare each C++ module within a macro that accepts module name (modname) and initialization function (regfunc) that will be called when the module is required. All we need to do is just create C++ methods that can read that information from node_module structure and call regfunc method.
Let’s write a simple method that will search for a module in the node_module structure by its name. We’ll call it get_builtin_module.
This will return declared module if name matches the nm_modname from node_module structure.
Based on information from node_module structure, we can write a simple method that will load the C++ module and assign V8 Template instance to our ObjectTemplate. As a result, this ObjectTemplate will be sent as a JavaScript instance to JavaScript context.
A few notes regarding the code above. Binding takes module name as an argument. This argument is a module name you gave that via macro. We are looking for information about this module via get_builtin_module method. If we find it, we call initialization function from this module, sending some useful arguments like exports. exports is an ObjectTemplate instance, so we can use V8 Template API on it. After all these operations, we get the exports object as a result from Binding method. As you remember, ObjectTemplate instance can return JavaScript instance and that’s what Binding does.
The last thing we should do is make this method available from JavaScript context. We do this at the last line by wrapping Binding method in FunctionTemplate and assigning it to the global variable process.
At this stage, you can call process.binding(‘fs’) for instance, and get native bindings for it.
Here is an example of a built-in module with omitted logic for simplicity.
The code above will create a binding with a name “v8” that exports JavaScript object, so that calling process.binding(‘v8’) from JavaScript context gets this object.
Hopefully, you are still following along.
Now we should make JavaScript Module Loader that will help us do all the neat stuff like require(‘fs’).
JavaScript Module Loader
Great, thanks to our latest improvements, we can call process.binding() and get access to C++ bindings from JavaScript context. But this still does not resolve the issue with JavaScript modules. How can we write JavaScript modules and require them on demand?
First, we need to understand that there are two different types of modules. One of them is JavaScript modules we write alongside with C++ callbacks. In a nutshell, these are Node.js built-in modules, like fs, http, etc… Let’s call these modules NativeModule. The other type are modules in your working directory. Let’s call them just Module.
We need to require both types. That means we need to know how to grab NativeModule from Node.js and Module from your working directory.
Let’s start with NativeModule first.
All JavaScript native modules are located within our C++ project in another folder. That means that all the JavaScript sources are accessible at compile-time. This allows us to wrap JavaScript sources into a C++ header file, that we can use.
There’s a Python tool called js2c.py for this (located under tools/js2c.py). It generates node_natives.h header file with wrapped JavaScript code. node_natives.h can be included in any C++ code to get JavaScript sources within C++.
Now we can use JavaScript sources in C++ context — let’s try it out. We can implement a simple method DefineJavaScript that gets JavaScript sources from node_natives.h and assigns them to ObjectTemplate instance.
In the code above, we are iterating through each native JavaScript module and setting them into ObjectTemplate instance with module name as a key and module itself as a value. The last thing we need to do is call DefineJavaScript with ObjectTemplate instance as target.
Binding method comes in handy here. If you look at our Binding C++ implementation (C++ Module Loader section), you’ll see we hard-coded two bindings: constants and natives. Thus, if binding’s name is natives, then DefineJavaScript method is called with environment and exports objects. As a result, JavaScript native modules will be returned when calling process.binding(‘natives’).
So, that’s cool. But another improvement can be made here by defining GYP task in node.gyp file and calling js2c.py tool from it. This will make it so that when Node.js is compiling, JavaScript sources will also be wrapped into node_natives.h header file.
By now, we have JavaScript sources of our native modules available as the process.binding(‘natives’). Let’s write simple JavaScript wrapper for NativeModule now.
Now, to load a module, you call NativeModule.require() method with module name you want to load. This will first check if module already exists in cache, if so — gets it from cache, otherwise the module is compiled, cached and returned as exports object.
Let’s inspect cache and compile methods now.
All cache does is just setting NativeModule instance to a static object _cache in NativeModule.
More interesting is the compile method. First, we are getting sources of required module from _source (we set this static property with process.binding(‘natives’)). We are then wrapping them in a function with wrap method. As you can see, resulting function accepts exports, require, module, __filename and __dirname arguments. Afterwards, we call this function with required arguments. As a result, our JavaScript module is wrapped in scope that has exports as pointer to NativeModule.exports, require as pointer to NativeModule.require, module as pointer to NativeModule instance itself and __filename as a string with current file name. Now you know where all the stuff like module and require is coming from in your JavaScript code. They are just pointers to NativeModule instance :)
Another thing is Module loader implementation.
Module loader implementation is the same as with NativeModule, the difference is that sources are not taken from node_natives.h header file, but from files we can read with fs native module. So we are doing all the same stuff as wrap, cache and compile, only with sources read from the file.
Great, now we know how to require native modules or modules from your working directory.
Finally, we can write a simple JavaScript module that will run each time we run Node.js and prepare the Node.js environment using all the stuff above.
Node.js Runtime Library
What is a runtime library? It’s a library that prepares the environment, setting global variables process, console, Buffer, etc, and runs the main script you send to Node.js CLI as an argument. We can achieve it with a simple JavaScript file that will execute at Node.js runtime before all other JavaScript code.
We can start with proxying all our native modules to global scope and setting up other global variables. It’s just a lot of code that does something like global.Buffer = NativeModule.require(‘buffer’) or global.process = process.
Second step is running the main script which you send in Node.js CLI as an argument. Logic is simple here. It just parses process.argv[1] and creates Module instance with its value as a constructor value. So, Module can read sources from file -> cache and compile it as NativeModule does with pre-compiled JavaScript sources.
There’s not much I can add here, it’s simple, if you want more details though, you can look at src/node.js file in node repository. This file is executing at Node.js runtime and uses all the techniques, described in this article.
This is how Node.js can run your JavaScript code with access to low-level API. Cool, isn’t it?
But all the above can’t do any asynchronous stuff yet. All the operations like fs.readFile() are synchronous at this point.
What do we need for asynchronous operations? An event loop…
Event Loop
Event loop is message dispatcher that waits for and dispatches events or messages in a program. It works by making a request to some internal or external event provider (which blocks the request until an event has arrived), and then it calls the relevant event handler (dispatches the event). The event loop may be used with a reactor if the event provider follows the file interface which can be selected or polled. The event loop almost always operates asynchronously with the message originator.
V8 can accept event loop as an argument when you are creating V8 Environment. But before setting up an event loop to V8 we need to implement it first…
Luckily, we already have that implementation which is called libuv. It’s responsible for all the asynchronous operations like read the file and others. Without libuv Node.js is just a synchronous JavaScript/C++ execution.
So, we can include libuv sources into Node.js and create V8 Environment with libuv default event loop in there. Here is an implementation.
CreateEnvironment method accepts libuv event loop as a loop argument. We can call Environment::New from V8 namespace and send there libuv event loop and then configure it in V8 Environment. That’s how Node.js became asynchronous.
I’d like to talk about libuv more and tell you how it works, but that’s another story for another time :)
Thanks
Thanks to everyone who has read this post to the end. I hope you enjoyed it and learned something new. If you found any issues or something, comment and I’ll reply as soon as possible.
Eugene Obrezkov, Technical Leader at Onix-Systems, Kirovohrad, Ukraine.
Comments