Set up a Neo4j server with Docker importing huge CSV datasheets

From both a work and a research perspective, it is clear how fundamental graph data is - from disease detection, genetics, and healthcare to banking and engineering, graphs are emerging as a powerful analysis paradigm for hard problems. Neo4j gives developers and data scientists the most trusted and advanced tools to quickly build today’s intelligent applications and machine learning workflows. And the great thing is that it is (almost) free.

Neo4j relies on Cypher query language (open source). Cypher allows users to store and retrieve data from the graph database. Cypher’s LOAD CSV is great for importing small-data sets into our running database. Nevertheless, dealing with big data can be pretty tedious (if not ludicrously slow). To initialize an unused database with large amounts of data from CSV files we need to use the neo4j-admin command.

Neo4j Admin is the primary tool for managing your Neo4j instance. It is a command-line tool that is installed as part of the product and can be executed with several commands. Here, we show how to set up a Neo4j server via a Docker image on a local machine, importing first big data on a brand new graph database, and then running the server in a self-hosted fashion. And again, all this for free.

Contents

The NorthWind dataset

↑ back to contents ↑

Before loading our CSV data into our server, we have to convert it in a Neo4j-compliant fashion. Here, we will be using the NorthWind dataset, an often-used SQL dataset. This data depicts a product sale system - storing and tracking customers, products, customer orders, warehouse stock, shipping, suppliers, and even employees and their sales territories:

Northwind dataser ER diagram

Although the NorthWind dataset is often used to demonstrate SQL and relational databases, the data also can be structured as a graph:

Northwind dataser ER diagram

The main differences between the Graph Model and the Relational Model are the following:

Relational Model Graph Model
null values allowed No null value allowed
Less detailed and clear (e.g. to model sales, we need an Orders-to-Employees foreign key relationship) More detailed and clear (e.g we know that an employee SOLD an order)
Faster for unconnected data Faster for connected data
Query latency proportional to the amount of data stored (“join bomb”) Query latency proportional to how much of the graph you choose to explore in a query

We have the orders.csv file:

OrderID CustomerID EmployeeID OrderDate RequiredDate ShippedDate ShipVia Freight ShipName ShipAddress ShipCity ShipRegion ShipPostalCode ShipCountry OrderID ProductID UnitPrice Quantity Discount
10248 VINET 5 1996-07-04 1996-08-01 1996-07-16 3 32.38 Vins et alcools Chevalier 59 rue de l’Abbaye Reims   51100 France 10248 11 14 12 0
10248 VINET 5 1996-07-04 1996-08-01 1996-07-16 3 32.38 Vins et alcools Chevalier 59 rue de l’Abbaye Reims   51100 France 10248 42 9.8 10 0

Then we have the employees.csv file:

EmployeeID LastName FirstName Title TitleOfCourtesy BirthDate HireDate Address City Region PostalCode Country HomePhone Extension Photo Notes ReportsTo PhotoPath
1 Davolio Nancy Sales Representative Ms. 1948-12-08 1992-05-01 507 - 20th Ave. E.\nApt. 2A Seattle WA 98122 USA (206) 555-9857 5467 \x “Education includes a BA in (…)” 2 http://accweb/emmployees/davolio.bmp
2 Fuller Andrew “Vice President, Sales” Dr. 1952-02-19 1992-08-14 908 W. Capital Way Tacoma WA 98401 USA (206) 555-9482 3457 \x “Andrew received his BTS commercial (…)”   http://accweb/emmployees/fuller.bmp

We will limit this example to just these two entities () and the relationship between them.

Prepare your CSV data

↑ back to contents ↑

In order to import it into Neo4j, we have to prepare the headers in a Neo4j-compliant fashion. From the two CSV files, we will need three separate ones.

We have the orders_prepared.csv file:

OrderID:ID OrderDate:DATE RequiredDate:DATE ShippedDate:DATE ShipVia:INTEGER Freight:FLOAT ShipName:STRING ShipAddress:STRING ShipCity:STRING ShipRegion:STRING ShipPostalCode:INTEGER ShipCountry:STRING UnitPrice:FLOAT Quantity:INTEGER Discount:FLOAT
10248 1996-07-04 1996-08-01 1996-07-16 3 32.38 Vins et alcools Chevalier 59 rue de l’Abbaye Reims   51100 France 14 12 0
10248 1996-07-04 1996-08-01 1996-07-16 3 32.38 Vins et alcools Chevalier 59 rue de l’Abbaye Reims   51100 France 9.8 10 0

Then we have the employees_prepared.csv file:

EmployeeID:ID LastName:STRING FirstName:STRING Title:STRING TitleOfCourtesy:STRING BirthDate:DATE HireDate:DATE Address:STRING City:STRING Region:STRING PostalCode:INTEGER Country:STRING HomePhone:STRING Extension:IGNORE Photo:IGNORE Notes:IGNORE PhotoPath:IGNORE
1 Davolio Nancy Sales Representative Ms. 1948-12-08 1992-05-01 507 - 20th Ave. E.\nApt. 2A Seattle WA 98122 USA (206) 555-9857 5467 \x “Education includes a BA in (…)” http://accweb/emmployees/davolio.bmp
2 Fuller Andrew “Vice President, Sales” Dr. 1952-02-19 1992-08-14 908 W. Capital Way Tacoma WA 98401 USA (206) 555-9482 3457 \x “Andrew received his BTS commercial (…)” http://accweb/emmployees/fuller.bmp

Finally, we have the brand new sold_prepared.csv file:

:START_ID(Order) :START_ID(Employee)
10248 1
10248 2

What do these new headers mean? Briefly, the column names are used for property names of the nodes (Order and Employee) and relationship (SOLD) we want to create. There is specific markup on specific columns:

Note - These are the timestamp formats for Cypher:

RETURN DATE("2019-06-01")
RETURN TIME("18:40:32.142+0100")
RETURN DATETIME("2019-06-01T18:40:32.142+0100")

The Neo4j Docker Image

↑ back to contents ↑

Neo4j provides and maintains official Neo4j Docker images on DockerHub for both Neo4j Community and Enterprise editions (we’re interested in the first one). There are several ways to leverage Docker for your Neo4j development and deployment. You can create throw-away Neo4j instances of many different versions for testing and running your applications. You can also pre-seed containers with datasets, extensions, and configurations for interaction and processing. Here, we want to perform two steps on our local machine:

  1. Import data on the default database called neo4j (the only one provided by the Community edition) with an interactive run of the container executing neo4j-admin import, storing database files on persistent Docker volumes
  2. Launch the server in a detached run, exposing both the server and the web client on two ports of the local machine

To download the image, we just need to execute docker pull neo4j with the desired tag:

docker pull:4.3.1-community

We will use three Docker persistent volumes:

Import data

↑ back to contents ↑

In your docker volume folder /neo4j-import/_data, create a csv_files folder where to store your data:

neo4j-import/_data/
└── csv_files/
    ├── orders_prepared.csv
    ├── employees_prepared.csv
    └── sold_prepared.csv

Then, import data on the default database with a interactive run of the container executing neo4j-admin import, storing database files on the docker volume neo4j-data:

docker run --interactive --tty --rm \
    --env=NEO4J_AUTH=neo4j/<YOUR_PASSWORD> \
    --env=NEO4JLABS_PLUGINS='["apoc", "graph-data-science", "n10s"]' \
    --volume=neo4j-data:/data \
    --volume=neo4j-import:/var/lib/neo4j/import \
    --volume=neo4j-plugins:/plugins \
    --name=neo4j-server \
    neo4j:4.3.1-community \
bin/neo4j-admin import \
--database=neo4j \
--skip-bad-relationships \
--nodes=Order=import/csv_files/orders_prepared.csv \
--nodes=Employee=import/csv_files/employees_prepared.csv \
--relationships=SOLD=import/csv_files/sold_prepared.csv \

What all these parameters mean?

Launch the server

↑ back to contents ↑

Now we finally have our data imported in our Dockdr persistent volume neo4j-data. We can now start the container with the following command:

docker run -d --restart always \
    --env=NEO4J_AUTH=neo4j/<YOUR_PASSWORD> \
    --env=NEO4JLABS_PLUGINS='["apoc","graph-data-science", "n10s"]' \
    --env=NEO4J_dbms_connector_http_listen__address=:6476 \
    --env=NEO4J_dbms_connector_https_listen__address=:6477 \
    --env=NEO4J_dbms_connector_bolt_listen__address=:7687 \
    --env=NEO4J_dbms_connector_http_advertised__address=:6476 \
    --env=NEO4J_dbms_connector_https_advertised__address=:6477 \
    --env=NEO4J_dbms_connector_bolt_advertised__address=:7687 \
    --env=NEO4J_dbms_security_procedures_unrestricted=gds.*,apoc.* \
    --env=NEO4J_dbms_security_procedures_allowlist=gds.*,apoc.* \
    --publish=<HTTP_PORT>:6476 \
    --publish=<HTTPS_PORT>:6477 \
    --publish=<BOLT_PORT>:7687 \
    --volume=neo4j-data:/data \
    --volume=neo4j-import:/var/lib/neo4j/import \
    --volume=neo4j-plugins:/plugins \
    --name=neo4j-server \
    neo4j:4.3.1-community 

What all these parameters mean?

Now, at machine.local/<HTTP_PORT>/browser we can access the Neo4j Browser WebUI:

Neo4j Browser WebUI

Upgrade to Neo4J Enterprise & Bloom

↑ back to contents ↑

If you want to use the enterprise version with Neo4j Bloom (both NOT available for free), you need the neo4j:4.3.1-enterprise docker image and a valid activation key for the Bloom server. Then, proceed with the following steps:

  1. download the Bloom server package
  2. unzip the downloaded package and place the 4.x .jar file into the neo4j-plugins docker volume
  3. create a new neo4j-licenses volume and place in it your ```bloom.license`` file
  4. now you are ready to start the server:
     docker run -d --restart always \
       --env NEO4J_ACCEPT_LICENSE_AGREEMENT=yes \
       --env=NEO4J_AUTH=neo4j/<YOUR_PASSWORD> \
       --env=NEO4JLABS_PLUGINS='["apoc","bloom","graph-data-science","n10s"]' \
       --env=NEO4J_dbms_connector_http_listen__address=:6476 \
       --env=NEO4J_dbms_connector_https_listen__address=:6477 \
       --env=NEO4J_dbms_connector_bolt_listen__address=:7687 \
       --env=NEO4J_dbms_connector_http_advertised__address=:6476 \
       --env=NEO4J_dbms_connector_https_advertised__address=:6477 \
       --env=NEO4J_dbms_connector_bolt_advertised__address=:7687 \
       --env=NEO4J_dbms_security_procedures_unrestricted=gds.*,apoc.*,bloom.* \
       --env=NEO4J_dbms_security_procedures_allowlist=gds.*,apoc.*,bloom \
       --env=NEO4J_dbms_unmanaged__extension__classes=com.neo4j.bloom.server=/browser/bloom \
       --env=NEO4J_neo4j_bloom_license__file=/licenses/bloom.license \
       --publish=<HTTP_PORT>:6476 \
       --publish=<HTTPS_PORT>:6477 \
       --publish=<BOLT_PORT>:7687 \
       --volume=neo4j-data:/data \
       --volume=neo4j-import:/var/lib/neo4j/import \
       --volume=neo4j-licenses:/licenses \
       --volume=neo4j-plugins:/plugins \
       --name=neo4j-server \
       neo4j:4.3.0-enterprise