Nebula seems like a decent fit, but getting started with it is a bit tricky, and there's a lot of outdated docs floating around.

While it has a proper large-scale architecture, I want to just use it on my workstation for now, so I'll use the docker-compose method.

Start the database

I don't think this has persistence, so expect to redo all your work if you the workstation is shutdown. I don't know for sure, because I've never used docker-compose before.

These docs are old but they work fine to get it up and running:

Connect to the console

docker run --rm -it --network nebula-docker-compose_nebula-net --entrypoint=/bin/sh vesoft/nebula-console:v2-nightly

/ # nebula-console -addr graphd -port 9669 -u root -p nebula

This is like a mysql/psql shell now. :QUIT to exit.

Make some structure

I'm using this series of videos:

The syntax has changed a little since then though, here's the reference:

(root@nebula) [(none)]> CREATE SPACE mneme (vid_type = int64);
Execution succeeded (time spent 1229/1435 us)

(root@nebula) [(none)]> USE mneme;

(root@nebula) [mneme]> CREATE TAG person (displayname string NOT NULL, dob_year INT, dob_month INT, dob_day INT) comment = 'A single identifiable person, who may have multiple handles/names';

(root@nebula) [mneme]> CREATE TAG handle (name STRING NOT NULL, context STRING, url STRING) comment = 'A name that a Person is known by, in a specific context or situation';

(root@nebula) [mneme]> CREATE TAG photoshoot(shoot_date DATE NOT NULL, shoot_name STRING);

(root@nebula) [mneme]> CREATE EDGE did_shoot(role STRING);

(root@nebula) [mneme]> CREATE TAG location(name STRING, coords GEOGRAPHY) comment = 'Usually a LatLong of a point, but could be a linestring or polygon';

(root@nebula) [mneme]> CREATE EDGE shot_at();

(root@nebula) [mneme]> CREATE EDGE known_as();

Create some data

INSERT VERTEX person(displayname, dob_year, dob_month, dob_day) VALUES 1:("Barney Desmond", 1989, 12, 13);
INSERT VERTEX handle(name) VALUES 2:("furinkan");
INSERT EDGE known_as() VALUES 1 -> 2:();

INSERT VERTEX person(displayname) VALUES 3:("Victoria Ho");
INSERT VERTEX handle(name) VALUES 4:("dboomer");
INSERT EDGE known_as() VALUES 3 -> 4:();

INSERT VERTEX person(displayname) VALUES 5:("Jessica Li");
INSERT VERTEX handle(name, context, url) VALUES 6:("Kisara Shimada", "Facebook", "");
INSERT VERTEX handle(name, context, url) VALUES 7:("aerithxzack", "Instagram", "");
INSERT VERTEX handle(name, context, url) VALUES 8:("Kisa_9225", "Twitter", "");

// these could be a single INSERT, that's fine too
INSERT EDGE known_as() VALUES 5 -> 6:() ;
INSERT EDGE known_as() VALUES 5 -> 7:() ;
INSERT EDGE known_as() VALUES 5 -> 8:() ;

Now with some masking

INSERT VERTEX handle(name) VALUES 9:("Valerious");
// Yes it really is in longitude-latitude order >:(
INSERT VERTEX location(name, coords) VALUES 10:("International Convention Centre Sydney", ST_GeogFromText("POINT(151.199 -33.8734)"));
INSERT VERTEX photoshoot(shoot_date, shoot_name) VALUES 11:(date("2020-03-08"), "Madman Anime Festival");

INSERT VERTEX handle(name) VALUES 12:("");
INSERT EDGE known_as() VALUES 1 -> 12:();

INSERT EDGE shot_at() VALUES 11 -> 10:() ;
INSERT EDGE did_shoot(role) VALUES 9 -> 11:("cosplayer") ;
INSERT EDGE did_shoot(role) VALUES 12 -> 11:("photographer") ;

Valerious is a cosplayer, but she doesn't have a Person associated with her. As demonstrated here:

(root@nebula) [mneme]> GO FROM 11 OVER did_shoot REVERSELY YIELD properties($$).name AS participant;
| participant      |
| "Valerious"      |
| "" |

// Now try and map that back to a real person
> GO FROM 11     OVER did_shoot REVERSELY YIELD did_shoot._dst as hid | \
  GO FROM $-.hid OVER known_as REVERSELY YIELD $^ AS handle, properties($$).displayname AS person_name;
| handle           | person_name      |
| "" | "Barney Desmond" |

Those special variables ($$ and $- and $^) are Operators.

Inspect it with the studio

The Studio is basically a nice web-enabled console.

docker pull vesoft/nebula-graph-studio:v3.4.0

docker run -d -it -p 7001:7001 vesoft/nebula-graph-studio:v3.4.0

Hit it on http://localhost:7001/

You can connect with:

Python client

Activate your venv and then install the client library. I'm using Nebula 3.4, but the latest client is 3.1.0

pip install nebula3-python==3.1.0

Then this is the easiest way to use it.

>>> from import ConnectionPool
>>> from nebula3.Config import Config
>>> # define a config
>>> config = Config()
>>> config.max_connection_pool_size = 10
>>> # init connection pool
>>> connection_pool = ConnectionPool()
>>> # if the given servers are ok, return true, else return false
>>> ok = connection_pool.init([('', 9669)], config)
>>> ok
>>> session = connection_pool.get_session('root', 'nebula')
>>> session.execute('USE mneme')
>>> session.execute('SHOW TAGS')
ResultSet(keys: ['Name'], values: ["handle"],["location"],["person"],["photoshoot"])

>>> from pprint import pprint as PP
>>> r = session.execute('''GO FROM 5 OVER known_as YIELD $^.person.displayname AS person, $$ AS handle;''')
>>> ks = r.keys()

>>> PP([ dict(zip(ks, [x.value.decode('utf8') for x in row.values])) for row in r.rows() ])
[{'handle': 'Kisara Shimada', 'person': 'Jessica Li'},
 {'handle': 'aerithxzack', 'person': 'Jessica Li'},
 {'handle': 'Kisa_9225', 'person': 'Jessica Li'}]

Generating vertex IDs

Nebula forces you to come up your own Vertex ID (VID) for each vertex, and they can either be fixed-width strings, or 64-bit integers. It's clear that they're designed to be unique lookup keys for vertices, hence the fixed sizes.

While I could just generate random numbers, I think I'd like the vertex IDs to be predictable or calculable. That means you really want something that's a hash of the data in the vertex, a sort of primary key. UUID seems like an obvious choice here.

UUID4 for over 120 bits of pure entropy, for UUID5 for something derivable.

Really this is a bit too long, and I'd be happy to throw away a few bits. In practical terms, 64 bits would be plenty (and it's all Nebula could originally handle for VIDs anyway). But let's ignore that for now.

In python this is trivial:

import uuid

Boom, there's your uuid.

Let's say you want something deterministic. UUID5 uses a namespace (also a uuid) and a "name" in that namespace, it basically just glues them together and calculates the SHA1 sum of that. It's deterministic which is great. We can define ourselves a namespace, and crank out names to our heart's content.

Here's how I'm thinking of doing it, it's UUIDs all the way down, baby!

import uuid

# Arbitrary starting point, but you could generate a fresh uuid4 for your installation and use that as the root of your scheme.
# This is kind of like the secret key or seed for your tripcode generator on *chan boards.
# If you don't want VIDs to be guessable by someone outside the system, you keep this a secret.
u_root = uuid.UUID(int=0)  # UUID('00000000-0000-0000-0000-000000000000')

# Create a space for the app itself, kinda optional I guess. Output depends on your root seed.
u_mneme = uuid.uuid5(u_root, 'mneme')  # UUID('2d1794fd-3437-5b2f-b24e-3deeb700a8b7')

# Define a UUID that'll be the input for all Person vertices. This ensure that's name collisions  between different tags (vertex-types) don't produce VID collisions
u_person = uuid.uuid5(u_mneme, 'person')  # UUID('d0fbb4ce-4137-5798-b090-27c010158483')

# Now we finally generate the VID for a single Person
u_joe = uuid.uuid5(u_person, 'Joe Bloggs')  # UUID('7fb08cbe-7679-5d43-b542-de1fb158d9a4')

Encoding vertex IDs

I would like to encode this better though. We're not even going to be using them that much, so it makes sense to ensure they're easy to generate and read. UUIDs are long and have way too much entropy for me to care about, so let's chop them down to 64-bit as well.

We'd like something shorter, without too many symbols, and of a fixed textual width. Base32 is good, base64 is fine as well. Base64 contains a couple of symbols to round out the alphabet, so that's less preferable.

Also worth considering is whether we need them to be suitable as labels anywhere else. One particular issue in certain markup languages is that labels must start with an alphabetic character. This rules out pure base32 or base64 because there's a high likelihood that a label will start with a digit, and thus cause confusion.

Here is one proposal that aims to encode UUIDs such that they're usable is "NCNames" in XML/HTML/RDF:

That would do the job very nicely, though the bookend values do waste some space. The fact that they're identifiable could be of benefit to humans though.

We don't need NCNames though, so we can be happy with a representation that simply avoids symbols.

Based on the format of UUIDs, specifically of a UUID5, it's clear that we can pluck out 64 bits however it suits us. We might as well skip the chunks containing the version and variant bits, so let's grab the 2nd and 5th chunk and base58-encode that.

# Joe Bloggs
xxxxxxxx-oooo-xxxx-xxxx-oooooooooooo  # keep the 'o', discard the 'x'

u_joe_id = u_joe.bytes[4:6] + u_joe.bytes[10:16]

import base58
u_joe_id_b58 = base58.b58encode(u_joe.bytes[4:6] + u_joe.bytes[10:16])

Now we have a nice 11-char identifier for use as a VID. This will be a FIXED_STRING(11), and that needs to be supplied when you first create a graph-space. I'll need to go back and recreate the graph-space like so:

CREATE SPACE mneme (vid_type = FIXED_STRING(11));

This is advice about how you should generate VIDs, something I should've read earlier:

I'm not expecting to hit 1 billion vertices soon, so I'm not worried. If I were, I'd use the whole 128-bit UUID then encode that to 22 chars (sometimes 21) of base58.

MeidokonWiki: furinkan/Mneme/NebulaGraph (last edited 2022-08-01 06:06:28 by furinkan)