-
Notifications
You must be signed in to change notification settings - Fork 204
Description
- Neo4j Mode: Single instance
- Python version: 3.10
- Driver version: python driver 4.4.5 (Also applies to current state of repository)
- Operating system: Windows 10
TLDR at the the bottom.
Steps to reproduce
The following is a minimal working example to showcase the issue. It merges the same relationship between the same two nodes over and over again. The relationship has 3 properties of type neo4j.time.DateTime
.
from datetime import datetime
import cProfile
from neo4j import GraphDatabase
from neo4j.time import DateTime
property_date = datetime.strptime("2021-12-01", "%Y-%m-%d")
neo4j_date = DateTime.from_native(property_date)
NUM_RELATIONSHIPS = 100000
BATCH_SIZE = 1000
DBMS_IP = "127.0.0.1"
DBMS_PORT = 7687
DBMS_AUTH_TUP = ("neo4j", "admin")
relationships = [{"start_node_labels": ["Node1"], "start_node_properties": {"prop": "yes"},
"end_node_labels": ["Node2"], "end_node_properties": {"prop": "yes"},
"label": "relationship",
"properties": {f"date{i}": neo4j_date for i in range(3)}} for _ in range(NUM_RELATIONSHIPS)]
def create_relationships(tx, batch):
tx.run("""WITH $batch AS batch
UNWIND batch as ind
CALL apoc.merge.node(ind.start_node_labels,
ind.start_node_properties)
YIELD node as startNode
CALL apoc.merge.node(ind.end_node_labels,
ind.end_node_properties)
YIELD node as endNode
CALL apoc.merge.relationship(startNode, ind.label,
ind.properties, {}, endNode, {})
YIELD rel RETURN null""", batch=batch).consume()
connect_string = f"bolt://{DBMS_IP}:{DBMS_PORT}"
driver = GraphDatabase.driver(connect_string, auth=DBMS_AUTH_TUP)
session = driver.session()
with cProfile.Profile() as pr:
for i in range(0, len(relationships), BATCH_SIZE):
batch = relationships[i:i + BATCH_SIZE]
session.write_transaction(create_relationships, batch)
session.close()
driver.close()
pr.print_stats("time")
From the first line of the profiler output, it is evident that the vast majority of the time is spent in to_clock_time
.
Expected behavior
Having multiple properties of type neo4j.time.DateTime
per relationship does not cause a significant overhead when compared to other types such as str
, int
or float
.
Actual behavior
The function to_clock_time
has a severe impact on the running time to the point that it constitutes a bottleck, i.e. most of the total running time is spent in the function to_clock_time
.
Suggested Improvement
Since the slow running time is caused by the two for-loops in to_clock_time
, I propose precomputing the cumulative days for each year/month, similar to what is already done with the global variables DAYS_IN_YEAR
and DAYS_IN_MONTH
. More specifically, I added the following two global variables:
CUM_DAYS_IN_YEAR = [0]
for year in range(MIN_YEAR, MAX_YEAR + 1):
CUM_DAYS_IN_YEAR.append(DAYS_IN_YEAR[year] + CUM_DAYS_IN_YEAR[year-1])
YEAR_TO_CUM_DAYS_IN_MONTH = {}
for year in range(MIN_YEAR, MAX_YEAR + 1):
cum_days_in_month = [0]
for month in range(1, 13):
cum_days_in_month.append(DAYS_IN_MONTH[year, month] + cum_days_in_month[month-1])
YEAR_TO_CUM_DAYS_IN_MONTH[year] = cum_days_in_month
Subsequently, to_clock_time
is changed to:
def to_clock_time(self):
"""Convert to :class:`.ClockTime`.
:rtype: ClockTime
"""
total_seconds = (CUM_DAYS_IN_YEAR[self.year-1] + YEAR_TO_CUM_DAYS_IN_MONTH[self.year][self.month-1]) * 86400
total_seconds += 86400 * (self.day - 1)
seconds, nanoseconds = divmod(self.__time.ticks_ns, NANO_SECONDS)
return ClockTime(total_seconds + seconds, nanoseconds)
This change improves the worst case running time (i.e., when the date consists of a year == 9999 and a month == 12) by a factor of roughly 2671. More importantly, after this improvement, to_clock_time
is no longer a performance bottleneck when writing datetime data to the DBMS.
The two global variables require an additional 3.9 MB of memory. If this turns you off, know that you could reduce your memory overhead by about 10.3 MB by converting DAYS_IN_YEAR
to a simple list and DAYS_IN_MONTH
to a dictionary of lists. Because these lists are relatively short, this will likely also be faster than retrieving the value from a dictionary. I have tested this hypothesis with the year 2022 and found a simple list faster than DAYS_IN_YEAR
.
TLDR:
- The function
to_clock_time
is so slow that it becomes a bottleneck when reading in moderate amounts of relationships withneo4j.time.DateTime
properties. - I propose precomputing the cumulative days for each year/month
- The proposed change yields a hundred- to thousandfold decrease in the running time of
to_clock_time
while requiring an additional 3.9 MB of memory. - I propose an additional and optional change to reduce the overall memory overhead by 10.3 MB by replacing the global variables
DAYS_IN_YEAR
andDAYS_IN_MONTH
with lists, which in most realistic cases (year <= 2022) also improves running time.