Skip to content

to_clock_time Bottleneck When Writing Relationships With neo4j.time.DateTime Properties to Database  #772

@mamei16

Description

@mamei16
  • Neo4j Mode: Single instance
  • Python version: 3.10
  • Driver version: python driver 4.4.5 (Also applies to current state of repository)
  • Operating system: Windows 10

TLDR at the the bottom.

Steps to reproduce

The following is a minimal working example to showcase the issue. It merges the same relationship between the same two nodes over and over again. The relationship has 3 properties of type neo4j.time.DateTime.

from datetime import datetime
import cProfile

from neo4j import GraphDatabase
from neo4j.time import DateTime 


property_date = datetime.strptime("2021-12-01", "%Y-%m-%d")
neo4j_date = DateTime.from_native(property_date)

NUM_RELATIONSHIPS = 100000
BATCH_SIZE = 1000

DBMS_IP = "127.0.0.1"
DBMS_PORT = 7687
DBMS_AUTH_TUP = ("neo4j", "admin")

relationships = [{"start_node_labels":  ["Node1"], "start_node_properties": {"prop": "yes"},
                  "end_node_labels": ["Node2"], "end_node_properties": {"prop": "yes"},
                  "label": "relationship", 
                  "properties": {f"date{i}": neo4j_date for i in range(3)}} for _ in range(NUM_RELATIONSHIPS)]
                  
def create_relationships(tx, batch):
            tx.run("""WITH $batch AS batch
                        UNWIND batch as ind
                        CALL apoc.merge.node(ind.start_node_labels,
                                            ind.start_node_properties)
                        YIELD node as startNode
                        CALL apoc.merge.node(ind.end_node_labels,
                                            ind.end_node_properties)
                        YIELD node as endNode
                        CALL apoc.merge.relationship(startNode, ind.label,
                        ind.properties, {}, endNode, {})
                        YIELD rel RETURN null""", batch=batch).consume()

connect_string = f"bolt://{DBMS_IP}:{DBMS_PORT}"
driver = GraphDatabase.driver(connect_string, auth=DBMS_AUTH_TUP)
session = driver.session()
                        
with cProfile.Profile() as pr:
    for i in range(0, len(relationships), BATCH_SIZE):
        batch = relationships[i:i + BATCH_SIZE]
        session.write_transaction(create_relationships, batch)
        
session.close()
driver.close()

pr.print_stats("time")

From the first line of the profiler output, it is evident that the vast majority of the time is spent in to_clock_time.

Expected behavior

Having multiple properties of type neo4j.time.DateTime per relationship does not cause a significant overhead when compared to other types such as str, int or float.

Actual behavior

The function to_clock_time has a severe impact on the running time to the point that it constitutes a bottleck, i.e. most of the total running time is spent in the function to_clock_time.

Suggested Improvement

Since the slow running time is caused by the two for-loops in to_clock_time, I propose precomputing the cumulative days for each year/month, similar to what is already done with the global variables DAYS_IN_YEAR and DAYS_IN_MONTH. More specifically, I added the following two global variables:

CUM_DAYS_IN_YEAR = [0]
for year in range(MIN_YEAR, MAX_YEAR + 1):
    CUM_DAYS_IN_YEAR.append(DAYS_IN_YEAR[year] + CUM_DAYS_IN_YEAR[year-1])

YEAR_TO_CUM_DAYS_IN_MONTH = {}
for year in range(MIN_YEAR, MAX_YEAR + 1):
    cum_days_in_month = [0]
    for month in range(1, 13):
        cum_days_in_month.append(DAYS_IN_MONTH[year, month] + cum_days_in_month[month-1])
    YEAR_TO_CUM_DAYS_IN_MONTH[year] = cum_days_in_month

Subsequently, to_clock_time is changed to:

def to_clock_time(self):
        """Convert to :class:`.ClockTime`.
        :rtype: ClockTime
        """
        total_seconds = (CUM_DAYS_IN_YEAR[self.year-1] + YEAR_TO_CUM_DAYS_IN_MONTH[self.year][self.month-1]) * 86400 
        total_seconds += 86400 * (self.day - 1)
        seconds, nanoseconds = divmod(self.__time.ticks_ns, NANO_SECONDS)
        return ClockTime(total_seconds + seconds, nanoseconds)

This change improves the worst case running time (i.e., when the date consists of a year == 9999 and a month == 12) by a factor of roughly 2671. More importantly, after this improvement, to_clock_time is no longer a performance bottleneck when writing datetime data to the DBMS.
The two global variables require an additional 3.9 MB of memory. If this turns you off, know that you could reduce your memory overhead by about 10.3 MB by converting DAYS_IN_YEAR to a simple list and DAYS_IN_MONTH to a dictionary of lists. Because these lists are relatively short, this will likely also be faster than retrieving the value from a dictionary. I have tested this hypothesis with the year 2022 and found a simple list faster than DAYS_IN_YEAR.

TLDR:

  • The function to_clock_time is so slow that it becomes a bottleneck when reading in moderate amounts of relationships with neo4j.time.DateTime properties.
  • I propose precomputing the cumulative days for each year/month
  • The proposed change yields a hundred- to thousandfold decrease in the running time of to_clock_time while requiring an additional 3.9 MB of memory.
  • I propose an additional and optional change to reduce the overall memory overhead by 10.3 MB by replacing the global variables DAYS_IN_YEAR and DAYS_IN_MONTH with lists, which in most realistic cases (year <= 2022) also improves running time.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions