Skip to content

Study definition hacks #252

@wjchulme

Description

@wjchulme

study definition haacks

This issue is for documenting some useful study definition workarounds that allow users to extract information in a way that is not formally built into the existing study definition API.

They are collected here to help users do more with study definitions, and to help users and developers think about how the study definition API might be improved in future.

Create a variable for each code in a codelist

Say we have a codelist and want to return the number of times each code in that codelist was used, then a variable is needed for each code within the codelist. The standard way to create these variables would be as follows:

study = StudyDefinition(

    ...

    count_code1 = patients.with_these_clinical_events(
        codelist(code1, system="snomed"),
        on_or_after="index_date",
        returning = "number_of_matches_in_period",
        return_expectations={
            "incidence": 0.1,
            "int": {"distribution": "normal", "mean": 3, "stddev": 1}
        }
    ),

    count_code2 = patients.with_these_clinical_events(
        codelist(code2, system="snomed"),
        on_or_after="index_date",
        returning = "number_of_matches_in_period",
        return_expectations={
            "incidence": 0.1,
            "int": {"distribution": "normal", "mean": 3, "stddev": 1},
        }
    ),

    ...
	
)

A neater way of doing this is to wrap the variable signature in a function and loop that function over each code in the codelist:

def loop_over_codes(code_list):

    def make_variable(code):
        return {
            f"count_{code}": (
                patients.with_these_clinical_events(
                    codelist([code], system="snomed"),
                    on_or_after="index_date",
                    returning="number_of_matches_in_period",
                    return_expectations={
                         "incidence": 0.1,
                         "int": {"distribution": "normal", "mean": 3, "stddev": 1},
                    },
                )
            )
        }

    variables = {}
    for code in code_list:
        variables.update(make_variable(code))
    return variables

This function is then invoked inside the StudyDefinition call for example as follows:

study = StudyDefinition(

    ...
	
    **loop_over_codes(diabetes_codes),
	
    ...
)

This pattern can be used for any type of variable that uses codelists. For instance to extract the first admission date for a set of diagnoses in a codelist, use the following variable signature function:

def make_variable(code):
    return {
        f"date_of_first_{code}": (
            patients.admitted_to_hospital(
                with_these_primary_diagnoses = codelist([code], system="icd10"),
                on_or_before="index_date",
		find_first_match_in_period=True,
                returning="date_admitted",
		return_expectations={
		    "incidence": 0.1,
                    "date": {"2015-01-01": "index_date"}}
                },
            )
        )
    }

Example code: https://github.com/opensafely/long-covid/blob/master/analysis/study_definition_cohort.py

Extracting a series of consecutive events

Say we wanted to extract the first 5 coding events that occurred after a study start date. We could do this using:

admission_date1 = patients.with_these_clinical_events(
                    codelist,
                    returning="date",
                    on_or_after="index_date",
                    date_format="YYYY-MM-DD",
                    find_first_match_in_period=True,
		    return_expectations = {...}
                ),
				
admission_date2 = patients.with_these_clinical_events(
                    codelist,
                    returning="date",
                    on_or_after="admission_date1 + 1 day",
                    date_format="YYYY-MM-DD",
                    find_first_match_in_period=True,
                    return_expectations = {...}
                ),
				
...

admission_date5 = patients.with_these_clinical_events(
                    codelist,
                    returning="date",
                    on_or_after="admission_date4 + 1 day",
                    date_format="YYYY-MM-DD",
                    find_first_match_in_period=True,
                    return_expectations = {...}
                ),

This is ok for 5 events, but if you need more it becomes quite cumbersome. An alternative is to wrap the variable signature in a function and loop that signature over the number of events required:

def with_these_clinical_events_date_X(name, codelist, index_date, n, return_expectations):
    
    def var_signature(name, codelist, on_or_after, return_expectations):
        return {
            name: patients.with_these_clinical_events(
                    codelist,
                    returning="date",
                    on_or_after=on_or_after,
                    date_format="YYYY-MM-DD",
                    find_first_match_in_period=True,
                    return_expectations=return_expectations
	    ),
        }
    variables = var_signature(f"{name}_1", codelist, index_date, return_expectations)
    for i in range(2, n+1):
        variables.update(var_signature(f"{name}_{i}", codelist, f"{name}_{i-1} + 1 day", return_expectations))
    return variables

This can then be used inside a study definition for example as follows:

study = StudyDefinition(

	...
	
	**with_these_clinical_events_date_X(
		name = "probable_covid_date", 
		codelist = probable_codes, 
		returning="date",
		on_or_after = "index_date",
		n = 5, 
		return_expectations = {
			"date": {"earliest": "2020-05-01", "latest": "2021-06-01"},
			"rate": "uniform",
			"incidence": 0.05,
		},
	),
	
	...
)

This can be used for any variable where you want to return information for a series of consecutive events.

The will create n columns in the dataset. A few things to be aware of:

Better dummy data for categorical variables

For categorical variables with a small number of categories, the standard way of creating dummy data is easy to use, for example:

sex=patients.sex(
        return_expectations={
            "rate": "universal",
            "category": {"ratios": {"M": 0.49, "F": 0.51}},
        }
    )

However, if the number of categories is large, for instance small geographical regions like MSOAs, then it's a pain to have to write out all these by hand. An alternative is to create an external file containing the category levels, import this as a python dict, and pass this dict to the return_expectations argument. For example:

import pandas as pd

STPs = pd.read_csv(
    filepath_or_buffer = './lib/STPs.csv'
)

dict_stp = { stp : 1/len(STPs.index) for stp in STPs['stp_id'].tolist() }

study = StudyDefinition(
 
    ...

    stp = patients.registered_practice_as_of(
        index_date,
        returning="stp_code",
        return_expectations={
            "category": {"ratios": dict_stp},
        },
    ),
    
    ....

)

If the category levels are named with a predictable structure for example (["stage 1", "stage 2", ...], or [100, 200, 300, ...]) then a dict comprehension can be used to create the levels, without having to import them. For example for IMD ranks rounded to the nearest 100:

 imd=patients.address_as_of(
        "index_date",
        returning="index_of_multiple_deprivation",
        round_to_nearest=100,
        return_expectations={
            "category": {"ratios": {c: 1/320 for c in range(100,32100, 100)}}
        }
    )

See this issue for further discussion about how this functionality might be improved in future.

Common variables

For using the same set of variables across different study definitions, see the common variables section of the documentation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    cohort extractorCohort extractor-related documentation issues

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions