Evidence generation plan for artificial intelligence (AI) technologies for assessing and triaging skin lesions referred to the urgent suspected skin cancer pathway

3 Approach to research

3.1 Evidence gaps and ongoing studies

Table 1 summarises the evidence gaps and ongoing studies that might address them. Information about evidence status is derived from the external assessment group's report; evidence not meeting the scope and inclusion criteria is not included. Table 1 shows the evidence available to the committee when the guidance was published. No ongoing studies were identified in the external assessment group's report.

Table 1 Evidence gaps and ongoing studies
Evidence gap Deep Ensemble for Recognition of Malignancy (DERM)

How accurate DERM used in teledermatology services is at detecting cancer and non-cancer skin lesions compared with teledermatology services alone

Limited evidence

Accuracy of DERM in people with black or brown skin

Limited evidence

The effect of using AI technologies in teledermatology services on the number of face-to-face dermatology appointments compared with a well-established teledermatology service alone

Limited evidence

3.2 Data sources

NICE's real-world evidence framework provides detailed guidance on assessing the suitability of a real-world data source to answer a specific research question.

Some data will be generated through the technology itself, such as the number of referrals that were assessed by the technology and the diagnostic outcomes predicted by it. This data can be integrated with other data collected.

The NHS England Secure Data Environment service could potentially support this research. This platform provides access to high-standard NHS health and social care data that can be used for research and analysis. Local or regional data collections such as NHS England's sub-national secure data environments (see the blog on 'Investing in the future of health research') and databases like NHS England's National Cancer Registration and Analysis Service already measure outcomes specified in this plan. They could be used to collect data to address the evidence gaps. Secure data environments are data storage and access platforms that bring together many sources of data, such as from primary and secondary care, to enable research and analysis. The sub-national secure data environments are designed to be agile and can be modified to suit the needs of new projects.

Datasets that are taken from general-practice electronic health records with broad coverage, such as the Clinical Practice Research Datalink and The Health Improvement Network could be used to provide individual patient-level data. These could provide some useful information on referrals, diagnostic outcomes and patient characteristics.

The quality and coverage of real-world data collections are of key importance when used in research. Active monitoring and follow up through a central coordinating point is an effective and viable approach of ensuring good-quality data with broad coverage.

3.3 Evidence collection plan

Diagnostic accuracy study

A diagnostic accuracy study is used to assess the agreement between 2 or more methods. The study would assess the agreement between the diagnosis decision reached for each included case of suspected cancer by:

  • AI technology alone (intervention)

  • teledermatology unassisted by AI technology (comparator)

  • a reference standard.

Potential approaches that can be taken to collect the reference standard include:

  • A panel of experts or expert opinion: Ideally a consensus assessment by an expert panel or, if resources are limited, assessment by an experienced healthcare professional unassisted by AI technology, but ideally with access to clinical information that would be available at the time of intended AI use. This is the ideal approach for a comprehensive assessment of both the AI technology and teledermatology.

  • Follow up: Monitoring of clinical progression to identify and assess any false negatives or false positives, ensuring that the accuracy of the initial diagnosis can be confirmed or corrected over time. This approach can be affected by differential verification bias and may require a considerable follow-up period.

Representative image sets would be generated prospectively. These would then be processed by the AI technology within teledermatology services. Cases that the technology was unable to assess would be recorded. It is important to consider variation in skin colour as part of the study design, for example, ensuring a sufficient sample size to assess different skin colours (ideally measured using skin spectrophotometry).

A comparison between the AI technology alone (intervention), the teledermatology unassisted by AI (comparator) and the reference standard would allow an assessment of the diagnostic accuracy of the AI technology compared with teledermatology services. Cases with disagreements in the diagnosis between each method could be further explored to identify common characteristics, and reasons for disagreements could be considered.

For pragmatism, this study could be done as part of the 'before' in the before-and-after study. Care would need to be taken to control for confounders and blinding. This could reduce the time to collect evidence and ensures the accuracy values are representative of the setting.

Real-world before-and-after implementation study

The results of the accuracy study should inform the population for a before-and-after implementation study, so that the technology is implemented in populations in which it has been shown to be effective.

A before-and-after study design allows for comparisons when there is considerable variation between services in the standards and mode of delivery of teledermatology. It also allows assessment of implementation costs, changes in referral rates, and the proportion of cases that are eligible for assessment by the AI technology.

Before the AI technology is implemented in a teledermatology service, data should be collected on the:

  • total number of referrals to that service

  • number of those referrals that resulted in a face-to-face appointment with a dermatologist

  • number of biopsies

  • number of referrals that resulted in a cancer lesion diagnosis.

If teledermatology is already established then the number of lesions that are not eligible for assessment by this service should also be recorded and the reasons why. The AI technology should then be implemented into the service and all implementation and training costs should be collected. After leaving a period of time to account for learning effects, the outcomes on referral rates, appointments and biopsies should be collected again in a period after implementation. The number of lesions that are not eligible for assessment by the AI technology should also be collected in the after-implementation study, and the reason why.

In a phased approach, a comparison between the AI technology's diagnosis and a dermatologist's opinion, and ideally, the final clinical outcome, can help to predict the likely impact of autonomous use of the AI technology before moving into, and testing fully autonomous use.

This study could be done at a single centre with an established teledermatology service or ideally, replicated across multiple centres. This could show how the AI technology can be implemented across a range of services, representative of the variety in the NHS. Outcomes may reflect other changes that occur over time in the population, unrelated to the interventions. Additional robustness can be achieved by collecting data in a centre that has not implemented an AI technology but is as similar as possible (in terms of clinical practice and patient characteristics) to a service where an AI technology is being used or ideally, a stepped-wedge design. This could help control for changes in referral rates over time that might have occurred anyway.

3.4 Data to be collected

The following information has been identified for collection:

Diagnostic accuracy study

  • Classifications made using teledermatology unassisted by AI technology, and by AI technology alone, and by the reference standard.

  • Proportion of lesions discharged from the urgent suspected skin cancer pathway onto the non-urgent pathway.

  • Information on lesions that are not eligible for assessment by teledermatology and not eligible for assessment by AI technology, and the reasons.

  • Whether or not the AI technology was able to process each image.

  • Performance of the AI technology and teledermatology compared with the reference standard.

  • Accuracy in people with black or brown skin (ideally measured using skin spectrophotometry).

  • Cases of diagnostic disagreement and the likely reason for disagreement (given by reference standard).

Real-world before-and-after implementation study

  • Patient information, for example, age, sex and ethnicity.

  • Number and proportion of suspected skin cancer cases that are not eligible for teledermatology before implementation of the technology and the reasons why.

  • Total number of referrals through the urgent suspected skin cancer pathway.

  • Number and proportion of referrals that had appointments with a dermatologist.

  • Number and proportion of appointments with a dermatologist that resulted in a biopsy or diagnosis of a cancer lesion.

  • For the after-implementation period, the number and proportion of suspected cancer cases that are not eligible for assessment by the technology.

  • The number and proportion of suspected cancer cases that are judged to be 'indeterminate' or cannot be processed by the AI technology (technical failure and rejection rate).

Data collection should follow a predefined protocol and quality assurance processes should be put in place to ensure the integrity and consistency of data collection. See NICE's real-world evidence framework, which provides guidance on the planning, conduct and reporting of real-world evidence studies.

Information about the technology

Information about how the technology was developed, the update version tested, and how the effect of future updates will be monitored should also be reported. See the NICE evidence standards framework for digital health technologies.

Evidence generation period

This will be 3 years to allow for setting up and implementing the AI technologies, and for data collection, analysis and reporting.

This page was last updated: