Evaluating artificial intelligence

Intelligent systems have undergone major developments since 2016 and are finding more and more applications in our economy. They represent a particularly innovative and probably strategic industrial, commercial, social and societal challenge. Having confidence in ethically irreproachable artificial intelligence (AI) systems whose performances are known is an essential prerequisite to propel France into this new area. . This confidence is essentially based on a rigorous evaluation of these systems.

A rapidly changing context

Autonomous vehicles, conversational agents, domestic, agricultural or industrial robots... Intelligent systems invest in all fields and occupy a prime position in many sectors of activity: factory 4.0, health, defence, transport, ecology, education, customer relations, finance, energy...

France wishes to position itself as soon as possible in order to preserve an advantage in international competition, and avoid being outdated and exposed to new dependencies. However, it is singularly lacking in digital giants (such as Google, Microsoft, etc.) enabling an easy mobilisation of skills and capital, which is a definite handicap in the race for artificial intelligence. These concerns are shared by the highest national authorities, whose initiatives are growing in number in recent months: status report by institutional actors around the community #FranceIA, investigation by the Office parlementaire d'évaluation des choix scientifiques et techniques (OPECST), in-depth and planning mission of the MP for Essonne C. Villani, whose report was published on 28 March 2018.

Qualification of these future intelligent systems is imperative for the end user (selection, acceptance, assessment of functionality and performance, reliability and safety) as well as for the developer (development and certification). Beyond the very effectiveness of the systems, it is necessary to address security issues, which are essential conditions for collective acceptability. Behind the popular fantasy fuelled by science fiction writers or speculation about the apocalyptic consequences of a "technological singularity", it is important to be transparent and continually able to prove the safety and ethicality of new systems, or to estimate their legal implications.

Villani Report on Artificial Intelligence
Villani report
Villani Report on Artificial Intelligence

LNE's general and historical positioning, its vocation as a "trusted third party" organisation and the practical and long-standing experience of this positioning, its statutory neutrality, make it a legitimate and appropriate player to meet this demand and promote the development of AI in France and the acceptability of future systems. Therefore, the Villani mission report proposes that the LNE's responsibilities be extended and that it be recognised as the "competent authority in the evaluation of artificial intelligence systems". LNE will thus be called upon to participate in the major transversal challenges of AI by developing reference systems to structure, guarantee and certify intelligent systems and enable the development of standards and regulations, thus contributing to the confidence that companies and citizens will place in these new technologies.

Read the report (in French)

Issues in AI Evaluation

Evaluation: a vector for innovation and a decision-making tool

Evaluation is an indispensable element for AI system developers. It allows:

  • to identify the origin of underperformance and guide future developments ;
  • to estimate the amount and nature of effort required before commercial launch ;
  • to evaluate the effectiveness of investments made to advance technology ;
  • to characterize the scope of use of the system.

Moreover, faced with the multitude of artificial intelligence solutions proposed by a growing number of players, a company wishing to acquire an artificial intelligence system, be it a chatbot for its customer relationship or an automated agricultural robot, must be able to rely on concrete metrics attesting to the system's efficiency and robustness. Evaluation thus encourages pragmatic and reasoned decision-making when choosing an AI.

The challenge, one of the ways for evaluation

Challenges are multi-year projects that consist in proposing a common framework for the competition of teams developing different approaches. These campaigns constitute an essential means of organization and motivation for the maintenance of exchanges between various participants, generating a significant training effect and making it possible to remove scientific or technological obstacles, to improve performances and to accompany the rise in TRL (Technology Readiness Level) of the concerned systems.

To rigorously, comparably and unbiasedly assess the reliability of competing technical solutions, evaluation campaigns must be organized by a trusted third party competent in metrology applied to these systems. LNE is a suitable candidate to play this role. It has business expertise in the evaluation of intelligent systems, including the selection, qualification and annotation of data, the definition of evaluation protocols (ensuring reproducible experiments and repeatable measurements) and metrics, as well as the analysis of results.

ROSE, a challenge to evaluate agricultural robots

Challenge ROSEFrom 2018 to 2022, for example, LNE is responsible, in partnership with IRSTEA, for organising the ROSE Challenge funded by ANR and AFB, as part of the Ecophyto II plan supported by the Ministries of Agriculture and Ecological and Solidarity Transition. The objective is to evaluate agricultural robots intended for intra-row weeding, in order to control herbicides in crops with wide spacing (maize, sunflower) and vegetable crops. LNE and IRSTEA's main missions are to define the operating methods for the evaluation campaigns of this challenge and to compare the participants' solutions.

ROSE Challenge website (in French)

Examples of AI System Evaluation

Automatic speech transcription

Transcription systems convert an audio file into a text file. This task can be more or less complex depending on the number of people involved. It is thus easier when it comes to transcribing a speech than a group conversation. The most traditional unit for measuring the performance of a transcription system is the word error rate (WER), which penalizes words incorrectly recognized, omitted or added unnecessarily compared to a reference transcription performed by a human annotator.

Automatic speech processing systems
Evaluation tool for automatic speech processing systems

Since 2008, LNE has been conducting research projects in the evaluation of automatic information processing systems and is forging public and private partnerships in this field. Whether for civil applications in speech comprehension or transcription, translation, diarization and detection of named entities or dual applications in comparison to voices for forensics, translation, handwriting recognition, recognition of persons in television documents, LNE has acquired a sharp expertise that covers the various areas of the field.
Thanks to this expertise, it developed two complementary tools in 2017 to evaluate automatic speech transcription systems: Datomatic, which organizes data to create a reference database; and Evalomatic, which tests the relevance of software by comparing its results with Datomatic's references. Gathered in a single interface, they are intended to be freely distributed to industrialists and researchers.

Evaluation of robots: tests in real and/or virtual environments

A robot is a system with at least partial autonomy of action driven by artificial intelligence. As such, the specificity of the robot compared to other mechatronic systems is to use AI. Its evaluation is therefore based on the same principles as for artificial intelligence and its aptitude is to be measured mainly on the functional level and in its adaptability, specific to the notion of intelligence. It is therefore not only a question of quantifying its functions and performances but also of validating and characterising the operating environments in which it will behave robustly.

Evaluation in a virtual environment
Evaluation in a virtual environment

To determine the operating perimeter of the robot, it is necessary to carry out tests in controlled environments. Physical tests in climatic chambers, thermal chambers, saline mist and sun exposure test rigs allow for example to analyse the influence of environmental conditions on the performance of intelligent systems. Vibration, shock, and constant acceleration tests should also be performed to evaluate the performance of systems under extreme conditions and to accurately determine operating boundary conditions.

Robot evaluation
Evaluation of the LAAS HRP-2 robot in a climatic chamber

As part of the European Robocom++ project, which aims at organising the service robotics sector and setting up a consortium to respond to the next Flagship call, LNE has carried out tests on a humanoid robot from the LAAS, in order to test the robustness of its performance to changing climatic conditions.

For the evaluation of autonomous robots moving in open and changing environments, given the almost infinite number of configurations that they may encounter, LNE participates in the development of virtual test environments allowing validation by simulation. As part of “the new industrial France” (ecological mobility sector), whose main objectives are to develop tools to qualify the performance and safety of autonomous vehicles through virtual tests, LNE is involved in the SVA project (simulation for the safety of autonomous vehicles). In particular, he is working on methods and tools to characterise on-board sensors and evaluate the performance of algorithms underlying decision-making in autonomous vehicles. This work is likely to respond to the political will to provide a sustainable and safe development to the European automotive industry, as well as to the new needs of design offices seeking to develop increasingly robust and efficient systems.