machine learning Recently Published Documents

Total documents.

  • Latest Documents
  • Most Cited Documents
  • Contributed Authors
  • Related Sources
  • Related Keywords

An explainable machine learning model for identifying geographical origins of sea cucumber Apostichopus japonicus based on multi-element profile

A comparison of machine learning- and regression-based models for predicting ductility ratio of rc beam-column joints, alexa, is this a historical record.

Digital transformation in government has brought an increase in the scale, variety, and complexity of records and greater levels of disorganised data. Current practices for selecting records for transfer to The National Archives (TNA) were developed to deal with paper records and are struggling to deal with this shift. This article examines the background to the problem and outlines a project that TNA undertook to research the feasibility of using commercially available artificial intelligence tools to aid selection. The project AI for Selection evaluated a range of commercial solutions varying from off-the-shelf products to cloud-hosted machine learning platforms, as well as a benchmarking tool developed in-house. Suitability of tools depended on several factors, including requirements and skills of transferring bodies as well as the tools’ usability and configurability. This article also explores questions around trust and explainability of decisions made when using AI for sensitive tasks such as selection.

Automated Text Classification of Maintenance Data of Higher Education Buildings Using Text Mining and Machine Learning Techniques

Data-driven analysis and machine learning for energy prediction in distributed photovoltaic generation plants: a case study in queensland, australia, modeling nutrient removal by membrane bioreactor at a sewage treatment plant using machine learning models, big five personality prediction based in indonesian tweets using machine learning methods.

<span lang="EN-US">The popularity of social media has drawn the attention of researchers who have conducted cross-disciplinary studies examining the relationship between personality traits and behavior on social media. Most current work focuses on personality prediction analysis of English texts, but Indonesian has received scant attention. Therefore, this research aims to predict user’s personalities based on Indonesian text from social media using machine learning techniques. This paper evaluates several machine learning techniques, including <a name="_Hlk87278444"></a>naive Bayes (NB), K-nearest neighbors (KNN), and support vector machine (SVM), based on semantic features including emotion, sentiment, and publicly available Twitter profile. We predict the personality based on the big five personality model, the most appropriate model for predicting user personality in social media. We examine the relationships between the semantic features and the Big Five personality dimensions. The experimental results indicate that the Big Five personality exhibit distinct emotional, sentimental, and social characteristics and that SVM outperformed NB and KNN for Indonesian. In addition, we observe several terms in Indonesian that specifically refer to each personality type, each of which has distinct emotional, sentimental, and social features.</span>

Compressive strength of concrete with recycled aggregate; a machine learning-based evaluation

Temperature prediction of flat steel box girders of long-span bridges utilizing in situ environmental parameters and machine learning, computer-assisted cohort identification in practice.

The standard approach to expert-in-the-loop machine learning is active learning, where, repeatedly, an expert is asked to annotate one or more records and the machine finds a classifier that respects all annotations made until that point. We propose an alternative approach, IQRef , in which the expert iteratively designs a classifier and the machine helps him or her to determine how well it is performing and, importantly, when to stop, by reporting statistics on a fixed, hold-out sample of annotated records. We justify our approach based on prior work giving a theoretical model of how to re-use hold-out data. We compare the two approaches in the context of identifying a cohort of EHRs and examine their strengths and weaknesses through a case study arising from an optometric research problem. We conclude that both approaches are complementary, and we recommend that they both be employed in conjunction to address the problem of cohort identification in health research.

Export Citation Format

Share document.

Subscribe to the PwC Newsletter

Join the community, trending research, world model on million-length video and language with ringattention.

machine learning related research papers

This work paves the way for training on massive datasets of long video and language to develop understanding of both human knowledge and the multimodal world, and broader capabilities.

UFO: A UI-Focused Agent for Windows OS Interaction

microsoft/UFO • 8 Feb 2024

We introduce UFO, an innovative UI-Focused agent to fulfill user requests tailored to applications on Windows OS, harnessing the capabilities of GPT-Vision.

YOLO-World: Real-Time Open-Vocabulary Object Detection

machine learning related research papers

The You Only Look Once (YOLO) series of detectors have established themselves as efficient and practical tools.

machine learning related research papers

DoRA: Weight-Decomposed Low-Rank Adaptation

By employing DoRA, we enhance both the learning capacity and training stability of LoRA while avoiding any additional inference overhead.

Generative Representational Instruction Tuning

Notably, we find that GRIT matches training on only generative or embedding data, thus we can unify both at no performance loss.

Scalable Diffusion Models with Transformers

We explore a new class of diffusion models based on the transformer architecture.

machine learning related research papers

GraphCast: Learning skillful medium-range global weather forecasting

Global medium-range weather forecasting is critical to decision-making across many social and economic domains.

BitDelta: Your Fine-Tune May Only Be Worth One Bit

Large Language Models (LLMs) are typically trained in two phases: pre-training on large internet-scale datasets, and fine-tuning for downstream tasks.

Revisiting Feature Prediction for Learning Visual Representations from Video

This paper explores feature prediction as a stand-alone objective for unsupervised learning from video and introduces V-JEPA, a collection of vision models trained solely using a feature prediction objective, without the use of pretrained image encoders, text, negative examples, reconstruction, or other sources of supervision.

Lag-Llama: Towards Foundation Models for Probabilistic Time Series Forecasting

Over the past years, foundation models have caused a paradigm shift in machine learning due to their unprecedented capabilities for zero-shot and few-shot generalization.

Help | Advanced Search

Computer Science > Artificial Intelligence

Title: machine learning and deep learning.

Abstract: Today, intelligent systems that offer artificial intelligence capabilities often rely on machine learning. Machine learning describes the capacity of systems to learn from problem-specific training data to automate the process of analytical model building and solve associated tasks. Deep learning is a machine learning concept based on artificial neural networks. For many applications, deep learning models outperform shallow machine learning models and traditional data analysis approaches. In this article, we summarize the fundamentals of machine learning and deep learning to generate a broader understanding of the methodical underpinning of current intelligent systems. In particular, we provide a conceptual distinction between relevant terms and concepts, explain the process of automated analytical model building through machine learning and deep learning, and discuss the challenges that arise when implementing such intelligent systems in the field of electronic markets and networked business. These naturally go beyond technological aspects and highlight issues in human-machine interaction and artificial intelligence servitization.

Submission history

Access paper:.

  • Download PDF

license icon

References & Citations

  • Google Scholar
  • Semantic Scholar

DBLP - CS Bibliography

Bibtex formatted citation.

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings
  • Advanced Search
  • Journal List
  • Springer Nature - PMC COVID-19 Collection

Logo of phenaturepg

Machine Learning: Algorithms, Real-World Applications and Research Directions

Iqbal h. sarker.

1 Swinburne University of Technology, Melbourne, VIC 3122 Australia

2 Department of Computer Science and Engineering, Chittagong University of Engineering & Technology, 4349 Chattogram, Bangladesh

In the current age of the Fourth Industrial Revolution (4 IR or Industry 4.0), the digital world has a wealth of data, such as Internet of Things (IoT) data, cybersecurity data, mobile data, business data, social media data, health data, etc. To intelligently analyze these data and develop the corresponding smart and automated  applications, the knowledge of artificial intelligence (AI), particularly, machine learning (ML) is the key. Various types of machine learning algorithms such as supervised, unsupervised, semi-supervised, and reinforcement learning exist in the area. Besides, the deep learning , which is part of a broader family of machine learning methods, can intelligently analyze the data on a large scale. In this paper, we present a comprehensive view on these machine learning algorithms that can be applied to enhance the intelligence and the capabilities of an application. Thus, this study’s key contribution is explaining the principles of different machine learning techniques and their applicability in various real-world application domains, such as cybersecurity systems, smart cities, healthcare, e-commerce, agriculture, and many more. We also highlight the challenges and potential research directions based on our study. Overall, this paper aims to serve as a reference point for both academia and industry professionals as well as for decision-makers in various real-world situations and application areas, particularly from the technical point of view.

Introduction

We live in the age of data, where everything around us is connected to a data source, and everything in our lives is digitally recorded [ 21 , 103 ]. For instance, the current electronic world has a wealth of various kinds of data, such as the Internet of Things (IoT) data, cybersecurity data, smart city data, business data, smartphone data, social media data, health data, COVID-19 data, and many more. The data can be structured, semi-structured, or unstructured, discussed briefly in Sect. “ Types of Real-World Data and Machine Learning Techniques ”, which is increasing day-by-day. Extracting insights from these data can be used to build various intelligent applications in the relevant domains. For instance, to build a data-driven automated and intelligent cybersecurity system, the relevant cybersecurity data can be used [ 105 ]; to build personalized context-aware smart mobile applications, the relevant mobile data can be used [ 103 ], and so on. Thus, the data management tools and techniques having the capability of extracting insights or useful knowledge from the data in a timely and intelligent way is urgently needed, on which the real-world applications are based.

Artificial intelligence (AI), particularly, machine learning (ML) have grown rapidly in recent years in the context of data analysis and computing that typically allows the applications to function in an intelligent manner [ 95 ]. ML usually provides systems with the ability to learn and enhance from experience automatically without being specifically programmed and is generally referred to as the most popular latest technologies in the fourth industrial revolution (4 IR or Industry 4.0) [ 103 , 105 ]. “Industry 4.0” [ 114 ] is typically the ongoing automation of conventional manufacturing and industrial practices, including exploratory data processing, using new smart technologies such as machine learning automation. Thus, to intelligently analyze these data and to develop the corresponding real-world applications, machine learning algorithms is the key. The learning algorithms can be categorized into four major types, such as supervised, unsupervised, semi-supervised, and reinforcement learning in the area [ 75 ], discussed briefly in Sect. “ Types of Real-World Data and Machine Learning Techniques ”. The popularity of these approaches to learning is increasing day-by-day, which is shown in Fig. ​ Fig.1, 1 , based on data collected from Google Trends [ 4 ] over the last five years. The x - axis of the figure indicates the specific dates and the corresponding popularity score within the range of 0 ( m i n i m u m ) to 100 ( m a x i m u m ) has been shown in y - axis . According to Fig. ​ Fig.1, 1 , the popularity indication values for these learning types are low in 2015 and are increasing day by day. These statistics motivate us to study on machine learning in this paper, which can play an important role in the real-world through Industry 4.0 automation.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_592_Fig1_HTML.jpg

The worldwide popularity score of various types of ML algorithms (supervised, unsupervised, semi-supervised, and reinforcement) in a range of 0 (min) to 100 (max) over time where x-axis represents the timestamp information and y-axis represents the corresponding score

In general, the effectiveness and the efficiency of a machine learning solution depend on the nature and characteristics of data and the performance of the learning algorithms . In the area of machine learning algorithms, classification analysis, regression, data clustering, feature engineering and dimensionality reduction, association rule learning, or reinforcement learning techniques exist to effectively build data-driven systems [ 41 , 125 ]. Besides, deep learning originated from the artificial neural network that can be used to intelligently analyze data, which is known as part of a wider family of machine learning approaches [ 96 ]. Thus, selecting a proper learning algorithm that is suitable for the target application in a particular domain is challenging. The reason is that the purpose of different learning algorithms is different, even the outcome of different learning algorithms in a similar category may vary depending on the data characteristics [ 106 ]. Thus, it is important to understand the principles of various machine learning algorithms and their applicability to apply in various real-world application areas, such as IoT systems, cybersecurity services, business and recommendation systems, smart cities, healthcare and COVID-19, context-aware systems, sustainable agriculture, and many more that are explained briefly in Sect. “ Applications of Machine Learning ”.

Based on the importance and potentiality of “Machine Learning” to analyze the data mentioned above, in this paper, we provide a comprehensive view on various types of machine learning algorithms that can be applied to enhance the intelligence and the capabilities of an application. Thus, the key contribution of this study is explaining the principles and potentiality of different machine learning techniques, and their applicability in various real-world application areas mentioned earlier. The purpose of this paper is, therefore, to provide a basic guide for those academia and industry people who want to study, research, and develop data-driven automated and intelligent systems in the relevant areas based on machine learning techniques.

The key contributions of this paper are listed as follows:

  • To define the scope of our study by taking into account the nature and characteristics of various types of real-world data and the capabilities of various learning techniques.
  • To provide a comprehensive view on machine learning algorithms that can be applied to enhance the intelligence and capabilities of a data-driven application.
  • To discuss the applicability of machine learning-based solutions in various real-world application domains.
  • To highlight and summarize the potential research directions within the scope of our study for intelligent data analysis and services.

The rest of the paper is organized as follows. The next section presents the types of data and machine learning algorithms in a broader sense and defines the scope of our study. We briefly discuss and explain different machine learning algorithms in the subsequent section followed by which various real-world application areas based on machine learning algorithms are discussed and summarized. In the penultimate section, we highlight several research issues and potential future directions, and the final section concludes this paper.

Types of Real-World Data and Machine Learning Techniques

Machine learning algorithms typically consume and process data to learn the related patterns about individuals, business processes, transactions, events, and so on. In the following, we discuss various types of real-world data as well as categories of machine learning algorithms.

Types of Real-World Data

Usually, the availability of data is considered as the key to construct a machine learning model or data-driven real-world systems [ 103 , 105 ]. Data can be of various forms, such as structured, semi-structured, or unstructured [ 41 , 72 ]. Besides, the “metadata” is another type that typically represents data about the data. In the following, we briefly discuss these types of data.

  • Structured: It has a well-defined structure, conforms to a data model following a standard order, which is highly organized and easily accessed, and used by an entity or a computer program. In well-defined schemes, such as relational databases, structured data are typically stored, i.e., in a tabular format. For instance, names, dates, addresses, credit card numbers, stock information, geolocation, etc. are examples of structured data.
  • Unstructured: On the other hand, there is no pre-defined format or organization for unstructured data, making it much more difficult to capture, process, and analyze, mostly containing text and multimedia material. For example, sensor data, emails, blog entries, wikis, and word processing documents, PDF files, audio files, videos, images, presentations, web pages, and many other types of business documents can be considered as unstructured data.
  • Semi-structured: Semi-structured data are not stored in a relational database like the structured data mentioned above, but it does have certain organizational properties that make it easier to analyze. HTML, XML, JSON documents, NoSQL databases, etc., are some examples of semi-structured data.
  • Metadata: It is not the normal form of data, but “data about data”. The primary difference between “data” and “metadata” is that data are simply the material that can classify, measure, or even document something relative to an organization’s data properties. On the other hand, metadata describes the relevant data information, giving it more significance for data users. A basic example of a document’s metadata might be the author, file size, date generated by the document, keywords to define the document, etc.

In the area of machine learning and data science, researchers use various widely used datasets for different purposes. These are, for example, cybersecurity datasets such as NSL-KDD [ 119 ], UNSW-NB15 [ 76 ], ISCX’12 [ 1 ], CIC-DDoS2019 [ 2 ], Bot-IoT [ 59 ], etc., smartphone datasets such as phone call logs [ 84 , 101 ], SMS Log [ 29 ], mobile application usages logs [ 137 ] [ 117 ], mobile phone notification logs [ 73 ] etc., IoT data [ 16 , 57 , 62 ], agriculture and e-commerce data [ 120 , 138 ], health data such as heart disease [ 92 ], diabetes mellitus [ 83 , 134 ], COVID-19 [ 43 , 74 ], etc., and many more in various application domains. The data can be in different types discussed above, which may vary from application to application in the real world. To analyze such data in a particular problem domain, and to extract the insights or useful knowledge from the data for building the real-world intelligent applications, different types of machine learning techniques can be used according to their learning capabilities, which is discussed in the following.

Types of Machine Learning Techniques

Machine Learning algorithms are mainly divided into four categories: Supervised learning, Unsupervised learning, Semi-supervised learning, and Reinforcement learning [ 75 ], as shown in Fig. ​ Fig.2. 2 . In the following, we briefly discuss each type of learning technique with the scope of their applicability to solve real-world problems.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_592_Fig2_HTML.jpg

Various types of machine learning techniques

  • Supervised: Supervised learning is typically the task of machine learning to learn a function that maps an input to an output based on sample input-output pairs [ 41 ]. It uses labeled training data and a collection of training examples to infer a function. Supervised learning is carried out when certain goals are identified to be accomplished from a certain set of inputs [ 105 ], i.e., a task-driven approach . The most common supervised tasks are “classification” that separates the data, and “regression” that fits the data. For instance, predicting the class label or sentiment of a piece of text, like a tweet or a product review, i.e., text classification, is an example of supervised learning.
  • Unsupervised: Unsupervised learning analyzes unlabeled datasets without the need for human interference, i.e., a data-driven process [ 41 ]. This is widely used for extracting generative features, identifying meaningful trends and structures, groupings in results, and exploratory purposes. The most common unsupervised learning tasks are clustering, density estimation, feature learning, dimensionality reduction, finding association rules, anomaly detection, etc.
  • Semi-supervised: Semi-supervised learning can be defined as a hybridization of the above-mentioned supervised and unsupervised methods, as it operates on both labeled and unlabeled data [ 41 , 105 ]. Thus, it falls between learning “without supervision” and learning “with supervision”. In the real world, labeled data could be rare in several contexts, and unlabeled data are numerous, where semi-supervised learning is useful [ 75 ]. The ultimate goal of a semi-supervised learning model is to provide a better outcome for prediction than that produced using the labeled data alone from the model. Some application areas where semi-supervised learning is used include machine translation, fraud detection, labeling data and text classification.
  • Reinforcement: Reinforcement learning is a type of machine learning algorithm that enables software agents and machines to automatically evaluate the optimal behavior in a particular context or environment to improve its efficiency [ 52 ], i.e., an environment-driven approach . This type of learning is based on reward or penalty, and its ultimate goal is to use insights obtained from environmental activists to take action to increase the reward or minimize the risk [ 75 ]. It is a powerful tool for training AI models that can help increase automation or optimize the operational efficiency of sophisticated systems such as robotics, autonomous driving tasks, manufacturing and supply chain logistics, however, not preferable to use it for solving the basic or straightforward problems.

Thus, to build effective models in various application areas different types of machine learning techniques can play a significant role according to their learning capabilities, depending on the nature of the data discussed earlier, and the target outcome. In Table ​ Table1, 1 , we summarize various types of machine learning techniques with examples. In the following, we provide a comprehensive view of machine learning algorithms that can be applied to enhance the intelligence and capabilities of a data-driven application.

Various types of machine learning techniques with examples

Machine Learning Tasks and Algorithms

In this section, we discuss various machine learning algorithms that include classification analysis, regression analysis, data clustering, association rule learning, feature engineering for dimensionality reduction, as well as deep learning methods. A general structure of a machine learning-based predictive model has been shown in Fig. ​ Fig.3, 3 , where the model is trained from historical data in phase 1 and the outcome is generated in phase 2 for the new test data.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_592_Fig3_HTML.jpg

A general structure of a machine learning based predictive model considering both the training and testing phase

Classification Analysis

Classification is regarded as a supervised learning method in machine learning, referring to a problem of predictive modeling as well, where a class label is predicted for a given example [ 41 ]. Mathematically, it maps a function ( f ) from input variables ( X ) to output variables ( Y ) as target, label or categories. To predict the class of given data points, it can be carried out on structured or unstructured data. For example, spam detection such as “spam” and “not spam” in email service providers can be a classification problem. In the following, we summarize the common classification problems.

  • Binary classification: It refers to the classification tasks having two class labels such as “true and false” or “yes and no” [ 41 ]. In such binary classification tasks, one class could be the normal state, while the abnormal state could be another class. For instance, “cancer not detected” is the normal state of a task that involves a medical test, and “cancer detected” could be considered as the abnormal state. Similarly, “spam” and “not spam” in the above example of email service providers are considered as binary classification.
  • Multiclass classification: Traditionally, this refers to those classification tasks having more than two class labels [ 41 ]. The multiclass classification does not have the principle of normal and abnormal outcomes, unlike binary classification tasks. Instead, within a range of specified classes, examples are classified as belonging to one. For example, it can be a multiclass classification task to classify various types of network attacks in the NSL-KDD [ 119 ] dataset, where the attack categories are classified into four class labels, such as DoS (Denial of Service Attack), U2R (User to Root Attack), R2L (Root to Local Attack), and Probing Attack.
  • Multi-label classification: In machine learning, multi-label classification is an important consideration where an example is associated with several classes or labels. Thus, it is a generalization of multiclass classification, where the classes involved in the problem are hierarchically structured, and each example may simultaneously belong to more than one class in each hierarchical level, e.g., multi-level text classification. For instance, Google news can be presented under the categories of a “city name”, “technology”, or “latest news”, etc. Multi-label classification includes advanced machine learning algorithms that support predicting various mutually non-exclusive classes or labels, unlike traditional classification tasks where class labels are mutually exclusive [ 82 ].

Many classification algorithms have been proposed in the machine learning and data science literature [ 41 , 125 ]. In the following, we summarize the most common and popular methods that are used widely in various application areas.

  • Naive Bayes (NB): The naive Bayes algorithm is based on the Bayes’ theorem with the assumption of independence between each pair of features [ 51 ]. It works well and can be used for both binary and multi-class categories in many real-world situations, such as document or text classification, spam filtering, etc. To effectively classify the noisy instances in the data and to construct a robust prediction model, the NB classifier can be used [ 94 ]. The key benefit is that, compared to more sophisticated approaches, it needs a small amount of training data to estimate the necessary parameters and quickly [ 82 ]. However, its performance may affect due to its strong assumptions on features independence. Gaussian, Multinomial, Complement, Bernoulli, and Categorical are the common variants of NB classifier [ 82 ].
  • Linear Discriminant Analysis (LDA): Linear Discriminant Analysis (LDA) is a linear decision boundary classifier created by fitting class conditional densities to data and applying Bayes’ rule [ 51 , 82 ]. This method is also known as a generalization of Fisher’s linear discriminant, which projects a given dataset into a lower-dimensional space, i.e., a reduction of dimensionality that minimizes the complexity of the model or reduces the resulting model’s computational costs. The standard LDA model usually suits each class with a Gaussian density, assuming that all classes share the same covariance matrix [ 82 ]. LDA is closely related to ANOVA (analysis of variance) and regression analysis, which seek to express one dependent variable as a linear combination of other features or measurements.
  • Logistic regression (LR): Another common probabilistic based statistical model used to solve classification issues in machine learning is Logistic Regression (LR) [ 64 ]. Logistic regression typically uses a logistic function to estimate the probabilities, which is also referred to as the mathematically defined sigmoid function in Eq. 1 . It can overfit high-dimensional datasets and works well when the dataset can be separated linearly. The regularization (L1 and L2) techniques [ 82 ] can be used to avoid over-fitting in such scenarios. The assumption of linearity between the dependent and independent variables is considered as a major drawback of Logistic Regression. It can be used for both classification and regression problems, but it is more commonly used for classification. g ( z ) = 1 1 + exp ( - z ) . 1
  • K-nearest neighbors (KNN): K-Nearest Neighbors (KNN) [ 9 ] is an “instance-based learning” or non-generalizing learning, also known as a “lazy learning” algorithm. It does not focus on constructing a general internal model; instead, it stores all instances corresponding to training data in n -dimensional space. KNN uses data and classifies new data points based on similarity measures (e.g., Euclidean distance function) [ 82 ]. Classification is computed from a simple majority vote of the k nearest neighbors of each point. It is quite robust to noisy training data, and accuracy depends on the data quality. The biggest issue with KNN is to choose the optimal number of neighbors to be considered. KNN can be used both for classification as well as regression.
  • Support vector machine (SVM): In machine learning, another common technique that can be used for classification, regression, or other tasks is a support vector machine (SVM) [ 56 ]. In high- or infinite-dimensional space, a support vector machine constructs a hyper-plane or set of hyper-planes. Intuitively, the hyper-plane, which has the greatest distance from the nearest training data points in any class, achieves a strong separation since, in general, the greater the margin, the lower the classifier’s generalization error. It is effective in high-dimensional spaces and can behave differently based on different mathematical functions known as the kernel. Linear, polynomial, radial basis function (RBF), sigmoid, etc., are the popular kernel functions used in SVM classifier [ 82 ]. However, when the data set contains more noise, such as overlapping target classes, SVM does not perform well.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_592_Fig4_HTML.jpg

An example of a decision tree structure

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_592_Fig5_HTML.jpg

An example of a random forest structure considering multiple decision trees

  • Adaptive Boosting (AdaBoost): Adaptive Boosting (AdaBoost) is an ensemble learning process that employs an iterative approach to improve poor classifiers by learning from their errors. This is developed by Yoav Freund et al. [ 35 ] and also known as “meta-learning”. Unlike the random forest that uses parallel ensembling, Adaboost uses “sequential ensembling”. It creates a powerful classifier by combining many poorly performing classifiers to obtain a good classifier of high accuracy. In that sense, AdaBoost is called an adaptive classifier by significantly improving the efficiency of the classifier, but in some instances, it can trigger overfits. AdaBoost is best used to boost the performance of decision trees, base estimator [ 82 ], on binary classification problems, however, is sensitive to noisy data and outliers.
  • Extreme gradient boosting (XGBoost): Gradient Boosting, like Random Forests [ 19 ] above, is an ensemble learning algorithm that generates a final model based on a series of individual models, typically decision trees. The gradient is used to minimize the loss function, similar to how neural networks [ 41 ] use gradient descent to optimize weights. Extreme Gradient Boosting (XGBoost) is a form of gradient boosting that takes more detailed approximations into account when determining the best model [ 82 ]. It computes second-order gradients of the loss function to minimize loss and advanced regularization (L1 and L2) [ 82 ], which reduces over-fitting, and improves model generalization and performance. XGBoost is fast to interpret and can handle large-sized datasets well.
  • Stochastic gradient descent (SGD): Stochastic gradient descent (SGD) [ 41 ] is an iterative method for optimizing an objective function with appropriate smoothness properties, where the word ‘stochastic’ refers to random probability. This reduces the computational burden, particularly in high-dimensional optimization problems, allowing for faster iterations in exchange for a lower convergence rate. A gradient is the slope of a function that calculates a variable’s degree of change in response to another variable’s changes. Mathematically, the Gradient Descent is a convex function whose output is a partial derivative of a set of its input parameters. Let, α is the learning rate, and J i is the training example cost of i th , then Eq. ( 4 ) represents the stochastic gradient descent weight update method at the j th iteration. In large-scale and sparse machine learning, SGD has been successfully applied to problems often encountered in text classification and natural language processing [ 82 ]. However, SGD is sensitive to feature scaling and needs a range of hyperparameters, such as the regularization parameter and the number of iterations. w j : = w j - α ∂ J i ∂ w j . 4
  • Rule-based classification : The term rule-based classification can be used to refer to any classification scheme that makes use of IF-THEN rules for class prediction. Several classification algorithms such as Zero-R [ 125 ], One-R [ 47 ], decision trees [ 87 , 88 ], DTNB [ 110 ], Ripple Down Rule learner (RIDOR) [ 125 ], Repeated Incremental Pruning to Produce Error Reduction (RIPPER) [ 126 ] exist with the ability of rule generation. The decision tree is one of the most common rule-based classification algorithms among these techniques because it has several advantages, such as being easier to interpret; the ability to handle high-dimensional data; simplicity and speed; good accuracy; and the capability to produce rules for human clear and understandable classification [ 127 ] [ 128 ]. The decision tree-based rules also provide significant accuracy in a prediction model for unseen test cases [ 106 ]. Since the rules are easily interpretable, these rule-based classifiers are often used to produce descriptive models that can describe a system including the entities and their relationships.

Regression Analysis

Regression analysis includes several methods of machine learning that allow to predict a continuous ( y ) result variable based on the value of one or more ( x ) predictor variables [ 41 ]. The most significant distinction between classification and regression is that classification predicts distinct class labels, while regression facilitates the prediction of a continuous quantity. Figure ​ Figure6 6 shows an example of how classification is different with regression models. Some overlaps are often found between the two types of machine learning algorithms. Regression models are now widely used in a variety of fields, including financial forecasting or prediction, cost estimation, trend analysis, marketing, time series estimation, drug response modeling, and many more. Some of the familiar types of regression algorithms are linear, polynomial, lasso and ridge regression, etc., which are explained briefly in the following.

  • Simple and multiple linear regression: This is one of the most popular ML modeling techniques as well as a well-known regression technique. In this technique, the dependent variable is continuous, the independent variable(s) can be continuous or discrete, and the form of the regression line is linear. Linear regression creates a relationship between the dependent variable ( Y ) and one or more independent variables ( X ) (also known as regression line) using the best fit straight line [ 41 ]. It is defined by the following equations: y = a + b x + e 5 y = a + b 1 x 1 + b 2 x 2 + ⋯ + b n x n + e , 6 where a is the intercept, b is the slope of the line, and e is the error term. This equation can be used to predict the value of the target variable based on the given predictor variable(s). Multiple linear regression is an extension of simple linear regression that allows two or more predictor variables to model a response variable, y, as a linear function [ 41 ] defined in Eq. 6 , whereas simple linear regression has only 1 independent variable, defined in Eq. 5 .
  • Polynomial regression: Polynomial regression is a form of regression analysis in which the relationship between the independent variable x and the dependent variable y is not linear, but is the polynomial degree of n th in x [ 82 ]. The equation for polynomial regression is also derived from linear regression (polynomial regression of degree 1) equation, which is defined as below: y = b 0 + b 1 x + b 2 x 2 + b 3 x 3 + ⋯ + b n x n + e . 7 Here, y is the predicted/target output, b 0 , b 1 , . . . b n are the regression coefficients, x is an independent/ input variable. In simple words, we can say that if data are not distributed linearly, instead it is n th degree of polynomial then we use polynomial regression to get desired output.
  • LASSO and ridge regression: LASSO and Ridge regression are well known as powerful techniques which are typically used for building learning models in presence of a large number of features, due to their capability to preventing over-fitting and reducing the complexity of the model. The LASSO (least absolute shrinkage and selection operator) regression model uses L 1 regularization technique [ 82 ] that uses shrinkage, which penalizes “absolute value of magnitude of coefficients” ( L 1 penalty). As a result, LASSO appears to render coefficients to absolute zero. Thus, LASSO regression aims to find the subset of predictors that minimizes the prediction error for a quantitative response variable. On the other hand, ridge regression uses L 2 regularization [ 82 ], which is the “squared magnitude of coefficients” ( L 2 penalty). Thus, ridge regression forces the weights to be small but never sets the coefficient value to zero, and does a non-sparse solution. Overall, LASSO regression is useful to obtain a subset of predictors by eliminating less important features, and ridge regression is useful when a data set has “multicollinearity” which refers to the predictors that are correlated with other predictors.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_592_Fig6_HTML.jpg

Classification vs. regression. In classification the dotted line represents a linear boundary that separates the two classes; in regression, the dotted line models the linear relationship between the two variables

Cluster Analysis

Cluster analysis, also known as clustering, is an unsupervised machine learning technique for identifying and grouping related data points in large datasets without concern for the specific outcome. It does grouping a collection of objects in such a way that objects in the same category, called a cluster, are in some sense more similar to each other than objects in other groups [ 41 ]. It is often used as a data analysis technique to discover interesting trends or patterns in data, e.g., groups of consumers based on their behavior. In a broad range of application areas, such as cybersecurity, e-commerce, mobile data processing, health analytics, user modeling and behavioral analytics, clustering can be used. In the following, we briefly discuss and summarize various types of clustering methods.

  • Partitioning methods: Based on the features and similarities in the data, this clustering approach categorizes the data into multiple groups or clusters. The data scientists or analysts typically determine the number of clusters either dynamically or statically depending on the nature of the target applications, to produce for the methods of clustering. The most common clustering algorithms based on partitioning methods are K-means [ 69 ], K-Mediods [ 80 ], CLARA [ 55 ] etc.
  • Density-based methods: To identify distinct groups or clusters, it uses the concept that a cluster in the data space is a contiguous region of high point density isolated from other such clusters by contiguous regions of low point density. Points that are not part of a cluster are considered as noise. The typical clustering algorithms based on density are DBSCAN [ 32 ], OPTICS [ 12 ] etc. The density-based methods typically struggle with clusters of similar density and high dimensionality data.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_592_Fig7_HTML.jpg

A graphical interpretation of the widely-used hierarchical clustering (Bottom-up and top-down) technique

  • Grid-based methods: To deal with massive datasets, grid-based clustering is especially suitable. To obtain clusters, the principle is first to summarize the dataset with a grid representation and then to combine grid cells. STING [ 122 ], CLIQUE [ 6 ], etc. are the standard algorithms of grid-based clustering.
  • Model-based methods: There are mainly two types of model-based clustering algorithms: one that uses statistical learning, and the other based on a method of neural network learning [ 130 ]. For instance, GMM [ 89 ] is an example of a statistical learning method, and SOM [ 22 ] [ 96 ] is an example of a neural network learning method.
  • Constraint-based methods: Constrained-based clustering is a semi-supervised approach to data clustering that uses constraints to incorporate domain knowledge. Application or user-oriented constraints are incorporated to perform the clustering. The typical algorithms of this kind of clustering are COP K-means [ 121 ], CMWK-Means [ 27 ], etc.

Many clustering algorithms have been proposed with the ability to grouping data in machine learning and data science literature [ 41 , 125 ]. In the following, we summarize the popular methods that are used widely in various application areas.

  • K-means clustering: K-means clustering [ 69 ] is a fast, robust, and simple algorithm that provides reliable results when data sets are well-separated from each other. The data points are allocated to a cluster in this algorithm in such a way that the amount of the squared distance between the data points and the centroid is as small as possible. In other words, the K-means algorithm identifies the k number of centroids and then assigns each data point to the nearest cluster while keeping the centroids as small as possible. Since it begins with a random selection of cluster centers, the results can be inconsistent. Since extreme values can easily affect a mean, the K-means clustering algorithm is sensitive to outliers. K-medoids clustering [ 91 ] is a variant of K-means that is more robust to noises and outliers.
  • Mean-shift clustering: Mean-shift clustering [ 37 ] is a nonparametric clustering technique that does not require prior knowledge of the number of clusters or constraints on cluster shape. Mean-shift clustering aims to discover “blobs” in a smooth distribution or density of samples [ 82 ]. It is a centroid-based algorithm that works by updating centroid candidates to be the mean of the points in a given region. To form the final set of centroids, these candidates are filtered in a post-processing stage to remove near-duplicates. Cluster analysis in computer vision and image processing are examples of application domains. Mean Shift has the disadvantage of being computationally expensive. Moreover, in cases of high dimension, where the number of clusters shifts abruptly, the mean-shift algorithm does not work well.
  • DBSCAN: Density-based spatial clustering of applications with noise (DBSCAN) [ 32 ] is a base algorithm for density-based clustering which is widely used in data mining and machine learning. This is known as a non-parametric density-based clustering technique for separating high-density clusters from low-density clusters that are used in model building. DBSCAN’s main idea is that a point belongs to a cluster if it is close to many points from that cluster. It can find clusters of various shapes and sizes in a vast volume of data that is noisy and contains outliers. DBSCAN, unlike k-means, does not require a priori specification of the number of clusters in the data and can find arbitrarily shaped clusters. Although k-means is much faster than DBSCAN, it is efficient at finding high-density regions and outliers, i.e., is robust to outliers.
  • GMM clustering: Gaussian mixture models (GMMs) are often used for data clustering, which is a distribution-based clustering algorithm. A Gaussian mixture model is a probabilistic model in which all the data points are produced by a mixture of a finite number of Gaussian distributions with unknown parameters [ 82 ]. To find the Gaussian parameters for each cluster, an optimization algorithm called expectation-maximization (EM) [ 82 ] can be used. EM is an iterative method that uses a statistical model to estimate the parameters. In contrast to k-means, Gaussian mixture models account for uncertainty and return the likelihood that a data point belongs to one of the k clusters. GMM clustering is more robust than k-means and works well even with non-linear data distributions.
  • Agglomerative hierarchical clustering: The most common method of hierarchical clustering used to group objects in clusters based on their similarity is agglomerative clustering. This technique uses a bottom-up approach, where each object is first treated as a singleton cluster by the algorithm. Following that, pairs of clusters are merged one by one until all clusters have been merged into a single large cluster containing all objects. The result is a dendrogram, which is a tree-based representation of the elements. Single linkage [ 115 ], Complete linkage [ 116 ], BOTS [ 102 ] etc. are some examples of such techniques. The main advantage of agglomerative hierarchical clustering over k-means is that the tree-structure hierarchy generated by agglomerative clustering is more informative than the unstructured collection of flat clusters returned by k-means, which can help to make better decisions in the relevant application areas.

Dimensionality Reduction and Feature Learning

In machine learning and data science, high-dimensional data processing is a challenging task for both researchers and application developers. Thus, dimensionality reduction which is an unsupervised learning technique, is important because it leads to better human interpretations, lower computational costs, and avoids overfitting and redundancy by simplifying models. Both the process of feature selection and feature extraction can be used for dimensionality reduction. The primary distinction between the selection and extraction of features is that the “feature selection” keeps a subset of the original features [ 97 ], while “feature extraction” creates brand new ones [ 98 ]. In the following, we briefly discuss these techniques.

  • Feature selection: The selection of features, also known as the selection of variables or attributes in the data, is the process of choosing a subset of unique features (variables, predictors) to use in building machine learning and data science model. It decreases a model’s complexity by eliminating the irrelevant or less important features and allows for faster training of machine learning algorithms. A right and optimal subset of the selected features in a problem domain is capable to minimize the overfitting problem through simplifying and generalizing the model as well as increases the model’s accuracy [ 97 ]. Thus, “feature selection” [ 66 , 99 ] is considered as one of the primary concepts in machine learning that greatly affects the effectiveness and efficiency of the target machine learning model. Chi-squared test, Analysis of variance (ANOVA) test, Pearson’s correlation coefficient, recursive feature elimination, are some popular techniques that can be used for feature selection.
  • Feature extraction: In a machine learning-based model or system, feature extraction techniques usually provide a better understanding of the data, a way to improve prediction accuracy, and to reduce computational cost or training time. The aim of “feature extraction” [ 66 , 99 ] is to reduce the number of features in a dataset by generating new ones from the existing ones and then discarding the original features. The majority of the information found in the original set of features can then be summarized using this new reduced set of features. For instance, principal components analysis (PCA) is often used as a dimensionality-reduction technique to extract a lower-dimensional space creating new brand components from the existing features in a dataset [ 98 ].

Many algorithms have been proposed to reduce data dimensions in the machine learning and data science literature [ 41 , 125 ]. In the following, we summarize the popular methods that are used widely in various application areas.

  • Variance threshold: A simple basic approach to feature selection is the variance threshold [ 82 ]. This excludes all features of low variance, i.e., all features whose variance does not exceed the threshold. It eliminates all zero-variance characteristics by default, i.e., characteristics that have the same value in all samples. This feature selection algorithm looks only at the ( X ) features, not the ( y ) outputs needed, and can, therefore, be used for unsupervised learning.
  • Pearson correlation: Pearson’s correlation is another method to understand a feature’s relation to the response variable and can be used for feature selection [ 99 ]. This method is also used for finding the association between the features in a dataset. The resulting value is [ - 1 , 1 ] , where - 1 means perfect negative correlation, + 1 means perfect positive correlation, and 0 means that the two variables do not have a linear correlation. If two random variables represent X and Y , then the correlation coefficient between X and Y is defined as [ 41 ] r ( X , Y ) = ∑ i = 1 n ( X i - X ¯ ) ( Y i - Y ¯ ) ∑ i = 1 n ( X i - X ¯ ) 2 ∑ i = 1 n ( Y i - Y ¯ ) 2 . 8
  • ANOVA: Analysis of variance (ANOVA) is a statistical tool used to verify the mean values of two or more groups that differ significantly from each other. ANOVA assumes a linear relationship between the variables and the target and the variables’ normal distribution. To statistically test the equality of means, the ANOVA method utilizes F tests. For feature selection, the results ‘ANOVA F value’ [ 82 ] of this test can be used where certain features independent of the goal variable can be omitted.
  • Chi square: The chi-square χ 2 [ 82 ] statistic is an estimate of the difference between the effects of a series of events or variables observed and expected frequencies. The magnitude of the difference between the real and observed values, the degrees of freedom, and the sample size depends on χ 2 . The chi-square χ 2 is commonly used for testing relationships between categorical variables. If O i represents observed value and E i represents expected value, then χ 2 = ∑ i = 1 n ( O i - E i ) 2 E i . 9
  • Recursive feature elimination (RFE): Recursive Feature Elimination (RFE) is a brute force approach to feature selection. RFE [ 82 ] fits the model and removes the weakest feature before it meets the specified number of features. Features are ranked by the coefficients or feature significance of the model. RFE aims to remove dependencies and collinearity in the model by recursively removing a small number of features per iteration.
  • Model-based selection: To reduce the dimensionality of the data, linear models penalized with the L 1 regularization can be used. Least absolute shrinkage and selection operator (Lasso) regression is a type of linear regression that has the property of shrinking some of the coefficients to zero [ 82 ]. Therefore, that feature can be removed from the model. Thus, the penalized lasso regression method, often used in machine learning to select the subset of variables. Extra Trees Classifier [ 82 ] is an example of a tree-based estimator that can be used to compute impurity-based function importance, which can then be used to discard irrelevant features.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_592_Fig8_HTML.jpg

An example of a principal component analysis (PCA) and created principal components PC1 and PC2 in different dimension space

Association Rule Learning

Association rule learning is a rule-based machine learning approach to discover interesting relationships, “IF-THEN” statements, in large datasets between variables [ 7 ]. One example is that “if a customer buys a computer or laptop (an item), s/he is likely to also buy anti-virus software (another item) at the same time”. Association rules are employed today in many application areas, including IoT services, medical diagnosis, usage behavior analytics, web usage mining, smartphone applications, cybersecurity applications, and bioinformatics. In comparison to sequence mining, association rule learning does not usually take into account the order of things within or across transactions. A common way of measuring the usefulness of association rules is to use its parameter, the ‘support’ and ‘confidence’, which is introduced in [ 7 ].

In the data mining literature, many association rule learning methods have been proposed, such as logic dependent [ 34 ], frequent pattern based [ 8 , 49 , 68 ], and tree-based [ 42 ]. The most popular association rule learning algorithms are summarized below.

  • AIS and SETM: AIS is the first algorithm proposed by Agrawal et al. [ 7 ] for association rule mining. The AIS algorithm’s main downside is that too many candidate itemsets are generated, requiring more space and wasting a lot of effort. This algorithm calls for too many passes over the entire dataset to produce the rules. Another approach SETM [ 49 ] exhibits good performance and stable behavior with execution time; however, it suffers from the same flaw as the AIS algorithm.
  • Apriori: For generating association rules for a given dataset, Agrawal et al. [ 8 ] proposed the Apriori, Apriori-TID, and Apriori-Hybrid algorithms. These later algorithms outperform the AIS and SETM mentioned above due to the Apriori property of frequent itemset [ 8 ]. The term ‘Apriori’ usually refers to having prior knowledge of frequent itemset properties. Apriori uses a “bottom-up” approach, where it generates the candidate itemsets. To reduce the search space, Apriori uses the property “all subsets of a frequent itemset must be frequent; and if an itemset is infrequent, then all its supersets must also be infrequent”. Another approach predictive Apriori [ 108 ] can also generate rules; however, it receives unexpected results as it combines both the support and confidence. The Apriori [ 8 ] is the widely applicable techniques in mining association rules.
  • ECLAT: This technique was proposed by Zaki et al. [ 131 ] and stands for Equivalence Class Clustering and bottom-up Lattice Traversal. ECLAT uses a depth-first search to find frequent itemsets. In contrast to the Apriori [ 8 ] algorithm, which represents data in a horizontal pattern, it represents data vertically. Hence, the ECLAT algorithm is more efficient and scalable in the area of association rule learning. This algorithm is better suited for small and medium datasets whereas the Apriori algorithm is used for large datasets.
  • FP-Growth: Another common association rule learning technique based on the frequent-pattern tree (FP-tree) proposed by Han et al. [ 42 ] is Frequent Pattern Growth, known as FP-Growth. The key difference with Apriori is that while generating rules, the Apriori algorithm [ 8 ] generates frequent candidate itemsets; on the other hand, the FP-growth algorithm [ 42 ] prevents candidate generation and thus produces a tree by the successful strategy of ‘divide and conquer’ approach. Due to its sophistication, however, FP-Tree is challenging to use in an interactive mining environment [ 133 ]. Thus, the FP-Tree would not fit into memory for massive data sets, making it challenging to process big data as well. Another solution is RARM (Rapid Association Rule Mining) proposed by Das et al. [ 26 ] but faces a related FP-tree issue [ 133 ].
  • ABC-RuleMiner: A rule-based machine learning method, recently proposed in our earlier paper, by Sarker et al. [ 104 ], to discover the interesting non-redundant rules to provide real-world intelligent services. This algorithm effectively identifies the redundancy in associations by taking into account the impact or precedence of the related contextual features and discovers a set of non-redundant association rules. This algorithm first constructs an association generation tree (AGT), a top-down approach, and then extracts the association rules through traversing the tree. Thus, ABC-RuleMiner is more potent than traditional rule-based methods in terms of both non-redundant rule generation and intelligent decision-making, particularly in a context-aware smart computing environment, where human or user preferences are involved.

Among the association rule learning techniques discussed above, Apriori [ 8 ] is the most widely used algorithm for discovering association rules from a given dataset [ 133 ]. The main strength of the association learning technique is its comprehensiveness, as it generates all associations that satisfy the user-specified constraints, such as minimum support and confidence value. The ABC-RuleMiner approach [ 104 ] discussed earlier could give significant results in terms of non-redundant rule generation and intelligent decision-making for the relevant application areas in the real world.

Reinforcement Learning

Reinforcement learning (RL) is a machine learning technique that allows an agent to learn by trial and error in an interactive environment using input from its actions and experiences. Unlike supervised learning, which is based on given sample data or examples, the RL method is based on interacting with the environment. The problem to be solved in reinforcement learning (RL) is defined as a Markov Decision Process (MDP) [ 86 ], i.e., all about sequentially making decisions. An RL problem typically includes four elements such as Agent, Environment, Rewards, and Policy.

RL can be split roughly into Model-based and Model-free techniques. Model-based RL is the process of inferring optimal behavior from a model of the environment by performing actions and observing the results, which include the next state and the immediate reward [ 85 ]. AlphaZero, AlphaGo [ 113 ] are examples of the model-based approaches. On the other hand, a model-free approach does not use the distribution of the transition probability and the reward function associated with MDP. Q-learning, Deep Q Network, Monte Carlo Control, SARSA (State–Action–Reward–State–Action), etc. are some examples of model-free algorithms [ 52 ]. The policy network, which is required for model-based RL but not for model-free, is the key difference between model-free and model-based learning. In the following, we discuss the popular RL algorithms.

  • Monte Carlo methods: Monte Carlo techniques, or Monte Carlo experiments, are a wide category of computational algorithms that rely on repeated random sampling to obtain numerical results [ 52 ]. The underlying concept is to use randomness to solve problems that are deterministic in principle. Optimization, numerical integration, and making drawings from the probability distribution are the three problem classes where Monte Carlo techniques are most commonly used.
  • Q-learning: Q-learning is a model-free reinforcement learning algorithm for learning the quality of behaviors that tell an agent what action to take under what conditions [ 52 ]. It does not need a model of the environment (hence the term “model-free”), and it can deal with stochastic transitions and rewards without the need for adaptations. The ‘Q’ in Q-learning usually stands for quality, as the algorithm calculates the maximum expected rewards for a given behavior in a given state.
  • Deep Q-learning: The basic working step in Deep Q-Learning [ 52 ] is that the initial state is fed into the neural network, which returns the Q-value of all possible actions as an output. Still, when we have a reasonably simple setting to overcome, Q-learning works well. However, when the number of states and actions becomes more complicated, deep learning can be used as a function approximator.

Reinforcement learning, along with supervised and unsupervised learning, is one of the basic machine learning paradigms. RL can be used to solve numerous real-world problems in various fields, such as game theory, control theory, operations analysis, information theory, simulation-based optimization, manufacturing, supply chain logistics, multi-agent systems, swarm intelligence, aircraft control, robot motion control, and many more.

Artificial Neural Network and Deep Learning

Deep learning is part of a wider family of artificial neural networks (ANN)-based machine learning approaches with representation learning. Deep learning provides a computational architecture by combining several processing layers, such as input, hidden, and output layers, to learn from data [ 41 ]. The main advantage of deep learning over traditional machine learning methods is its better performance in several cases, particularly learning from large datasets [ 105 , 129 ]. Figure ​ Figure9 9 shows a general performance of deep learning over machine learning considering the increasing amount of data. However, it may vary depending on the data characteristics and experimental set up.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_592_Fig9_HTML.jpg

Machine learning and deep learning performance in general with the amount of data

The most common deep learning algorithms are: Multi-layer Perceptron (MLP), Convolutional Neural Network (CNN, or ConvNet), Long Short-Term Memory Recurrent Neural Network (LSTM-RNN) [ 96 ]. In the following, we discuss various types of deep learning methods that can be used to build effective data-driven models for various purposes.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_592_Fig10_HTML.jpg

A structure of an artificial neural network modeling with multiple processing layers

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_592_Fig11_HTML.jpg

An example of a convolutional neural network (CNN or ConvNet) including multiple convolution and pooling layers

  • LSTM-RNN: Long short-term memory (LSTM) is an artificial recurrent neural network (RNN) architecture used in the area of deep learning [ 38 ]. LSTM has feedback links, unlike normal feed-forward neural networks. LSTM networks are well-suited for analyzing and learning sequential data, such as classifying, processing, and predicting data based on time series data, which differentiates it from other conventional networks. Thus, LSTM can be used when the data are in a sequential format, such as time, sentence, etc., and commonly applied in the area of time-series analysis, natural language processing, speech recognition, etc.

In addition to these most common deep learning methods discussed above, several other deep learning approaches [ 96 ] exist in the area for various purposes. For instance, the self-organizing map (SOM) [ 58 ] uses unsupervised learning to represent the high-dimensional data by a 2D grid map, thus achieving dimensionality reduction. The autoencoder (AE) [ 15 ] is another learning technique that is widely used for dimensionality reduction as well and feature extraction in unsupervised learning tasks. Restricted Boltzmann machines (RBM) [ 46 ] can be used for dimensionality reduction, classification, regression, collaborative filtering, feature learning, and topic modeling. A deep belief network (DBN) is typically composed of simple, unsupervised networks such as restricted Boltzmann machines (RBMs) or autoencoders, and a backpropagation neural network (BPNN) [ 123 ]. A generative adversarial network (GAN) [ 39 ] is a form of the network for deep learning that can generate data with characteristics close to the actual data input. Transfer learning is currently very common because it can train deep neural networks with comparatively low data, which is typically the re-use of a new problem with a pre-trained model [ 124 ]. A brief discussion of these artificial neural networks (ANN) and deep learning (DL) models are summarized in our earlier paper Sarker et al. [ 96 ].

Overall, based on the learning techniques discussed above, we can conclude that various types of machine learning techniques, such as classification analysis, regression, data clustering, feature selection and extraction, and dimensionality reduction, association rule learning, reinforcement learning, or deep learning techniques, can play a significant role for various purposes according to their capabilities. In the following section, we discuss several application areas based on machine learning algorithms.

Applications of Machine Learning

In the current age of the Fourth Industrial Revolution (4IR), machine learning becomes popular in various application areas, because of its learning capabilities from the past and making intelligent decisions. In the following, we summarize and discuss ten popular application areas of machine learning technology.

  • Predictive analytics and intelligent decision-making: A major application field of machine learning is intelligent decision-making by data-driven predictive analytics [ 21 , 70 ]. The basis of predictive analytics is capturing and exploiting relationships between explanatory variables and predicted variables from previous events to predict the unknown outcome [ 41 ]. For instance, identifying suspects or criminals after a crime has been committed, or detecting credit card fraud as it happens. Another application, where machine learning algorithms can assist retailers in better understanding consumer preferences and behavior, better manage inventory, avoiding out-of-stock situations, and optimizing logistics and warehousing in e-commerce. Various machine learning algorithms such as decision trees, support vector machines, artificial neural networks, etc. [ 106 , 125 ] are commonly used in the area. Since accurate predictions provide insight into the unknown, they can improve the decisions of industries, businesses, and almost any organization, including government agencies, e-commerce, telecommunications, banking and financial services, healthcare, sales and marketing, transportation, social networking, and many others.
  • Cybersecurity and threat intelligence: Cybersecurity is one of the most essential areas of Industry 4.0. [ 114 ], which is typically the practice of protecting networks, systems, hardware, and data from digital attacks [ 114 ]. Machine learning has become a crucial cybersecurity technology that constantly learns by analyzing data to identify patterns, better detect malware in encrypted traffic, find insider threats, predict where bad neighborhoods are online, keep people safe while browsing, or secure data in the cloud by uncovering suspicious activity. For instance, clustering techniques can be used to identify cyber-anomalies, policy violations, etc. To detect various types of cyber-attacks or intrusions machine learning classification models by taking into account the impact of security features are useful [ 97 ]. Various deep learning-based security models can also be used on the large scale of security datasets [ 96 , 129 ]. Moreover, security policy rules generated by association rule learning techniques can play a significant role to build a rule-based security system [ 105 ]. Thus, we can say that various learning techniques discussed in Sect. Machine Learning Tasks and Algorithms , can enable cybersecurity professionals to be more proactive inefficiently preventing threats and cyber-attacks.
  • Internet of things (IoT) and smart cities: Internet of Things (IoT) is another essential area of Industry 4.0. [ 114 ], which turns everyday objects into smart objects by allowing them to transmit data and automate tasks without the need for human interaction. IoT is, therefore, considered to be the big frontier that can enhance almost all activities in our lives, such as smart governance, smart home, education, communication, transportation, retail, agriculture, health care, business, and many more [ 70 ]. Smart city is one of IoT’s core fields of application, using technologies to enhance city services and residents’ living experiences [ 132 , 135 ]. As machine learning utilizes experience to recognize trends and create models that help predict future behavior and events, it has become a crucial technology for IoT applications [ 103 ]. For example, to predict traffic in smart cities, parking availability prediction, estimate the total usage of energy of the citizens for a particular period, make context-aware and timely decisions for the people, etc. are some tasks that can be solved using machine learning techniques according to the current needs of the people.
  • Traffic prediction and transportation: Transportation systems have become a crucial component of every country’s economic development. Nonetheless, several cities around the world are experiencing an excessive rise in traffic volume, resulting in serious issues such as delays, traffic congestion, higher fuel prices, increased CO 2 pollution, accidents, emergencies, and a decline in modern society’s quality of life [ 40 ]. Thus, an intelligent transportation system through predicting future traffic is important, which is an indispensable part of a smart city. Accurate traffic prediction based on machine and deep learning modeling can help to minimize the issues [ 17 , 30 , 31 ]. For example, based on the travel history and trend of traveling through various routes, machine learning can assist transportation companies in predicting possible issues that may occur on specific routes and recommending their customers to take a different path. Ultimately, these learning-based data-driven models help improve traffic flow, increase the usage and efficiency of sustainable modes of transportation, and limit real-world disruption by modeling and visualizing future changes.
  • Healthcare and COVID-19 pandemic: Machine learning can help to solve diagnostic and prognostic problems in a variety of medical domains, such as disease prediction, medical knowledge extraction, detecting regularities in data, patient management, etc. [ 33 , 77 , 112 ]. Coronavirus disease (COVID-19) is an infectious disease caused by a newly discovered coronavirus, according to the World Health Organization (WHO) [ 3 ]. Recently, the learning techniques have become popular in the battle against COVID-19 [ 61 , 63 ]. For the COVID-19 pandemic, the learning techniques are used to classify patients at high risk, their mortality rate, and other anomalies [ 61 ]. It can also be used to better understand the virus’s origin, COVID-19 outbreak prediction, as well as for disease diagnosis and treatment [ 14 , 50 ]. With the help of machine learning, researchers can forecast where and when, the COVID-19 is likely to spread, and notify those regions to match the required arrangements. Deep learning also provides exciting solutions to the problems of medical image processing and is seen as a crucial technique for potential applications, particularly for COVID-19 pandemic [ 10 , 78 , 111 ]. Overall, machine and deep learning techniques can help to fight the COVID-19 virus and the pandemic as well as intelligent clinical decisions making in the domain of healthcare.
  • E-commerce and product recommendations: Product recommendation is one of the most well known and widely used applications of machine learning, and it is one of the most prominent features of almost any e-commerce website today. Machine learning technology can assist businesses in analyzing their consumers’ purchasing histories and making customized product suggestions for their next purchase based on their behavior and preferences. E-commerce companies, for example, can easily position product suggestions and offers by analyzing browsing trends and click-through rates of specific items. Using predictive modeling based on machine learning techniques, many online retailers, such as Amazon [ 71 ], can better manage inventory, prevent out-of-stock situations, and optimize logistics and warehousing. The future of sales and marketing is the ability to capture, evaluate, and use consumer data to provide a customized shopping experience. Furthermore, machine learning techniques enable companies to create packages and content that are tailored to the needs of their customers, allowing them to maintain existing customers while attracting new ones.
  • NLP and sentiment analysis: Natural language processing (NLP) involves the reading and understanding of spoken or written language through the medium of a computer [ 79 , 103 ]. Thus, NLP helps computers, for instance, to read a text, hear speech, interpret it, analyze sentiment, and decide which aspects are significant, where machine learning techniques can be used. Virtual personal assistant, chatbot, speech recognition, document description, language or machine translation, etc. are some examples of NLP-related tasks. Sentiment Analysis [ 90 ] (also referred to as opinion mining or emotion AI) is an NLP sub-field that seeks to identify and extract public mood and views within a given text through blogs, reviews, social media, forums, news, etc. For instance, businesses and brands use sentiment analysis to understand the social sentiment of their brand, product, or service through social media platforms or the web as a whole. Overall, sentiment analysis is considered as a machine learning task that analyzes texts for polarity, such as “positive”, “negative”, or “neutral” along with more intense emotions like very happy, happy, sad, very sad, angry, have interest, or not interested etc.
  • Image, speech and pattern recognition: Image recognition [ 36 ] is a well-known and widespread example of machine learning in the real world, which can identify an object as a digital image. For instance, to label an x-ray as cancerous or not, character recognition, or face detection in an image, tagging suggestions on social media, e.g., Facebook, are common examples of image recognition. Speech recognition [ 23 ] is also very popular that typically uses sound and linguistic models, e.g., Google Assistant, Cortana, Siri, Alexa, etc. [ 67 ], where machine learning methods are used. Pattern recognition [ 13 ] is defined as the automated recognition of patterns and regularities in data, e.g., image analysis. Several machine learning techniques such as classification, feature selection, clustering, or sequence labeling methods are used in the area.
  • Sustainable agriculture: Agriculture is essential to the survival of all human activities [ 109 ]. Sustainable agriculture practices help to improve agricultural productivity while also reducing negative impacts on the environment [ 5 , 25 , 109 ]. The sustainable agriculture supply chains are knowledge-intensive and based on information, skills, technologies, etc., where knowledge transfer encourages farmers to enhance their decisions to adopt sustainable agriculture practices utilizing the increasing amount of data captured by emerging technologies, e.g., the Internet of Things (IoT), mobile technologies and devices, etc. [ 5 , 53 , 54 ]. Machine learning can be applied in various phases of sustainable agriculture, such as in the pre-production phase - for the prediction of crop yield, soil properties, irrigation requirements, etc.; in the production phase—for weather prediction, disease detection, weed detection, soil nutrient management, livestock management, etc.; in processing phase—for demand estimation, production planning, etc. and in the distribution phase - the inventory management, consumer analysis, etc.
  • User behavior analytics and context-aware smartphone applications: Context-awareness is a system’s ability to capture knowledge about its surroundings at any moment and modify behaviors accordingly [ 28 , 93 ]. Context-aware computing uses software and hardware to automatically collect and interpret data for direct responses. The mobile app development environment has been changed greatly with the power of AI, particularly, machine learning techniques through their learning capabilities from contextual data [ 103 , 136 ]. Thus, the developers of mobile apps can rely on machine learning to create smart apps that can understand human behavior, support, and entertain users [ 107 , 137 , 140 ]. To build various personalized data-driven context-aware systems, such as smart interruption management, smart mobile recommendation, context-aware smart searching, decision-making that intelligently assist end mobile phone users in a pervasive computing environment, machine learning techniques are applicable. For example, context-aware association rules can be used to build an intelligent phone call application [ 104 ]. Clustering approaches are useful in capturing users’ diverse behavioral activities by taking into account data in time series [ 102 ]. To predict the future events in various contexts, the classification methods can be used [ 106 , 139 ]. Thus, various learning techniques discussed in Sect. “ Machine Learning Tasks and Algorithms ” can help to build context-aware adaptive and smart applications according to the preferences of the mobile phone users.

In addition to these application areas, machine learning-based models can also apply to several other domains such as bioinformatics, cheminformatics, computer networks, DNA sequence classification, economics and banking, robotics, advanced engineering, and many more.

Challenges and Research Directions

Our study on machine learning algorithms for intelligent data analysis and applications opens several research issues in the area. Thus, in this section, we summarize and discuss the challenges faced and the potential research opportunities and future directions.

In general, the effectiveness and the efficiency of a machine learning-based solution depend on the nature and characteristics of the data, and the performance of the learning algorithms. To collect the data in the relevant domain, such as cybersecurity, IoT, healthcare and agriculture discussed in Sect. “ Applications of Machine Learning ” is not straightforward, although the current cyberspace enables the production of a huge amount of data with very high frequency. Thus, collecting useful data for the target machine learning-based applications, e.g., smart city applications, and their management is important to further analysis. Therefore, a more in-depth investigation of data collection methods is needed while working on the real-world data. Moreover, the historical data may contain many ambiguous values, missing values, outliers, and meaningless data. The machine learning algorithms, discussed in Sect “ Machine Learning Tasks and Algorithms ” highly impact on data quality, and availability for training, and consequently on the resultant model. Thus, to accurately clean and pre-process the diverse data collected from diverse sources is a challenging task. Therefore, effectively modifying or enhance existing pre-processing methods, or proposing new data preparation techniques are required to effectively use the learning algorithms in the associated application domain.

To analyze the data and extract insights, there exist many machine learning algorithms, summarized in Sect. “ Machine Learning Tasks and Algorithms ”. Thus, selecting a proper learning algorithm that is suitable for the target application is challenging. The reason is that the outcome of different learning algorithms may vary depending on the data characteristics [ 106 ]. Selecting a wrong learning algorithm would result in producing unexpected outcomes that may lead to loss of effort, as well as the model’s effectiveness and accuracy. In terms of model building, the techniques discussed in Sect. “ Machine Learning Tasks and Algorithms ” can directly be used to solve many real-world issues in diverse domains, such as cybersecurity, smart cities and healthcare summarized in Sect. “ Applications of Machine Learning ”. However, the hybrid learning model, e.g., the ensemble of methods, modifying or enhancement of the existing learning techniques, or designing new learning methods, could be a potential future work in the area.

Thus, the ultimate success of a machine learning-based solution and corresponding applications mainly depends on both the data and the learning algorithms. If the data are bad to learn, such as non-representative, poor-quality, irrelevant features, or insufficient quantity for training, then the machine learning models may become useless or will produce lower accuracy. Therefore, effectively processing the data and handling the diverse learning algorithms are important, for a machine learning-based solution and eventually building intelligent applications.

In this paper, we have conducted a comprehensive overview of machine learning algorithms for intelligent data analysis and applications. According to our goal, we have briefly discussed how various types of machine learning methods can be used for making solutions to various real-world issues. A successful machine learning model depends on both the data and the performance of the learning algorithms. The sophisticated learning algorithms then need to be trained through the collected real-world data and knowledge related to the target application before the system can assist with intelligent decision-making. We also discussed several popular application areas based on machine learning techniques to highlight their applicability in various real-world issues. Finally, we have summarized and discussed the challenges faced and the potential research opportunities and future directions in the area. Therefore, the challenges that are identified create promising research opportunities in the field which must be addressed with effective solutions in various application areas. Overall, we believe that our study on machine learning-based solutions opens up a promising direction and can be used as a reference guide for potential research and applications for both academia and industry professionals as well as for decision-makers, from a technical point of view.

Declaration

The author declares no conflict of interest.

This article is part of the topical collection “Advances in Computational Approaches for Artificial Intelligence, Image Processing, IoT and Cloud Applications” guest edited by Bhanu Prakash K N and M. Shivakumar.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

If you are not redirected, click here .

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • My Account Login
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 09 September 2022

Machine learning in project analytics: a data-driven framework and case study

  • Shahadat Uddin 1 ,
  • Stephen Ong 1 &
  • Haohui Lu 1  

Scientific Reports volume  12 , Article number:  15252 ( 2022 ) Cite this article

8269 Accesses

9 Citations

19 Altmetric

Metrics details

  • Applied mathematics
  • Computational science

The analytic procedures incorporated to facilitate the delivery of projects are often referred to as project analytics. Existing techniques focus on retrospective reporting and understanding the underlying relationships to make informed decisions. Although machine learning algorithms have been widely used in addressing problems within various contexts (e.g., streamlining the design of construction projects), limited studies have evaluated pre-existing machine learning methods within the delivery of construction projects. Due to this, the current research aims to contribute further to this convergence between artificial intelligence and the execution construction project through the evaluation of a specific set of machine learning algorithms. This study proposes a machine learning-based data-driven research framework for addressing problems related to project analytics. It then illustrates an example of the application of this framework. In this illustration, existing data from an open-source data repository on construction projects and cost overrun frequencies was studied in which several machine learning models (Python’s Scikit-learn package) were tested and evaluated. The data consisted of 44 independent variables (from materials to labour and contracting) and one dependent variable (project cost overrun frequency), which has been categorised for processing under several machine learning models. These models include support vector machine, logistic regression, k -nearest neighbour, random forest, stacking (ensemble) model and artificial neural network. Feature selection and evaluation methods, including the Univariate feature selection, Recursive feature elimination, SelectFromModel and confusion matrix, were applied to determine the most accurate prediction model. This study also discusses the generalisability of using the proposed research framework in other research contexts within the field of project management. The proposed framework, its illustration in the context of construction projects and its potential to be adopted in different contexts will significantly contribute to project practitioners, stakeholders and academics in addressing many project-related issues.

Introduction

Successful projects require the presence of appropriate information and technology 1 . Project analytics provides an avenue for informed decisions to be made through the lifecycle of a project. Project analytics applies various statistics (e.g., earned value analysis or Monte Carlo simulation) among other models to make evidence-based decisions. They are used to manage risks as well as project execution 2 . There is a tendency for project analytics to be employed due to other additional benefits, including an ability to forecast and make predictions, benchmark with other projects, and determine trends such as those that are time-dependent 3 , 4 , 5 . There has been increasing interest in project analytics and how current technology applications can be incorporated and utilised 6 . Broadly, project analytics can be understood on five levels 4 . The first is descriptive analytics which incorporates retrospective reporting. The second is known as diagnostic analytics , which aims to understand the interrelationships and underlying causes and effects. The third is predictive analytics which seeks to make predictions. Subsequent to this is prescriptive analytics , which prescribes steps following predictions. Finally, cognitive analytics aims to predict future problems. The first three levels can be applied with ease with the help of technology. The fourth and fifth steps require data that is generally more difficult to obtain as they may be less accessible or unstructured. Further, although project key performance indicators can be challenging to define 2 , identifying common measurable features facilitates this 7 . It is anticipated that project analytics will continue to experience development due to its direct benefits to the major baseline measures focused on productivity, profitability, cost, and time 8 . The nature of project management itself is fluid and flexible, and project analytics allows an avenue for which machine learning algorithms can be applied 9 .

Machine learning within the field of project analytics falls into the category of cognitive analytics, which deals with problem prediction. Generally, machine learning explores the possibilities of computers to improve processes through training or experience 10 . It can also build on the pre-existing capabilities and techniques prevalent within management to accomplish complex tasks 11 . Due to its practical use and broad applicability, recent developments have led to the invention and introduction of newer and more innovative machine learning algorithms and techniques. Artificial intelligence, for instance, allows for software to develop computer vision, speech recognition, natural language processing, robot control, and other applications 10 . Specific to the construction industry, it is now used to monitor construction environments through a virtual reality and building information modelling replication 12 or risk prediction 13 . Within other industries, such as consumer services and transport, machine learning is being applied to improve consumer experiences and satisfaction 10 , 14 and reduce the human errors of traffic controllers 15 . Recent applications and development of machine learning broadly fall into the categories of classification, regression, ranking, clustering, dimensionality reduction and manifold learning 16 . Current learning models include linear predictors, boosting, stochastic gradient descent, kernel methods, and nearest neighbour, among others 11 . Newer and more applications and learning models are continuously being introduced to improve accessibility and effectiveness.

Specific to the management of construction projects, other studies have also been made to understand how copious amounts of project data can be used 17 , the importance of ontology and semantics throughout the nexus between artificial intelligence and construction projects 18 , 19 as well as novel approaches to the challenges within this integration of fields 20 , 21 , 22 . There have been limited applications of pre-existing machine learning models on construction cost overruns. They have predominantly focussed on applications to streamline the design processes within construction 23 , 24 , 25 , 26 , and those which have investigated project profitability have not incorporated the types and combinations of algorithms used within this study 6 , 27 . Furthermore, existing applications have largely been skewed towards one type or another 28 , 29 .

In addition to the frequently used earned value method (EVM), researchers have been applying many other powerful quantitative methods to address a diverse range of project analytics research problems over time. Examples of those methods include time series analysis, fuzzy logic, simulation, network analytics, and network correlation and regression. Time series analysis uses longitudinal data to forecast an underlying project's future needs, such as the time and cost 30 , 31 , 32 . Few other methods are combined with EVM to find a better solution for the underlying research problems. For example, Narbaev and De Marco 33 integrated growth models and EVM for forecasting project cost at completion using data from construction projects. For analysing the ongoing progress of projects having ambiguous or linguistic outcomes, fuzzy logic is often combined with EVM 34 , 35 , 36 . Yu et al. 36 applied fuzzy theory and EVM for schedule management. Ponz-Tienda et al. 35 found that using fuzzy arithmetic on EVM provided more objective results in uncertain environments than the traditional methodology. Bonato et al. 37 integrated EVM with Monte Carlo simulation to predict the final cost of three engineering projects. Batselier and Vanhoucke 38 compared the accuracy of the project time and cost forecasting using EVM and simulation. They found that the simulation results supported findings from the EVM. Network methods are primarily used to analyse project stakeholder networks. Yang and Zou 39 developed a social network theory-based model to explore stakeholder-associated risks and their interactions in complex green building projects. Uddin 40 proposed a social network analytics-based framework for analysing stakeholder networks. Ong and Uddin 41 further applied network correlation and regression to examine the co-evolution of stakeholder networks in collaborative healthcare projects. Although many other methods have already been used, as evident in the current literature, machine learning methods or models are yet to be adopted for addressing research problems related to project analytics. The current investigation is derived from the cognitive analytics component of project analytics. It proposes an approach for determining hidden information and patterns to assist with project delivery. Figure  1 illustrates a tree diagram showing different levels of project analytics and their associated methods from the literature. It also illustrates existing methods within the cognitive component of project analytics to where the application of machine learning is situated contextually.

figure 1

A tree diagram of different project analytics methods. It also shows where the current study belongs to. Although earned value analysis is commonly used in project analytics, we do not include it in this figure since it is used in the first three levels of project analytics.

Machine learning models have several notable advantages over traditional statistical methods that play a significant role in project analytics 42 . First, machine learning algorithms can quickly identify trends and patterns by simultaneously analysing a large volume of data. Second, they are more capable of continuous improvement. Machine learning algorithms can improve their accuracy and efficiency for decision-making through subsequent training from potential new data. Third, machine learning algorithms efficiently handle multi-dimensional and multi-variety data in dynamic or uncertain environments. Fourth, they are compelling to automate various decision-making tasks. For example, machine learning-based sentiment analysis can easily a negative tweet and can automatically take further necessary steps. Last but not least, machine learning has been helpful across various industries, for example, defence to education 43 . Current research has seen the development of several different branches of artificial intelligence (including robotics, automated planning and scheduling and optimisation) within safety monitoring, risk prediction, cost estimation and so on 44 . This has progressed from the applications of regression on project cost overruns 45 to the current deep-learning implementations within the construction industry 46 . Despite this, the uses remain largely limited and are still in a developmental state. The benefits of applications are noted, such as optimising and streamlining existing processes; however, high initial costs form a barrier to accessibility 44 .

The primary goal of this study is to demonstrate the applicability of different machine learning algorithms in addressing problems related to project analytics. Limitations in applying machine learning algorithms within the context of construction projects have been explored previously. However, preceding research has mainly been conducted to improve the design processes specific to construction 23 , 24 , and those investigating project profitabilities have not incorporated the types and combinations of algorithms used within this study 6 , 27 . For instance, preceding research has incorporated a different combination of machine-learning algorithms in research of predicting construction delays 47 . This study first proposed a machine learning-based data-driven research framework for project analytics to contribute to the proposed study direction. It then applied this framework to a case study of construction projects. Although there are three different machine learning algorithms (supervised, unsupervised and semi-supervised), the supervised machine learning models are most commonly used due to their efficiency and effectiveness in addressing many real-world problems 48 . Therefore, we will use machine learning to represent supervised machine learning throughout the rest of this article. The contribution of this study is significant in that it considers the applications of machine learning within project management. Project management is often thought of as being very fluid in nature, and because of this, applications of machine learning are often more difficult 9 , 49 . Further to this, existing implementations have largely been limited to safety monitoring, risk prediction, cost estimation and so on 44 . Through the evaluation of machine-learning applications, this study further demonstrates a case study for which algorithms can be used to consider and model the relationship between project attributes and a project performance measure (i.e., cost overrun frequency).

Machine learning-based framework for project analytics

When and why machine learning for project analytics.

Machine learning models are typically used for research problems that involve predicting the classification outcome of a categorical dependent variable. Therefore, they can be applied in the context of project analytics if the underlying objective variable is a categorical one. If that objective variable is non-categorical, it must first be converted into a categorical variable. For example, if the objective or target variable is the project cost, we can convert this variable into a categorical variable by taking only two possible values. The first value would be 0 to indicate a low-cost project, and the second could be 1 for showing a high-cost project. The average or median cost value for all projects under consideration can be considered for splitting project costs into low-cost and high-cost categories.

For data-driven decision-making, machine learning models are advantageous. This is because traditional statistical methods (e.g., ordinary least square (OLS) regression) make assumptions about the underlying research data to produce explicit formulae for the objective target measures. Unlike these statistical methods, machine learning algorithms figure out patterns on their own directly from the data. For instance, for a non-linear but separable dataset, an OLS regression model will not be the right choice due to its assumption that the underlying data must be linear. However, a machine learning model can easily separate the dataset into the underlying classes. Figure  2 (a) presents a situation where machine learning models perform better than traditional statistical methods.

figure 2

( a ) An illustration showing the superior performance of machine learning models compared with the traditional statistical models using an abstract dataset with two attributes (X 1 and X 2 ). The data points within this abstract dataset consist of two classes: one represented with a transparent circle and the second class illustrated with a black-filled circle. These data points are non-linear but separable. Traditional statistical models (e.g., ordinary least square regression) will not accurately separate these data points. However, any machine learning model can easily separate them without making errors; and ( b ) Traditional programming versus machine learning.

Similarly, machine learning models are compelling if the underlying research dataset has many attributes or independent measures. Such models can identify features that significantly contribute to the corresponding classification performance regardless of their distributions or collinearity. Traditional statistical methods have become prone to biased results when there exists a correlation between independent variables. Machine learning-based current studies specific to project analytics have been largely limited. Despite this, there have been tangential studies on the use of artificial intelligence to improve cost estimations as well as risk prediction 44 . Additionally, models have been implemented in the optimisation of existing processes 50 .

Machine learning versus traditional programming

Machine learning can be thought of as a process of teaching a machine (i.e., computers) to learn from data and adjust or apply its present knowledge when exposed to new data 42 . It is a type of artificial intelligence that enables computers to learn from examples or experiences. Traditional programming requires some input data and some logic in the form of code (program) to generate the output. Unlike traditional programming, the input data and their corresponding output are fed to an algorithm to create a program in machine learning. This resultant program can capture powerful insights into the data pattern and can be used to predict future outcomes. Figure  2 (b) shows the difference between machine learning and traditional programming.

Proposed machine learning-based framework

Figure  3 illustrates the proposed machine learning-based research framework of this study. The framework starts with breaking the project research dataset into the training and test components. As mentioned in the previous section, the research dataset may have many categorical and/or nominal independent variables, but its single dependent variable must be categorical. Although there is no strict rule for this split, the training data size is generally more than or equal to 50% of the original dataset 48 .

figure 3

The proposed machine learning-based data-driven framework.

Machine learning algorithms can handle variables that have only numerical outcomes. So, when one or more of the underlying categorical variables have a textual or string outcome, we must first convert them into the corresponding numerical values. Suppose a variable can take only three textual outcomes (low, medium and high). In that case, we could consider, for example, 1 to represent low , 2 to represent medium , and 3 to represent high . Other statistical techniques, such as the RIDIT (relative to an identified distribution) scoring 51 , can also be used to convert ordered categorical measurements into quantitative ones. RIDIT is a parametric approach that uses probabilistic comparison to determine the statistical differences between ordered categorical groups. The remaining components of the proposed framework have been briefly described in the following subsections.

Model-building procedure

The next step of the framework is to follow the model-building procedure to develop the desired machine learning models using the training data. The first step of this procedure is to select suitable machine learning algorithms or models. Among the available machine learning algorithms, the commonly used ones are support vector machine, logistic regression, k -nearest neighbours, artificial neural network, decision tree and random forest 52 . One can also select an ensemble machine learning model as the desired algorithm. An ensemble machine learning method uses multiple algorithms or the same algorithm multiple times to achieve better predictive performance than could be obtained from any of the constituent learning models alone 52 . Three widely used ensemble approaches are bagging, boosting and stacking. In bagging, the research dataset is divided into different equal-sized subsets. The underlying machine learning algorithm is then applied to these subsets for classification. In boosting, a random sample of the dataset is selected and then fitted and trained sequentially with different models to compensate for the weakness observed in the immediately used model. Stacking combined different weak machine learning models in a heterogeneous way to improve the predictive performance. For example, the random forest algorithm is an ensemble of different decision tree models 42 .

Second, each selected machine learning model will be processed through the k -fold cross-validation approach to improve predictive efficiency. In k -fold cross-validation, the training data is divided into k folds. In an iteration, the (k-1) folds are used to train the selected machine models, and the remaining last fold isF used for validation purposes. This iteration process continues until each k folds will get a turn to be used for validation purposes. The final predictive efficiency of the trained models is based on the average values from the outcomes of these iterations. In addition to this average value, researchers use the standard deviation of the results from different iterations as the predictive training efficiency. Supplementary Fig 1 shows an illustration of the k -fold cross-validation.

Third, most machine learning algorithms require a pre-defined value for their different parameters, known as hyperparameter tuning. The settings of these parameters play a vital role in the achieved performance of the underlying algorithm. For a given machine learning algorithm, the optimal value for these parameters can be different from one dataset to another. The same algorithm needs to run multiple times with different parameter values to find its optimal parameter value for a given dataset. Many algorithms are available in the literature, such as the Grid search 53 , to find the optimal parameter value. In the Grid search, hyperparameters are divided into discrete grids. Each grid point represents a specific combination of the underlying model parameters. The parameter values of the point that results in the best performance are the optimal parameter values 53 .

Testing of the developed models and reporting results

Once the desired machine learning models have been developed using the training data, they need to be tested using the test data. The underlying trained model is then applied to predict its dependent variable for each data instance. Therefore, for each data instance, two categorical outcomes will be available for its dependent variable: one predicted using the underlying trained model, and the other is the actual category. These predicted and actual categorical outcome values are used to report the results of the underlying machine learning model.

The fundamental tool to report results from machine learning models is the confusion matrix, which consists of four integer values 48 . The first value represents the number of positive cases correctly identified as positive by the underlying trained model (true-positive). The second value indicates the number of positive instances incorrectly identified as negative (false-negative). The third value represents the number of negative cases incorrectly identified as positive (false-positive). Finally, the fourth value indicates the number of negative instances correctly identified as negative (true-negative). Researchers also use a few performance measures based on the four values of the confusion matrix to report machine learning results. The most used measure is accuracy which is the ratio of the number of correct predictions (true-positive + true-negative) and the total number of data instances (sum of all four values of the confusion matrix). Other measures commonly used to report machine learning results are precision, recall and F1-score. Precision refers to the ratio between true-positives and the total number of positive predictions (i.e., true-positive + false-positive), often used to indicate the quality of a positive prediction made by a model 48 . Recall, also known as the true-positive rate, is calculated by dividing true-positive by the number of data instances that should have been predicted as positive (i.e., true-positive + false-negative). F1-score is the harmonic mean of the last two measures, i.e., [(2 × Precision × Recall)/(Precision + Recall)] and the error-rate equals to (1-Accuracy).

Another essential tool for reporting machine learning results is variable or feature importance, which identifies a list of independent variables (features) contributing most to the classification performance. The importance of a variable refers to how much a given machine learning algorithm uses that variable in making accurate predictions 54 . The widely used technique for identifying variable importance is the principal component analysis. It reduces the dimensionality of the data while minimising information loss, which eventually increases the interpretability of the underlying machine learning outcome. It further helps in finding the important features in a dataset as well as plotting them in 2D and 3D 54 .

Ethical approval

Ethical approval is not required for this study since this study used publicly available data for research investigation purposes. All research was performed in accordance with relevant guidelines/regulations.

Informed consent

Due to the nature of the data sources, informed consent was not required for this study.

Case study: an application of the proposed framework

This section illustrates an application of this study’s proposed framework (Fig.  2 ) in a construction project context. We will apply this framework in classifying projects into two classes based on their cost overrun experience. Projects rarely experience a delay belonging to the first class (Rare class). The second class indicates those projects that often experience a delay (Often class). In doing so, we consider a list of independent variables or features.

Data source

The research dataset is taken from an open-source data repository, Kaggle 55 . This survey-based research dataset was collected to explore the causes of the project cost overrun in Indian construction projects 45 , consisting of 44 independent variables or features and one dependent variable. The independent variables cover a wide range of cost overrun factors, from materials and labour to contractual issues and the scope of the work. The dependent variable is the frequency of experiencing project cost overrun (rare or often). The dataset size is 139; 65 belong to the rare class, and the remaining 74 are from the often class. We converted each categorical variable with a textual or string outcome into an appropriate numerical value range to prepare the dataset for machine learning analysis. For example, we used 1 and 2 to represent rare and often class, respectively. The correlation matrix among the 44 features is presented in Supplementary Fig 2 .

Machine learning algorithms

This study considered four machine learning algorithms to explore the causes of project cost overrun using the research dataset mentioned above. They are support vector machine, logistic regression, k- nearest neighbours and random forest.

Support vector machine (SVM) is a process applied to understand data. For instance, if one wants to determine and interpret which projects are classified as programmatically successful through the processing of precedent data information, SVM would provide a practical approach for prediction. SVM functions by assigning labels to objects 56 . The comparison attributes are used to cluster these objects into different groups or classes by maximising their marginal distances and minimising the classification errors. The attributes are plotted multi-dimensionally, allowing a separation line, known as a hyperplane , see supplementary Fig 3 (a), to distinguish between underlying classes or groups 52 . Support vectors are the data points that lie closest to the decision boundary on both sides. In Supplementary Fig 3 (a), they are the circles (both transparent and shaded ones) close to the hyperplane. Support vectors play an essential role in deciding the position and orientation of the hyperplane. Various computational methods, including a kernel function to create more derived attributes, are applied to accommodate this process 56 . Support vector machines are not only limited to binary classes but can also be generalised to a larger variety of classifications. This is accomplished through the training of separate SVMs 56 .

Logistic regression (LR) builds on the linear regression model and predicts the outcome of a dichotomous variable 57 ; for example, the presence or absence of an event. It uses a scatterplot to understand the connection between an independent variable and one or more dependent variables (see Supplementary Fig 3 (b)). LR model fits the data to a sigmoidal curve instead of fitting it to a straight line. The natural logarithm is considered when developing the model. It provides a value between 0 and 1 that is interpreted as the probability of class membership. Best estimates are determined by developing from approximate estimates until a level of stability is reached 58 . Generally, LR offers a straightforward approach for determining and observing interrelationships. It is more efficient compared to ordinary regressions 59 .

k -nearest neighbours (KNN) algorithm uses a process that plots prior information and applies a specific sample size ( k ) to the plot to determine the most likely scenario 52 . This method finds the nearest training examples using a distance measure. The final classification is made by counting the most common scenario or votes present within the specified sample. As illustrated in Supplementary Fig 3 (c), the closest four nearest neighbours in the small circle are three grey squares and one white square. The majority class is grey. Hence, KNN will predict the instance (i.e., Χ ) as grey. On the other hand, if we look at the larger circle of the same figure, the nearest neighbours consist of ten white squares and four grey squares. The majority class is white. Thus, KNN will classify the instance as white. KNN’s advantage lies in its ability to produce a simplified result and handle missing data 60 . In summary, KNN utilises similarities (as well as differences) and distances in the process of developing models.

Random forest (RF) is a machine learning process that consists of many decision trees. A decision tree is a tree-like structure where each internal node represents a test on the input attribute. It may have multiple internal nodes at different levels, and the leaf or terminal nodes represent the decision outcomes. It produces a classification outcome for a distinctive and separate part to the input vector. For non-numerical processes, it considers the average value, and for discrete processes, it considers the number of votes 52 . Supplementary Fig 3 (d) shows three decision trees to illustrate the function of a random forest. The outcomes from trees 1, 2 and 3 are class B, class A and class A, respectively. According to the majority vote, the final prediction will be class A. Because it considers specific attributes, it can have a tendency to emphasise specific attributes over others, which may result in some attributes being unevenly weighted 52 . Advantages of the random forest include its ability to handle multidimensionality and multicollinearity in data despite its sensitivity to sampling design.

Artificial neural network (ANN) simulates the way in which human brains work. This is accomplished by modelling logical propositions and incorporating weighted inputs, a transfer and one output 61 (Supplementary Fig 3 (e)). It is advantageous because it can be used to model non-linear relationships and handle multivariate data 62 . ANN learns through three major avenues. These include error-back propagation (supervised), the Kohonen (unsupervised) and the counter-propagation ANN (supervised) 62 . There are two types of ANN—supervised and unsupervised. ANN has been used in a myriad of applications ranging from pharmaceuticals 61 to electronic devices 63 . It also possesses great levels of fault tolerance 64 and learns by example and through self-organisation 65 .

Ensemble techniques are a type of machine learning methodology in which numerous basic classifiers are combined to generate an optimal model 66 . An ensemble technique considers many models and combines them to form a single model, and the final model will eliminate the weaknesses of each individual learner, resulting in a powerful model that will improve model performance. The stacking model is a general architecture comprised of two classifier levels: base classifier and meta-learner 67 . The base classifiers are trained with the training dataset, and a new dataset is constructed for the meta-learner. Afterwards, this new dataset is used to train the meta-classifier. This study uses four models (SVM, LR, KNN and RF) as base classifiers and LR as a meta learner, as illustrated in Supplementary Fig 3 (f).

Feature selection

The process of selecting the optimal feature subset that significantly influences the predicted outcomes, which may be efficient to increase model performance and save running time, is known as feature selection. This study considers three different feature selection approaches. They are the Univariate feature selection (UFS), Recursive feature elimination (RFE) and SelectFromModel (SFM) approach. UFS examines each feature separately to determine the strength of its relationship with the response variable 68 . This method is straightforward to use and comprehend and helps acquire a deeper understanding of data. In this study, we calculate the chi-square values between features. RFE is a type of backwards feature elimination in which the model is fit first using all features in the given dataset and then removing the least important features one by one 69 . After that, the model is refit until the desired number of features is left over, which is determined by the parameter. SFM is used to choose effective features based on the feature importance of the best-performing model 70 . This approach selects features by establishing a threshold based on feature significance as indicated by the model on the training set. Those characteristics whose feature importance is more than the threshold are chosen, while those whose feature importance is less than the threshold are deleted. In this study, we apply SFM after we compare the performance of four machine learning methods. Afterwards, we train the best-performing model again using the features from the SFM approach.

Findings from the case study

We split the dataset into 70:30 for training and test purposes of the four selected machine learning algorithms. We used Python’s Scikit-learn package for implementing these algorithms 70 . Using the training data, we first developed six models based on these six algorithms. We used fivefold validation and target to improve the accuracy value. Then, we applied these models to the test data. We also executed all required hyperparameter tunings for each algorithm for the possible best classification outcome. Table 1 shows the performance outcomes for each algorithm during the training and test phase. The hyperparameter settings for each algorithm have been listed in Supplementary Table 1 .

As revealed in Table 1 , random forest outperformed the other three algorithms in terms of accuracy for both the training and test phases. It showed an accuracy of 78.14% and 77.50% for the training and test phases, respectively. The second-best performer in the training phase is k- nearest neighbours (76.98%), and for the test phase, it is the support vector machine, k- nearest neighbours and artificial neural network (72.50%).

Since random forest showed the best performance, we explored further based on this algorithm. We applied the three approaches (UFS, RFE and SFM) for feature optimisation on the random forest. The result is presented in Table 2 . SFM shows the best outcome among these three approaches. Its accuracy is 85.00%, whereas the accuracies of USF and RFE are 77.50% and 72.50%, respectively. As can be seen in Table 2 , the accuracy for the testing phase increases from 77.50% in Table 1 (b) to 85.00% with the SFM feature optimisation. Table 3 shows the 19 selected features from the SFM output. Out of 44 features, SFM found that 19 of them play a significant role in predicting the outcomes.

Further, Fig.  4 illustrates the confusion matrix when the random forest model with the SFM feature optimiser was applied to the test data. There are 18 true-positive, five false-negative, one false-positive and 16 true-negative cases. Therefore, the accuracy for the test phase is (18 + 16)/(18 + 5 + 1 + 16) = 85.00%.

figure 4

Confusion matrix results based on the random forest model with the SFM feature optimiser (1 for the rare class and 2 for the often class).

Figure  5 illustrates the top-10 most important features or variables based on the random forest algorithm with the SFM optimiser. We used feature importance based on the mean decrease in impurity in identifying this list of important variables. Mean decrease in impurity computes each feature’s importance as the sum over the number of splits that include the feature in proportion to the number of samples it splits 71 . According to this figure, the delays in decision marking attribute contributed most to the classification performance of the random forest algorithm, followed by cash flow problem and construction cost underestimation attributes. The current construction project literature also highlighted these top-10 factors as significant contributors to project cost overrun. For example, using construction project data from Jordan, Al-Hazim et al. 72 ranked 20 causes for cost overrun, including causes similar to these causes.

figure 5

Feature importance (top-10 out of 19) based on the random forest model with the SFM feature optimiser.

Further, we conduct a sensitivity analysis of the model’s ten most important features (from Fig.  5 ) to explore how a change in each feature affects the cost overrun. We utilise the partial dependence plot (PDP), which is a typical visualisation tool for non-parametric models 73 , to display this analysis’s outcomes. A PDP can demonstrate whether the relation between the target and a feature is linear, monotonic, or more complicated. The result of the sensitivity analysis is presented in Fig.  6 . For the ‘delays in decisions making’ attribute, the PDP shows that the probability is below 0.4 until the rating value is three and increases after. A higher value for this attribute indicates a higher risk of cost overrun. On the other hand, there are no significant differences can be seen in the remaining nine features if the value changes.

figure 6

The result of the sensitivity analysis from the partial dependency plot tool for the ten most important features.

Summary of the case study

We illustrated an application of the proposed machine learning-based research framework in classifying construction projects. RF showed the highest accuracy in predicting the test dataset. For a new data instance with information for its 19 features but has not had any information on its classification, RF can identify its class ( rare or often ) correctly with a probability of 85.00%. If more data is provided, in addition to the 139 instances of the case study, to the machine learning algorithms, then their accuracy and efficiency in making project classification will improve with subsequent training. For example, if we provide 100 more data instances, these algorithms will have an additional 50 instances for training with a 70:30 split. This continuous improvement facility put the machine learning algorithms in a superior position over other traditional methods. In the current literature, some studies explore the factors contributing to project delay or cost overrun. In most cases, they applied factor analysis or other related statistical methods for research data analysis 72 , 74 , 75 . In addition to identifying important attributes, the proposed machine learning-based framework identified the ranking of factors and how eliminating less important factors affects the prediction accuracy when applied to this case study.

We shared the Python software developed to implement the four machine learning algorithms considered in this case study using GitHub 76 , a software hosting internet site. user-friendly version of this software can be accessed at https://share.streamlit.io/haohuilu/pa/main/app.py . The accuracy findings from this link could be slightly different from one run to another due to the hyperparameter settings of the corresponding machine learning algorithms.

Due to their robust prediction ability, machine learning methods have already gained wide acceptability across a wide range of research domains. On the other side, EVM is the most commonly used method in project analytics due to its simplicity and ease of interpretability 77 . Essential research efforts have been made to improve its generalisability over time. For example, Naeni et al. 34 developed a fuzzy approach for earned value analysis to make it suitable to analyse project scenarios with ambiguous or linguistic outcomes. Acebes 78 integrated Monte Carlo simulation with EVM for project monitoring and control for a similar purpose. Another prominent method frequently used in project analytics is the time series analysis, which is compelling for the longitudinal prediction of project time and cost 30 . Apparently, as evident in the present current literature, not much effort has been made to bring machine learning into project analytics for addressing project management research problems. This research made a significant attempt to contribute to filling up this gap.

Our proposed data-driven framework only includes the fundamental model development and application process components for machine learning algorithms. It does not have a few advanced-level machine learning methods. This study intentionally did not consider them for the proposed model since they are required only in particular designs of machine learning analysis. For example, the framework does not contain any methods or tools to handle the data imbalance issue. Data imbalance refers to a situation when the research dataset has an uneven distribution of the target class 79 . For example, a binary target variable will cause a data imbalance issue if one of its class labels has a very high number of observations compared with the other class. Commonly used techniques to address this issue are undersampling and oversampling. The undersampling technique decreases the size of the majority class. On the other hand, the oversampling technique randomly duplicates the minority class until the class distribution becomes balanced 79 . The class distribution of the case study did not produce any data imbalance issues.

This study considered only six fundamental machine learning algorithms for the case study, although many other such algorithms are available in the literature. For example, it did not consider the extreme gradient boosting (XGBoost) algorithm. XGBoost is based on the decision tree algorithm, similar to the random forest algorithm 80 . It has become dominant in applied machine learning due to its performance and speed. Naïve Bayes and convolutional neural networks are other popular machine learning algorithms that were not considered when applying the proposed framework to the case study. In addition to the three feature selection methods, multi-view can be adopted when applying the proposed framework to the case study. Multi-view learning is another direction in machine learning that considers learning with multiple views of the existing data with the aim to improve predictive performance 81 , 82 . Similarly, although we considered five performance measures, there are other potential candidates. One such example is the area under the receiver operating curve, which is the ability of the underlying classifier to distinguish between classes 48 . We leave them as a potential application scope while applying our proposed framework in any other project contexts in future studies.

Although this study only used one case study for illustration, our proposed research framework can be used in other project analytics contexts. In such an application context, the underlying research goal should be to predict the outcome classes and find attributes playing a significant role in making correct predictions. For example, by considering two types of projects based on the time required to accomplish (e.g., on-time and delayed ), the proposed framework can develop machine learning models that can predict the class of a new data instance and find out attributes contributing mainly to this prediction performance. This framework can also be used at any stage of the project. For example, the framework’s results allow project stakeholders to screen projects for excessive cost overruns and forecast budget loss at bidding and before contracts are signed. In addition, various factors that contribute to project cost overruns can be figured out at an earlier stage. These elements emerge at each stage of a project’s life cycle. The framework’s feature importance helps project managers locate the critical contributor to cost overrun.

This study has made an important contribution to the current project analytics literature by considering the applications of machine learning within project management. Project management is often thought of as being very fluid in nature, and because of this, applications of machine learning are often more difficult. Further, existing implementations have largely been limited to safety monitoring, risk prediction and cost estimation. Through the evaluation of machine learning applications, this study further demonstrates the uses for which algorithms can be used to consider and model the relationship between project attributes and cost overrun frequency.

The applications of machine learning in project analytics are still undergoing constant development. Within construction projects, its applications have been largely limited and focused on profitability or the design of structures themselves. In this regard, our study made a substantial effort by proposing a machine learning-based framework to address research problems related to project analytics. We also illustrated an example of this framework’s application in the context of construction project management.

Like any other research, this study also has a few limitations that could provide scopes for future research. First, the framework does not include a few advanced machine learning techniques, such as data imbalance issues and kernel density estimation. Second, we considered only one case study to illustrate the application of the proposed framework. Illustrations of this framework using case studies from different project contexts would confirm its robust application. Finally, this study did not consider all machine learning models and performance measures available in the literature for the case study. For example, we did not consider the Naïve Bayes model and precision measure in applying the proposed research framework for the case study.

Data availability

This study obtained research data from publicly available online repositories. We mentioned their sources using proper citations. Here is the link to the data https://www.kaggle.com/datasets/amansaxena/survey-on-road-construction-delay .

Venkrbec, V. & Klanšek, U. In: Advances and Trends in Engineering Sciences and Technologies II 685–690 (CRC Press, 2016).

Google Scholar  

Damnjanovic, I. & Reinschmidt, K. Data Analytics for Engineering and Construction Project Risk Management (Springer, 2020).

Book   Google Scholar  

Singh, H. Project Management Analytics: A Data-driven Approach to Making Rational and Effective Project Decisions (FT Press, 2015).

Frame, J. D. & Chen, Y. Why Data Analytics in Project Management? (Auerbach Publications, 2018).

Ong, S. & Uddin, S. Data Science and Artificial Intelligence in Project Management: The Past, Present and Future. J. Mod. Proj. Manag. 7 , 26–33 (2020).

Bilal, M. et al. Investigating profitability performance of construction projects using big data: A project analytics approach. J. Build. Eng. 26 , 100850 (2019).

Article   Google Scholar  

Radziszewska-Zielina, E. & Sroka, B. Planning repetitive construction projects considering technological constraints. Open Eng. 8 , 500–505 (2018).

Neely, A. D., Adams, C. & Kennerley, M. The Performance Prism: The Scorecard for Measuring and Managing Business Success (Prentice Hall Financial Times, 2002).

Kanakaris, N., Karacapilidis, N., Kournetas, G. & Lazanas, A. In: International Conference on Operations Research and Enterprise Systems. 135–155 Springer.

Jordan, M. I. & Mitchell, T. M. Machine learning: Trends, perspectives, and prospects. Science 349 , 255–260 (2015).

Article   ADS   MathSciNet   CAS   PubMed   MATH   Google Scholar  

Shalev-Shwartz, S. & Ben-David, S. Understanding Machine Learning: From Theory to Algorithms (Cambridge University Press, 2014).

Book   MATH   Google Scholar  

Rahimian, F. P., Seyedzadeh, S., Oliver, S., Rodriguez, S. & Dawood, N. On-demand monitoring of construction projects through a game-like hybrid application of BIM and machine learning. Autom. Constr. 110 , 103012 (2020).

Sanni-Anibire, M. O., Zin, R. M. & Olatunji, S. O. Machine learning model for delay risk assessment in tall building projects. Int. J. Constr. Manag. 22 , 1–10 (2020).

Cong, J. et al. A machine learning-based iterative design approach to automate user satisfaction degree prediction in smart product-service system. Comput. Ind. Eng. 165 , 107939 (2022).

Li, F., Chen, C.-H., Lee, C.-H. & Feng, S. Artificial intelligence-enabled non-intrusive vigilance assessment approach to reducing traffic controller’s human errors. Knowl. Based Syst. 239 , 108047 (2021).

Mohri, M., Rostamizadeh, A. & Talwalkar, A. Foundations of Machine Learning (MIT press, 2018).

MATH   Google Scholar  

Whyte, J., Stasis, A. & Lindkvist, C. Managing change in the delivery of complex projects: Configuration management, asset information and ‘big data’. Int. J. Proj. Manag. 34 , 339–351 (2016).

Zangeneh, P. & McCabe, B. Ontology-based knowledge representation for industrial megaprojects analytics using linked data and the semantic web. Adv. Eng. Inform. 46 , 101164 (2020).

Akinosho, T. D. et al. Deep learning in the construction industry: A review of present status and future innovations. J. Build. Eng. 32 , 101827 (2020).

Soman, R. K., Molina-Solana, M. & Whyte, J. K. Linked-Data based constraint-checking (LDCC) to support look-ahead planning in construction. Autom. Constr. 120 , 103369 (2020).

Soman, R. K. & Whyte, J. K. Codification challenges for data science in construction. J. Constr. Eng. Manag. 146 , 04020072 (2020).

Soman, R. K. & Molina-Solana, M. Automating look-ahead schedule generation for construction using linked-data based constraint checking and reinforcement learning. Autom. Constr. 134 , 104069 (2022).

Shi, F., Soman, R. K., Han, J. & Whyte, J. K. Addressing adjacency constraints in rectangular floor plans using Monte-Carlo tree search. Autom. Constr. 115 , 103187 (2020).

Chen, L. & Whyte, J. Understanding design change propagation in complex engineering systems using a digital twin and design structure matrix. Eng. Constr. Archit. Manag. (2021).

Allison, J. T. et al. Artificial intelligence and engineering design. J. Mech. Des. 144 , 020301 (2022).

Dutta, D. & Bose, I. Managing a big data project: The case of ramco cements limited. Int. J. Prod. Econ. 165 , 293–306 (2015).

Bilal, M. & Oyedele, L. O. Guidelines for applied machine learning in construction industry—A case of profit margins estimation. Adv. Eng. Inform. 43 , 101013 (2020).

Tayefeh Hashemi, S., Ebadati, O. M. & Kaur, H. Cost estimation and prediction in construction projects: A systematic review on machine learning techniques. SN Appl. Sci. 2 , 1–27 (2020).

Arage, S. S. & Dharwadkar, N. V. In: International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud)(I-SMAC). 594–599 (IEEE, 2017).

Cheng, C.-H., Chang, J.-R. & Yeh, C.-A. Entropy-based and trapezoid fuzzification-based fuzzy time series approaches for forecasting IT project cost. Technol. Forecast. Soc. Chang. 73 , 524–542 (2006).

Joukar, A. & Nahmens, I. Volatility forecast of construction cost index using general autoregressive conditional heteroskedastic method. J. Constr. Eng. Manag. 142 , 04015051 (2016).

Xu, J.-W. & Moon, S. Stochastic forecast of construction cost index using a cointegrated vector autoregression model. J. Manag. Eng. 29 , 10–18 (2013).

Narbaev, T. & De Marco, A. Combination of growth model and earned schedule to forecast project cost at completion. J. Constr. Eng. Manag. 140 , 04013038 (2014).

Naeni, L. M., Shadrokh, S. & Salehipour, A. A fuzzy approach for the earned value management. Int. J. Proj. Manag. 29 , 764–772 (2011).

Ponz-Tienda, J. L., Pellicer, E. & Yepes, V. Complete fuzzy scheduling and fuzzy earned value management in construction projects. J. Zhejiang Univ. Sci. A 13 , 56–68 (2012).

Yu, F., Chen, X., Cory, C. A., Yang, Z. & Hu, Y. An active construction dynamic schedule management model: Using the fuzzy earned value management and BP neural network. KSCE J. Civ. Eng. 25 , 2335–2349 (2021).

Bonato, F. K., Albuquerque, A. A. & Paixão, M. A. S. An application of earned value management (EVM) with Monte Carlo simulation in engineering project management. Gest. Produção 26 , e4641 (2019).

Batselier, J. & Vanhoucke, M. Empirical evaluation of earned value management forecasting accuracy for time and cost. J. Constr. Eng. Manag. 141 , 05015010 (2015).

Yang, R. J. & Zou, P. X. Stakeholder-associated risks and their interactions in complex green building projects: A social network model. Build. Environ. 73 , 208–222 (2014).

Uddin, S. Social network analysis in project management–A case study of analysing stakeholder networks. J. Mod. Proj. Manag. 5 , 106–113 (2017).

Ong, S. & Uddin, S. Co-evolution of project stakeholder networks. J. Mod. Proj. Manag. 8 , 96–115 (2020).

Khanzode, K. C. A. & Sarode, R. D. Advantages and disadvantages of artificial intelligence and machine learning: A literature review. Int. J. Libr. Inf. Sci. (IJLIS) 9 , 30–36 (2020).

Loyola-Gonzalez, O. Black-box vs. white-box: Understanding their advantages and weaknesses from a practical point of view. IEEE Access 7 , 154096–154113 (2019).

Abioye, S. O. et al. Artificial intelligence in the construction industry: A review of present status, opportunities and future challenges. J. Build. Eng. 44 , 103299 (2021).

Doloi, H., Sawhney, A., Iyer, K. & Rentala, S. Analysing factors affecting delays in Indian construction projects. Int. J. Proj. Manag. 30 , 479–489 (2012).

Alkhaddar, R., Wooder, T., Sertyesilisik, B. & Tunstall, A. Deep learning approach’s effectiveness on sustainability improvement in the UK construction industry. Manag. Environ. Qual. Int. J. 23 , 126–139 (2012).

Gondia, A., Siam, A., El-Dakhakhni, W. & Nassar, A. H. Machine learning algorithms for construction projects delay risk prediction. J. Constr. Eng. Manag. 146 , 04019085 (2020).

Witten, I. H. & Frank, E. Data Mining: Practical Machine Learning Tools and Techniques (Morgan Kaufmann, 2005).

Kanakaris, N., Karacapilidis, N. I. & Lazanas, A. In: ICORES. 362–369.

Heo, S., Han, S., Shin, Y. & Na, S. Challenges of data refining process during the artificial intelligence development projects in the architecture engineering and construction industry. Appl. Sci. 11 , 10919 (2021).

Article   CAS   Google Scholar  

Bross, I. D. How to use ridit analysis. Biometrics 14 , 18–38 (1958).

Uddin, S., Khan, A., Hossain, M. E. & Moni, M. A. Comparing different supervised machine learning algorithms for disease prediction. BMC Med. Inform. Decis. Mak. 19 , 1–16 (2019).

LaValle, S. M., Branicky, M. S. & Lindemann, S. R. On the relationship between classical grid search and probabilistic roadmaps. Int. J. Robot. Res. 23 , 673–692 (2004).

Abdi, H. & Williams, L. J. Principal component analysis. Wiley Interdiscip. Rev. Comput. Stat. 2 , 433–459 (2010).

Saxena, A. Survey on Road Construction Delay , https://www.kaggle.com/amansaxena/survey-on-road-construction-delay (2021).

Noble, W. S. What is a support vector machine?. Nat. Biotechnol. 24 , 1565–1567 (2006).

Article   CAS   PubMed   Google Scholar  

Hosmer, D. W. Jr., Lemeshow, S. & Sturdivant, R. X. Applied Logistic Regression Vol. 398 (John Wiley & Sons, 2013).

LaValley, M. P. Logistic regression. Circulation 117 , 2395–2399 (2008).

Article   PubMed   Google Scholar  

Menard, S. Applied Logistic Regression Analysis Vol. 106 (Sage, 2002).

Batista, G. E. & Monard, M. C. A study of K-nearest neighbour as an imputation method. His 87 , 48 (2002).

Agatonovic-Kustrin, S. & Beresford, R. Basic concepts of artificial neural network (ANN) modeling and its application in pharmaceutical research. J. Pharm. Biomed. Anal. 22 , 717–727 (2000).

Zupan, J. Introduction to artificial neural network (ANN) methods: What they are and how to use them. Acta Chim. Slov. 41 , 327–327 (1994).

CAS   Google Scholar  

Hopfield, J. J. Artificial neural networks. IEEE Circuits Devices Mag. 4 , 3–10 (1988).

Zou, J., Han, Y. & So, S.-S. Overview of artificial neural networks. Artificial Neural Networks . 14–22 (2008).

Maind, S. B. & Wankar, P. Research paper on basic of artificial neural network. Int. J. Recent Innov. Trends Comput. Commun. 2 , 96–100 (2014).

Wolpert, D. H. Stacked generalization. Neural Netw. 5 , 241–259 (1992).

Pavlyshenko, B. In: IEEE Second International Conference on Data Stream Mining & Processing (DSMP). 255–258 (IEEE).

Jović, A., Brkić, K. & Bogunović, N. In: 38th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO). 1200–1205 (Ieee, 2015).

Guyon, I., Weston, J., Barnhill, S. & Vapnik, V. Gene selection for cancer classification using support vector machines. Mach. Learn. 46 , 389–422 (2002).

Article   MATH   Google Scholar  

Pedregosa, F. et al. Scikit-learn: Machine learning in python. J. Mach. Learn. Res. 12 , 2825–2830 (2011).

MathSciNet   MATH   Google Scholar  

Louppe, G., Wehenkel, L., Sutera, A. & Geurts, P. Understanding variable importances in forests of randomized trees. Adv. Neural. Inf. Process. Syst. 26 , 431–439 (2013).

Al-Hazim, N., Salem, Z. A. & Ahmad, H. Delay and cost overrun in infrastructure projects in Jordan. Procedia Eng. 182 , 18–24 (2017).

Breiman, L. Random forests. Mach. Learn. 45 , 5–32. https://doi.org/10.1023/A:1010933404324 (2001).

Shehu, Z., Endut, I. R. & Akintoye, A. Factors contributing to project time and hence cost overrun in the Malaysian construction industry. J. Financ. Manag. Prop. Constr. 19 , 55–75 (2014).

Akomah, B. B. & Jackson, E. N. Contractors’ perception of factors contributing to road project delay. Int. J. Constr. Eng. Manag. 5 , 79–85 (2016).

GitHub: Where the world builds software , https://github.com/ .

Anbari, F. T. Earned value project management method and extensions. Proj. Manag. J. 34 , 12–23 (2003).

Acebes, F., Pereda, M., Poza, D., Pajares, J. & Galán, J. M. Stochastic earned value analysis using Monte Carlo simulation and statistical learning techniques. Int. J. Proj. Manag. 33 , 1597–1609 (2015).

Japkowicz, N. & Stephen, S. The class imbalance problem: A systematic study. Intell. data anal. 6 , 429–449 (2002).

Chen, T. et al. Xgboost: extreme gradient boosting. R Packag. Version 0.4–2.1 1 , 1–4 (2015).

Guarino, A., Lettieri, N., Malandrino, D., Zaccagnino, R. & Capo, C. Adam or Eve? Automatic users’ gender classification via gestures analysis on touch devices. Neural Comput. Appl. 1–23 (2022).

Zaccagnino, R., Capo, C., Guarino, A., Lettieri, N. & Malandrino, D. Techno-regulation and intelligent safeguards. Multimed. Tools Appl. 80 , 15803–15824 (2021).

Download references

Acknowledgements

The authors acknowledge the insightful comments from Prof Jennifer Whyte on an earlier version of this article.

Author information

Authors and affiliations.

School of Project Management, The University of Sydney, Level 2, 21 Ross St, Forest Lodge, NSW, 2037, Australia

Shahadat Uddin, Stephen Ong & Haohui Lu

You can also search for this author in PubMed   Google Scholar

Contributions

S.U.: Conceptualisation; Data curation; Formal analysis; Methodology; Supervision; and Writing (original draft, review and editing) S.O.: Data curation; and Writing (original draft, review and editing) H.L.: Methodology; and Writing (original draft, review and editing) All authors reviewed the manuscript).

Corresponding author

Correspondence to Shahadat Uddin .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary information., rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Uddin, S., Ong, S. & Lu, H. Machine learning in project analytics: a data-driven framework and case study. Sci Rep 12 , 15252 (2022). https://doi.org/10.1038/s41598-022-19728-x

Download citation

Received : 13 April 2022

Accepted : 02 September 2022

Published : 09 September 2022

DOI : https://doi.org/10.1038/s41598-022-19728-x

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

Evaluation and prediction of time overruns in jordanian construction projects using coral reefs optimization and deep learning methods.

  • Jumana Shihadeh
  • Ghyda Al-Shaibie
  • Hamza Al-Bdour

Asian Journal of Civil Engineering (2024)

Unsupervised machine learning for disease prediction: a comparative performance analysis using multiple datasets

  • Shahadat Uddin

Health and Technology (2024)

Prediction of SMEs’ R&D performances by machine learning for project selection

  • Hyoung Sun Yoo
  • Ye Lim Jung
  • Seung-Pyo Jun

Scientific Reports (2023)

A robust and resilience machine learning for forecasting agri-food production

  • Amin Gholamrezaei
  • Kiana Kheiri

Scientific Reports (2022)

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

machine learning related research papers

machine learning related research papers

Analytics Insight

Top 10 Machine Learning Research Papers of 2021

Avatar photo

Machine learning research papers showcasing the transformation of the technology

Unbiased gradient estimation in unrolled computation graphs with persistent evolution, solving high-dimensional parabolic pdes using the tensor train format.

  • TOP 10 MACHINE LEARNING TOOLS 2021
  • TOP COMPANIES USING MACHINE LEARNING IN A PROFITABLE WAY
  • MACHINE LEARNING GUIDE: DIFFERENCES BETWEEN PYTHON AND JAVA

Oops I took a gradient: Scalable sampling for discrete distributions

Optimal complexity in decentralized training, understanding self-supervised learning dynamics without contrastive pairs, how transferable are featured in deep neural networks, do we need hundreds of classifiers to solve real-world classification problems, knowledge vault: a web-scale approach to probabilistic knowledge fusion, scalable nearest neighbor algorithms for high dimensional data, trends in extreme learning machines.

Whatsapp Icon

Disclaimer: Any financial and crypto market information given on Analytics Insight are sponsored articles, written for informational purpose only and is not an investment advice. The readers are further advised that Crypto products and NFTs are unregulated and can be highly risky. There may be no regulatory recourse for any loss from such transactions. Conduct your own research by contacting financial experts before making any investment decisions. The decision to read hereinafter is purely a matter of choice and shall be construed as an express undertaking/guarantee in favour of Analytics Insight of being absolved from any/ all potential legal action, or enforceable claims. We do not represent nor own any cryptocurrency, any complaints, abuse or concerns with regards to the information provided shall be immediately informed here .

You May Also Like

machine learning related research papers

Infor: Delivering Enterprise Cloud Software to Support Businesses in Digital Environment

machine learning related research papers

Top 5 Next-Gen AI Products Launched In 2019

Artificial Intelligence

Artificial Intelligence use cases in BFSI, Healthcare and HR for Business Transformation

Bitcoin Spark

Ethereum Bag Holders Didn’t Believe It, Yet Bitcoin Spark Proved Them Wrong

machine learning related research papers

Analytics Insight® is an influential platform dedicated to insights, trends, and opinion from the world of data-driven technologies. It monitors developments, recognition, and achievements made by Artificial Intelligence, Big Data and Analytics companies across the globe.

linkedin

  • Select Language:
  • Privacy Policy
  • Content Licensing
  • Terms & Conditions
  • Submit an Interview

Special Editions

  • Dec – Crypto Weekly Vol-1
  • 40 Under 40 Innovators
  • Women In Technology
  • Market Reports
  • AI Glossary
  • Infographics

Latest Issue

Magazine Issue January 2024

Disclaimer: Any financial and crypto market information given on Analytics Insight is written for informational purpose only and is not an investment advice. Conduct your own research by contacting financial experts before making any investment decisions, more information here .

Second Menu

machine learning related research papers

  • Skip to main content
  • Skip to primary sidebar
  • Skip to footer

TOPBOTS Logo

The Best of Applied Artificial Intelligence, Machine Learning, Automation, Bots, Chatbots

2020’s Top AI & Machine Learning Research Papers

November 24, 2020 by Mariya Yao

machine learning papers

Despite the challenges of 2020, the AI research community produced a number of meaningful technical breakthroughs. GPT-3 by OpenAI may be the most famous, but there are definitely many other research papers worth your attention. 

For example, teams from Google introduced a revolutionary chatbot, Meena, and EfficientDet object detectors in image recognition. Researchers from Yale introduced a novel AdaBelief optimizer that combines many benefits of existing optimization methods. OpenAI researchers demonstrated how deep reinforcement learning techniques can achieve superhuman performance in Dota 2.

To help you catch up on essential reading, we’ve summarized 10 important machine learning research papers from 2020. These papers will give you a broad overview of AI research advancements this year. Of course, there are many more breakthrough papers worth reading as well.

We have also published the top 10 lists of key research papers in natural language processing and computer vision . In addition, you can read our premium research summaries , where we feature the top 25 conversational AI research papers introduced recently.

Subscribe to our AI Research mailing list at the bottom of this article to be alerted when we release new summaries.

If you’d like to skip around, here are the papers we featured:

  • A Distributed Multi-Sensor Machine Learning Approach to Earthquake Early Warning
  • Efficiently Sampling Functions from Gaussian Process Posteriors
  • Dota 2 with Large Scale Deep Reinforcement Learning
  • Towards a Human-like Open-Domain Chatbot
  • Language Models are Few-Shot Learners
  • Beyond Accuracy: Behavioral Testing of NLP models with CheckList
  • EfficientDet: Scalable and Efficient Object Detection
  • Unsupervised Learning of Probably Symmetric Deformable 3D Objects from Images in the Wild
  • An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale
  • AdaBelief Optimizer: Adapting Stepsizes by the Belief in Observed Gradients

Best AI & ML Research Papers 2020

1. a distributed multi-sensor machine learning approach to earthquake early warning , by kévin fauvel, daniel balouek-thomert, diego melgar, pedro silva, anthony simonet, gabriel antoniu, alexandru costan, véronique masson, manish parashar, ivan rodero, and alexandre termier, original abstract .

Our research aims to improve the accuracy of Earthquake Early Warning (EEW) systems by means of machine learning. EEW systems are designed to detect and characterize medium and large earthquakes before their damaging effects reach a certain location. Traditional EEW methods based on seismometers fail to accurately identify large earthquakes due to their sensitivity to the ground motion velocity. The recently introduced high-precision GPS stations, on the other hand, are ineffective to identify medium earthquakes due to their propensity to produce noisy data. In addition, GPS stations and seismometers may be deployed in large numbers across different locations and may produce a significant volume of data, consequently affecting the response time and the robustness of EEW systems. 

In practice, EEW can be seen as a typical classification problem in the machine learning field: multi-sensor data are given in input, and earthquake severity is the classification result. In this paper, we introduce the Distributed Multi-Sensor Earthquake Early Warning (DMSEEW) system, a novel machine learning-based approach that combines data from both types of sensors (GPS stations and seismometers) to detect medium and large earthquakes. DMSEEW is based on a new stacking ensemble method which has been evaluated on a real-world dataset validated with geoscientists. The system builds on a geographically distributed infrastructure, ensuring an efficient computation in terms of response time and robustness to partial infrastructure failures. Our experiments show that DMSEEW is more accurate than the traditional seismometer-only approach and the combined-sensors (GPS and seismometers) approach that adopts the rule of relative strength.

Our Summary 

The authors claim that traditional Earthquake Early Warning (EEW) systems that are based on seismometers, as well as recently introduced GPS systems, have their disadvantages with regards to predicting large and medium earthquakes respectively. Thus, the researchers suggest approaching an early earthquake prediction problem with machine learning by using the data from seismometers and GPS stations as input data. In particular, they introduce the Distributed Multi-Sensor Earthquake Early Warning (DMSEEW) system, which is specifically tailored for efficient computation on large-scale distributed cyberinfrastructures. The evaluation demonstrates that the DMSEEW system is more accurate than other baseline approaches with regard to real-time earthquake detection.

earthquake early warning

What’s the core idea of this paper?

  • Seismometers have difficulty detecting large earthquakes because of their sensitivity to ground motion velocity.
  • GPS stations are ineffective in detecting medium earthquakes, as they are prone to producing lots of noisy data.
  • takes sensor-level class predictions from seismometers and GPS stations (i.e. normal activity, medium earthquake, large earthquake);
  • aggregates these predictions using a bag-of-words representation and defines a final prediction for the earthquake category.
  • Furthermore, they introduce a distributed cyberinfrastructure that can support the processing of high volumes of data in real time and allows the redirection of data to other processing data centers in case of disaster situations.

What’s the key achievement?

  • precision – 100% vs. 63.2%;
  • recall – 100% vs. 85.7%;
  • F1 score – 100% vs. 72.7%.
  • precision – 76.7% vs. 70.7%;
  • recall – 38.8% vs. 34.1%;
  • F1 score – 51.6% vs. 45.0%.

What does the AI community think?

  • The paper received an Outstanding Paper award at AAAI 2020 (special track on AI for Social Impact).

What are future research areas?

  • Evaluating DMSEEW response time and robustness via simulation of different scenarios in an existing EEW execution platform. 
  • Evaluating the DMSEEW system on another seismic network.

2nd Edition Applied AI book

2. Efficiently Sampling Functions from Gaussian Process Posteriors , by James T. Wilson, Viacheslav Borovitskiy, Alexander Terenin, Peter Mostowsky, Marc Peter Deisenroth

Gaussian processes are the gold standard for many real-world modeling problems, especially in cases where a model’s success hinges upon its ability to faithfully represent predictive uncertainty. These problems typically exist as parts of larger frameworks, wherein quantities of interest are ultimately defined by integrating over posterior distributions. These quantities are frequently intractable, motivating the use of Monte Carlo methods. Despite substantial progress in scaling up Gaussian processes to large training sets, methods for accurately generating draws from their posterior distributions still scale cubically in the number of test locations. We identify a decomposition of Gaussian processes that naturally lends itself to scalable sampling by separating out the prior from the data. Building off of this factorization, we propose an easy-to-use and general-purpose approach for fast posterior sampling, which seamlessly pairs with sparse approximations to afford scalability both during training and at test time. In a series of experiments designed to test competing sampling schemes’ statistical properties and practical ramifications, we demonstrate how decoupled sample paths accurately represent Gaussian process posteriors at a fraction of the usual cost.

In this paper, the authors explore techniques for efficiently sampling from Gaussian process (GP) posteriors. After investigating the behaviors of naive approaches to sampling and fast approximation strategies using Fourier features, they find that many of these strategies are complementary. They, therefore, introduce an approach that incorporates the best of different sampling approaches. First, they suggest decomposing the posterior as the sum of a prior and an update. Then they combine this idea with techniques from literature on approximate GPs and obtain an easy-to-use general-purpose approach for fast posterior sampling. The experiments demonstrate that decoupled sample paths accurately represent GP posteriors at a much lower cost.

  • The introduced approach to sampling functions from GP posteriors centers on the observation that it is possible to implicitly condition Gaussian random variables by combining them with an explicit corrective term.
  • The authors translate this intuition to Gaussian processes and suggest decomposing the posterior as the sum of a prior and an update.
  • Building on this factorization, the researchers suggest an efficient approach for fast posterior sampling that seamlessly pairs with sparse approximations to achieve scalability both during training and at test time.
  • Introducing an easy-to-use and general-purpose approach to sampling from GP posteriors.
  • avoid many shortcomings of the alternative sampling strategies;
  • accurately represent GP posteriors at a much lower cost; for example, simulation of a well-known model of a biological neuron required only 20 seconds using decoupled sampling, while the iterative approach required 10 hours.
  • The paper received an Honorable Mention at ICML 2020. 

Where can you get implementation code?

  • The authors released the implementation of this paper on GitHub .

3. Dota 2 with Large Scale Deep Reinforcement Learning , by Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław “Psyho” Dębiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, Rafal Józefowicz, Scott Gray, Catherine Olsson, Jakub Pachocki, Michael Petrov, Henrique Pondé de Oliveira Pinto, Jonathan Raiman, Tim Salimans, Jeremy Schlatter, Jonas Schneider, Szymon Sidor, Ilya Sutskever, Jie Tang, Filip Wolski, Susan Zhang

On April 13th, 2019, OpenAI Five became the first AI system to defeat the world champions at an esports game. The game of Dota 2 presents novel challenges for AI systems such as long time horizons, imperfect information, and complex, continuous state-action spaces, all challenges which will become increasingly central to more capable AI systems. OpenAI Five leveraged existing reinforcement learning techniques, scaled to learn from batches of approximately 2 million frames every 2 seconds. We developed a distributed training system and tools for continual training which allowed us to train OpenAI Five for 10 months. By defeating the Dota 2 world champion (Team OG), OpenAI Five demonstrates that self-play reinforcement learning can achieve superhuman performance on a difficult task.

The OpenAI research team demonstrates that modern reinforcement learning techniques can achieve superhuman performance in such a challenging esports game as Dota 2. The challenges of this particular task for the AI system lies in the long time horizons, partial observability, and high dimensionality of observation and action spaces. To tackle this game, the researchers scaled existing RL systems to unprecedented levels with thousands of GPUs utilized for 10 months. The resulting OpenAI Five model was able to defeat the Dota 2 world champions and won 99.4% of over 7000 games played during the multi-day showcase.

OpenAI Dota 2

  • The goal of the introduced OpenAI Five model is to find the policy that maximizes the probability of winning the game against professional human players, which in practice implies maximizing the reward function with some additional signals like characters dying, resources collected, etc.
  • While the Dota 2 engine runs at 30 frames per second, the OpenAI Five only acts on every 4th frame.
  • At each timestep, the model receives an observation with all the information available to human players (approximated in a set of data arrays) and returns a discrete action , which encodes the desired movement, attack, etc.
  • A policy is defined as a function from the history of observations to a probability distribution over actions that are parameterized as an LSTM with ~159M parameters.
  • The policy is trained using a variant of advantage actor critic, Proximal Policy Optimization.
  • The OpenAI Five model was trained for 180 days spread over 10 months of real time.

OpenAI Dota 2

  • defeated the Dota 2 world champions in a best-of-three match (2–0);
  • won 99.4% of over 7000 games during a multi-day online showcase.
  • Applying introduced methods to other zero-sum two-team continuous environments.

What are possible business applications?

  • Tackling challenging esports games like Dota 2 can be a promising step towards solving advanced real-world problems using reinforcement learning techniques.

4. Towards a Human-like Open-Domain Chatbot , by Daniel Adiwardana, Minh-Thang Luong, David R. So, Jamie Hall, Noah Fiedel, Romal Thoppilan, Zi Yang, Apoorv Kulshreshtha, Gaurav Nemade, Yifeng Lu, Quoc V. Le

We present Meena, a multi-turn open-domain chatbot trained end-to-end on data mined and filtered from public domain social media conversations. This 2.6B parameter neural network is simply trained to minimize perplexity of the next token. We also propose a human evaluation metric called Sensibleness and Specificity Average (SSA), which captures key elements of a human-like multi-turn conversation. Our experiments show strong correlation between perplexity and SSA. The fact that the best perplexity end-to-end trained Meena scores high on SSA (72% on multi-turn evaluation) suggests that a human-level SSA of 86% is potentially within reach if we can better optimize perplexity. Additionally, the full version of Meena (with a filtering mechanism and tuned decoding) scores 79% SSA, 23% higher in absolute SSA than the existing chatbots we evaluated. 

In contrast to most modern conversational agents, which are highly specialized, the Google research team introduces a chatbot Meena that can chat about virtually anything. It’s built on a large neural network with 2.6B parameters trained on 341 GB of text. The researchers also propose a new human evaluation metric for open-domain chatbots, called Sensibleness and Specificity Average (SSA), which can capture important attributes for human conversation. They demonstrate that this metric correlates highly with perplexity, an automatic metric that is readily available. Thus, the Meena chatbot, which is trained to minimize perplexity, can conduct conversations that are more sensible and specific compared to other chatbots. Particularly, the experiments demonstrate that Meena outperforms existing state-of-the-art chatbots by a large margin in terms of the SSA score (79% vs. 56%) and is closing the gap with human performance (86%).

Meena chatbot

  • Despite recent progress, open-domain chatbots still have significant weaknesses: their responses often do not make sense or are too vague or generic.
  • Meena is built on a seq2seq model with Evolved Transformer (ET) that includes 1 ET encoder block and 13 ET decoder blocks.
  • The model is trained on multi-turn conversations with the input sequence including all turns of the context (up to 7) and the output sequence being the response.
  • making sense,
  • being specific.
  • The research team discovered that the SSA metric shows high negative correlation (R2 = 0.93) with perplexity, a readily available automatic metric that Meena is trained to minimize.
  • Proposing a simple human-evaluation metric for open-domain chatbots.
  • The best end-to-end trained Meena model outperforms existing state-of-the-art open-domain chatbots by a large margin, achieving an SSA score of 72% (vs. 56%).
  • Furthermore, the full version of Meena, with a filtering mechanism and tuned decoding, further advances the SSA score to 79%, which is not far from the 86% SSA achieved by the average human.
  • “Google’s “Meena” chatbot was trained on a full TPUv3 pod (2048 TPU cores) for 30 full days – that’s more than $1,400,000 of compute time to train this chatbot model.” – Elliot Turner, CEO and founder of Hyperia .
  • “So I was browsing the results for the new Google chatbot Meena, and they look pretty OK (if boring sometimes). However, every once in a while it enters ‘scary sociopath mode,’ which is, shall we say, sub-optimal” – Graham Neubig, Associate professor at Carnegie Mellon University .

Meena chatbot

  • Lowering the perplexity through improvements in algorithms, architectures, data, and compute.
  • Considering other aspects of conversations beyond sensibleness and specificity, such as, for example, personality and factuality.
  • Tackling safety and bias in the models.
  • further humanizing computer interactions; 
  • improving foreign language practice; 
  • making interactive movie and videogame characters relatable.
  • Considering the challenges related to safety and bias in the models, the authors haven’t released the Meena model yet. However, they are still evaluating the risks and benefits and may decide otherwise in the coming months.

5. Language Models are Few-Shot Learners , by Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei

Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions – something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10× more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3’s few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.

The OpenAI research team draws attention to the fact that the need for a labeled dataset for every new language task limits the applicability of language models. Considering that there is a wide range of possible tasks and it’s often difficult to collect a large labeled training dataset, the researchers suggest an alternative solution, which is scaling up language models to improve task-agnostic few-shot performance. They test their solution by training a 175B-parameter autoregressive language model, called GPT-3 , and evaluating its performance on over two dozen NLP tasks. The evaluation under few-shot learning, one-shot learning, and zero-shot learning demonstrates that GPT-3 achieves promising results and even occasionally outperforms the state of the art achieved by fine-tuned models.

GPT-3

  • The GPT-3 model uses the same model and architecture as GPT-2, including the modified initialization, pre-normalization, and reversible tokenization.
  • However, in contrast to GPT-2, it uses alternating dense and locally banded sparse attention patterns in the layers of the transformer, as in the Sparse Transformer .
  • Few-shot learning , when the model is given a few demonstrations of the task (typically, 10 to 100) at inference time but with no weight updates allowed.
  • One-shot learning , when only one demonstration is allowed, together with a natural language description of the task.
  • Zero-shot learning , when no demonstrations are allowed and the model has access only to a natural language description of the task.
  • On the CoQA benchmark, 81.5 F1 in the zero-shot setting, 84.0 F1 in the one-shot setting, and 85.0 F1 in the few-shot setting, compared to the 90.7 F1 score achieved by fine-tuned SOTA.
  • On the TriviaQA benchmark, 64.3% accuracy in the zero-shot setting, 68.0% in the one-shot setting, and 71.2% in the few-shot setting, surpassing the state of the art (68%) by 3.2%.
  • On the LAMBADA dataset, 76.2 % accuracy in the zero-shot setting, 72.5% in the one-shot setting, and 86.4% in the few-shot setting, surpassing the state of the art (68%) by 18%.
  • The news articles generated by the 175B-parameter GPT-3 model are hard to distinguish from real ones, according to human evaluations (with accuracy barely above the chance level at ~52%).
  • “The GPT-3 hype is way too much. It’s impressive (thanks for the nice compliments!) but it still has serious weaknesses and sometimes makes very silly mistakes. AI is going to change the world, but GPT-3 is just a very early glimpse. We have a lot still to figure out.” – Sam Altman, CEO and co-founder of OpenAI .
  • “I’m shocked how hard it is to generate text about Muslims from GPT-3 that has nothing to do with violence… or being killed…” – Abubakar Abid, CEO and founder of Gradio .
  • “No. GPT-3 fundamentally does not understand the world that it talks about. Increasing corpus further will allow it to generate a more credible pastiche but not fix its fundamental lack of comprehension of the world. Demos of GPT-4 will still require human cherry picking.” – Gary Marcus, CEO and founder of Robust.ai .
  • “Extrapolating the spectacular performance of GPT3 into the future suggests that the answer to life, the universe and everything is just 4.398 trillion parameters.” – Geoffrey Hinton, Turing Award winner .
  • Improving pre-training sample efficiency.
  • Exploring how few-shot learning works.
  • Distillation of large models down to a manageable size for real-world applications.
  • The model with 175B parameters is hard to apply to real business problems due to its impractical resource requirements, but if the researchers manage to distill this model down to a workable size, it could be applied to a wide range of language tasks, including question answering, dialog agents, and ad copy generation.
  • The code itself is not available, but some dataset statistics together with unconditional, unfiltered 2048-token samples from GPT-3 are released on GitHub .

6. Beyond Accuracy: Behavioral Testing of NLP models with CheckList , by Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, Sameer Singh

Although measuring held-out accuracy has been the primary approach to evaluate generalization, it often overestimates the performance of NLP models, while alternative approaches for evaluating models either focus on individual tasks or on specific behaviors. Inspired by principles of behavioral testing in software engineering, we introduce CheckList, a task-agnostic methodology for testing NLP models. CheckList includes a matrix of general linguistic capabilities and test types that facilitate comprehensive test ideation, as well as a software tool to generate a large and diverse number of test cases quickly. We illustrate the utility of CheckList with tests for three tasks, identifying critical failures in both commercial and state-of-art models. In a user study, a team responsible for a commercial sentiment analysis model found new and actionable bugs in an extensively tested model. In another user study, NLP practitioners with CheckList created twice as many tests, and found almost three times as many bugs as users without it.

The authors point out the shortcomings of existing approaches to evaluating performance of NLP models. A single aggregate statistic, like accuracy, makes it difficult to estimate where the model is failing and how to fix it. The alternative evaluation approaches usually focus on individual tasks or specific capabilities. To address the lack of comprehensive evaluation approaches, the researchers introduce CheckList , a new evaluation methodology for testing of NLP models. The approach is inspired by principles of behavioral testing in software engineering. Basically, CheckList is a matrix of linguistic capabilities and test types that facilitates test ideation. Multiple user studies demonstrate that CheckList is very effective at discovering actionable bugs, even in extensively tested NLP models.

CheckList

  • The primary approach to the evaluation of models’ generalization capabilities, which is accuracy on held-out data, may lead to performance overestimation, as the held-out data often contains the same biases as the training data. Moreover, this single aggregate statistic doesn’t help much in figuring out where the NLP model is failing and how to fix these bugs.
  • The alternative approaches are usually designed for evaluation of specific behaviors on individual tasks and thus, lack comprehensiveness.
  • CheckList provides users with a list of linguistic capabilities to be tested, like vocabulary, named entity recognition, and negation.
  • Then, to break down potential capability failures into specific behaviors, CheckList suggests different test types , such as prediction invariance or directional expectation tests in case of certain perturbations.
  • Potential tests are structured as a matrix, with capabilities as rows and test types as columns.
  • The suggested implementation of CheckList also introduces a variety of abstractions to help users generate large numbers of test cases easily.
  • Evaluation of state-of-the-art models with CheckList demonstrated that even though some NLP tasks are considered “solved” based on accuracy results, the behavioral testing highlights many areas for improvement.
  • helps to identify and test for capabilities not previously considered;
  • results in more thorough and comprehensive testing for previously considered capabilities;
  • helps to discover many more actionable bugs.
  • The paper received the Best Paper Award at ACL 2020, the leading conference in natural language processing.
  • CheckList can be used to create more exhaustive testing for a variety of NLP tasks.
  • Such comprehensive testing that helps in identifying many actionable bugs is likely to lead to more robust NLP systems.
  • The code for testing NLP models with CheckList is available on GitHub .

7. EfficientDet: Scalable and Efficient Object Detection , by Mingxing Tan, Ruoming Pang, Quoc V. Le

Model efficiency has become increasingly important in computer vision. In this paper, we systematically study neural network architecture design choices for object detection and propose several key optimizations to improve efficiency. First, we propose a weighted bi-directional feature pyramid network (BiFPN), which allows easy and fast multi-scale feature fusion; Second, we propose a compound scaling method that uniformly scales the resolution, depth, and width for all backbone, feature network, and box/class prediction networks at the same time. Based on these optimizations and EfficientNet backbones, we have developed a new family of object detectors, called EfficientDet, which consistently achieve much better efficiency than prior art across a wide spectrum of resource constraints. In particular, with single-model and single-scale, our EfficientDet-D7 achieves state-of-the-art 52.2 AP on COCO test-dev with 52M parameters and 325B FLOPs, being 4×–9× smaller and using 13×–42× fewer FLOPs than previous detectors. Code is available on https://github.com/google/automl/tree/master/efficientdet .

The large size of object detection models deters their deployment in real-world applications such as self-driving cars and robotics. To address this problem, the Google Research team introduces two optimizations, namely (1) a weighted bi-directional feature pyramid network (BiFPN) for efficient multi-scale feature fusion and (2) a novel compound scaling method. By combining these optimizations with the EfficientNet backbones, the authors develop a family of object detectors, called EfficientDet . The experiments demonstrate that these object detectors consistently achieve higher accuracy with far fewer parameters and multiply-adds (FLOPs).

EfficientDet

  • A weighted bi-directional feature pyramid network (BiFPN) for easy and fast multi-scale feature fusion. It learns the importance of different input features and repeatedly applies top-down and bottom-up multi-scale feature fusion.
  • A new compound scaling method for simultaneous scaling of the resolution, depth, and width for all backbone, feature network, and box/class prediction networks.
  • These optimizations, together with the EfficientNet backbones, allow the development of a new family of object detectors, called EfficientDet .
  • the EfficientDet model with 52M parameters gets state-of-the-art 52.2 AP on the COCO test-dev dataset, outperforming the previous best detector with 1.5 AP while being 4× smaller and using 13× fewer FLOPs;
  • with simple modifications, the EfficientDet model achieves 81.74% mIOU accuracy, outperforming DeepLabV3+ by 1.7% on Pascal VOC 2012 semantic segmentation with 9.8x fewer FLOPs;
  • the EfficientDet models are up to 3× to 8× faster on GPU/CPU than previous detectors.
  • The paper was accepted to CVPR 2020, the leading conference in computer vision.
  • The high level of interest in the code implementations of this paper makes this research one of the highest-trending papers introduced recently.
  • The high accuracy and efficiency of the EfficientDet detectors may enable their application for real-world tasks, including self-driving cars and robotics.
  • The authors released the official TensorFlow implementation of EfficientDet.
  • The PyTorch implementation of this paper can be found here and here .

8. Unsupervised Learning of Probably Symmetric Deformable 3D Objects from Images in the Wild , by Shangzhe Wu, Christian Rupprecht, Andrea Vedaldi

We propose a method to learn 3D deformable object categories from raw single-view images, without external supervision. The method is based on an autoencoder that factors each input image into depth, albedo, viewpoint and illumination. In order to disentangle these components without supervision, we use the fact that many object categories have, at least in principle, a symmetric structure. We show that reasoning about illumination allows us to exploit the underlying object symmetry even if the appearance is not symmetric due to shading. Furthermore, we model objects that are probably, but not certainly, symmetric by predicting a symmetry probability map, learned end-to-end with the other components of the model. Our experiments show that this method can recover very accurately the 3D shape of human faces, cat faces and cars from single-view images, without any supervision or a prior shape model. On benchmarks, we demonstrate superior accuracy compared to another method that uses supervision at the level of 2D image correspondences.

The research group from the University of Oxford studies the problem of learning 3D deformable object categories from single-view RGB images without additional supervision. To decompose the image into depth, albedo, illumination, and viewpoint without direct supervision for these factors, they suggest starting by assuming objects to be symmetric. Then, considering that real-world objects are never fully symmetrical, at least due to variations in pose and illumination, the researchers augment the model by explicitly modeling illumination and predicting a dense map with probabilities that any given pixel has a symmetric counterpart. The experiments demonstrate that the introduced approach achieves better reconstruction results than other unsupervised methods. Moreover, it outperforms the recent state-of-the-art method that leverages keypoint supervision.

deformable 3D

  • no access to 2D or 3D ground truth information such as keypoints, segmentation, depth maps, or prior knowledge of a 3D model;
  • using an unconstrained collection of single-view images without having multiple views of the same instance.
  • leveraging symmetry as a geometric cue to constrain the decomposition;
  • explicitly modeling illumination and using it as an additional cue for recovering the shape;
  • augmenting the model to account for potential lack of symmetry – particularly, predicting a dense map that contains the probability of a given pixel having a symmetric counterpart in the image.
  • Qualitative evaluation of the suggested approach demonstrates that it reconstructs 3D faces of humans and cats with high fidelity, containing fine details of the nose, eyes, and mouth.
  • The method reconstructs higher-quality shapes compared to other state-of-the-art unsupervised methods, and even outperforms the DepthNet model, which uses 2D keypoint annotations for depth prediction.

deformable 3D reconstruction

  • The paper received the Best Paper Award at CVPR 2020, the leading conference in computer vision.
  • Reconstructing more complex objects by extending the model to use either multiple canonical views or a different 3D representation, such as a mesh or a voxel map.
  • Improving model performance under extreme lighting conditions and for extreme poses.
  • The implementation code and demo are available on GitHub .

9. An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale , by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby

While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer can perform very well on image classification tasks when applied directly to sequences of image patches. When pre-trained on large amounts of data and transferred to multiple recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer attain excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.

The authors of this paper show that a pure Transformer can perform very well on image classification tasks. They introduce Vision Transformer (ViT) , which is applied directly to sequences of image patches by analogy with tokens (words) in NLP. When trained on large datasets of 14M–300M images, Vision Transformer approaches or beats state-of-the-art CNN-based models on image recognition tasks. In particular, it achieves an accuracy of 88.36% on ImageNet, 90.77% on ImageNet-ReaL, 94.55% on CIFAR-100, and 77.16% on the VTAB suite of 19 tasks.

Visual Transformer

  • When applying Transformer architecture to images, the authors follow as closely as possible the design of the original Transformer designed for NLP.
  • splitting images into fixed-size patches;
  • linearly embedding each of them;
  • adding position embeddings to the resulting sequence of vectors;
  • feeding the patches to a standard Transformer encoder;
  • adding an extra learnable ‘classification token’ to the sequence.
  • Similarly to Transformers in NLP, Vision Transformer is typically pre-trained on large datasets and fine-tuned to downstream tasks.
  • 88.36% on ImageNet; 
  • 90.77% on ImageNet-ReaL; 
  • 94.55% on CIFAR-100; 
  • 97.56% on Oxford-IIIT Pets;
  • 99.74% on Oxford Flowers-102;
  • 77.16% on the VTAB suite of 19 tasks.

Visual Transformer

  • The paper is trending in the AI research community, as evident from the repository stats on GitHub .
  • It is also under review for ICLR 2021 , one of the key conferences in deep learning.
  • Applying Vision Transformer to other computer vision tasks, such as detection and segmentation.
  • Exploring self-supervised pre-training methods.
  • Analyzing the few-shot properties of Vision Transformer.
  • Exploring contrastive pre-training.
  • Further scaling ViT.
  • Thanks to their efficient pre-training and high performance, Transformers may substitute convolutional networks in many computer vision applications, including navigation, automatic inspection, and visual surveillance.
  • The PyTorch implementation of Vision Transformer is available on GitHub .

10. AdaBelief Optimizer: Adapting Stepsizes by the Belief in Observed Gradients , by Juntang Zhuang, Tommy Tang, Sekhar Tatikonda, Nicha Dvornek, Yifan Ding, Xenophon Papademetris, James S. Duncan

Most popular optimizers for deep learning can be broadly categorized as adaptive methods (e.g. Adam) or accelerated schemes (e.g. stochastic gradient descent (SGD) with momentum). For many models such as convolutional neural networks (CNNs), adaptive methods typically converge faster but generalize worse compared to SGD; for complex settings such as generative adversarial networks (GANs), adaptive methods are typically the default because of their stability. We propose AdaBelief to simultaneously achieve three goals: fast convergence as in adaptive methods, good generalization as in SGD, and training stability. The intuition for AdaBelief is to adapt the step size according to the “belief” in the current gradient direction. Viewing the exponential moving average (EMA) of the noisy gradient as the prediction of the gradient at the next time step, if the observed gradient greatly deviates from the prediction, we distrust the current observation and take a small step; if the observed gradient is close to the prediction, we trust it and take a large step. We validate AdaBelief in extensive experiments, showing that it outperforms other methods with fast convergence and high accuracy on image classification and language modeling. Specifically, on ImageNet, AdaBelief achieves comparable accuracy to SGD. Furthermore, in the training of a GAN on Cifar10, AdaBelief demonstrates high stability and improves the quality of generated samples compared to a well-tuned Adam optimizer. Code is available at https://github.com/juntang-zhuang/Adabelief-Optimizer .

The researchers introduce AdaBelief , a new optimizer, which combines the high convergence speed of adaptive optimization methods and good generalization capabilities of accelerated stochastic gradient descent (SGD) schemes. The core idea behind the AdaBelief optimizer is to adapt step size based on the difference between predicted gradient and observed gradient: the step is small if the observed gradient deviates significantly from the prediction, making us distrust this observation, and the step is large when the current observation is close to the prediction, making us believe in this observation. The experiments confirm that AdaBelief combines fast convergence of adaptive methods, good generalizability of the SGD family, and high stability in the training of GANs.

  • The idea of the AdaBelief optimizer is to combine the advantages of adaptive optimization methods (e.g., Adam) and accelerated SGD optimizers. Adaptive methods typically converge faster, while SGD optimizers demonstrate better generalization performance.
  • If the observed gradient deviates greatly from the prediction, we have a weak belief in this observation and take a small step.
  • If the observed gradient is close to the prediction, we have a strong belief in this observation and take a large step.
  • fast convergence, like adaptive optimization methods;
  • good generalization, like the SGD family;
  • training stability in complex settings such as GAN.
  • In image classification tasks on CIFAR and ImageNet, AdaBelief demonstrates as fast convergence as Adam and as good generalization as SGD.
  • It outperforms other methods in language modeling.
  • In the training of a WGAN , AdaBelief significantly improves the quality of generated images compared to Adam.
  • The paper was accepted to NeurIPS 2020, the top conference in artificial intelligence.
  • It is also trending in the AI research community, as evident from the repository stats on GitHub .
  • AdaBelief can boost the development and application of deep learning models as it can be applied to the training of any model that numerically estimates parameter gradient. 
  • Both PyTorch and Tensorflow implementations are released on GitHub.

If you like these research summaries, you might be also interested in the following articles:

  • GPT-3 & Beyond: 10 NLP Research Papers You Should Read
  • Novel Computer Vision Research Papers From 2020
  • AAAI 2021: Top Research Papers With Business Applications
  • ICLR 2021: Key Research Papers

Enjoy this article? Sign up for more AI research updates.

We’ll let you know when we release more summary articles like this one.

  • Email Address *
  • Name * First Last
  • Natural Language Processing (NLP)
  • Chatbots & Conversational AI
  • Computer Vision
  • Ethics & Safety
  • Machine Learning
  • Deep Learning
  • Reinforcement Learning
  • Generative Models
  • Other (Please Describe Below)
  • What is your biggest challenge with AI research? *

Reader Interactions

' src=

About Mariya Yao

Mariya is the co-author of Applied AI: A Handbook For Business Leaders and former CTO at Metamaven. She "translates" arcane technical concepts into actionable business advice for executives and designs lovable products people actually want to use. Follow her on Twitter at @thinkmariya to raise your AI IQ.

' src=

May 16, 2021 at 8:13 pm

Merci pour ces informations massives

Leave a Reply

Your email address will not be published. Required fields are marked *

About TOPBOTS

  • Expert Contributors
  • Terms of Service & Privacy Policy
  • Contact TOPBOTS

AIM logo Black

  • Last updated November 18, 2021
  • In AI Origins & Evolution

Top Machine Learning Research Papers Released In 2021

  • by Dr. Nivash Jeevanandam

machine learning related research papers

Advances in machine learning and deep learning research are reshaping our technology. Machine learning and deep learning have accomplished various astounding feats this year in 2021, and key research articles have resulted in technical advances used by billions of people. The research in this sector is advancing at a breakneck pace and assisting you to keep up. Here is a collection of the most important recent scientific study papers.

Rebooting ACGAN: Auxiliary Classifier GANs with Stable Training

The authors of this work examined why ACGAN training becomes unstable as the number of classes in the dataset grows. The researchers revealed that the unstable training occurs due to a gradient explosion problem caused by the unboundedness of the input feature vectors and the classifier’s poor classification capabilities during the early training stage. The researchers presented the Data-to-Data Cross-Entropy loss (D2D-CE) and the Rebooted Auxiliary Classifier Generative Adversarial Network to alleviate the instability and reinforce ACGAN (ReACGAN). Additionally, extensive tests of ReACGAN demonstrate that it is resistant to hyperparameter selection and is compatible with a variety of architectures and differentiable augmentations.

This article is ranked #1 on CIFAR-10 for Conditional Image Generation.

For the research paper, read here .

For code, see here .

Dense Unsupervised Learning for Video Segmentation

The authors presented a straightforward and computationally fast unsupervised strategy for learning dense spacetime representations from unlabeled films in this study. The approach demonstrates rapid convergence of training and a high degree of data efficiency. Furthermore, the researchers obtain VOS accuracy superior to previous results despite employing a fraction of the previously necessary training data. The researchers acknowledge that the research findings may be utilised maliciously, such as for unlawful surveillance, and that they are excited to investigate how this skill might be used to better learn a broader spectrum of invariances by exploiting larger temporal windows in movies with complex (ego-)motion, which is more prone to disocclusions.

This study is ranked #1 on DAVIS 2017 for Unsupervised Video Object Segmentation (val).

Temporally-Consistent Surface Reconstruction using Metrically-Consistent Atlases

The authors offer an atlas-based technique for producing unsupervised temporally consistent surface reconstructions by requiring a point on the canonical shape representation to translate to metrically consistent 3D locations on the reconstructed surfaces. Finally, the researchers envisage a plethora of potential applications for the method. For example, by substituting an image-based loss for the Chamfer distance, one may apply the method to RGB video sequences, which the researchers feel will spur development in video-based 3D reconstruction.

This article is ranked #1 on ANIM in the category of Surface Reconstruction. 

EdgeFlow: Achieving Practical Interactive Segmentation with Edge-Guided Flow

The researchers propose a revolutionary interactive architecture called EdgeFlow that uses user interaction data without resorting to post-processing or iterative optimisation. The suggested technique achieves state-of-the-art performance on common benchmarks due to its coarse-to-fine network design. Additionally, the researchers create an effective interactive segmentation tool that enables the user to improve the segmentation result through flexible options incrementally.

This paper is ranked #1 on Interactive Segmentation on PASCAL VOC

Learning Transferable Visual Models From Natural Language Supervision

The authors of this work examined whether it is possible to transfer the success of task-agnostic web-scale pre-training in natural language processing to another domain. The findings indicate that adopting this formula resulted in the emergence of similar behaviours in the field of computer vision, and the authors examine the social ramifications of this line of research. CLIP models learn to accomplish a range of tasks during pre-training to optimise their training objective. Using natural language prompting, CLIP can then use this task learning to enable zero-shot transfer to many existing datasets. When applied at a large scale, this technique can compete with task-specific supervised models, while there is still much space for improvement.

This research is ranked #1 on Zero-Shot Transfer Image Classification on SUN

CoAtNet: Marrying Convolution and Attention for All Data Sizes

The researchers in this article conduct a thorough examination of the features of convolutions and transformers, resulting in a principled approach for combining them into a new family of models dubbed CoAtNet. Extensive experiments demonstrate that CoAtNet combines the advantages of ConvNets and Transformers, achieving state-of-the-art performance across a range of data sizes and compute budgets. Take note that this article is currently concentrating on ImageNet classification for model construction. However, the researchers believe their approach is relevant to a broader range of applications, such as object detection and semantic segmentation.

This paper is ranked #1 on Image Classification on ImageNet (using extra training data).

SwinIR: Image Restoration Using Swin Transformer

The authors of this article suggest the SwinIR image restoration model, which is based on the Swin Transformer . The model comprises three modules: shallow feature extraction, deep feature extraction, and human-recognition reconstruction. For deep feature extraction, the researchers employ a stack of residual Swin Transformer blocks (RSTB), each formed of Swin Transformer layers, a convolution layer, and a residual connection.

This research article is ranked #1 on Image Super-Resolution on Manga109 – 4x upscaling.

Access all our open Survey & Awards Nomination forms in one place >>

Dr. Nivash Jeevanandam

Dr. Nivash Jeevanandam

Download our mobile app.

machine learning related research papers

AIM Research

Pioneering advanced ai market research, request customised insights & surveys for the ai industry.

machine learning related research papers

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative ai skilling for enterprises, our customized corporate training program on generative ai provides a unique opportunity to empower, retain, and advance your talent., 3 ways to join our community, telegram group.

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox, most popular.

machine learning related research papers

Zoho Expands on Rural Vision, Opens R&D Centre in Kottarakara

The new centre looks to train and grow young local talent

Project Vaani to English Gyani, This IISc Professor is Going Places

Project Vaani to English Gyani, This IISc Professor is Going Places

machine learning related research papers

Samsung Semiconductor India Research to Train Youth in Karnataka in AI and IoT

SSIR will train and empower 1,100 undergraduate engineering students

Amey Dharwadekar from Meta

From Goa to California: How Amey Dharwadker is Leading Meta’s Video Recommendations

machine learning related research papers

AutoVRse Secures $2 mn to Scale Enterprise VR Training 

Much of the funds will go toward expanding VRseBuilder, the company’s flagship enterprise product.

Groq

Groq’s LPU Demonstrates Remarkable Speed, Running Mixtral at Nearly 500 tok/s

Aim launches the 3rd edition of data engineering summit. may 30-31, bengaluru.

machine learning related research papers

6 Trending Computer Vision Models on GitHub

machine learning related research papers

This Indian AI Startup is Bringing LLMs to Your Kitchen 

Our mission is to bring about better-informed and more conscious decisions about technology through authoritative, influential, and trustworthy journalism., shape the future of tech.

© Analytics India Magazine Pvt Ltd & AIM Media House LLC 2024

  • Terms of use
  • Privacy Policy

machine learning related research papers

machine learning related research papers

Frequently Asked Questions

Journal of Machine Learning Research

The Journal of Machine Learning Research (JMLR), established in 2000 , provides an international forum for the electronic and paper publication of high-quality scholarly articles in all areas of machine learning. All published papers are freely available online.

  • 2024.02.18 : Volume 24 completed; Volume 25 began.
  • 2023.01.20 : Volume 23 completed; Volume 24 began.
  • 2022.07.20 : New special issue on climate change .
  • 2022.02.18 : New blog post: Retrospectives from 20 Years of JMLR .
  • 2022.01.25 : Volume 22 completed; Volume 23 began.
  • 2021.12.02 : Message from outgoing co-EiC Bernhard Schölkopf .
  • 2021.02.10 : Volume 21 completed; Volume 22 began.
  • More news ...

Latest papers

Deep Network Approximation: Beyond ReLU to Diverse Activation Functions Shijun Zhang, Jianfeng Lu, Hongkai Zhao , 2024. [ abs ][ pdf ][ bib ]

Effect-Invariant Mechanisms for Policy Generalization Sorawit Saengkyongam, Niklas Pfister, Predrag Klasnja, Susan Murphy, Jonas Peters , 2024. [ abs ][ pdf ][ bib ]

Pygmtools: A Python Graph Matching Toolkit Runzhong Wang, Ziao Guo, Wenzheng Pan, Jiale Ma, Yikai Zhang, Nan Yang, Qi Liu, Longxuan Wei, Hanxue Zhang, Chang Liu, Zetian Jiang, Xiaokang Yang, Junchi Yan , 2024. (Machine Learning Open Source Software Paper) [ abs ][ pdf ][ bib ]      [ code ]

Heterogeneous-Agent Reinforcement Learning Yifan Zhong, Jakub Grudzien Kuba, Xidong Feng, Siyi Hu, Jiaming Ji, Yaodong Yang , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Sample-efficient Adversarial Imitation Learning Dahuin Jung, Hyungyu Lee, Sungroh Yoon , 2024. [ abs ][ pdf ][ bib ]

Stochastic Modified Flows, Mean-Field Limits and Dynamics of Stochastic Gradient Descent Benjamin Gess, Sebastian Kassing, Vitalii Konarovskyi , 2024. [ abs ][ pdf ][ bib ]

Rates of convergence for density estimation with generative adversarial networks Nikita Puchkin, Sergey Samsonov, Denis Belomestny, Eric Moulines, Alexey Naumov , 2024. [ abs ][ pdf ][ bib ]

Additive smoothing error in backward variational inference for general state-space models Mathis Chagneux, Elisabeth Gassiat, Pierre Gloaguen, Sylvain Le Corff , 2024. [ abs ][ pdf ][ bib ]

Optimal Bump Functions for Shallow ReLU networks: Weight Decay, Depth Separation, Curse of Dimensionality Stephan Wojtowytsch , 2024. [ abs ][ pdf ][ bib ]

Numerically Stable Sparse Gaussian Processes via Minimum Separation using Cover Trees Alexander Terenin, David R. Burt, Artem Artemev, Seth Flaxman, Mark van der Wilk, Carl Edward Rasmussen, Hong Ge , 2024. [ abs ][ pdf ][ bib ]      [ code ]

On Tail Decay Rate Estimation of Loss Function Distributions Etrit Haxholli, Marco Lorenzi , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Deep Nonparametric Estimation of Operators between Infinite Dimensional Spaces Hao Liu, Haizhao Yang, Minshuo Chen, Tuo Zhao, Wenjing Liao , 2024. [ abs ][ pdf ][ bib ]

Post-Regularization Confidence Bands for Ordinary Differential Equations Xiaowu Dai, Lexin Li , 2024. [ abs ][ pdf ][ bib ]

On the Generalization of Stochastic Gradient Descent with Momentum Ali Ramezani-Kebrya, Kimon Antonakopoulos, Volkan Cevher, Ashish Khisti, Ben Liang , 2024. [ abs ][ pdf ][ bib ]

Pursuit of the Cluster Structure of Network Lasso: Recovery Condition and Non-convex Extension Shotaro Yagishita, Jun-ya Gotoh , 2024. [ abs ][ pdf ][ bib ]

Iterate Averaging in the Quest for Best Test Error Diego Granziol, Nicholas P. Baskerville, Xingchen Wan, Samuel Albanie, Stephen Roberts , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Nonparametric Inference under B-bits Quantization Kexuan Li, Ruiqi Liu, Ganggang Xu, Zuofeng Shang , 2024. [ abs ][ pdf ][ bib ]

Black Box Variational Inference with a Deterministic Objective: Faster, More Accurate, and Even More Black Box Ryan Giordano, Martin Ingram, Tamara Broderick , 2024. [ abs ][ pdf ][ bib ]      [ code ]

On Sufficient Graphical Models Bing Li, Kyongwon Kim , 2024. [ abs ][ pdf ][ bib ]

Localized Debiased Machine Learning: Efficient Inference on Quantile Treatment Effects and Beyond Nathan Kallus, Xiaojie Mao, Masatoshi Uehara , 2024. [ abs ][ pdf ][ bib ]      [ code ]

On the Effect of Initialization: The Scaling Path of 2-Layer Neural Networks Sebastian Neumayer, Lénaïc Chizat, Michael Unser , 2024. [ abs ][ pdf ][ bib ]

Improving physics-informed neural networks with meta-learned optimization Alex Bihlo , 2024. [ abs ][ pdf ][ bib ]

A Comparison of Continuous-Time Approximations to Stochastic Gradient Descent Stefan Ankirchner, Stefan Perko , 2024. [ abs ][ pdf ][ bib ]

Critically Assessing the State of the Art in Neural Network Verification Matthias König, Annelot W. Bosman, Holger H. Hoos, Jan N. van Rijn , 2024. [ abs ][ pdf ][ bib ]

Estimating the Minimizer and the Minimum Value of a Regression Function Arya Akhava, Davit Gogolashvili, Alexandre B. Tsybakov , 2024. [ abs ][ pdf ][ bib ]

Modeling Random Networks with Heterogeneous Reciprocity Daniel Cirkovic, Tiandong Wang , 2024. [ abs ][ pdf ][ bib ]

Exploration, Exploitation, and Engagement in Multi-Armed Bandits with Abandonment Zixian Yang, Xin Liu, Lei Ying , 2024. [ abs ][ pdf ][ bib ]

On Efficient and Scalable Computation of the Nonparametric Maximum Likelihood Estimator in Mixture Models Yangjing Zhang, Ying Cui, Bodhisattva Sen, Kim-Chuan Toh , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Decorrelated Variable Importance Isabella Verdinelli, Larry Wasserman , 2024. [ abs ][ pdf ][ bib ]

Model-Free Representation Learning and Exploration in Low-Rank MDPs Aditya Modi, Jinglin Chen, Akshay Krishnamurthy, Nan Jiang, Alekh Agarwal , 2024. [ abs ][ pdf ][ bib ]

Seeded Graph Matching for the Correlated Gaussian Wigner Model via the Projected Power Method Ernesto Araya, Guillaume Braun, Hemant Tyagi , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Fast Policy Extragradient Methods for Competitive Games with Entropy Regularization Shicong Cen, Yuting Wei, Yuejie Chi , 2024. [ abs ][ pdf ][ bib ]

Power of knockoff: The impact of ranking algorithm, augmented design, and symmetric statistic Zheng Tracy Ke, Jun S. Liu, Yucong Ma , 2024. [ abs ][ pdf ][ bib ]

Lower Complexity Bounds of Finite-Sum Optimization Problems: The Results and Construction Yuze Han, Guangzeng Xie, Zhihua Zhang , 2024. [ abs ][ pdf ][ bib ]

On Truthing Issues in Supervised Classification Jonathan K. Su , 2024. [ abs ][ pdf ][ bib ]

  • Event calendar

The Top 17 ‘Must-Read’ AI Papers in 2022

The Top 17 ‘Must-Read’ AI Papers in 2022

We caught up with experts in the RE•WORK community to find out what the top 17 AI papers are for 2022 so far that you can add to your Summer must reads. The papers cover a wide range of topics including AI in social media and how AI can benefit humanity and are free to access.

Interested in learning more? Check out all the upcoming RE•WORK events to find out about the latest trends and industry updates in AI here .

Max Li, Staff Data Scientist – Tech Lead at Wish

Max is a Staff Data Scientist at Wish where he focuses on experimentation (A/B testing) and machine learning.  His passion is to empower data-driven decision-making through the rigorous use of data. View Max’s presentation, ‘Assign Experiment Variants at Scale in A/B Tests’, from our Deep Learning Summit in February 2022 here .

1. Boostrapped Meta-Learning (2022) – Sebastian Flennerhag et al.

The first paper selected by Max proposes an algorithm in which allows the meta-learner teach itself, allowing to overcome the meta-optimisation challenge. The algorithm focuses meta-learning with gradients, which guarantees improvements in performance. The paper also looks at how bootstrapping opens up possibilities. Read the full paper here .

2. Multi-Objective Bayesian Optimization over High-Dimensional Search Spaces (2022) – Samuel Daulton et al.

Another paper selected by Max proposes MORBO, a scalable method for multiple-objective BO as it performs better than that of high-dimensional search spaces. MORBO significantly improves the sample efficiency, and where BO algorithms fail, MORBO provides improved sample efficiencies to the current BO approach used. Read the full paper here .

3. Tabular Data: Deep Learning is Not All You Need (2021) – Ravid Shwartz-Ziv, Amitai Armon

To solve real-life data science problems, selecting the right model to use is crucial. This final paper selected by Max explores whether deep models should be recommended as an option for tabular data. Read the full paper here .

machine learning related research papers

Jigyasa Grover, Senior Machine Learning Engineer at Twitter

Jigyasa Grover is a Senior Machine Learning Engineer at Twitter working in the performance ads ranking domain. Recently, she was honoured with the 'Outstanding in AI: Young Role Model Award' by Women in AI across North America. She is one of the few ML Google Developer Experts globally. Jigyasa has previously presented at our Deep Learning Summit and MLOps event in San Fransisco earlier this year.

4. Privacy for Free: How does Dataset Condensation Help Privacy? (2022) – Tian Dong et al.

Jigyasa’s first recommendation concentrates on Privacy Preserving Machine Learning, specifically mitigating the leakage of sensitive data in Machine Learning. The paper provides one of the first propositions of using dataset condensation techniques to preserve the data efficiency during model training and furnish membership privacy. This paper was published by Sony AI and won the Outstanding Paper Award at ICML 2022. Read the full paper here .

5. Affective Signals in a Social Media Recommender System (2022) – Jane Dwivedi-Yu et al.

The second paper recommended by Jigyasa talks about operationalising Affective Computing, also known as Emotional AI, for an improved personalised feed on social media. The paper discusses the design of an affective taxonomy customised to user needs on social media. It further lays out the curation of suitable training data by combining engagement data and data from a human-labelling task to enable the identification of the affective response a user might exhibit for a particular post. Read the full paper here .

6. ItemSage: Learning Product Embeddings for Shopping Recommendations at Pinterest (2022) – Paul Baltescu et al.

Jigyasa’s last recommendation is a paper by Pinterest that illustrates the aggregation of both textual and visual information to build a unified set of product embeddings to enhance recommendation results on e-commerce websites. By applying multi-task learning, the proposed embeddings can optimise for multiple engagement types and ensures that the shopping recommendation stack is efficient with respect to all objectives. Read the full article here .

Asmita Poddar, Software Development Engineer at Amazon Alexa

Asmita is a Software Development Engineer at Amazon Alexa, where she works on developing and productionising natural language processing and speech models. Asmita also has prior experience in applying machine learning in diverse domains. Asmita will be presenting at our London AI Summit , in September, where she will discuss AI for Spoken Communication.

7. Competition-Level Code Generation with AlphaCode (2022) – Yujia Li et al.

Systems can help programmers become more productive. Asmita has selected this paper which addresses the problems with incorporating innovations in AI into these systems. AlphaCode is a system that creates solutions for problems that requires deeper reasoning. Read the full paper here .

8. A Commonsense Knowledge Enhanced Network with Retrospective Loss for Emotion Recognition in Spoken Dialog (2022) – Yunhe Xie et al.

There are limits to model’s reasoning in regards to the existing ERSD datasets. The final paper selected by Asmita proposes a Commonsense Knowledge Enhanced Network with a backward-looking loss to perform dialog modelling, external knowledge integration and historical state retrospect. The model used has been shown to outperform other models. Read the full paper here .

machine learning related research papers

Discover the speakers we have lined up and the topics we will cover at the London AI Summit.

Sergei Bobrovskyi, Expert in Anomaly Detection for Root Cause Analysis at Airbus

Dr. Sergei Bobrovskyi is a Data Scientist within the Analytics Accelerator team of the Airbus Digital Transformation Office. His work focuses on applications of AI for anomaly detection in time series, spanning various use-cases across Airbus. Sergei will be presenting at our Berlin AI Summit in October about Anomaly Detection, Root Cause Analysis and Explainability.

9. LaMDA: Language Models for Dialog Applications (2022) – Romal Thoppilan et al.

The paper chosen by Sergei describes the LaMDA system, which caused the furor this summer, when a former Google engineer claimed it has shown signs of being sentient. LaMDA is a family of large language models for dialog applications based on Transformer architecture. The interesting feature of the model is their fine-tuning with human annotated data and possibility to consult external sources. In any case, this is a very interesting model family, which we might encounter in many of the applications we use daily. Read the full paper here .

10. A Path Towards Autonomous Machine Intelligence Version 0.9.2, 2022-06-27 (2022) – Yann LeCun

The second paper chosen by Sergei provides a vision on how to progress towards general AI. The study combines a number of concepts including configurable predictive world model, behaviour driven through intrinsic motivation, and hierarchical joint embedding architectures. Read the full paper here .

11. Coordination Among Neural Modules Through a Shared Global Workpace (2022) – Anirudh Goyal et al.

This paper chosen by Sergei combines the Transformer architecture underlying most of the recent successes of deep learning with ideas from the Global Workspace Theory from cognitive sciences. This is an interesting read to broaden the understanding of why certain model architectures perform well and in which direction we might go in the future to further improve performance on challenging tasks. Read the full paper here .

12. Magnetic control of tokamak plasmas through deep reinforcement learning (2022) – Jonas Degrave et al.

Sergei chose the next paper, which asks the question of ‘how can AI research benefit humanity?’. The use of AI to enable safe, reliable and scalable deployment of fusion energy could contribute to the solution of pression problems of climate change. Sergei has said that this is an extremely interesting application of AI technology for engineering. Read the full paper here .

13. TranAd: Deep Transformer Networks for Anomaly Detection in Multivariate Time Series Data (2022) – Shreshth Tuli, Giuliano Casale and Nicholas R. Jennings

The final paper chosen by Sergei is a specialised paper applying transformer architecture to the problem of unsupervised anomaly detection in multivariate time-series. Many architectures which were successful in other fields are at some points being also applied to time-series. The paper shows an improved performance on some known data sets. Read the full paper here .

machine learning related research papers

Abdullahi Adamu, Senior Software Engineer at Sony

Abdullahi has worked in various industries including working at a market research start-up where he developed models that could extract insights from human conversations about products or services. He moved to Publicis, where he became Data Engineer and Data Scientist in 2018. Abdullahi will be part of our panel discussion at the London AI Summit in September, where he will discuss Harnessing the Power of Deep Learning.

14. Self-Supervision for Learning from the Bottom Up (2022) – Alexei Efros

This paper chosen by Abdullahi makes compelling arguments for why self-supervision is the next step in the evolution of AI/ML for building more robust models. Overall, these compelling arguments justify even further why self-supervised learning is important on our journey towards more robust models that generalise better in the wild. Read the full paper here .

15. Neural Architecture Search Survey: A Hardware Perspective (2022) – Krishna Teja Chitty-Venkata and Arun K. Somani

Another paper chosen by Abdullahi understands that as we move towards edge computing and federated learning, neural architecture search that takes into account hardware constraints which will be more critical in ensuring that we have leaner neural network models that balance latency and generalisation performance. This survey gives a birds eye view of the various neural architecture search algorithms that take into account hardware constraints to design artificial neural networks that give the best tradeoff of performance and accuracy. Read the full paper here .

16. What Should Not Be Contrastive In Contrastive Learning (2021) – Tete Xiao et al.

In the paper chosen by Abdullahi highlights the underlying assumptions behind data augmentation methods and how these can be counter productive in the context of contrastive learning; for example colour augmentation whilst a downstream task is meant to differentiate colours of objects. The result reported show promising results in the wild. Overall, it presents an elegant solution to using data augmentation for contrastive learning. Read the full paper here .

17. Why do tree-based models still outperform deep learning on tabular data? (2022) – Leo Grinsztajn, Edouard Oyallon and Gael Varoquaux

The final paper selected by Abdulliah works on answering the question of why deep learning models still find it hard to compete on tabular data compared to tree-based models. It is shown that MLP-like architectures are more sensitive to uninformative features in data, compared to their tree-based counterparts. Read the full paper here .

Sign up to the RE•WORK monthly newsletter for the latest AI news, trends and events.

Join us at our upcoming events this year:

·       London AI Summit – 14-15 September 2022

·       Berlin AI Summit – 4-5 October 2022

·       AI in Healthcare Summit Boston – 13-14 October 2022

·       Sydney Deep Learning and Enterprise AI Summits – 17-18 October 2022

·       MLOps Summit – 9-10 November 2022

·       Toronto AI Summit – 9-10 November 2022

·       Nordics AI Summit - 7-8 December 2022

Machine Learning

Ieee account.

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

machine learning related research papers

CORE MACHINE LEARNING

Revisiting feature prediction for learning visual representations from video.

February 15, 2024

This paper explores feature prediction as a stand-alone objective for unsupervised learning from video and introduces V-JEPA, a collection of vision models trained solely using a feature prediction objective, without the use of pretrained image encoders, text, negative examples, reconstruction, or other sources of supervision. The models are trained on 2 million videos collected from public datasets and are evaluated on downstream image and video tasks. Our results show that learning by predicting video features leads to versatile visual representations that perform well on both motion and appearance-based tasks, without adaption of the model’s parameters; e.g., using a frozen backbone, our largest model, a ViT-H/16 trained only on videos, obtains 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet1K.

Adrien Bardes

Quentin Garrido

Xinlei Chen

Michael Rabbat

Mido Assran

Nicolas Ballas

Research Topics

Core Machine Learning

Related Publications

January 09, 2024

Accelerating a Triton Fused Kernel for W4A16 Quantized Inference with SplitK Work Decomposition

Less Wright , Adnan Hoque

January 06, 2024

RANKING AND RECOMMENDATIONS

Reinforcement learning, learning to bid and rank together in recommendation systems.

Geng Ji , Wentao Jiang , Jiang Li , Fahmid Morshed Fahid , Zhengxing Chen , Yinghua Li , Jun Xiao , Chongxi Bao , Zheqing (Bill) Zhu

November 13, 2023

Mechanic: A Learning Rate Tuner

Aaron Defazio , Ashok Cutkosky , Harsh Mehta

October 01, 2023

Q-Pensieve: Boosting Sample Efficiency of Multi-Objective RL Through Memory Sharing of Q-Snapshots

Wei Hung , Bo-Kai Huang , Ping-Chun Hsieh , Xi Liu

Latest News

Cicero: an ai agent that negotiates, persuades, and cooperates with people.

November 22, 2022

machine learning related research papers

Help Us Pioneer The Future of AI

We share our open source frameworks, tools, libraries, and models for everything from research exploration to large-scale production deployment..

Latest Work

Our Actions

Meta © 2024

Suggestions or feedback?

MIT News | Massachusetts Institute of Technology

  • Machine learning
  • Social justice
  • Black holes
  • Classes and programs

Departments

  • Aeronautics and Astronautics
  • Brain and Cognitive Sciences
  • Architecture
  • Political Science
  • Mechanical Engineering

Centers, Labs, & Programs

  • Abdul Latif Jameel Poverty Action Lab (J-PAL)
  • Picower Institute for Learning and Memory
  • Lincoln Laboratory
  • School of Architecture + Planning
  • School of Engineering
  • School of Humanities, Arts, and Social Sciences
  • Sloan School of Management
  • School of Science
  • MIT Schwarzman College of Computing

MIT researchers remotely map crops, field by field

Press contact :, media download.

Four Google Street View photos show rice, cassava, sugarcane, and maize fields.

*Terms of Use:

Images for download on the MIT News office website are made available to non-commercial entities, press and the general public under a Creative Commons Attribution Non-Commercial No Derivatives license . You may not alter the images provided, other than to crop them to size. A credit line must be used when reproducing images; if one is not provided below, credit the images to "MIT."

Four Google Street View photos show rice, cassava, sugarcane, and maize fields.

Previous image Next image

Crop maps help scientists and policymakers track global food supplies and estimate how they might shift with climate change and growing populations. But getting accurate maps of the types of crops that are grown from farm to farm often requires on-the-ground surveys that only a handful of countries have the resources to maintain.

Now, MIT engineers have developed a method to quickly and accurately label and map crop types without requiring in-person assessments of every single farm. The team’s method uses a combination of Google Street View images, machine learning, and satellite data to automatically determine the crops grown throughout a region, from one fraction of an acre to the next. 

The researchers used the technique to automatically generate the first nationwide crop map of Thailand — a smallholder country where small, independent farms make up the predominant form of agriculture. The team created a border-to-border map of Thailand’s four major crops — rice, cassava, sugarcane, and maize — and determined which of the four types was grown, at every 10 meters, and without gaps, across the entire country. The resulting map achieved an accuracy of 93 percent, which the researchers say is comparable to on-the-ground mapping efforts in high-income, big-farm countries.

The team is applying their mapping technique to other countries such as India, where small farms sustain most of the population but the type of crops grown from farm to farm has historically been poorly recorded.

“It’s a longstanding gap in knowledge about what is grown around the world,” says Sherrie Wang, the d’Arbeloff Career Development Assistant Professor in MIT’s Department of Mechanical Engineering, and the Institute for Data, Systems, and Society (IDSS). “The final goal is to understand agricultural outcomes like yield, and how to farm more sustainably. One of the key preliminary steps is to map what is even being grown — the more granularly you can map, the more questions you can answer.”

Wang, along with MIT graduate student Jordi Laguarta Soler and Thomas Friedel of the agtech company PEAT GmbH, will present a paper detailing their mapping method later this month at the AAAI Conference on Artificial Intelligence.

Ground truth

Smallholder farms are often run by a single family or farmer, who subsist on the crops and livestock that they raise. It’s estimated that smallholder farms support two-thirds of the world’s rural population and produce 80 percent of the world’s food. Keeping tabs on what is grown and where is essential to tracking and forecasting food supplies around the world. But the majority of these small farms are in low to middle-income countries, where few resources are devoted to keeping track of individual farms’ crop types and yields.

Crop mapping efforts are mainly carried out in high-income regions such as the United States and Europe, where government agricultural agencies oversee crop surveys and send assessors to farms to label crops from field to field. These “ground truth” labels are then fed into machine-learning models that make connections between the ground labels of actual crops and satellite signals of the same fields. They then label and map wider swaths of farmland that assessors don’t cover but that satellites automatically do.

“What’s lacking in low- and middle-income countries is this ground label that we can associate with satellite signals,” Laguarta Soler says. “Getting these ground truths to train a model in the first place has been limited in most of the world.”

The team realized that, while many developing countries do not have the resources to maintain crop surveys, they could potentially use another source of ground data: roadside imagery, captured by services such as Google Street View and Mapillary, which send cars throughout a region to take continuous 360-degree images with dashcams and rooftop cameras.

In recent years, such services have been able to access low- and middle-income countries. While the goal of these services is not specifically to capture images of crops, the MIT team saw that they could search the roadside images to identify crops.

Cropped image

In their new study, the researchers worked with Google Street View (GSV) images taken throughout Thailand — a country that the service has recently imaged fairly thoroughly, and which consists predominantly of smallholder farms.

Starting with over 200,000 GSV images randomly sampled across Thailand, the team filtered out images that depicted buildings, trees, and general vegetation. About 81,000 images were crop-related. They set aside 2,000 of these, which they sent to an agronomist, who determined and labeled each crop type by eye. They then trained a convolutional neural network to automatically generate crop labels for the other 79,000 images, using various training methods, including iNaturalist — a web-based crowdsourced  biodiversity database, and GPT-4V, a “multimodal large language model” that enables a user to input an image and ask the model to identify what the image is depicting. For each of the 81,000 images, the model generated a label of one of four crops that the image was likely depicting — rice, maize, sugarcane, or cassava.

The researchers then paired each labeled image with the corresponding satellite data taken of the same location throughout a single growing season. These satellite data include measurements across multiple wavelengths, such as a location’s greenness and its reflectivity (which can be a sign of water). 

“Each type of crop has a certain signature across these different bands, which changes throughout a growing season,” Laguarta Soler notes.

The team trained a second model to make associations between a location’s satellite data and its corresponding crop label. They then used this model to process satellite data taken of the rest of the country, where crop labels were not generated or available. From the associations that the model learned, it then assigned crop labels across Thailand, generating a country-wide map of crop types, at a resolution of 10 square meters.

This first-of-its-kind crop map included locations corresponding to the 2,000 GSV images that the researchers originally set aside, that were labeled by arborists. These human-labeled images were used to validate the map’s labels, and when the team looked to see whether the map’s labels matched the expert, “gold standard” labels, it did so 93 percent of the time.

“In the U.S., we’re also looking at over 90 percent accuracy, whereas with previous work in India, we’ve only seen 75 percent because ground labels are limited,” Wang says. “Now we can create these labels in a cheap and automated way.”

The researchers are moving to map crops across India, where roadside images via Google Street View and other services have recently become available.

“There are over 150 million smallholder farmers in India,” Wang says. “India is covered in agriculture, almost wall-to-wall farms, but very small farms, and historically it’s been very difficult to create maps of India because there are very sparse ground labels.”

The team is working to generate crop maps in India, which could be used to inform policies having to do with assessing and bolstering yields, as global temperatures and populations rise.

“What would be interesting would be to create these maps over time,” Wang says. “Then you could start to see trends, and we can try to relate those things to anything like changes in climate and policies.”

Share this news article on:

Related links.

  • Sherrie Wang
  • Institute for Data, Systems, and Society
  • Department of Mechanical Engineering

Related Topics

  • Agriculture
  • Computer modeling
  • Computer vision
  • Developing countries
  • Environment
  • Mechanical engineering

Related Articles

Collage of eleven new faculty member's headshots, arranged in two rows

School of Engineering welcomes new faculty

Landscape of a peat bog under a blue sky. In the foreground, several islands of peat are surrounded by water.

Satellite-based method measures carbon in peat bogs

Three women, researchers from the GEAR Lab, stand on a dirt road in a field in Jordan holding laptops.

Smart irrigation technology covers “more crop per drop”

The village has about 20 huts that form a large a ring around an empty, brown, circular area. Lots of trees are around the village.

Ancient Amazonians intentionally created fertile “dark earth”

Aerial view of an abandoned agricultural terrace in France

3 Questions: Can disused croplands help mitigate climate change?

Previous item Next item

More MIT News

Three layers show a glob of glue, shiny circular metal bits, and the colorful blue computer chip. Pink lasers go through the chip and hit the circular metal bits and bounce back. A lock icon is to the right.

This tiny, tamper-proof ID tag can authenticate almost anything

Read full story →

An aerial view shows trees and sports stadiums and parking lots. The areas are measured with a road measuring “921.97ft” and a parking lot measuring, “614,232.74 ft squared.”

Stitch3D is powering a new wave of 3D data collaboration

Headshots of Thea Keith-Lucas, Sarah Johnson, and Natalie Hill

MIT course aids social connection, better relationships, and happiness

Illustration of U shaped vaccine moledules with tails attach to three-lobed albumin molecules. The background is an image of a oval shaped lymph node.

Hitchhiking cancer vaccine makes progress in the clinic

Portrait photo of Leon Sandler standing in the foyer of an MIT building

A passion for innovation and education

Graphic of glowing moleculars being touched by an electrical charge against black background

With just a little electricity, MIT researchers boost common catalytic reactions

  • More news on MIT News homepage →

Massachusetts Institute of Technology 77 Massachusetts Avenue, Cambridge, MA, USA

  • Map (opens in new window)
  • Events (opens in new window)
  • People (opens in new window)
  • Careers (opens in new window)
  • Accessibility
  • Social Media Hub
  • MIT on Facebook
  • MIT on YouTube
  • MIT on Instagram

Predicting building energy consumption in urban neighborhoods using machine learning algorithms

  • Research article
  • Open access
  • Published: 16 February 2024
  • Volume 2 , article number  6 , ( 2024 )

Cite this article

You have full access to this open access article

  • Qingrui Jiang 1 , 2 ,
  • Chenyu Huang 1 ,
  • Zhiqiang Wu 1 , 2 ,
  • Jiawei Yao 1 ,
  • Jinyu Wang 1 ,
  • Xiaochang Liu 1 &
  • Renlu Qiao 1 , 2  

35 Accesses

Explore all metrics

Assessing building energy consumption in urban neighborhoods at the early stages of urban planning assists decision-makers in developing detailed urban renewal plans and sustainable development strategies. At the city-level, the use of physical simulation-based urban building energy modeling (UBEM) is too costly, and data-driven approaches often are hampered by a lack of available building energy monitoring data. This paper combines a simulation-based approach with a data-driven approach, using UBEM to provide a dataset for machine learning and deploying the trained model for large-scale urban building energy consumption prediction. Firstly, we collected 18,789 neighborhoods containing 248,938 buildings in the Shanghai central area, of which 2,702 neighborhoods were used for UBEM. Simultaneously, building functions were defined by POI data and land use data. We used 14 impact factors related to land use and building morphology to define each neighborhood. Next, we compared the performance of six ensemble learning methods modeling impact factors with building energy consumption and used SHAP to explain the best model; we also filtered out the features that contributed the most to the model output to reduce the model complexity. Finally, the balanced regressor that had the best prediction accuracy with the minimum number of features was used to predict the remaining urban neighborhoods in the Shanghai central area. The results show that XGBoost achieves the best performance. The balanced regressor, constructed with the 9 most contributing features, predicted the building rooftop photovoltaics potential, total load, cooling load, and heating load with test set accuracies of 0.956, 0.674, 0.608, and 0.762, respectively. Our method offers an 85.5%-time advantage over traditional methods, with only a maximum of 22.75% of error.

Similar content being viewed by others

machine learning related research papers

Artificial intelligence-based solutions for climate change: a review

Lin Chen, Zhonghao Chen, … Pow-Seng Yap

machine learning related research papers

Air pollution prediction with machine learning: a case study of Indian cities

K. Kumar & B. P. Pande

machine learning related research papers

AI-big data analytics for building automation and management systems: a survey, actual challenges and future perspectives

Yassine Himeur, Mariam Elnour, … Abbes Amira

Avoid common mistakes on your manuscript.

1 Introduction

During urbanization, we face the pressing challenge of climate change. To meet the goals of the Paris Agreement, the scientific community has united in efforts to combat global warming of 1.5 degrees Celsius (Mishra et al., 2022 ; Morfeldt & Johansson, 2022 ; Slameršak et al., 2022 ). The building and construction sector stands as one of the largest consumers of energy in the world, contributing for 25–40% of global CO2 emissions (Pomponi & Moncaster, 2017 ). In China, the construction sector among the top three energy consuming sectors, representing about 21.9% of carbon emissions from energy-related sectors (You et al., 2023 ). China has set forth ambitious carbon neutrality goals, aiming for carbon peaking by 2030 and carbon neutrality by 2060. Thus, curbing carbon emissions from the building sector is critical to China's strategy. Recent studies have raised concerns regarding the carbon neutrality of China's construction sector (Camarasa et al., 2022 ), suggesting that the construction sector might need more intensive efforts to align with carbon neutrality compared to others.

To achieve the goal of decarbonizing the building sector, a range of strategies are essential, including the construction of zero-carbon buildings, retrofitting existing energy-intensive buildings, developing new low-carbon building technologies, and promoting renewable energy sources in cities. Conducting carbon emission assessments at the decision-making stage of urban planning and urban regeneration is crucial (Dahlström et al., 2022 ; Heidelberger & Rakha, 2022 ). This approach aids decision-makers in developing detailed urban renewal plans and sustainable development strategies. However, current city-level assessments of building energy consumption present challenges. On the one hand, the top-down approach, which depends on monitoring and statistics (Abbasabadi & Ashayeri, 2019 ; Wu et al., 2022 ), is often lacking in smaller cities or cities with insufficient economic development. On the other hand, the resources and time required for assessments that rely on bottom-up approaches with energy consumption simulation engines often prove prohibitive in the early stages of planning (W. Wang et al., 2021 ). While scholars have recently turned to artificial intelligence and machine learning to predict building energy consumption, most studies focus on predicting the dynamic loads of building monoliths (L. Zhang et al., 2021 ). However, for planning designers and policymakers, energy use intensity, rather than dynamic loads, is the primary evaluation indicator. Moreover, these models often demand detailed inputs to ensure prediction accuracy (Parhizkar et al., 2021 ), posing challenges for data collection in the early stages of urban planning.

This paper employs a combination of a physical simulation engine and data-driven techniques to predict city-level building energy consumption in a bottom-up manner. We divided urban neighborhoods into simulation and prediction datasets, performed urban building energy simulations on a small sample of simulation datasets, and trained machine learning models. The trained models were then deployed to the prediction set to generalize to the full urban neighborhoods. Through interpretability analysis, we identified and retained the features that contribute most to the model output, thereby reducing the model's complexity. The organization of this paper is as follows: Sect. 2 presents the related work; Sect. 3 describes the main methods, including data collection, impact factor calculation, urban building energy simulation, and interpretable machine learning modeling; Sect. 5 provides a discussion on the findings; and conclusions are presented in Sect. 6 .

2 Related works

2.1 methods for estimating energy consumption in urban buildings.

The estimation methods for urban building energy consumption can be categorized into two main types: top-down and bottom-up approaches (Ma et al., 2017 ; Reinhart & Cerezo Davila, 2016 ). The top-down approach focuses less on the energy use of each end building but treats urban building energy consumption on a macro level. Top-down approaches rely on historical or statistical urban building energy consumption data, often correlating them to the level of economic, demographic, and technological development. This approach equips urban decision-makers with long-term or cross-city energy knowledge (Gan et al., 2022 ; Huo et al., 2022 ; Sun et al., 2022 ). Though top-down methods can provide a rapid assessment of large-scale building energy demand, they are ineffective in cities and regions that lack data. Furthermore, this approach often employs the grid as the smallest cell (Shi et al., 2019 ; J. Wang et al., 2022a , 2022b , 2022c ; Y. Zhang et al., 2022a , 2022b ), resulting in a misalignment between study outcomes and policy implementation boundaries.

In contrast to top-down approaches, bottom-up approaches emphasize the energy use of individual building or building complexes and can be categorized into physical simulation-based approaches and data-driven approaches. The physical simulation-based approach has a long-standing history. For a single building entity, building energy simulation models the thermodynamic energy processes by abstracting the building geometry into a network of connected nodes. Heat balance equations are then formulated and solved for each node, based on the provided non-geometric building parameters (Nutkiewicz et al., 2021 ). However, accurately modeling the energy consumption of individual buildings is resource-intensive and time-consuming due to the extensive number of nodes and associated equations. With the advent of artificial intelligence techniques, data-driven approaches have become a focal point in building energy consumption research (Bourdeau et al., 2019 ). Machine learning and deep learning techniques discern hidden patterns from vast energy consumption datasets, creating a predictive 'black box' for building energy consumption. This method significantly streamlines the building energy consumption assessment process. Yet, much of the current research is centered on the building O&M process, notably optimizing HVAC systems by predicting dynamic building loads (Ahmad et al., 2016 ; Zhu et al., 2022 ). Some studies have highlighted the use of data-driven methods for predicting building energy use intensity early in the design phase (M. Wang et al., 2022a , 2022b , 2022c ). However, the limited data on energy use intensity of buildings, compared to time series data, poses challenges in training robust and generalizable models (Fan et al., 2022 ).

2.2 Urban building energy consumption modeling

The urban neighborhood serves as the basic units of urban planning (H. Zhang et al., 2022a , 2022b ), and conducting building energy simulations for these urban neighborhoods offers valuable insights for urban planning and architectural design. While urban building energy simulation is gaining traction, it remains a nascent field. Urban building energy modeling encompasses the computational modeling and simulation of a group of buildings within an urban context. This approach accounts for not only the dynamics of individual buildings but, more crucially, the interactions between them (Buckley et al., 2021 ; T. Hong et al., 2020a , 2020b ). Zhou et al., ( 2022 ) utilized UBEM to simulate the energy use intensity of 9,000 residential buildings in Dublin, aiming to support the energy renovation process in the European housing sector. However, the heightened physical complexity in building energy simulations, when compared to modeling individual buildings, renders the calculations notably less efficient.

Distinct from the energy simulation of individual buildings, UBEM presents the challenge of sourcing input data. Securing accurate and comprehensive input parameters, such as geometric parameters (building geometry, window-to-wall ratio, number of floors, etc.) and non-geometric parameters (energy use patterns and HAVC systems), often proves difficult (C. Wang et al., 2022a , 2022b , 2022c ). However, leveraging data collection methods from other disciplines offers potential solutions. Mapping platforms, notably OpenStreetMap, can supply building footprint data essential for UBEM (Chen & Hong, 2018 ; Schiefelbein et al., 2019 ). Cell phone data helps characterize building occupancy(Barbour et al., 2019 ; Pang et al., 2018 ), a key determinant of energy use. Given the challenges in accessing cell phone data, point-of-interest data can also serve to identify building functions, subsequently informing UBEM about building usage (C. Wang et al., n.d. , 2020 ). These data sourcing strategies have led to UBEM often being integrated with GIS(Ali et al., 2020a , 2020b ; Groppi et al., 2018 ). Of particular note recently is the growing interest in urban distributed photovoltaic power generation. Scholars combine GIS with UBEM to evaluate the PV potential of buildings (Boccalatte et al., 2022 ; Montealegre et al., 2022 ), paving the way for sustainable urban development.

2.3 Data-driven building energy prediction

While we have highlighted the advancements in UBEM, the significant consumption of computational resources remains a major challenge, particularly for city-level energy consumption assessments where UBEM becomes almost impractical. To address this, scholars have turned to data-driven approaches. Existing studies can be broadly grouped into two categories, the first employs data-driven tools for building energy consumption and, built environment assessment, aiming to expedite the design process of sustainable urban neighborhoods (Huang et al., 2022 ; Nutkiewicz et al., 2018 ; W. Wang et al., 2021 ); the second utilizes data-driven methods to identify building energy consumption across expansive urban neighborhoods, offering insights for energy retrofitting (Ali et al., 2020a , 2020b ; Ye et al., 2021 ). The research presented in this paper aligns with the latter category.

Similar to UBEM based on physical simulation, early data-driven approaches often necessitated a plethora of input parameters to ensure model prediction accuracy. However, recent advancements have seen interpretable analysis employed to identify the most impactful features on building energy use, thereby reducing the required feature inputs for data-driven models (Seo et al., 2022 ; L. Zhang, 2021 ). Moreover, the adoption of interpretable methods has proven to enhance the generalization capability of these models (Jin et al., 2022 ; Manfren et al., 2022 ). Research has consistently demonstrated building function and morphology as pivotal factors influencing building energy consumption (Abbasabadi et al., 2019 ). Quan and Li, ( 2021 ) proposed a multi-scale data-driven energy use modeling framework, comparing various machine learning algorithms and emphasizing the pronounced impact of building size and height on building EUI. In our study, we utilized land use data sourced from POI to determine building energy use and employed building data to compute building morphology factors. Subsequently, we developed a machine learning model to predict building energy use in urban neighborhoods, integrating land use and building morphology as inputs. To streamline the feature set, we employed interpretable analysis to discern the most influential features for the model's predictive objective and devised the balanced regressor. This balanced regressor optimizes prediction accuracy while minimizing input features.

3.1 Research workflow

In this study, we leveraged data from UBEM simulation to train machine learning models, subsequently employing them to predict urban building energy consumption over an expansive area. The research process is segmented into four stages. First, we collected land use and building morphology data for modeling urban building energy consumption. Concurrently, we processed these datasets to derive 14 impactful characteristics that define urban neighborhoods. Next, we randomly selected urban neighborhoods within our study domain, subjecting the sampled data to simulations for both urban building energy consumption and rooftop photovoltaic power generation. In the third stage, we partitioned the simulated samples into training and test sets, executed machine learning modeling, and evaluated the performance of various machine learning models. We then employed the SHAP value to interpret the optimal model. Lastly, we applied the trained machine learning models to predict building energy consumption for the unsimulated urban neighborhoods within our study domain. We also contrasted the data distributions of the impact factors between the simulated and predicted samples. The study's workflow is depicted in Fig. 1 .

figure 1

Research workflow

3.2 Data collection

Shanghai (120°52′ E-122°12′ E, 30°40′ N-31°53′ N) is located on the west coast of the Pacific Ocean and has a subtropical monsoon climate with abundant light and rainfall (H. Zhang et al., 2022a , 2022b ). Recognized as one of the world's preeminent mega-cities, Shanghai stands as a beacon of urbanization in China (Cao et al., 2021 ). This rapid urbanization has resulted in a pronounced heat island effect (Yang et al., 2022 ), subsequently driving up building energy consumption (Y. Hong et al., 2020a , 2020b ). The central area of Shanghai comprises Huangpu, Hongkou, Jing'an, Xuhui, Changning, Yangpu, and Putuo districts. This region is renowned for its thriving economy, showcasing top-tier commercial, entertainment, and culinary establishments, alongside state-of-the-art infrastructure. Moreover, the central area presents a diverse architectural timeline, with a notable disparity in building ages. Many of its older structures necessitate heightened energy consumption to sustain a comfortable indoor climate. Given its high energy consumption profile and architectural diversity, we opted for the central area of Shanghai as our study domain (see Fig. 2 ).

figure 2

Study area a China, b Shanghai, c Shanghai central area

The land use data (LU) for this study was sourced from point-of-interest (POI) data and calculations based on urban planning unit data. The POI data were provided by OpenStreetMap (data source: https://www.openstreetmap.org/ ), were categorized into 18 types (e.g., restaurants, shopping malls, schools, etc.) leveraging the semantic phrases they contained. These categories were then converted into proportions representing building functions. The urban planning unit data offered geometric boundaries as well as land use types for each urban neighborhood (data source: https://www.shanghai.gov.cn/nw42806/ ). Building morphology data (BM) was derived from calculations based on urban building data (data source: https://lbsyun.baidu.com/ ).

We screened the urban neighborhoods in the Shanghai central area. The criteria for this screening were: 1) excluding lands devoid of buildings, such as landscapes, water bodies, and open spaces, and 2) eliminating lands with an area smaller than 12,000 m2, falling in the lower quartile. This screening aimed to minimize errors in urban building energy simulation and enhance the stability of the machine learning model. Following this process, we secured a total of 18,689 samples. These samples encompassed only five land use categories: LU-1: urban residential land; LU-2: industrial and mining storage land; LU-3: public infrastructure land; LU-4: public building land; and LU-5: rural settlement land. We partitioned the complete sample set into simulated and predicted datasets (see Table 1 ). The simulated dataset served the dual purpose of urban building energy consumption modeling and machine learning modeling. The models, once trained, were then applied to the prediction dataset, enabling us to determine the urban building energy consumption across the entire sample in the Shanghai central area.

3.3 Calculation of impact factors

We selected fourteen impact factors as key characteristics for predicting the urban building energy consumption. These data were classified into two categories LU and BM. LU includes 7 factors, which are land use type (LUT), the proportion of restaurant buildings (REST), the proportion of medical buildings (HOSP), the proportion of educational buildings (SCH), the proportion of commercial buildings (MALL), the proportion of residential buildings (RES), and proportion of office buildings (OFC). BM includes 7 factors, which are Site Area (SA), Number of Buildings (NoB), Building Coverage Ratio (BCR), Floor Area Ratio (FAR), Average Building Height (HAVE), Building Height Standard Deviation (HSTD), and Building Shape Coefficient (BSC) (see Fig. 3 ).

figure 3

The variables in this study, a land use type, b different functions in LUTs, c Site Area, d Building Coverage Ratio, e Number of Buildings, f Floor Area Ratio, g Average Building Height, h Building Height Standard Deviation, i Building Shape Coefficient

Among the 7 impact factors of LU, LUT was directly sourced from urban planning unit data, while the remaining impact factors were derived from POI calculations. The approach involved counting the number of POIs within each urban neighborhoods using spatial join in GIS. Subsequently, we determined the proportion of different POI classifications, which were then translated into the proportion of building functions. The 7 impact factors of BM were calculated using GIS, and the calculation formula was presented in Table 4 in Appendix .

To facilitate visualization, we introduced a cofactor: the functional mix degree (FMD). This cofactor utilizes Shannon's information entropy to represent the degree of mixing of different building functions in a plot. It is formulated as follows (Eq. ( 1 )).

where n is the number of building functions within an urban neighborhood and \({P}_{i}\) is the proportion of the i th function within the urban neighborhood.

3.4 Urban building energy simulation

UBEM is a bottom-up, physics-based approach that calculates building energy consumption, accounting for energy consumption for heating, air conditioning, ventilation, lighting, the equipment uses, and heat through the envelope during building operation. We utilized the Dragonfly plug-in of the Rhino/Grasshopper platform for urban building energy simulation, which requires around 3000 parameters for a single simulation. Dragonfly simplifies EnergyPlus model input by pre-setting many parameters using ASHARE's standard values. Additionally, Dragonfly offers a visual programming interface for urban building energy simulations. In this study, Dragonfly was used to batch call EnergyPlus, enabling us to model urban building energy consumption for all samples of the simulation dataset.

The function of a building not only determines its configuration but also its energy use. In this research, we determined building functions in each urban neighborhood using six categories of building function proportions (REST, HOSP, SCH, MALL, RES and OFC) derived from POI data. For each simulated neighborhood, building functions were randomly assigned based on their respective proportions. The urban building energy modeling process accounted for the impact of building shading on energy consumption within a 50 m radius. To expedite the simulation process, each unique room (or standard floor) was simulated once, and the results were then aggregated. The simulation spanned a full year with an hourly time step. The weather data was sourced from epw files specific to Shanghai (data source: https://www.ladybug.tools/epwmap/ ). The final UBEM outputs comprised the total load (TL), cooling load (CL), and heating load (HL) for each neighborhood. Detailed parameter settings pertaining to the building functions can be found in Table 5 in Appendix .

Furthermore, this study also simulated the solar power potential of building rooftops, given the rapid development of distributed photovoltaic power in Shanghai. The yearly acceptable solar irradiance of building rooftops was calculated using the Ladybug plug-in for Rhino/Grasshopper. Ladybug employs RADIANCE to run global and diffuse radiation simulations and is widely validated for its accuracy and efficiency in solar irradiance simulation studies(Li et al., 2022 ). The weather file utilized the epw file of Shanghai, and the calculation process took into account the shading of surrounding buildings with an accuracy of 1 m. The results from the irradiance calculation were multiplied by the attenuation coefficient of the PV panels to derive the solar power potential (RPV) of the building roof. In this work, the attenuation coefficient was set to 0.2.

The entire process of urban building energy simulation is illustrated in Fig. 4 . The outputs of the urban building energy simulation encompass RPV, TL, CL, and HL. These 14 impact factors were combined with the simulation outputs to create the dataset. A Pearson correlation analysis explored the relationship between the impact factors and the simulation outputs. Subsequently, maximum-minimum normalization was applied to the dataset in preparation for machine learning model training.

figure 4

Urban building energy simulation process, a random setting according to the proportion of building functions, b generating EnergyPlus model, c calculating of building load, d calculating of building rooftop PV potential

3.5 Explainable machine learning modeling

3.5.1 ensemble learning method.

We employed machine learning to model the nonlinear relationship between 14 impact factors and the output of an urban building energy modeling. Ensemble learning is a machine learning paradigm where multiple weak learners are combined to achieve better predictive performance than could be obtained from any of the constituent learners alone. The efficacy of ensemble learning methods in predicting building energy consumption have been well-established. In this study, we focused on two prominent ensemble learning methods, the Bagging method and the Boosting method (see Fig. 5 ). The Bagging method trains weak learners in parallel using subsets of the data and aggregates their predictions through a deterministic averaging process. In contrast, the Boosting method trains weak learners sequentially. During this process, Boosting iteratively fits a weak learner, incorporates it into the ensemble model, and then "updates" the training dataset to emphasize the strengths and weaknesses of the current ensemble model when fitting the subsequent base model. While the primary objective of the bagging approach is to produce an ensemble model with reduced variance (enhancing stability), the boosting approach aims to yield a model with diminished bias (increasing accuracy).

figure 5

Ensemble learning method, a Bagging method, b Boosting method

In this paper, we evaluated six ensemble models, For the Bagging method, we considered Bagging Regression, Extra Tree, and Random Forest; and for the Boosting method, we looked at Gradient Boosting, AdaBoost, and XGBoost. We adopted the hold-out method to partition the simulation dataset, allocating 70% for training and 30% for testing. Training was conducted using the Scikit-learn machine learning library. Model performance was assessed using coefficient of determination (R2) and mean square error (MSE). The selection of the optimal model was based on both model comparison and hyperparameter optimization.

3.5.2 Model explanation method

We used the SHAP (Shapley Additive exPlanations) library to explain the performance of the best-trained model, aiming to discern the contribution of 14 features to the model. SHAP is a model interpretation method developed from cooperative game theory, which calculates the marginal contribution of features to the model output by computing the Shapley value. SHAP constructs an additive interpretation model where all features are treated as "contributors". For each prediction sample, the model produces a prediction value, and the Shapley value is the value assigned to each feature in that sample, representing the feature contribution or feature importance. We used SHAP to determine a feature importance ranking for the best model.

3.6 Model generalization

To improve the generalizability of machine learning models, we aimed to simplify the model inputs while ensuring the prediction accuracy of the model. Reducing the number of feature inputs can maximize the simplification of the model, while also reducing the complexity of data collection. However, a decrease the number of features may compromise the model’s accuracy. Thus, we sought to develop a model that strikes a balance the number of features and model accuracy, which we termed "The Balanced Regressor".

Based on the ranking of the feature contributions, we identified the most influential impact factors on RPV, TL, CL, and HL. We then examined the effect of varying the number of feature inputs on the model accuracy. We established five different feature selection methods: 1) 14 features: using all features for training; 2) 13 features: using all features except for LUT for training; 3) 9 features: selecting the top 9 features for RPV, TL, CL, and HL respectively for training; 4) 5 features: selecting the top 5 features for RPV, TL, CL and HL respectively for training; 5) 3 features: selecting the top 3 features for RPV, TL, CL, and HL, respectively. Finally, the balanced regressor, which offers the best accuracy, was employed to predict the building energy consumption in the remaining urban neighborhoods (prediction dataset) in the Shanghai central area.

It's essential to highlight that to ensure the reliability of model generalization, a distribution test on the input features of both the training and generalized data is crucial. Only when the distributions of the two sets of features align or are similar can the machine learning model be reliably deployed.

4 Results and discussion

4.1 description of simulation results.

As depicted in Fig. 5 , the simulation results for RPV, TL, CL, and HL are presented. The overall distribution of RPV, TL, CL, and HL appears balanced across different LUTs. This suggests the feasibility of employing using machine learning models with consistent weights to predict across various LUTs. The RPV for the majority of urban neighborhoods ranged falls between 0 and 10,000,000 kWh, TL ranges from 0 to 800 kWh/m2, CL from 0 to 200 kWh/m2, and HL from 0 to 100 kWh/m2. The most pronounced variation in HL is observed across different LUTs, as illustrated in Fig. 6 (d). The median HL of LU-1 and LU-5, predominantly residential buildings, is considerably higher than that for LU-2 to LU-4, which are primarily public buildings. This discrepancy can be attributed to Shanghai's climate, characterized by hot summers and cold winters. The city lacks centralized heating during winter, yet most residential buildings employ heating equipment, leading to a surge in HL.

figure 6

UBEM output results of urban neighborhoods on different LUTs, a RPV on different LUTs, b TL on different LUTs, c CL on different LUTs, and d HL on different LUTs

Figures 7 and 8 illustrates the Pearson correlations for the variables within the simulated dataset. The correlation matrix for the complete dataset reveals a pronounced correlation between RPV and SA (0.75) and NoB (0.73) (see Fig. 7 (a)). There exists a notable positive correlation between TL (0.66), CL (0.70), and HL (0.43) with REST. This indicates that the prevalence of restaurants significantly influences the building load within urban neighborhoods. This observation is further corroborated by Table 5 in Appendix , which indicates that restaurants have greater equipment power, gas power, and hot water usage compared to other building functions. Both RES and OFC exhibit significant negative correlations with TL, CL, and HL, suggesting that neighborhoods with more residential and office buildings tend to have reduced total energy consumption. BSC displays a notable correlation with HL (0.38), aligning with conclusions from previous research. Moreover, the observed covariance between certain impact factors suggests potential feature redundancy.

figure 7

Pearson Correlation Coefficient (PCC) between UBEM output and impact factors, a PCC matrix of total simulated data, b PCC matrix of simulated data on LU-1, c PCC matrix of simulated data on LU-2, d PCC matrix of simulated data on LU-3, e PCC matrix of simulated data on LU-4, f PCC matrix of simulated data on LU-5

Figures 7 (b-f) show the Pearson coefficients of the variables in the sub-datasets corresponding to different LUTs. Generally, they mirror similar patterns as seen in Fig. 7 (a), albeit with variations in correlation strengths. In the correlation matrices for LU-1 and LU-5 (see Fig. 7 (b, f)), RES demonstrates negative correlations for TL, CL, and HL. However, this trend diminishes in the correlation matrices for LU-2–3 (see Fig. 7 (c-e)). In contrast, OFC exhibits significant correlations with TL, CL, and HL solely in LU-4.

figure 8

Regression performance of XGBoost on different LUTs, a trained model of predicting RPV, b trained model of predicting TL, c trained model of predicting CL, d trained model of predicting HL

While the correlation matrix offers insights into the significance of the impact factors, the correlation between the majority of these factors and the UBEM simulation output is negligible. This might be attributed to the Pearson method's inability to capture the nonlinear interactions inherent in real-world data combined with the physics-based simulation process. Consequently, there's a compelling need to further investigate the contribution of impact factors to the UBEM simulation output using interpretable machine learning.

4.2 Result of machine learning modeling

4.2.1 performance of ensemble model.

We evaluated six ensemble models, leading to a total of 24 machine learning training sessions for predicting RPV, TL, CL, and HL. The detailed results are presented in Tables 6, 7, 8, 9 in Appendix . A negative R 2 value indicates that the model's fit is worse than a simple mean model, highlighting its unsuitability for the given data. The results from the training and test sets were used to jointly evaluate the performance of the models. Overall, the three Boosting algorithms outperformed the Bagging method in this study, with XGBoost achieving the best performance in predicting the four simulated outputs. The R 2 of RPV is the highest, reaching 0.987 and 0.914 for the training and test sets, respectively (Table 6 in Appendix ), which is significantly higher than the prediction accuracy of TL, CL, and HL. This might be attributed to the low complexity of the RPV calculation, which solely involves irradiance calculation and influenced only by BM. In contrast to RPV, energy consumption simulation encompasses a high complexity of physical models and is affected by both LU and BM. The R 2 of XGBoost in the test set of predicted TL (0.674) was slightly lower than CL (0.685) and HL (0.749) (see Tables 6, 7, 8, 9 in Appendix ), potentially due to the increased uncertainty in components of TL other than CL and HL, such as equipment load. The impact of equipment use on neighborhood energy consumption has been discussed above, and this also highlights the necessity of predicting the energy consumption subsections.

As the correlation analysis reveals, different LUTs influence factors have varying effects on the simulated output of the four UBEM. We utilized the advantageous XGBoost model, trained using the sub-dataset of LU1-5. The model performance is given in Table 2 . The results indicate that the model achieves the best performance in the sub-dataset of LU-1. The R 2 of the training set for predicting the RPV is 0.910 (Table 2 ) slightly lower than the R 2 of the training set for predicting the full data set (Table 6 in Appendix ). When trained on the LU-1 subset, the accuracy of the models predicting TL, CL, and HL markedly outperforms those trained on the complete dataset. Specifically, for TL, the test set R2 is 0.726 on the LU-1 subset, as opposed to 0.674 on the full dataset; for CL, it's 0.751 on the LU-1 subset versus 0.685 on the full dataset; and for HL, it's 0.792 on the LU-1 subset compared to 0.749 on the full dataset. This suggests that RPV's prediction is more influenced by data volume and is less sensitive to LUT variations than TL, CL, and HL. Notably, TL, CL, and HL predictions exhibit pronounced accuracy disparities across different LUTs, with enhanced accuracy particularly in LU-1 and LU-5. This could be attributed to LU-1 and LU-5 being predominantly residential, leading to more consistent building energy consumption. In contrast, LU2-3, which comprise more public buildings, display greater energy consumption variability across different building functions.

4.2.2 Explainable analysis of the best model

Figure 9 illustrates the global feature importance of the optimal model. Here, the global importance of each feature is determined by the average absolute value of that feature's SHAP value across all samples. Figure 9 (a-d) highlights the varying contributions of each feature to the XGBoost algorithm for different prediction objectives. For RPV, BM impact factors, including SA, BCR, NoB, play a pivotal role in prediction outcomes, whereas LU impact factors exert minimal influence on the model output (Fig. 9 (a)). For TL and CL, REST emerges as the most influential contributor to the model output (Fig. 9 (b-c)), aligning with the insights from the Pearson correlation matrix. The SHAP analysis further elucidates the BM influence on UBEM, contrasting with the correlation analysis. For instance, BSC significantly affects TL, CL, and HL (ranking in the top three), while HAVE notably influences both TL and HL (also ranking in the top three).

figure 9

Feature importance of best model, a  feature importance for prediction RPV, b  feature importance for prediction TL, c  feature importance for prediction CL, d  feature importance for prediction HL

Figure 10 displays the Shapley value for each feature across all samples, highlighting the significance of each feature and how the magnitude of the sample influences the model. The feature ranking further underscores the contribution of these features to the model, often referred to as feature importance. Each dot represents a sample, with the color denoting the magnitude of the: red signifies a higher feature value, while blue indicates a lower one. These color variations help elucidate how shifts in feature values impact the model's output. Moreover, broader regions signify a clustering of numerous samples.

figure 10

SHAP summary of the best model, a  SHAP summary of prediction RPV, b  SHAP summary of prediction TL, c  SHAP summary of prediction CL, d  SHAP summary of prediction HL

Regarding RPV (Fig. 10 (a)), samples with larger SA values (represented by red dots), exert a pronounced positive influence on the model's output. Conversely, when the SA value is minimal (blue dots), its impact on the model is relatively muted. Additionally, BCR values exhibit a balanced effect on the model: larger BCR values amplify the positive effect on the model's output, while smaller BCR values enhance its negative effect. For TL and CL (Fig. 10 (b-c)), samples with a higher REST value predominantly boost the model's output. Yet, a majority of samples possess modest REST values (cluster of blue dots), consistently exerting a negative influence on the model's output. For HL (Fig. 10 (d)), samples with a larger BSC value significantly influence the model's output, whereas those with a smaller BSC value have a diminished impact. Furthermore, samples with elevated HAVE values have a restrained positive effect on the model, while those with reduced HAVE values considerably dampen the model's output.

Drawing from the results of the best model's interpretive analysis, we identified and utilized the most contributive features for RPV, TL, CL, and HL. We then trained the XGBoost model, aiming to achieve the “balanced regressor”—a model that maximizes prediction accuracy using the fewest features. Table 3 presents the R2 for both the training and test sets. The results indicate that the XGBoost model, when trained using 9 features, delivers optimal performance and is thus designated as the “balanced regressor”. Notably, it surpasses the accuracy of the best model (trained using 14 features) in predicting RPV, CL, and HL for the test set. This underscores the presence of redundant features in the initial dataset for various prediction objectives. Consequently, we employed the balanced regressor for model generalization.

4.3 Results of model generalization

4.3.1 results of the same distribution test.

For effective model generalization in machine learning, it's imperative to ensure that the input features of both the training data and the generalized data share the same distribution. We analyzed the data distributions of all input features of the balanced regressor, as derived in the previous section. In Fig. 11 , we compare the distributions of various indicators for both the training and generalized sets. The features REST, HOSP, SCH, MALL, RES, and OFC are aggregated into a single indicator represented as FMD, while the distribution comparisons for the remaining BM impact indicators are also presented. The results highlight only minor discrepancies in data distribution. The primary distinction lies in the volume of data; however, the domain of the training data fully encompasses the generalization set. This indicates that the model trained on the training set can be seamlessly applied to the generalization set.

figure 11

Same distribution test

4.4 Energy prediction results for Shanghai central area

We deployed the balanced regressor on the generalization set for prediction, aiming to swiftly estimate the spatial distribution of building energy consumption and PV generation potential in the Shanghai central area. Figure 12 shows all the input features of both the simulated dataset and the generalized set. Figure 13 presents the predication for RPV, TL, CL, and HL in the Shanghai central area. The results offer valuable insights into urban decarbonization. For instance, distributed PV development projects in the city can be prioritized in the hotspots shown in Fig. 13 (a); the hotspots in Fig. 13 (b-d) identify high energy-intensive urban neighborhoods that that require immediate low-carbon retrofitting. The overlapping hotspots in Fig. 13 (a) and (b-d) suggest that a large amount of PV energy can be consumed locally, forming a foundation for the development of PV infrastructure, including energy storage stations. In the early stages of urban planning, projections of energy consumption for buildings in expansive urban neighborhoods can be visualized against local baselines of energy consumption or carbon emissions, ensuring the continued relevance of this methodology across different cities.

figure 12

Impact factors per urban neighborhoods for model generalization, a  LU, b  FMD, representing 6 building function ratios, c  SA, d  NoB, e  BCR, f  FAR, g  HAVE, h  HSTD, i  BSC

figure 13

Urban energy use per urban neighborhoods prediction of Shanghai central area, a  RPV prediction of Shanghai central area, b  TL prediction of Shanghai central area, c  CL prediction of Shanghai central area, d  HL prediction of Shanghai central area

5 Discussion

In this study, we introduce a method that integrates physics-based approaches with data-driven techniques to employ machine learning for predicting energy consumption across large-scale urban neighborhoods. Our proposed method offers a substantial time benefit compared to the traditional URBM approach. In this study, simulating PRV, TL, CL, and HL for a single neighborhood takes approximately 5 min (using Intel 13th i9, RTX2080). With a simulation database comprising 2702 samples, the total time amounts to roughly 225 h or 14,400 core hours (utilizing 64 cores). The time taken for model training and generalization is minimal. The simulated data accounts for 14.5% of all urban neighborhoods in the Shanghai central area, meaning the application of machine learning results in a time-saving of 85.5%. Moreover, the interpretative outcomes enable the identification of the optimal prediction with the fewest features. The balanced regressor predicts RPV, TL, CL, and HL with test set accuracies of 0.956, 0.674, 0.608, yielding an average test set accuracy of 0.7725. This implies that our energy consumption assessment method has a maximum error margin of 22.75%. Future endeavors may further reduce this error by incorporating more simulation data and refining the model. Semi-supervised learning and few-shot learning may offer avenues for further enhancing workflow efficiency and model accuracy in future investigations.

To ensure clarity, it is imperative to elucidate the reliability and applicability of our model. The framework proposed in this study is apt for both planned and unplanned design communities, primarily because the predictors for energy consumption encompass architectural functions and morphological features. These relationships, rooted in thermodynamics, are embedded within the UBEM. Machine learning, with its prowess in fitting non-linear relationships, explicitly manifests these associations. When extending the application across regions, it becomes essential to rigorously assess the alignment between the distributions of the training and generalization sets, stemming from the inherent assumption of independent and identically distributed samples in machine learning algorithms. This signifies that if the architectural functions and morphology of the prediction region deviate significantly from the training set, the predictions might falter. At the algorithmic level, transfer learning could potentially mitigate the accuracy losses due to distribution disparities. Further enhancements can be introduced at the data level by augmenting the training samples. Energy consumption habits at end-use terminals vary across regions, and this variation might be reflected in the parameter settings of the UBEM during the preparation of the training set. For cross-regional applications, settings should be aligned with the local energy consumption simulation standards. Moreover, climatic factors play a pivotal role in energy consumption simulations; hence, when applying this method in diverse regions, it's crucial to incorporate local meteorological data.

Our methodology offers a viable approach to estimate building energy consumption at the urban scale, especially when data availability is limited. Within the scope of this study, the balanced regressor opted for nine indicators for modeling. In reality, in underdeveloped regions, the number of available indicators might be even fewer. Our method can be effectively integrated with workflows that utilize remote sensing and deep learning for the identification of building footprints. This facilitates the estimation of building energy consumption using a minimal set of architectural features, thereby bolstering sustainable energy development in less developed areas.

6 Conclusion

The aim of this paper is to predict urban building energy consumption in the Shanghai central area and to establish a robust method for predicting building energy consumption at the city scale. We amassed a total of 18,689 urban neighborhoods, with 14.5% designated as the simulation dataset and the remaining 85.5% as the prediction dataset. The simulated dataset served for urban building energy modeling and machine learning model training, while the prediction dataset was allocated for the generalization of the machine learning models. The urban building energy consumption simulations executed in batches using Dragonfly. We compiled 14 factors related to land use and building morphology of urban neighborhoods as input features for machine learning. A comparison was made among six prevalent ensemble learning algorithms. The optimal model was analyzed using SHAP to derive a feature importance ranking of the model output. Subsequently, the balanced regressor was characterized as the model achieving optimal performance with the fewest input features. This balanced regressor, when applied to the prediction dataset, facilitated a rapid estimation of building energy consumption in the Shanghai central area.

The findings indicate that the Boosting ensemble learning model, specifically XGBoost, delivers superior performance, with test set accuracies of 0.914, 0.674, 0.685, and 0.749 for predicting RPV, TL, CL, and HL, respectively. There was a notable variance in the feature importance ranking across different prediction objectives. The test set accuracy of the balanced regressor, utilizing the 9 most influential features to predict RPV, TL, CL, and HL, stands at 0.956, 0.674, 0.608, and 0.762, resulting in an average test set accuracy of 0.7725. Compared to traditional approaches, our methodology offers an 85.5%-time advantage and incurs a maximum error of just 22.75%.

The present study has two primary limitations. Firstly, the current urban energy simulation engine does not account for the impacts of green spaces and water bodies. This omission might cause our model to underestimate the influence of the urban microclimate on building energy consumption in the Shanghai central area. Future work could incorporate simulation tools that calculate hydrodynamics, average radiation temperature, and heat island effect, integrating them with the energy consumption simulation engine. Secondly, the accuracy of our machine learning model needs to be improved. Future endeavors will explore the equilibrium between the simulation's time consumption and the potential reduction in model prediction accuracy. Furthermore, while the adoption of deep learning models might substantially boost model accuracy, they require more stringent generalization assessments to prevent overfitting.

Availability of data and materials

The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.

Abbasabadi, N., & Ashayeri, M. (2019). Urban energy use modeling methods and tools: A review and an outlook. Building and Environment, 161 , 106270. https://doi.org/10.1016/j.buildenv.2019.106270

Article   Google Scholar  

Abbasabadi, N., Ashayeri, M., Azari, R., Stephens, B., & Heidarinejad, M. (2019). An integrated data-driven framework for urban energy use modeling (UEUM). Applied Energy, 253 , 113550. https://doi.org/10.1016/j.apenergy.2019.113550

Ahmad, M. W., Mourshed, M., Yuce, B., & Rezgui, Y. (2016). Computational intelligence techniques for HVAC systems: A review. Building Simulation, 9 (4), 359–398. https://doi.org/10.1007/s12273-016-0285-4

Ali, U., Shamsi, M. H., Bohacek, M., Hoare, C., Purcell, K., Mangina, E., & O’Donnell, J. (2020a). A data-driven approach to optimize urban scale energy retrofit decisions for residential buildings. Applied Energy, 267 , 114861. https://doi.org/10.1016/j.apenergy.2020.114861

Ali, U., Shamsi, M. H., Bohacek, M., Purcell, K., Hoare, C., Mangina, E., & O’Donnell, J. (2020b). A data-driven approach for multi-scale GIS-based building energy modeling for analysis, planning and support decision making. Applied Energy, 279 , 115834. https://doi.org/10.1016/j.apenergy.2020.115834

Barbour, E., Davila, C. C., Gupta, S., Reinhart, C., Kaur, J., & González, M. C. (2019). Planning for sustainable cities by estimating building occupancy with mobile phones. Nature Communications, 10 (1), 3736. https://doi.org/10.1038/s41467-019-11685-w

Article   ADS   CAS   PubMed   PubMed Central   Google Scholar  

Boccalatte, A., Thebault, M., Ménézo, C., Ramousse, J., & Fossa, M. (2022). Evaluating the impact of urban morphology on rooftop solar radiation: A new city-scale approach based on Geneva GIS data. Energy and Buildings, 260 , 111919. https://doi.org/10.1016/j.enbuild.2022.111919

Bourdeau, M., Zhai, X., & qiang, Nefzaoui, E., Guo, X., & Chatellier, P. (2019). Modeling and forecasting building energy consumption: A review of data-driven techniques. Sustainable Cities and Society, 48 , 101533. https://doi.org/10.1016/j.scs.2019.101533

Buckley, N., Mills, G., Reinhart, C., & Berzolla, Z. M. (2021). Using urban building energy modelling (UBEM) to support the new European Union’s Green Deal: Case study of Dublin Ireland. Energy and Buildings, 247 , 111115. https://doi.org/10.1016/j.enbuild.2021.111115

Camarasa, C., Mata, É., Navarro, J. P. J., Reyna, J., Bezerra, P., Angelkorte, G. B., Feng, W., Filippidou, F., Forthuber, S., Harris, C., Sandberg, N. H., Ignatiadou, S., Kranzl, L., Langevin, J., Liu, X., Müller, A., Soria, R., Villamar, D., Dias, G. P., & Yaramenka, K. (2022). A global comparison of building decarbonization scenarios by 2050 towards 1.5–2 °C targets. Nature Communications, 13 (1), 3077. https://doi.org/10.1038/s41467-022-29890-5

Cao, Y., Kong, L., Zhang, L., & Ouyang, Z. (2021). The balance between economic development and ecosystem service value in the process of land urbanization: A case study of China’s land urbanization from 2000 to 2015. Land Use Policy, 108 , 105536. https://doi.org/10.1016/j.landusepol.2021.105536

Chen, Y., & Hong, T. (2018). Impacts of building geometry modeling methods on the simulation results of urban building energy models. Applied Energy, 215 , 717–735. https://doi.org/10.1016/j.apenergy.2018.02.073

Article   ADS   Google Scholar  

Dahlström, L., Broström, T., & Widén, J. (2022). Advancing urban building energy modelling through new model components and applications: A review. Energy and Buildings, 266 , 112099. https://doi.org/10.1016/j.enbuild.2022.112099

Fan, C., Lei, Y., Sun, Y., Piscitelli, M. S., Chiosa, R., & Capozzoli, A. (2022). Data-centric or algorithm-centric: Exploiting the performance of transfer learning for improving building energy predictions in data-scarce context. Energy, 240 , 122775. https://doi.org/10.1016/j.energy.2021.122775

Feliciotti, A., & Fleishmann, M. (2022). Simulating the impact of urban morphology on energy demand in blocks- A case study of dwellings in Nanjing . University of Strathclyde Publishing. https://doi.org/10.17868/80146

Gan, L., Liu, Y., Shi, Q., Cai, W., & Ren, H. (2022). Regional inequality in the carbon emission intensity of public buildings in China. Building and Environment, 225 , 109657. https://doi.org/10.1016/j.buildenv.2022.109657

Groppi, D., de Santoli, L., Cumo, F., & Astiaso Garcia, D. (2018). A GIS-based model to assess buildings energy consumption and usable solar energy potential in urban areas. Sustainable Cities and Society, 40 , 546–558. https://doi.org/10.1016/j.scs.2018.05.005

Heidelberger, E., & Rakha, T. (2022). Inclusive urban building energy modeling through socioeconomic data: a persona-based case study for an underrepresented community. Building and Environment, 222 , 109374. https://doi.org/10.1016/j.buildenv.2022.109374

Hong, T., Chen, Y., Luo, X., Luo, N., & Lee, S. H. (2020a). Ten questions on urban building energy modeling. Building and Environment, 168 , 106508. https://doi.org/10.1016/j.buildenv.2019.106508

Hong, Y., Ezeh, C. I., Deng, W., Hong, S.-H., Peng, Z., & Tang, Y. (2020b). Correlation between building characteristics and associated energy consumption: Prototyping low-rise office buildings in Shanghai. Energy and Buildings, 217 , 109959. https://doi.org/10.1016/j.enbuild.2020.109959

Huang, C., Zhang, G., Yao, J., Wang, X., Calautit, J. K., Zhao, C., An, N., & Peng, X. (2022). Accelerated environmental performance-driven urban design with generative adversarial network. Building and Environment, 224 , 109575. https://doi.org/10.1016/j.buildenv.2022.109575

Huo, T., Cao, R., Xia, N., Hu, X., Cai, W., & Liu, B. (2022). Spatial correlation network structure of China’s building carbon emissions and its driving factors: A social network analysis method. Journal of Environmental Management, 320 , 115808. https://doi.org/10.1016/j.jenvman.2022.115808

Article   CAS   PubMed   Google Scholar  

Jin, X., Xiao, F., Zhang, C., & Li, A. (2022). GEIN: An interpretable benchmarking framework towards all building types based on machine learning. Energy and Buildings, 260 , 111909. https://doi.org/10.1016/j.enbuild.2022.111909

Li, J., Wang, Y., & Xia, Y. (2022). A novel geometric parameter to evaluate the effects of block form on solar radiation towards sustainable urban design. Sustainable Cities and Society, 84 , 104001. https://doi.org/10.1016/j.scs.2022.104001

Ma, W., Fang, S., Liu, G., & Zhou, R. (2017). Modeling of district load forecasting for distributed energy system. Applied Energy, 204 , 181–205. https://doi.org/10.1016/j.apenergy.2017.07.009

Manfren, M., James, P. A. B., & Tronchin, L. (2022). Data-driven building energy modelling – An analysis of the potential for generalisation through interpretable machine learning. Renewable and Sustainable Energy Reviews, 167 , 112686. https://doi.org/10.1016/j.rser.2022.112686

Mishra, A., Humpenöder, F., Churkina, G., Reyer, C. P. O., Beier, F., Bodirsky, B. L., Schellnhuber, H. J., Lotze-Campen, H., & Popp, A. (2022). Land use change and carbon emissions of a transformation to timber cities. Nature Communications, 13 (1), 4889. https://doi.org/10.1038/s41467-022-32244-w

Montealegre, A. L., García-Pérez, S., Guillén-Lambea, S., Monzón-Chavarrías, M., & Sierra-Pérez, J. (2022). GIS-based assessment for the potential of implementation of food-energy-water systems on building rooftops at the urban level. Science of the Total Environment, 803 , 149963. https://doi.org/10.1016/j.scitotenv.2021.149963

Article   ADS   CAS   PubMed   Google Scholar  

Morfeldt, J., & Johansson, D. J. A. (2022). Impacts of shared mobility on vehicle lifetimes and on the carbon footprint of electric vehicles. Nature Communications, 13 (1), 6400. https://doi.org/10.1038/s41467-022-33666-2

Nutkiewicz, A., Choi, B., & Jain, R. K. (2021). Exploring the influence of urban context on building energy retrofit performance: A hybrid simulation and data-driven approach. Advances in Applied Energy, 3 , 100038. https://doi.org/10.1016/j.adapen.2021.100038

Nutkiewicz, A., Yang, Z., & Jain, R. K. (2018). Data-driven Urban Energy Simulation (DUE-S): A framework for integrating engineering simulation and machine learning methods in a multi-scale urban energy modeling workflow. Applied Energy, 225 , 1176–1189. https://doi.org/10.1016/j.apenergy.2018.05.023

Pang, Z., Xu, P., O’Neill, Z., Gu, J., Qiu, S., Lu, X., & Li, X. (2018). Application of mobile positioning occupancy data for building energy simulation: An engineering case study. Building and Environment, 141 , 1–15. https://doi.org/10.1016/j.buildenv.2018.05.030

Parhizkar, T., Rafieipour, E., & Parhizkar, A. (2021). Evaluation and improvement of energy consumption prediction models using principal component analysis based feature reduction. Journal of Cleaner Production, 279 , 123866. https://doi.org/10.1016/j.jclepro.2020.123866

Perera, A. T. D., Javanroodi, K., & Nik, V. M. (2021). Climate resilient interconnected infrastructure: Co-optimization of energy systems and urban morphology. Applied Energy, 285 , 116430. https://doi.org/10.1016/j.apenergy.2020.116430

Pomponi, F., & Moncaster, A. (2017). Circular economy for the built environment: A research framework. Journal of Cleaner Production, 143 , 710–718. https://doi.org/10.1016/j.jclepro.2016.12.055

Quan, S. J., & Li, C. (2021). Urban form and building energy use: A systematic review of measures, mechanisms, and methodologies. Renewable and Sustainable Energy Reviews, 139 , 110662. https://doi.org/10.1016/j.rser.2020.110662

Ratti, C., Baker, N., & Steemers, K. (2005). Energy consumption and urban texture. Energy and Buildings, 37 (7), 762–776. https://doi.org/10.1016/j.enbuild.2004.10.010

Reinhart, C. F., & Cerezo Davila, C. (2016). Urban building energy modeling – A review of a nascent field. Building and Environment, 97 , 196–202. https://doi.org/10.1016/j.buildenv.2015.12.001

Schiefelbein, J., Rudnick, J., Scholl, A., Remmen, P., Fuchs, M., & Müller, D. (2019). Automated urban energy system modeling and thermal building simulation based on OpenStreetMap data sets. Building and Environment, 149 , 630–639. https://doi.org/10.1016/j.buildenv.2018.12.025

Seo, J., Kim, S., Lee, S., Jeong, H., Kim, T., & Kim, J. (2022). Data-driven approach to predicting the energy performance of residential buildings using minimal input data. Building and Environment, 214 , 108911. https://doi.org/10.1016/j.buildenv.2022.108911

Shi, K., Yang, Q., Fang, G., Yu, B., Chen, Z., Yang, C., & Wu, J. (2019). Evaluating spatiotemporal patterns of urban electricity consumption within different spatial boundaries: A case study of Chongqing, China. Energy, 167 , 641–653. https://doi.org/10.1016/j.energy.2018.11.022

Slameršak, A., Kallis, G., & Neill, D. W. O. (2022). Energy requirements and carbon emissions for a low-carbon energy transition. Nature Communications, 13 (1), 6932. https://doi.org/10.1038/s41467-022-33976-5

Song, S., Leng, H., Xu, H., Guo, R., & Zhao, Y. (2020). Impact of Urban Morphology and Climate on Heating Energy Consumption of Buildings in Severe Cold Regions. International Journal of Environmental Research and Public Health, 17 (22), 8354. https://doi.org/10.3390/ijerph17228354

Article   PubMed   PubMed Central   Google Scholar  

Sun, X., Mi, Z., Sudmant, A., Coffman, D., Yang, P., & Wood, R. (2022). Using crowdsourced data to estimate the carbon footprints of global cities. Advances in Applied Energy, 8 , 100111. https://doi.org/10.1016/j.adapen.2022.100111

Article   CAS   Google Scholar  

Wang, C., Ferrando, M., Causone, F., Jin, X., Zhou, X., & Shi, X. (2022a). Data acquisition for urban building energy modeling: a review. Building and Environment, 217 , 109056. https://doi.org/10.1016/j.buildenv.2022.109056

Wang, C., Li, Y., & Shi, X. (n.d.). Information Mining for Urban Building Energy Models (UBEMs) from Two Data Sources: OpenStreetMap and Baidu Map . 3369–3376. https://doi.org/10.26868/25222708.2019.210545

Wang, C., Wu, Y., Shi, X., Li, Y., Zhu, S., Jin, X., & Zhou, X. (2020). Dynamic occupant density models of commercial buildings for urban energy simulation. Building and Environment, 169 , 106549. https://doi.org/10.1016/j.buildenv.2019.106549

Wang, J., Wei, J., Zhang, W., Liu, Z., Du, X., Liu, W., & Pan, K. (2022b). High-resolution temporal and spatial evolution of carbon emissions from building operations in Beijing. Journal of Cleaner Production, 376 , 134272. https://doi.org/10.1016/j.jclepro.2022.134272

Wang, M., Yu, H., Yang, Y., Jing, R., Tang, Y., & Li, C. (2022c). Assessing the impacts of urban morphology factors on the energy performance for building stocks based on a novel automatic generation framework. Sustainable Cities and Society, 87 , 104267. https://doi.org/10.1016/j.scs.2022.104267

Wang, W., Liu, K., Zhang, M., Shen, Y., Jing, R., & Xu, X. (2021). From simulation to data-driven approach: A framework of integrating urban morphology to low-energy urban design. Renewable Energy, 179 , 2016–2035. https://doi.org/10.1016/j.renene.2021.08.024

Wu, Z., Qiao, R., Zhao, S., Liu, X., Gao, S., Liu, Z., Ao, X., Zhou, S., Wang, Z., & Jiang, Q. (2022). Nonlinear forces in urban thermal environment using Bayesian optimization-based ensemble learning. Science of the Total Environment, 838 , 156348. https://doi.org/10.1016/j.scitotenv.2022.156348

Yang, Y., Guangrong, S., Chen, Z., Hao, S., Zhouyiling, Z., & Shan, Y. (2022). Quantitative analysis and prediction of urban heat island intensity on urban-rural gradient: a case study of Shanghai. Science of the Total Environment, 829 , 154264. https://doi.org/10.1016/j.scitotenv.2022.154264

Ye, Z., Cheng, K., Hsu, S.-C., Wei, H.-H., & Cheung, C. M. (2021). Identifying critical building-oriented features in city-block-level building energy consumption: a data-driven machine learning approach. Applied Energy, 301 , 117453. https://doi.org/10.1016/j.apenergy.2021.117453

You, K., Ren, H., Cai, W., Huang, R., & Li, Y. (2023). Modeling carbon emission trend in China’s building sector to year 2060. Resources, Conservation and Recycling, 188 , 106679. https://doi.org/10.1016/j.resconrec.2022.106679

Zhang, H., Han, J., Zhou, R., Zhao, A., Zhao, X., & Kang, M. (2022a). Quantifying the relationship between land parcel design attributes and intra-urban surface heat island effect via the estimated sensible heat flux. Urban Climate, 41 , 101030. https://doi.org/10.1016/j.uclim.2021.101030

Zhang, J., Xu, L., Shabunko, V., Tay, S. E. R., Sun, H., Lau, S. S. Y., & Reindl, T. (2019). Impact of urban block typology on building solar potential and energy use efficiency in tropical high-density city. Applied Energy, 240 , 513–533. https://doi.org/10.1016/j.apenergy.2019.02.033

Zhang, L. (2021). Data-driven building energy modeling with feature selection and active learning for data predictive control. Energy and Buildings, 252 , 111436. https://doi.org/10.1016/j.enbuild.2021.111436

Zhang, L., Wen, J., Li, Y., Chen, J., Ye, Y., Fu, Y., & Livingood, W. (2021). A review of machine learning in building load prediction. Applied Energy, 285 , 116452. https://doi.org/10.1016/j.apenergy.2021.116452

Zhang, Y., Teoh, B. K., Zhang, L., & Chen, J. (2022b). Spatio-temporal heterogeneity analysis of energy use in residential buildings. Journal of Cleaner Production, 352 , 131422. https://doi.org/10.1016/j.jclepro.2022.131422

Zhou, S., Liu, Z., Wang, M., Gan, W., Zhao, Z., & Wu, Z. (2022). Impacts of building configurations on urban stormwater management at a block scale using XGBoost. Sustainable Cities and Society, 87 , 104235. https://doi.org/10.1016/j.scs.2022.104235

Zhu, J., Niu, J., Tian, Z., Zhou, R., & Ye, C. (2022). Rapid quantification of demand response potential of building HAVC system via data-driven model. Applied Energy, 325 , 119796. https://doi.org/10.1016/j.apenergy.2022.119796

Download references

Acknowledgements

The authors gratefully acknowledge the contributions of the editors and peer reviewers for their valuable feedback and suggestions. Their insights have been crucial in refining and strengthening the manuscript.

Research on Multi-Modal Scenario Intelligent Simulation Information Platform for Sustainable Urban Planning and Construction under the National Key Research and Development Programme of the 14th Five-Year Plan (2022YFC3800205).

The National Natural Science Foundation of China under Grant NO. 52278041 and the Fundamental Research Funds for the Central Universities.

The International Knowledge Centre for Engineering Sciences and Technology (IKCEST) under the Auspices of UNESCO, Beijing 100088, China.

Author information

Authors and affiliations.

College of Architecture and Urban Planning, Tongji University, 1239 Siping Road, Shanghai, People’s Republic of China

Qingrui Jiang, Chenyu Huang, Zhiqiang Wu, Jiawei Yao, Jinyu Wang, Xiaochang Liu & Renlu Qiao

Shanghai Research Institute for Intelligent Autonomous Systems, Tongji University, 1239 Siping Road, Shanghai, People’s Republic of China

Qingrui Jiang, Zhiqiang Wu & Renlu Qiao

You can also search for this author in PubMed   Google Scholar

Contributions

Qingrui Jiang: Conceptualization, Visualization, Methodology, and Writing. Chenyu Huang: Conceptualization, Supervision, Writing, and Methodology. Zhiqiang Wu: Conceptualization, Investigation, Funding Acquisition and Supervision. Jiawei Yao: Conceptualization, Investigation, Funding Acquisition and Supervision. Jinyu Wang: Data curation. Xiaochang Liu: Review and editing. Renlu Qaio: Review and editing. All authors have read and agreed to the published version of the manuscript.

Corresponding authors

Correspondence to Zhiqiang Wu or Jiawei Yao .

Ethics declarations

Competing interests.

In the interest of transparency, we disclose that Zhiqiang Wu is a co-author of this paper and also serves as the editor of Frontiers of Urban and Rural Planning. To ensure the integrity of the review process, Zhiqiang Wu will recuse themselves from any involvement in the editorial decision-making for this submission. An alternative editor has been designated to handle the peer review process for this paper. The journal's commitment to editorial independence and ethical standards will be upheld throughout the review process.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Jiang, Q., Huang, C., Wu, Z. et al. Predicting building energy consumption in urban neighborhoods using machine learning algorithms. FURP 2 , 6 (2024). https://doi.org/10.1007/s44243-024-00032-3

Download citation

Received : 24 May 2023

Revised : 11 January 2024

Accepted : 12 January 2024

Published : 16 February 2024

DOI : https://doi.org/10.1007/s44243-024-00032-3

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Urban building energy modeling (UBEM)
  • Interpretable machine learning
  • Ensemble learning
  • Shanghai central area
  • Energy consumption
  • Find a journal
  • Publish with us
  • Track your research

COMMENTS

  1. machine learning Latest Research Papers

    ... Keyword (s): Machine Learning Sea Cucumber Apostichopus Japonicus Learning Model Machine Learning Model Element Profile Download Full-text A comparison of machine learning- and regression-based models for predicting ductility ratio of RC beam-column joints Structures 10.1016/j.istruc.2021.12.083 2022 Vol 37

  2. Machine learning

    Machine learning is the ability of a machine to improve its performance based on previous results. Machine learning methods enable computers to learn without being explicitly programmed...

  3. Machine Learning: Algorithms, Real-World Applications and Research

    Machine learning and deep learning Article Open access 08 April 2021 What Is Machine Learning? Chapter © 2015 Introduction We live in the age of data, where everything around us is connected to a data source, and everything in our lives is digitally recorded [ 21, 103 ].

  4. The latest in Machine Learning

    ailab-cvc/yolo-world • • 30 Jan 2024. The You Only Look Once (YOLO) series of detectors have established themselves as efficient and practical tools. Instance Segmentation Language Modelling +4. 1,452. 0.90 stars / hour. Paper. Code. Papers With Code highlights trending Machine Learning research and the code to implement it.

  5. Machine learning-based approach: global trends, research directions

    In this paper, we present a comprehensive view on geo worldwide trends (taking into account China, the USA, Israel, Italy, the UK, and the Middle East) of ML-based approaches highlighting the rapid growth in the last 5 years attributable to the introduction of related national policies.

  6. [2104.05314] Machine learning and deep learning

    [2104.05314] Machine learning and deep learning Computer Science > Artificial Intelligence [Submitted on 12 Apr 2021 ( v1 ), last revised 14 Apr 2021 (this version, v2)] Machine learning and deep learning Christian Janiesch, Patrick Zschech, Kai Heinrich

  7. Machine Learning: Algorithms, Real-World Applications and Research

    In this paper, we present a comprehensive view on these machine learning algorithms that can be applied to enhance the intelligence and the capabilities of an application.

  8. Journal of Machine Learning Research

    Journal of Machine Learning Research. The Journal of Machine Learning Research (JMLR), established in 2000, provides an international forum for the electronic and paper publication of high-quality scholarly articles in all areas of machine learning.All published papers are freely available online. JMLR has a commitment to rigorous yet rapid reviewing.

  9. Home

    Machine Learning is an international forum focusing on computational approaches to learning. ... Improves how machine learning research is conducted. Prioritizes verifiable and replicable supporting evidence in all published papers. Editor-in-Chief. Hendrik Blockeel; Impact factor 7.5 (2022) 5 year impact factor

  10. Machine learning in project analytics: a data-driven framework ...

    This study proposes a machine learning-based data-driven research framework for addressing problems related to project analytics. It then illustrates an example of the application of this framework.

  11. Artificial intelligence and machine learning research ...

    Fig. 1 Source: The Authors An AI-Driven Digital Transformation in all aspects of human activity/ Full size image (1) Integration of diverse data-warehouses to unified ecosystems of AI and ML value-based services (2) Deployment of robust AI and ML processing capabilities for enhanced decision making and generation of value our of data. (3)

  12. Machine Learning with Applications

    Machine Learning with Applications (MLWA) is a peer reviewed, open access journal focused on research related to machine learning. The journal encompasses all aspects of research and development in ML, including but not limited to data mining, computer vision, natural language processing (NLP), … View full aims & scope $1100 (standard fee: $ 2200)

  13. Top 10 Machine Learning Research Papers of 2021

    August 23, 2021 4 mins read Machine learning research papers showcasing the transformation of the technology In 2021, machine learning and deep learning had many amazing advances and important research papers may lead to breakthroughs in technology that get used by billions of people.

  14. Top 4 Important Machine Learning Papers You Should Read in 2021

    Every year, 1000s of research papers related to Machine Learning are published in popular publications like NeurIPS, ICML, ICLR, ACL, and MLDS. The criteria are using citation counts from three academic sources: scholar.google.com; academic.microsoft.com; and semanticscholar.org.

  15. 2020's Top AI & Machine Learning Research Papers

    To help you catch up on essential reading, we've summarized 10 important machine learning research papers from 2020. These papers will give you a broad overview of AI research advancements this year. Of course, there are many more breakthrough papers worth reading as well.

  16. Machine learning in manufacturing and industry 4.0 applications

    Review papers related to machine learning applications in the manufacturing domain in this special issue bring together quantitative and qualitative components and provide new conceptual frameworks, synthesise diverse results, and give the broader research community a 'state-of-the-art' snapshot of essential issues related to the focus of ...

  17. Top Machine Learning Research Papers Released In 2021

    Machine learning and deep learning have accomplished various astounding feats this year in 2021, and key research articles have resulted in technical advances used by billions of people. The research in this sector is advancing at a breakneck pace and assisting you to keep up.

  18. Journal of Machine Learning Research

    The Journal of Machine Learning Research (JMLR), established in 2000, provides an international forum for the electronic and paper publication of high-quality scholarly articles in all areas of machine learning. All published papers are freely available online. JMLR has a commitment to rigorous yet rapid reviewing. Final versions are published ...

  19. (PDF) Machine Learning:A Review

    In this paper, we focus on general review of machine learning including various machine learning techniques. These techniques can be applied to different fields like image processing,...

  20. The Top 17 'Must-Read' AI Papers in 2022

    1. Boostrapped Meta-Learning (2022) - Sebastian Flennerhag et al. The first paper selected by Max proposes an algorithm in which allows the meta-learner teach itself, allowing to overcome the meta-optimisation challenge. The algorithm focuses meta-learning with gradients, which guarantees improvements in performance.

  21. Machine learning and deep learning

    Today, intelligent systems that offer artificial intelligence capabilities often rely on machine learning. Machine learning describes the capacity of systems to learn from problem-specific training data to automate the process of analytical model building and solve associated tasks. Deep learning is a machine learning concept based on artificial neural networks. For many applications, deep ...

  22. Machine Learning

    In machine learning, a computer first learns to perform a task by studying a training set of examples. The computer then performs the same task with data it hasn't encountered before. This article presents a brief overview of machine-learning technologies, with a concrete case study from code analysis.

  23. Top 20 Recent Research Papers on Machine Learning and Deep Learning

    Top 20 Recent Research Papers on Machine Learning and Deep Learning Machine learning and Deep Learning research advances are transforming our technology. Here are the 20 most important (most-cited) scientific papers that have been published since 2014, starting with "Dropout: a simple way to prevent neural networks from overfitting".

  24. Revisiting Feature Prediction for Learning Visual Representations from

    Abstract. This paper explores feature prediction as a stand-alone objective for unsupervised learning from video and introduces V-JEPA, a collection of vision models trained solely using a feature prediction objective, without the use of pretrained image encoders, text, negative examples, reconstruction, or other sources of supervision.

  25. True Value Investing in Credits through Machine Learning

    We show that existing value factors from the literature earn not only from mispricings but also from risk. To better control for risk, we construct a novel machine learning based value factor and find that it outperforms existing value factors while earning less from risk and more from mispricings.

  26. MIT researchers remotely map crops, field by field

    The team used machine learning to analyze satellite and roadside images of areas where small farms predominate and agricultural data are sparse. ... About 81,000 images were crop-related. They set aside 2,000 of these, which they sent to an agronomist, who determined and labeled each crop type by eye. ... Leon Sandler reflects on 18 years of ...

  27. Predicting building energy consumption in urban ...

    The organization of this paper is as follows: Sect. 2 presents the related work; Sect. 3 describes the main methods, including data collection, impact factor calculation, urban building energy simulation, and interpretable machine learning modeling; Sect. 5 provides a discussion on the findings; and conclusions are presented in Sect. 6.