We are seeking for candidates to a PhD fellowship in Computer science, in collaboration between LS2N (France) and NII (Japan), in the topics of: Ontology Learning, Graph Embeddings and GNN, Semi Supervised Learning, and Knowledge Graphs
Application is available here and open from May 2023 until a candidate is selected.
Combining Knowledge graph embedding and prior knowledge based semi-supervised learning for ontology learning from large scale data.
- Keywords: Ontology learning, Knowledge Graph Completion, Prior Knowledge, Clustering, Relation Prediction, Knowledge Graph Embedding, Graph Neural Network.
- Laboratory: DUKe, LS2N (Laboratory of Digital Sciences of Nantes, France) and a collaboration with NII & AIST (Tokyo, Japan)
- Supervisors: Mounira Harzallah and Fabrice Guillet
- CNRS financial support: 2135 € (gross salary)/month and a NII financial support for the Japan internship.
- Start date: 1st of October 2023
- Duration: 3 years
- Requirements:
- Education Level: MSc
- Field: Computer Science, Data Science, Web Science, Computational Linguistics, Artificial Intelligence
- Candidate Profile: Knowledge on Data mining/Machine Learning, Knowledge on Semantic Web and NLP will be strongly appreciated but not mandatory, Knowledge in programing languages mainly Python.
- Language: English
- The application evaluation will be continuous until the position is filled. Interested candidates should submit : CV, cover letter, transcripts of records of the tree last years and names and addresses of two references. Applications should be submitted to mounira.harzallah@univ-nantes.fr and fabrice.guillet@univ-nantes.fr
PhD Description
Background. The popularity of ontologies and the easy access to a large number of textual resources have strongly motivated the automatic construction of ontologies using artificial intelligence techniques. Three types of construction approaches are distinguished: distributional approaches, knowledge graph-based approaches and pattern-based approaches [Xu et al., 2019, Chen et al. 2020]. In this thesis, we will focus on distributional approaches and more specifically on clustering and graph-based approaches. Generally, clustering allows to consider a large amount of data. However, it faces two main difficulties: the cluster labelling and the formation of semantically consistent clusters relevant to the ontology domain. In our previous work, we have developed a prior knowledge-driven LDA to tackle these two difficulties [Huang et al. 2021, Xu et al 2020]. However, clustering based approaches suffer also from the sparsity of the term representation space [Shwartz et al., 2016]. Graph-based approaches extract triples from texts (subject, predicate, object), then align and link them to form knowledge graphs (e.g. Yago, DBpedia). They allow to process a large number of texts and build very large graphs, but they suffer from the issue of data heterogeneity, because the same concept can be denoted by different terms in distinct triples and the same term can have several semantics [Nguyen and Ichise, 2012], [Kertkeidkachorn and Ichise, 2018].
PhD purpose. The purpose of this thesis is to develop a new approach for automatic ontology construction combining semi-supervised clustering methods driven by prior knowledge (seed knowledge, local knowledge, domain knowledge, DBpedia,..) [Jagarlamudi et al. 2012, Xu et al. 2019, Huang et al, 2021] and knowledge graph embedding [Ebisu and Ichise, 2018]. This new approach will solve the scientific locks of data heterogeneity and data sparsity. By defining cluster terms by subgraphs and their vector embeddings, the problem of text sparsity can be addressed and the quality of clusters can be improved. In recent years, graph embedding has gained rapid growth [Zhang et al. 2020]. It aims to automatically learn a low-dimensional feature representation for each node in a graph. Graph embedding is used in the construction of machine learning models for various tasks, and our goal is to exploit them to improve ontology learning. The approach to be developed in this thesis will also infer hypernym relationships between terms within each cluster. The objective of this task is threefold: 1) to evaluate the quality of the clusters, 2) to refine their description space in an iterative clustering/extraction of hypernym relations/clustering approach, and 3) to evaluate and improve the quality of the exploited knowledge graphs from which term subgraphs are extracted.
The positioning and significance of this research. Since Ontology is crucial for AI applications, many research studies are working on ontology learning. However, they investigate the sparsity and the heterogeneous problem separately. The first originality of our research is to combine knowledge graph representation and prior-knowledge-driven clustering to solve simultaneously the sparsity and the heterogeneous problems. Knowledge graph and graph embedding deal with sparsity problem and prior knowledge-driven clustering deals with heterogenous problem.The second originality of our research is to enrich semantically the graph embedding by integrating prior knowledge from the core ontology in the process of embedding. Focusing on improving the embedding process itself, Sun et al [2020] show that embedding based approaches perform well when training is performed on the text corpus from which the graph is constructed. However, in the case where this corpus is unavailable or of small size, the graph embedding will be based exclusively on its structure, which weakens the performance of these approaches. In this case, in order to semantically enrich the graph embedding input, considering the semantics of certain entities or properties of the graphs could be relevant. This enrichment could be done using a domain ontology or its core ontology.
Therefore, we would like to develop an original approach benefiting on the one hand from the power of graph embedding techniques for the clustering of entities, and on the other hand from the semantic quality of ontology in order to drive and refine the learning. A core ontology will be used as a seed knowledge model to improve the quality of graph embedding as well as for clustering.