北京大学深圳互联网信息工程研发中心
互联网中心黄连恩博士后和王磊同学的论文被顶级国际信息检索会议接收
近日,深圳研究生院互联网信息工程中心黄连恩博士后和06级硕士研究生王磊的研究论文《Achieving both High Precision and High Recall in Near-duplicate Detection》被国际会议ACM CIKM2008 (The Conference on Information and Knowledge Management)全文接收。CKIM 2008 论文收录比例为 17% (131/772)。
CIKM由美国计算机协会(ACM)主办,旨在研讨信息检索、数据库和知识管理领域的关键技术和最新进展,通过研讨高质量的理论发现和工业应用来为未来的研究方向提供指导。CIKM 从1993年开始,目前已成功举办16届。CIKM在信息检索领域是属于最高级别的学术会议之一,仅次于SIGIR。
该论文主要研究相似性网页的检测问题,在搜索引擎(例如Google)等领域具有很高的应用价值。论文独创性的使用网页编辑距离作为相似性的衡量标准,并使用启发式算法巧妙的解决了运行效率问题,对文章性网页的检测效果达到国际领先水平。四位评委都以满分(5分)推荐录用该论文。
会议将于10月25日至30日在美国加利福尼亚州纳帕谷举行,王磊同学将注册和赴美国参加会议,并在会议上作专题报告。这是深圳互联网中心在 “国际化”学术交流上的一个重要突破。这是我院互联网学科发展取得的又一重要成果。
相关介绍:
该论文的研究源于在对Web Infomall(中国Web信息博物馆)的深度开发工作。Web Infomall是在国家 973和985项目支持下,北京大学网络实验室早在2002年开发建设的中国网页历史信息存储与展示系统。目前已经维护有30亿以中文为主的网页,并以平均每月四千五百万网页的速度扩大规模。论文的主体工作是2007年底Web Infomall落户深圳研究生院后,在深圳完成的。
本篇论文由北大网络实验室李晓明教授和我院互联网中心雷凯老师指导和大力支持,在研究中提供了宝贵意见和建议。互联网中心成立五年来在网络工程应用科研领域取得了引人注目的成绩,自主研发了天网Maze P2P网络文件下载系统(国家863项目),电信级P2P视频直播,点播及下载系统(国家“十一五”重点支撑项目子课题),下一代互联网(IPv6)等科研项目;进一步扩展了天网搜索引擎(国家973项目)的研究和发展、完成了2007年省部产学研示范项目《通信领域智能搜索引擎》。
The Paper by Dr. Huang Lian’en and M.S 06 Wang Lei from CIRE was accepted by a
Top-Level International Information Retrieval Conference
A research paper from Center for Internet Research and Engineering (CIRE) of Shenzhen Graduate School of Peking University, “Achieving both High Precision and High Recall in Near-duplicate Detection”, was accepted as a full paper by the ACM 17th Conference on Information and Knowledge Management (CIKM2008 ). This paper is written by Post Doctorate Lian’en Huang and Master Student Lei Wang. The accept rate of CIKM2008 this year is 17% (131/772).
CIKM is one of top-level conferences in information retrieval, second only to SIGIR. It is organized by Association for Computing Machinery (ACM). The purpose of the conference is to identify challenging development problems of future knowledge and information systems, and to shape future directions of research by soliciting and reviewing high quality, applied and theoretical research discoveries.
This paper focuses on near-duplicate documents detection, which is a hot research area recently and has great application value in search engines like Google. Dr.Huang Lian’en and Wang Lei creatively adopted the edit distance as the similarity measurement and solved the efficiency problem using heuristic methods in the processing of topic-type web pages. All the four reviewers recommend the paper’s acceptance with full marks.
The conference will take place on October 25 at Napa Valley, CA, USA. Wang Lei is going to attend this academic activity and will give an oral presentation in person. This is a significant milestone that indicates internationalization of the CIRE’s academic activities as well as an important achievement on the development of Internet Research in SZPKU.
Related:
The inspiration of this paper derives from the platform of Web Infomall (the Chinese web information museum). Web Infomall is an archive and reveal system of historical web pages, supported by the National “973 Project” and National “985 Project” and developed by the Computer Networks and Distributed System Laboratory (CNDS) of Peking University since 2002. It has stored over 3 billion web pages and expends at a scale of 45 million web pages per month. The majority of the research is accomplished in Shenzhen since the Web Infomall settled down here in SZPKU in the end of 2007.
Professor Li Xiaoming and Deputy Director Lei Kai gave great guidance and assistance in the research of this paper. The CIRE has made substantial progress in the Internet research and Engineering since its establishment in 2003, such as the most popular and advanced Chinese P2P community engine "Tianwang Maze” (National “863” Project), P2P VOD System for China Telecom (a key project supported by the National Eleventh Five-Years Fund) , Next Generation of Network (including IPv6) and so on. Besides, on top of the National 973 Project “Tianwang Search Engine”, it finished “The Intelligent Search Engine Used in Communication Fields”, a Sheng-Bu Industry-Academia-Research project of 2007.