Centre for Clinical Research and Biostatistics
MA, Zhijie, BSc in Statistics
I am very honored to have been selected to serve as a Junior Research Assistant at the Centre for Clinical Research and Biostatistics (CCRB) this summer. Working with Prof. Benny Chung Ying Zee, the director of the CCRB at the JC School of Public Health and Primary Care, along with his team, was a very meaningful and enjoyable experience.
During this internship programme, I was asked by Prof. Zee to construct a new search engine based on the Aims academic paper database of CUHK to improve the search performance of Aims. After a discussion with Mr Steven Yuk Fai Lau of the Research Association at the CCRB, I decided to build this new search engine in Python.
I first set up a MySQL database, and then used the Scrapy crawler framework to crawl the pages of CUHK’s Aims database, and save them to the newly built MySQL database. At the same time, NLTK was used to segment the keywords and the content of the paper. The code had to ensure that phrases in double quotation marks were not segmented but rather searched as a whole, consistent with the rules of the Google search engine.
I then chose the BM25 algorithm in the Gensim package to calculate the correlation scores and sort them in reverse order to display the paper information most closely related to the search keywords typed by the users. Finally, I used Python’s lightweight Web application framework, Flask, and deployed the search engine to the cloud server. This search engine realized the combined retrieval of paper title, abstract and author through a new correlation algorithm. It also realized the pagination function in the search results.
In these two wonderful months, I gained a great deal of knowledge that I had not previously learned in class. My self-learning ability was greatly improved, especially in computer programming. What impressed me most was the friendliness of the CCRB staff, who provided constructive suggestions, both work-related and more, which greatly benefited me. In particular, Dr Jack Lee and Mr Steven Yuk Fai Lau gave me a clearer understanding of my current work and future study plan. I am very grateful for their help and advice.
In addition, I would like to thank Professor Lin Yuanyuan of the Department of Statistics, and Benny Zee and Ms. Maria Ming Po Lai of the CCRB for their help and support during my internship programme, which enabled me to successfully complete my work despite the COVID-19 epidemic in Hong Kong.
Through this programme, I now know how to learn more effectively and have also greatly improved my learning ability in fields I was not familiar with or even aware of. I am very grateful to the Statistics Department and the CCRB for providing me with this excellent experience, which will be really helpful to my future studies and career.
|