====== CSCI5510 Big Data Analytics ======
==== Breaking News ====
* **September 3, 2013**. The course homepage is migrated to https://www.cse.cuhk.edu.hk/csci5510/wiki/ permanently.
* **September 2, 2013**. The new semester begins.
* **September 2, 2013**. News group address: cuhk.cse.csci5510
* **September 2, 2013**. The first tutorial will be conducted on Sept. 10. There is no tutorial in the first week.
* **September 3, 2013**. The tutorial class room is YIA LT7.
===== 20013-14 Term 1 =====
| ^ Lecture ^ Tutorial ^
^ Time | M2-4, 9:30 am - 12:30 pm | T3 10:30 am - 11:15 am |
^ Venue | KKB101 | YIA LT7 |
The Golden Rule of CSCI5510: No member of the CSCI5510 community shall take unfair advantage of any other member of the CSCI5510 community.
====== Course Description ======
This course aims at teaching students the state-of-the-art big data analytics, including techniques, software, applications, and perspectives with massive data. The class will cover, but not be limited to, the following topics: distributed file systems such as Google File System, Hadoop Distributed File System, CloudStore, and map-reduce technology; similarity search techniques for big data such as minhash, locality-sensitive hashing; specialized processing and algorithms for data streams; big data search and query technology; big graph analysis; recommendation systems for Web applications. The applications may involve business applications such as online marketing, computational advertising, location-based services, social networks, recommender systems, healthcare services, also covered are scientific and astrophysics applications such as environmental sensor applications, nebula search and query, etc.
本課程旨在教導學生最先進的針對大數據的分析,包括技術、軟件、應用和遠景。本課程內容將包括,但不限於以下內容:分佈式文件系統如谷歌文件系統,Hadoop文件系統,CloudStore等和Map-reduce技術;大數據的相似搜索技術,如最小哈希,局部敏感哈希等;針對數據流的專門處理方法和算法;大數據的搜索和查詢技術;互聯網應用中的廣告管理和推薦系統。本課涉及的應用程序可能包括商業應用程序,如網絡營銷、計算廣告、基於位置的服務、社交網絡、推薦系統、醫療保健服務和科學及天體物理學領域的應用,如環境傳感器的應用,星雲搜索和查詢等。
===== Learning Objectives =====
- To understand the current key issues on big data and the associated business/scientific data applications
- To teach the fundamental techniques and principles in achieving big data analytics with scalability and streaming capability
- To interpret business models and scientific computing results
- Able to apply software tools for big data analytics
===== Learning Outcomes =====
At the end of the course of studies, students will have acquired the ability to
- Understand the key issues on big data and the associated applications in intelligent business and scientific computing.
- Acquire fundamental enabling techniques and scalable algorithms in big data analytics.
- Interpret business models and scientific computing paradigms, and apply software tools for big data analytics.
- Achieve adequate perspectives of big data analytics in marketing, financial services, health services, social networking, astrophysics exploration, and environmental sensor applications, etc.
===== Learning Activities =====
- Lectures
- Tutorials
- Web resources
- Projects
- Presentations
- Lab Reports
- Examinations
====== Personnel ======
| ^ Lecturer ^ Lecturer ^ Tutor ^ Tutor ^
^ Name | [[https://www.cse.cuhk.edu.hk/irwin.king/home|Irwin King]] | [[http://www.cse.cuhk.edu.hk/~lyu|Michael R. Lyu]] | Guang Ling | Chen Cheng |
^ Email | king AT cse.cuhk.edu.hk | lyu AT cse.cuhk.edu.hk | gling AT cse.cuhk.edu.hk | ccheng AT cse.cuhk.edu.hk |
^ Office | Rm 908 | Rm 927 | Rm 1024 | Rm 1024 |
^ Telephone | 3943 8398 | 3943 8429 | 3943 4252 | 3943 4252 |
^ Office Hour(s) | TBA | 10:00-12:00 Tuesday | TBA | TBA |
Note: This class will be taught in English. Homework assignments and examinations will be conducted in English.
====== Syllabus ======
The pdf files are created in Acrobat 6.0. Please obtain the correct version of the [[http://www.adobe.com/prodindex/acrobat/readstep.html#reader | Acrobat Reader]] from Adobe.
^ Week ^ Date ^ Topics ^ Tutorials ^ Homework & Events ^ Resources ^
| 1 | 2/9 | Introduction and Motivation \\ \\ {{:teaching:csci5510:01.pptx|}} | No Tutorial | | [[http://infolab.stanford.edu/~ullman/mmds/ch1.pdf|Ch. 1 of MMDS]] |
| 2 | 9/9 | MapReduce\\ \\ [[|02-MapReduce.pdf]] | \\ \\ | \\ \\ | [[http://infolab.stanford.edu/~ullman/mmds/ch2.pdf|Ch. 2 of MMDS]] \\ [[http://infolab.stanford.edu/~ullman/mmds/ch6.pdf|Ch. 6 of MMDS]] |
| 3 | 16/9 | Locality Sensitive Hashing\\ \\ [[|03-lsh.pdf]] | \\ \\ | | [[http://infolab.stanford.edu/~ullman/mmds/ch3.pdf|Ch. 3 of MMDS]] |
| 4 | 23/9 | Mining Data Streams\\ \\ [[|04-stream.pdf]] | | | [[http://infolab.stanford.edu/~ullman/mmds/ch4.pdf|Ch. 4 of MMDS]] |
| 5 | 30/9 | Scalable Clustering \\ \\ [[|05-clustering.pdf]] | | | [[http://infolab.stanford.edu/~ullman/mmds/ch7.pdf|Ch. 7 of MMDS]] |
| 6 | 7/10 | Dimensionality Reduction \\ \\ [[|06-DR.pdf]] | | | [[http://infolab.stanford.edu/~ullman/mmds/ch11.pdf|Ch. 11 of MMDS]] |
| 7 | 14/10 | Public Holiday | | | |
| 8 | 21/10 | Recommender systems/Matrix Factorization \\ \\ [[|07-mf.pdf]] | | | [[http://infolab.stanford.edu/~ullman/mmds/ch9.pdf|Ch. 9 of MMDS]] |
| 9 | 28/10 | Massive Link Analysis \\ \\ [[|08-link.pdf]] | | | [[http://infolab.stanford.edu/~ullman/mmds/ch5.pdf|Ch. 5 of MMDS]] |
| 10 | 4/11 | Mid-term | | | |
| 11 | 11/11 | Analysis of Massive Graph \\ \\ [[|09-graph.pdf]] | | | [[http://infolab.stanford.edu/~ullman/mmds/ch10.pdf|Ch. 10 of MMDS]] |
| 12 | 18/11 | Large Scale SVM\\ \\ [[|10-svm.pdf]] | | \\ | [[http://www.svms.org/tutorials/Burges1998.pdf|SVM tutorial]] |
| 13 | 25/11 | Online Learning \\ \\ [[|11-ol.pdf]] | | | [[http://www.cs.huji.ac.il/~shais/papers/OLsurvey.pdf|Online learning survey]] |
====== Class Project ======
===== Class Project Presentation Schedule =====
* TBA
===== Class Project Presentation Requirements=====
====== Examination Matters ======
===== Examination Schedule =====
| ^ Time ^ Venue ^ Notes ^
^ Midterm Examination | Nov. 4, 9:30am-12:00 noon | TBA | TBA |
^ Final Examination | TBA | TBA | TBA |
* [[http://rgsntl.rgs.cuhk.edu.hk/rws_prd_life/main1.asp|CUHK Registration and Examination]]
===== Written Midterm Matters =====
- The midterm will test your knowledge of the materials.
- Answer all questions using the answer booklet. There will be more available at the venue if needed.
- Write legibly. Anything we cannot decipher will be considered incorrect.
- One A4-sized cheat-sheet page.
====== Grade Assessment Scheme ======
^ Homework\\ Assignments ^ Mid-term\\ Examination ^ Project ^
| 20% | 30% | 50% |
-Assignments (20%)
-Written assignments
-Coding
-Mid-term Examination (30%)
- Project (50%)
- Proposal
- Presentations
- Report
====== Reference Books ======
====== FAQ ======
- **Q: What is departmental guideline for plagiarism?**\\ A: If a student is found plagiarizing, his/her case will be reported to the Department Discipline Committee. If the case is proven after deliberation, the student will automatically fail the course in which he/she committed plagiarism. The definition of plagiarism includes copying of the whole or parts of written assignments, programming exercises, reports, quiz papers, mid-term examinations. The penalty will apply to both the one who copies the work and the one whose work is being copied, unless the latter can prove his/her work has been copied unwittingly. Furthermore, inclusion of others' works or results without citation in assignments and reports is also regarded as plagiarism with similar penalty to the offender. A student caught plagiarizing during tests or examinations will be reported to the Faculty Office and appropriate disciplinary authorities for further action, in addition to failing the course.
====== Resources ======
-[[http://pajek.imfm.si/doku.php|Pajek, a network analysis and visualization program.]]
-[[http://vlado.fmf.uni-lj.si/pub/networks/data/default.htm|Package for Large Network Analysis]]
-[[http://www.analytictech.com/downloaduc6.htm|UCINET 6]]
-[[http://www.analytictech.com/Netdraw/netdraw.htm|Netdraw]]
-[[http://stat.gamma.rug.nl/stocnet/|StOCNET]]
===== Big Data Analytics =====
* http://infolab.stanford.edu/~ullman/mmds.html \\
* http://cs246.stanford.edu/ \\
===== Graph Mining =====
* http://www.cs.cmu.edu/~deepay/mywww/papers/csur06.pdf \\
* http://cs.stanford.edu/people/jure/talks/www08tutorial/ \\
* http://www.xifengyan.net/tutorial/KDD08_graph_partI.pdf \\
* http://www.xifengyan.net/tutorial/KDD08_graph_partII.pdf
===== Link Analysis=====
* http://analytics.ijs.si/events/Tutorial-TextMiningLinkAnalysis-KDD2007-SanJose-Aug2007/ \\
* http://www.sigkdd.org/explorations/issues/7-2-2005-12/1-Getoor.pdf \\
* http://www.ncjrs.gov/pdffiles1/nij/grants/219552.pdf \\
* http://delab.csd.auth.gr/~dimitris/papers/ENVO07LARskm.pdf
===== Learning to Rank=====
* http://www2009.org/pdf/T7A-LEARNING%20TO%20RANK%20TUTORIAL.pdf\\
* http://radlinski.org/papers/LearningToRank_NESCAI08.pdf\\
* http://www.aclweb.org/anthology/P/P09/P09-5005.pdf\\
* http://www.cse.iitb.ac.in/~soumen/doc/www2007/TutorialSlides.pdf
===== Recommender Systems=====
* http://en.wikipedia.org/wiki/Recommender_system
* http://www.deitel.com/ResourceCenters/Web20/RecommenderSystems/RecommenderSystemsTutorialsandWebcasts/tabid/1313/Default.aspx
* http://www.computer.org/portal/web/csdl/doi/10.1109/TKDE.2005.99
* http://www.springerlink.com/content/n881136032u8k111/
* http://www.csd.abdn.ac.uk/~jmasthof/Publications/WPRSIUI07.pdf
===== Human Computation/Social Games =====
* http://www.gwap.com/gwap/ \\
* http://www.cs.cmu.edu/~biglou/ \\
===== Opinion Mining/Sentiment Analysis =====
* http://www.cs.uic.edu/~liub/FBS/opinion-mining-sentiment-analysis.pdf \\
* http://www.cs.cornell.edu/home/llee/omsa/omsa-published.pdf \\
* http://www.cs.cmu.edu/~wcohen/10-802/sentiment-sep-4.ppt \\
===== Visualization =====
-[[http://manyeyes.alphaworks.ibm.com/manyeyes/|Many Eyes Visualization]]
===== Programming =====
-[[http://networkx.lanl.gov/|NetworkX, a Python package for complex networks]]
-[[http://www.wolfram.com/|Mathematica from Wolfram]]
-[[http://demonstrations.wolfram.com/|Wolfram Demonstrations]]