====== crawlingRetweet(retweet_url, weibo_id) ====== ===== Description ===== Extract retweet of original weibo post and insert to database. ===== Parameters ===== ^ Parameter ^ Necessity ^ Type ^ Description ^ | retweet_url | required | string | the retweet link url of original weibo | | weibo_id | required | string | original weibo id | ===== Output ===== ^ Parameters ^ Type ^ Description ^ | status | string | show the crawler running status | ===== Implementation ===== - masterStart(). Create multiple processes to begin crawling data. - wapLogIn(). Log in sina Account. - weiBoWapSearch(searchStr, Sid). Use searchStr(person name or company name) and search id(person id or company id) to search related weibo * extractTopic(person_name or company_name, person_id or company_id). Extract weibo text and insert to database. * getRetweet(retweet_url, weibo_id). Extract retweet of original weibo text and insert to database. ===== Related Work ===== None ===== Issues About The Crawler ===== - Sina Weibo API is not not so effective, it need to be authorized but the crawler would not pass sina's examine and verify. - Using browser’s cookies to log in sina account. - Using the url weibo.cn instead of www.weibo.com to crawl data, because the latter one’s tweet data is sealed in javascript and it’s difficult to extract. - Using multiple proxies to prevent sina block our ip. - For speeding up the crawler, using multiple processes and accounts.