Table of Contents
crawlingRetweet(retweet_url, weibo_id)
Description
Extract retweet of original weibo post and insert to database.
Parameters
Parameter | Necessity | Type | Description |
---|---|---|---|
retweet_url | required | string | the retweet link url of original weibo |
weibo_id | required | string | original weibo id |
Output
Parameters | Type | Description |
---|---|---|
status | string | show the crawler running status |
Implementation
- masterStart(). Create multiple processes to begin crawling data.
- wapLogIn(). Log in sina Account.
- weiBoWapSearch(searchStr, Sid). Use searchStr(person name or company name) and search id(person id or company id) to search related weibo
- extractTopic(person_name or company_name, person_id or company_id). Extract weibo text and insert to database.
- getRetweet(retweet_url, weibo_id). Extract retweet of original weibo text and insert to database.
Related Work
None
Issues About The Crawler
- Sina Weibo API is not not so effective, it need to be authorized but the crawler would not pass sina's examine and verify.
- Using browser’s cookies to log in sina account.
- Using the url weibo.cn instead of www.weibo.com to crawl data, because the latter one’s tweet data is sealed in javascript and it’s difficult to extract.
- Using multiple proxies to prevent sina block our ip.
- For speeding up the crawler, using multiple processes and accounts.