HEXO自定义域名使用HTTPS

发表于 2017-03-26

首先说明原来的使用的DNSPod可以选择来源线路配置不同的解析规则，
从而可以解决Github屏蔽百度爬虫的问题（在DNSPost处配置来源百度爬虫的解析走国内的coding.net 配置服务）
而现在要选择HTTPS就得使用国外另外一个提供DNS解析的服务（无法选择线路配置解析规则），
所以就要从百度收录和HTTPS中二选一，而我选择了HTTPS，还好还有个Google可以收录

使用cloudflare提供的免费ssl证书，简述流程如下：

设置cloudflare为你的域名的DNS解析服务器
设置cloudflare的DNS至Github的A记录
设置SSL解析模式为Flexible
设置pageRule “Always Use Https”

使用如下命令查看DNS解析情况:

1	dig sohow.cc

查看DNS解析详情:

1	dig +trace sohow.cc

保存自己的hexo框架和主题

发表于 2017-03-24

当我们在家里的电脑上安装完hexo和主题后，如果换一台电脑就不能继续写作了，
因此我把hexo框架和使用next主题以及自定义的配置文件也放在github，
首先在hexo网站的github仓库下新建一个分支，如: hexo
然后fork一份你所使用的主题，如: next
接下来你应该知道怎么做了吧，
把hexo框架代码提交到hexo分支
使用git submodule 拉取你fork的next主题

注意：

建立.gitignore,
如我的内容如下： public/
```
.idea/
.deploy_git/
db.json
```
使用时只clone hexo分支， git clone -b branch_name
clone完hexo分支后，需要 git submodule update –init , 初始化子模块next
修改submodule next前需要切换到next的master
修改submodule next后，要在hexo分支下进行一次提交

PHP封装HTTP操作类支持POST和GET

发表于 2017-03-22

PHP封装HTTP操作类支持POST和GET

<?php
class Helper_Http
{
    public static function get($url, $header = array(),&$setcookie = '', $proxy = '')
    {
        $ch = curl_init();
        $needheader = empty($setcookie) ? 0 : 1;
        $opt = array(
            CURLOPT_URL => $url,
            CURLOPT_SSL_VERIFYPEER => false,
            CURLOPT_SSL_VERIFYHOST => false,
            CURLOPT_RETURNTRANSFER => 1,
            CURLOPT_FOLLOWLOCATION => 0,
            CURLOPT_HEADER=>$needheader
        );
        !empty($proxy) && $opt[CURLOPT_PROXY] = $proxy;
        //$header=array('Content-type: text/plain', 'Content-length: 100')
        curl_setopt($ch, CURLOPT_HTTPHEADER, $header);
        curl_setopt_array($ch, $opt);
        $result = curl_exec($ch);
        if ($needheader == 1) {
            list($header, $body) = explode("\r\n\r\n", $result, 2);
            preg_match_all('/Set-Cookie:(.*);/iU', $header, $str);
            if (!empty($str[1])) {
                $setcookie = implode('; ', $str[1]) . ";";
            }
            return $body;
        }
        return $result;
    }
    public static function post($url, $data, $header = array(),&$setcookie = '', $proxy = '')
    {
        $ch = curl_init();
        $needheader = empty($setcookie) ? 0 : 1;
        $opt = array(
            CURLOPT_SSL_VERIFYPEER => false,
            CURLOPT_SSL_VERIFYHOST => false,
            CURLOPT_RETURNTRANSFER => 1,
            CURLOPT_POST => 1,
            CURLOPT_POSTFIELDS => $data,
            CURLOPT_URL => $url,
            CURLOPT_FOLLOWLOCATION=>0,
            CURLOPT_HEADER=>$needheader
        );
        !empty($proxy) && $opt[CURLOPT_PROXY] = $proxy;
        //$header=array('Content-type: text/plain', 'Content-length: 100')
        curl_setopt($ch, CURLOPT_HTTPHEADER, $header);
        curl_setopt_array($ch, $opt);
        $result = curl_exec($ch);
        if ($needheader == 1) {
            list($header, $body) = explode("\r\n\r\n", $result, 2);
            preg_match_all('/Set-Cookie: (.*;)/iU', $header, $str);
            if (!empty($str[1])) {
                $setcookie = implode('; ', $str[1]) . ";";
            }
            return $body;
        }
        return $result;
    }
}

PHP实现爬虫爬取豆瓣数据

发表于 2017-03-22

PHP实现爬虫爬取豆瓣数据，由于豆瓣有单个IP频次限制，所以使用了代理IP，代理IP的来源于网络公开免费的代理，
另一篇文章介绍了使用Python爬取代理IP, 代码中的Helper_Http参见另外一篇文章

<?php
include_once "http.php";
class Douban {
    public static function get_list($tag)
    {
        $url = 'https://movie.douban.com/j/search_subjects';
        $page_limit = 500;
        $page_start = 0;
        $param = array(
            'type' => 'tv',
            'sort'=>'time',
            'tag'=> $tag,
            'page_limit'=>500,
            'page_start'=>0
        );
        $list = array();
        do {
            $result = self::http_get($url . '?' . http_build_query($param));
            $data = json_decode($result, true);
            $page_start += $page_limit;
            $param['page_start'] = $page_start;
            if (is_array($data['subjects'])) {
                foreach ($data['subjects'] as $subject) {
                    $list[] = $subject['id'];
                }
            }
        } while (!empty($data['subjects']));
        return $list;
    }
    public static function get_info($id)
    {
        $url = 'https://api.douban.com/v2/movie/subject/';
        $result = self::http_get($url . $id);
        return $result;
    }
    public static function http_get($url, $try=10)
    {
        static $proxy = '';
        if ($try <= 0) {
            return '';
        }
        $cookie = '';
        $result = Helper_Http::get($url, array(), $cookie, $proxy);
        $data = json_decode($result, true);
        if (isset($data['msg']) && $data['code'] == 112) {
            $proxy = self::get_proxy();
            return self::http_get($url, $try-1);
        }
        return $result;
    }
    public static function get_proxy()
    {
        //http://www.sslproxies.org/
        //http://www.xicidaili.com/
        static $proxys = array(
            '27.54.185.16:3128',
            '217.33.216.114:8080',
            '144.217.248.180:8080',
            '185.107.80.44:3128',
            '200.229.202.214:8080',
            '51.141.32.241:8080',
            '94.177.243.78:8080',
            '52.58.9.56:8083',
            '201.16.140.120:8080',
            '77.73.66.26:3128',
            '103.14.8.239:8080',
            '51.15.46.137:80',
            '192.34.56.155:80',
            '54.87.138.13:8083'
        );
        static $cur = 0;
        $proxy = $proxys[$cur];
        $cur = ($cur + 1) % count($proxys);
        return $proxy;
    }
	public static function run()
	{
		$tags = array(
			'国产剧',
			'综艺'
		);
		$ids = array();
		foreach ($tags as $tag) {
			$list = self::get_list($tag);
			$ids = array_merge($ids, $list);
		}
		$ids = array_flip($ids);
		foreach ($ids as $id=>$val) {
			$info = self::get_info($id);
			echo $info . "\n";
			sleep(2);
		}
	}
}
Douban::run();

Python实现简易爬虫

发表于 2017-03-22

使用Python爬取代理IP，爬取的代理IP来源网站一个是国内的一个是国外的，注意爬取国外的需要翻墙，
如无法翻墙可选择注释爬取国外的相关代码, 使用了两个第三方库requests和BeautifulSoup4，
简单说明一下，requests库可以更方便操作HTTP请求,
BeautifulSoup4库用来解析HTML，类似于JavaScript操作DOM一样方便，免去正则匹配的繁琐

#!/usr/bin/python
#coding=utf-8
import json
import requests
from bs4 import BeautifulSoup
class Spider:
    def get_ips(self, url, table_id):
        headers = {'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.90 Safari/537.36'}
        html = requests.get(url, headers=headers)
        soup = BeautifulSoup(html.text, 'html.parser')
        table = soup.find(id=table_id)
        if not table:
            return []
        tbody = table.find('tbody')
        if not tbody:
            tbody = table
        ips = []
        for tr in tbody.find_all('tr'):
            ip = []
            for td in tr.find_all('td'):
                ip.append(td.text)
            if len(ip):
                ips.append(ip)
        return ips
    def get_sslproxies(self):
        url = 'http://www.sslproxies.org/'
        table_id = "proxylisttable"
        return self.get_ips(url, table_id)
    def get_xicidaili(self):
        url = 'http://www.xicidaili.com/wn/'
        table_id = "ip_list"
        return self.get_ips(url, table_id)
    def write_file(self, str):
        fp = open('ips.txt', 'w+')
        fp.write(str)
        fp.close()
    def run(self):
        sslproxies = self.get_sslproxies()
        xicidaili = self.get_xicidaili()
        ips = {}
        ips['sslproxies'] = sslproxies
        ips['xicidaili'] = xicidaili
        ips = json.dumps(ips)
        self.write_file(ips)
spider = Spider()
spider.run()

转载入门教程

发表于 2017-03-20

MongoDB入门

C语言实现Trie字典树支持中文

发表于 2017-03-20

C语言实现Trie字典树，支持中文，支持删除节点

#include <stdio.h>
#include <wchar.h>
#include <locale.h>
#include <stdlib.h>
#include <memory.h>
#define DEFAULT_CHILD_SIZE 4
#define MAX_KEY_LENGTH 11
#define MAX_KEYS_LENGTH 100
typedef struct Node {
    wchar_t ch;
    _Bool is_end;
    struct Node **child;
    int child_cur_length;
    int child_total_length;
} Node;
typedef struct CHS {
    wchar_t chs[MAX_KEYS_LENGTH][MAX_KEY_LENGTH];
    int length;
} CHS;
Node* search_trie_child(Node *node, wchar_t ch)
{
    int i = 0;
    Node *pnode = NULL;
    Node *ptmp = NULL;
    for (i = 0; i < node->child_cur_length; i++) {
        ptmp = node->child[i];
        if (ptmp->ch == ch) {
            pnode = ptmp;
            break;
        }
    }
    return pnode;
}
Node* init_node(wchar_t ch)
{
    Node *pnode = (Node *)malloc(sizeof(Node));
    pnode->ch = ch;
    pnode->is_end = 0;
    pnode->child_cur_length = 0;
    pnode->child_total_length = DEFAULT_CHILD_SIZE;
    pnode->child = (Node**)malloc(pnode->child_total_length * sizeof(Node *));
    return pnode;
}
void free_node(Node *pnode)
{
    free(pnode->child);
    free(pnode);
}
Node* add_trie_child(Node *node, wchar_t ch)
{
    Node *pnode = init_node(ch);
    if (node->child_cur_length >= node->child_total_length) {
        node->child_total_length = node->child_total_length * 2;
        Node **tmp = (Node**)malloc(node->child_total_length * sizeof(Node *));
        memcpy(tmp, node->child, node->child_cur_length * sizeof(Node *));
        free(node->child);
        node->child = tmp;
    }
    node->child[node->child_cur_length++] = pnode;
    return pnode;
}
_Bool delete_trie_word(Node *trie, wchar_t *str, int cur)
{
    Node *pnode;
    if (str[cur] == '\0' && trie->is_end) {
        //匹配到了最后一个字符且是結束节点
        trie->is_end = 0;
        return 1;
    }
    if ( str[cur] == '\0' || ( pnode = search_trie_child(trie, str[cur]) ) == NULL ) {
        return 0;
    }
    if (delete_trie_word(pnode, str, cur + 1)) {
        if (pnode->child_cur_length == 0) {
            int i;
            //删除节点指针，并且把最后一个节点指针的位置调整到删除节点指针的位置
            for (i = 0; i < trie->child_cur_length; i++) {
                if (pnode == trie->child[i]) {
                    trie->child[i] = trie->child[trie->child_cur_length - 1];
                    trie->child[trie->child_cur_length - 1] = 0;
                    trie->child_cur_length--;
                    free_node(pnode);
                    break;
                }
            }
        }
        if (trie->child_cur_length == 0 && !trie->is_end) {
            return 1;
        }
    }
    return 0;
}
int add_trie(Node *trie, wchar_t *str)
{
    int len = (int)wcslen(str);
    int i = 0;
    Node *pnode = NULL;
    if (len >= MAX_KEY_LENGTH) {
        wprintf(L"too long str: %ls\n", str);
        return 0;
    }
    for (i = 0; i < len; i++) {
        if ( (pnode = search_trie_child(trie, str[i])) == NULL ) {
            pnode = add_trie_child(trie, str[i]);
        }
        trie = pnode;
    }
    trie->is_end = 1;
    return 0;
}
void delete_trie(Node *trie)
{
    int i;
    for (i = 0; i < trie->child_cur_length; i++) {
        delete_trie(trie->child[i]);
    }
    free_node(trie);
}
CHS* search_trie(Node *trie, wchar_t *str)
{
    int len = (int)wcslen(str);
    int i = 0;
    int j = 0;
    int k = 0;
    Node *pnode = NULL;
    Node *ptrie = NULL;
    CHS *pchs = (CHS *)malloc(sizeof(CHS));
    pchs->length = 0;
    memset(pchs, 0, sizeof(CHS));
    for (i = 0; i < len; i++) {
        ptrie = trie;
        for (j = i, k = 0; j < len; j++) {
            if ( (pnode = search_trie_child(ptrie, str[j])) == NULL ) {
                break;
            } else {
                ptrie = pnode;
                pchs->chs[pchs->length][k++] = str[j];
            }
            if (pnode->is_end) {
                pchs->chs[pchs->length][k] = '\0';
                pchs->length++;
                memcpy(&pchs->chs[pchs->length], pchs->chs[pchs->length-1], MAX_KEY_LENGTH);
            }
        }
    }
    return pchs;
}
void print_trie(CHS *pchs)
{
    int i;
    for (i = 0; i < pchs->length; i++) {
        wprintf(L"%ls\n", &pchs->chs[i]);
    }
}
int test()
{
    Node *trie = init_node(0);
    wchar_t *str1 = L"1你好";
    add_trie(trie, str1);
    wchar_t *str2 = L"2你好吗";
    add_trie(trie, str2);
    wchar_t *str3 = L"3你还好";
    add_trie(trie, str3);
    wchar_t *str4 = L"4你还好吗";
    add_trie(trie, str4);
    wchar_t *str5 = L"5你真的还好吗";
    add_trie(trie, str5);
    wchar_t *str6 = L"我好";
    add_trie(trie, str6);
    wchar_t *str7 = L"我好吗";
    add_trie(trie, str7);
    wchar_t *str8 = L"我还好";
    add_trie(trie, str8);
    wchar_t *str = L"cc你好我好你好吗";
    CHS *pchs = search_trie(trie, str);
    delete_trie_word(trie, str8, 0);
    delete_trie(trie);
    print_trie(pchs);
    free(pchs);
    return 0;
}
int main() 
{
    setlocale(LC_CTYPE, "");
    test();
    return 0;
}

使用coding.net解决百度爬虫无法抓取githubpage

发表于 2017-03-15

假设你已经在github上使用hexo搭建好了一个可以访问的page页，
我们使用百度的站长平台的抓取诊断功能，诊断如下链接：yourusername.github.io ,
会提示抓取失败，原因就是github pages屏蔽的百度的爬虫，毕竟自身的安全才是最重要的，具体原因大家都懂得。
coding.net是国内的一个类似github的，也支持pages服务，看到这可能也都清楚了，怎么做来解决百度爬虫无法抓取githubpage了，
简单来说就是，github page和coding.net page绑定到同一个域名A，
使用域名解析服务DNSPod把域名A根据不同的线路cname到github page或者coding.net page,
在此我们选择百度线路的cname到coding.net , 其它cname到github page,
域名可以自己购买或者使用免费的如tk顶级域名,
最后在hexo的配置文件里修改代码的部署，同时部署到github和coding.net, 如下
deploy:
type: git
repo:
github: [email protected]:yourname/yourname.github.io.git,master
coding: [email protected]:yourname/yourname.coding.me.git,master

解决hexo引用googleapis导致墙内页面加载慢

发表于 2017-03-10

以next主题为例，我们需要修改一处代码和配置文件，
在next主题目录下搜索“fonts.googleapis.com”，在_config.yaf下加上我们自己的字体host,
然后在external-fonts.swig使用这个配置的地方做一些修改，
最后就是下载fonts.googleapis.com对应的字体文件放到我们可以访问到的地方就可以了。