用Ruby将源代码文件Join起来,并且加上行号

Published on:
Tags: Ruby

今天要弄个专利申请,需要提交相关的代码。写了个script,目的是将文件目录下的java文件全部弄在一起,并且写上行号。 最后看了看生成的txt文件大小,足足有2M多。总数7万多行。

代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
BASE_PATH = "/Users/boboism/Projects/Java/Web/com.gacmotor.eam/"

File.open('/Users/boboism/Desktop/output.txt', 'w') do |output_file|

  Dir["#{BASE_PATH}**/**.java"].each do |file_path|
    File.open(file_path, 'r') do |file|
      output_file.write("/#{'*'*80}\r\n")
      output_file.write(" * File:   #{file_path.gsub(BASE_PATH, '')}\r\n")
      output_file.write(" * Author: Jianbo Su <sujb@gacmotor.com>\r\n")
      output_file.write(" #{'*'*80}/\r\n")
      file.each_with_index do |row, index|
        output_file.write("#{'%04d' % index} #{row}")
      end

      output_file.write("\r\n\r\n")
    end
  end

end

如何使用Ruby来读大文件(日志分析)

Published on:
Tags: Ruby

最近需要读取应用的日志分析数据。打开其中一个节点的Log,发现已经差不多有2G了。我的妈呀~都长这么大了~ 分析文件是件麻烦事,特别是这么大的日志。而且,如果不用程序员的方法还真不知道要看到猴年马月。这是,我想起了Ruby的IO类以及其子类File。其中有几个方法:

  1. #read
  2. #each
  3. #readlines

现在逐个介绍一下。

read([length [, buffer]]) → string, buffer, or nil

这个函数官方文档上面是这样写的:

1
Reads length bytes from the I/O stream.

那也就是说,此斯比较适合用来读取二进制文件。OK,不太适合我。下一位。

each(sep=$/) {|line| block } → ios

文档上说:

1
Executes the block for every line in ios, where lines are separated by sep. ios must be opened for reading or an IOError will be raised.

嘿~还可以将每行读进IO,那就非常的好,正好哥的文件比较大,这个很适合。再看看还有没有更好的。

readlines(sep=$/) → array

文档上说:

1
Reads all of the lines in ios, and returns them in anArray. 

看来这个是可以按照数组的方式去操作文件行的,不错。但是,还有一个关键的东东,那就是把全部的文件都读到IO中,那就是说,假如我的文件有2G,但是我的系统中只有1G内存,那就有可能再读到一半的时候就挂掉了。

选来选去,还是用#each吧。

以下是分析:假设我现在有个案例,需要找到日志中所有请求IP的前10名,而我的文件每行是以IP打头的,因此,就会写成以下的示例:

案例
1
2
3
4
5
6
7
8
9
10
11
12
# 新建Hash,每个key的value初始化为0
ip_counter = Hash.new{|hash, key| hash[key] = 0}
# 打开文件
File.open("/Users/boboism/Downloads/access.log") do |f|
  # 读取每一行到内存中
  f.each("\n") do |line|
    # 获得IP,并且将IP的字符串转化成Symbol后作为key值,并且在计数器上+1 
    line.scan(/^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}/).each{|ip| ip_counter[ip.to_sym] += 1}
  end
end
# 按照IP出现的先后,由多到少排序,并且获取前10名
top_ten = Hash[*ip_counter.sort_by{|key,value| value*-1}.take(10).flatten]

Total 12行就可以写一个简单的日志分析啦。回想起如果用Java的话,那是什么状况。不过题外话,这种还是可以用Linux下的Shell+awk来完成。

如何使用Anemone来爬视频地址

Published on:

刚刚翻回之前写的一些爬虫脚本,想分享一下其中一个比较有意思的爬虫。 这个爬虫脚本使用的是Chris Kite写的Anemone(一个Ruby的爬虫库)。它提供了非常简单的DSL用来爬取每一个页面以及其URLs,官方说会自动计算出其需要的最短路径。而且,这个爬虫库是多线程。由于Ruby正则表达式的写法简洁,因此,看上去非常简短。

主要用到的gem有anemone,digest,还有用来保存链接的mongo

1
2
gem install anemone
gem install mongo

首先,会定义一些全局变量,其中ENTRY_PATTERN是入口,PAGE_PATTERN是要爬的页面,ANY_PATTERN还会包含另外一些需要爬的链接:

1
2
3
ENTRY_PATTERN = "http://www.oabt.org/?cid=5"
PAGE_PATTERN  = %r[cid=(?:5|25|6|7|8|11)(?:&page=\d+)?$]
ANY_PATTERN   = PAGE_PATTERN

接着新建一个MONGODB以及表,因为是要爬http://www.oabt.org,所以就直接命名了:

1
2
db = Mongo::Connection.new.db("oabt_org")
movies = db["movie"]

接着,是定义Anemone的一些options:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
options = {
  :threads              => 1,                # 线程数
  :verbose              => true,             # 详细显示 
  :discard_page_bodies  => true,
  :user_agent           => "Mozilla...",
  :delay                => 0,
  :obey_robots_txt      => true,
  :depth_limit          => 1,
  :redirect_limit       => 5,
  :storage              => nil,
  :cookies              => nil,
  :accept_cookies       => true,
  :skip_query_strings   => false,
  :proxy_host           => nil,
  :proxy_port           => false,
  :read_timeout         => 20
}
官方代码中的default optionslink
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
DEFAULT_OPTS = {
  # run 4 Tentacle threads to fetch pages
  :threads => 4,
  # disable verbose output
  :verbose => false,
  # don't throw away the page response body after scanning it for links
  :discard_page_bodies => false,
  # identify self as Anemone/VERSION
  :user_agent => "Anemone/#{Anemone::VERSION}",
  # no delay between requests
  :delay => 0,
  # don't obey the robots exclusion protocol
  :obey_robots_txt => false,
  # by default, don't limit the depth of the crawl
  :depth_limit => false,
  # number of times HTTP redirects will be followed
  :redirect_limit => 5,
  # storage engine defaults to Hash in +process_options+ if none specified
  :storage => nil,
  # Hash of cookie name => value to send with HTTP requests
  :cookies => nil,
  # accept cookies from the server and send them back?
  :accept_cookies => false,
  # skip any link with a query string? e.g. http://foo.com/?u=user
  :skip_query_strings => false,
  # proxy server hostname 
  :proxy_host => nil,
  # proxy server port number
  :proxy_port => false,
  # HTTP read timeout in seconds
  :read_timeout => nil
}

好了,接着就是开始爬了,#focus_crawl中会定义需要爬的URL,只会保留ANY_PATTERN的URL,接着,如果符合PAGE_PATTERN的,会在其页面中找他的title,id,reference url,还有就是他的下载链接:magnet/thunder/ed2k(这里会使用到NOKIGIRI,因为我熟悉css selector,所以代码中直接使用css selecor来抓),接着就会检查他的引用页的hash值是否已经存在,如果不存在就直接插入mongo:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
Anemone.crawl(ENTRY_PATTERN, options) do |anemone|

  anemone.focus_crawl do |page|
    p "focus #{page.url}"
    page.links.keep_if{|link| link.to_s =~ ANY_PATTERN}
  end

  anemone.on_pages_like(PAGE_PATTERN) do |page|
    if page.doc
      p "crawl #{page.url}"
      p "crawl header:#{page.headers}"
      p "crawl code:#{page.code}"
      p "crawl body:#{page.body}"
      p "crawl links:#{(page.links||[]).collect(&:to_s).join('\n')}"
      page.doc.css('tr').each do |tr|
        p "crawl tr"
        title, id, ref_url = tr.css('td.name.magTitle a').collect{|a| [a.text, a['rel'], a['href']]}.first
        if id && (md5 = Digest::MD5.hexdigest(id)) && (movies.find({"md5" => md5}).first.nil?)
          cat = tr.css("a.sbule").collect{|a| [a['href'][/\d+/], a.text]}.first.join('|')
          ed2k_url, mag_url, thunder_url = (tr.css('td.dow a.ed2kDown').first||{})['ed2k'], (tr.css('td.dow a.magDown').first||{})['href'], (tr.css('td.dow a.thunder').first||{})['thunderhref']
          movie = {:md5 => md5, :cat => cat, :ref_url => ref_url, :title => title, :ed2k_url => ed2k_url, :mag_url => mag_url, :thunder_url => thunder_url}
          p "Inserting #{movie.inspect}"
          movies.insert movie
        end
      end
    end
  end

end

全部的代码如下,总共不到60行的代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
require 'anemone'
require 'digest/md5'
require 'mongo'

# Patterns
ENTRY_PATTERN = "http://www.oabt.org/?cid=5"
PAGE_PATTERN  = %r[cid=(?:5|25|6|7|8|11)(?:&page=\d+)?$]
ANY_PATTERN   = PAGE_PATTERN

db = Mongo::Connection.new.db("oabt_org")
movies = db["movie"]

options = {
  :threads              => 1,
  :verbose              => true,
  :discard_page_bodies  => true,
  :user_agent           => "Mozilla...",
  :delay                => 0,
  :obey_robots_txt      => true,
  :depth_limit          => 1,
  :redirect_limit       => 5,
  :storage              => nil,
  :cookies              => nil,
  :accept_cookies       => true,
  :skip_query_strings   => false,
  :proxy_host           => nil,
  :proxy_port           => false,
  :read_timeout         => 20
}
p "begin"
Anemone.crawl(ENTRY_PATTERN, options) do |anemone|

  anemone.focus_crawl do |page|
    p "focus #{page.url}"
    page.links.keep_if{|link| link.to_s =~ ANY_PATTERN}
  end

  anemone.on_pages_like(PAGE_PATTERN) do |page|
    if page.doc
      p "crawl #{page.url}"
      p "crawl header:#{page.headers}"
      p "crawl code:#{page.code}"
      p "crawl body:#{page.body}"
      p "crawl links:#{(page.links||[]).collect(&:to_s).join('\n')}"
      page.doc.css('tr').each do |tr|
        p "crawl tr"
        title, id, ref_url = tr.css('td.name.magTitle a').collect{|a| [a.text, a['rel'], a['href']]}.first
        if id && (md5 = Digest::MD5.hexdigest(id)) && (movies.find({"md5" => md5}).first.nil?)
          cat = tr.css("a.sbule").collect{|a| [a['href'][/\d+/], a.text]}.first.join('|')
          ed2k_url, mag_url, thunder_url = (tr.css('td.dow a.ed2kDown').first||{})['ed2k'], (tr.css('td.dow a.magDown').first||{})['href'], (tr.css('td.dow a.thunder').first||{})['thunderhref']
          movie = {:md5 => md5, :cat => cat, :ref_url => ref_url, :title => title, :ed2k_url => ed2k_url, :mag_url => mag_url, :thunder_url => thunder_url}
          p "Inserting #{movie.inspect}"
          movies.insert movie
        end
      end
    end
  end

end

如何使用Ruby来写WebService的Provider

Published on:

最近在做一个程序,用于我的EAM(固定资产管理系统)跟公司的ERP对接固定资产的数据。开始是想直接用MES在用的接口系统,修修补补上。 但是发现这个接口系统相对来说耦合比较厉害,如果要改来用,怕且都要搞个半个月才能上线。 后来就打算直接用Ruby写算了。于是就有了这篇文章。 下面看看其中的一个XML规格(大概),这是由ERP项目组定义的规法:

资产导入
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
<?xml version="1.0" encoding="UTF-8"?>
<Interface Sender="EAM" Receiver="ERP" Billtype="F10">
  <Bill>
    <BillHeader>
      <AcceptanceDate>2006-08-28 00:00:00</AcceptanceDate>
      <AssetName>笔记本电脑</AssetName>
      <AssetNo>703021</AssetNo>
      <Brand>IBM X60</Brand>
      <Category>7</Category>
      <IsSpecialFund></IsSpecialFund>
      <IsTariff></IsTariff>
      <IsVat></IsVat>
      <Model></Model>
      <OriginalValue>100.0</OriginalValue>
      <Salvage>20.0</Salvage>
      <SerialNo>L3B6534</SerialNo>
      <ServiceDate>2006-08-28 00:00:00</ServiceDate>
      <SubCategory>03</SubCategory>
      <TaxPreferType>不属于</TaxPreferType>
    </BillHeader>
    <BillBody>
      <Entry>
        <AllocationQuantity>0.5</AllocationQuantity>
        <ConstructionPeriod>1</ConstructionPeriod>
        <CostCenter>00001</CostCenter>
        <ManagementDepartment>00002</ManagementDepartment>
        <SpecialPurpose>01</SpecialPurpose>
      </Entry>
      <Entry>
        <AllocationQuantity>0.5</AllocationQuantity>
        <ConstructionPeriod>1</ConstructionPeriod>
        <CostCenter>00008</CostCenter>
        <ManagementDepartment>00002</ManagementDepartment>
        <SpecialPurpose>01</SpecialPurpose>
      </Entry>
    </BillBody>
  </Bill>
</Interface>

我最近都在用Ruby on Rails在开发项目(个人/公司),所以感觉上这个应该没有太大困难。不过还是有以下问题需要解决:

  • 不太规范的SOAP XML内容格式。
  • How to call webservice?
  • 定时执行方式?

想用ActiveRecord直接生成XML,但是需求的XML跟ActiveRecord直接生成出来的样子不太一样。后来看了ActiveRecord的XML相关的builder的代码,觉得还是使用builder定制吧。 就有了以下的代码:

ActiveRecord::Base#to_erp_xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
module ActiveRecord
  class Base
    def to_erp_xml(options={})
      require 'builder' unless defined?(Builder)
      options = options.dup
      xml = ::Builder::XmlMarkup.new
      xml.instruct! :xml, :version => "1.0", :encoding => "UTF-8"
      interface_attributes = [:sender, :receiver, :billtype].inject({}) do |acc, im|
        acc.merge({im.to_s.camelize => options[:interface][im]}) if options[:interface][im]
      end
      xml.Interface(interface_attributes) do |interface|
        if options[:include]
          interface.Bill do |bill|
            bill.BillHeader do |bill_header|
              self.class.column_names.sort.each do |col|
                unless options[:except] && Array(options[:except]).include?(col.to_sym)
                  bill_header.tag! col.to_s.camelize, self[col.to_sym]
                end
              end
            end
            bill.BillBody do |bill_body|
              Array(options[:include]).each do |attr|
                self.send(attr.to_sym).each do |association|
                  bill_body.Entry do |entity|
                    association.class.column_names.sort.each do |col|
                      unless options[:except] && Array(options[:except]).include?(col.to_sym)
                        entity.tag! col.to_s.camelize, association[col.to_sym]
                      end
                    end
                  end
                end
              end
            end
          end
        end
      end
      xml.target!.to_s
    end
  end
end

为啥写的这么别扭?其实是因为我直接使用了db的view去抓数据。哈哈哈,为防以后会有变更,这样直接修改SQL就可以了:

AssetF10&AllocationF10
1
2
3
4
5
6
7
8
9
10
11
12
13
class AssetF10 < ActiveRecord::Base
  self.table_name = "vw_asset_erp"
  self.primary_key = "id"

  has_many :allocations, :class_name => "AllocationF10", :foreign_key => "asset_id"
end

class AllocationF10 < ActiveRecord::Base
  self.table_name = "vw_asset_allocation_erp"
  self.primary_key = "id"

  belongs_to :asset, :class_name => "AssetF10"#, :foreign_key => "asset_id"
end

到这步,基本上XML是能够生成了,但是如何去调用呢?上rubygems.org看了看包的使用量,soap这一块savon相对来说比较好,而且文档还是很充分。感觉这样写的话相对Java的SOA调用使用xfire非常简洁方便。

Savon.client
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
client = Savon.client(:wsdl => "http://172.18.81.20/BOI/Service.asmx?WSDL")
ws_attrs = {
  :interface => {:sender => "EAM", :receiver => "ERP", :billtype => "F10"},
  :func_name => "FixedAssetImport",
  :handshake => "111111"
}
receivable_assets = []
unreceivable_assets = []
AssetF10.all.each do |asset|
  token = UUIDTools::UUID.random_create.to_s.gsub("-","").upcase
  parameters = asset.to_erp_xml(ws_attrs.merge({:include => :allocations, :except => [:id, :asset_sync_status, :asset_id]}))
  resp = client.call(:boi_invoke, :message => {
    :from       => "#{ws_attrs[:interface][:sender]}#{ws_attrs[:handshake]}",
    :to         => ws_attrs[:interface][:receiver],
    :token      => token,
    :func_name  => "#{ws_attrs[:func_name]}_#{ws_attrs[:interface][:billtype]}",
    :parameters => "#{parameters.to_s}" })
  p parameters
  if resp.body[:boi_invoke_response][:boi_invoke_result]
    receivable_assets << {asset.asset_no => "TOKEN=#{token}"}
    # update records
    asset.asset_sync_status = 1
    asset.save
  else
    unreceivable_assets << {asset.asset_no => "TOKEN=#{token} #{resp.body[:boi_invoke_response][:result]}"}
  end
end

OK,webservice call的问题已经解决了。就剩下最后一个问题,定时调用! 一开始就想通过linux crontab来定时调用这接口,这样就会相对于固定资产管理系统本身的sidekiq独立起来(毕竟这个是一个月才调用1次的接口,无需实时调要求,感觉也没有这个必要)

eam_soa_f10.sh
1
2
#!/bin/sh
/home/it/.rvm/rubies/ruby-1.9.3-p374/bin/ruby /home/it/projects/ruby/cli/eam_soa/eam_soa_f10.rb >> /home/it/projects/ruby/cli/eam_soa/eam_soa_f10.log 2>&1

接着在crontab中增加一行调用:

crontab
1
* * 25 * * /home/it/projects/ruby/cli/eam_soa/eam_soa_f10.sh

OK~ 大功告成。相对Java版本的接口系统,同样功能大概只用了其1个类(差不多)的代码量就完成了~。