Google Desktop for Linux With Apache2 On LAN

Google Desktop for Linux With Apache2 On LAN

Google Desktop for Linux With Apache2 On LAN

前言:
    在两年前第一次试作将google desktop与apache结合用于LAN的文件搜索,原文见这里《第一次原创:使用Google桌面搜索打造企业搜索服务器》http://blog.chinaunix.net/u/13472/showart.php?id=73880
    当时for linux的google desktop还没有出来,让我的samba文件服务器没有了集成的搜索服务可用,可谓望眼欲穿。等到for linux出来后,发现居然不支持搜索MS专有格式文档,又失望了很一段时间。终于,终于等到了google desktop for linux v1.1.1.0075,支持DOC、XLS、PPT的索引支持了,所以就捣鼓着一定要将它放置在我的samba服务器上,在提供samba服务的同时也提供一个简单的搜索服务器。
正文:
    原理和前文一样,依靠apache来代理google desktop。前文中提到需要端口映射器,经过后来的搜索,原来是缺少了设置反向代理所致,即在ProxyPass后面再接一个ProxyPassReverse代理就可以避免了。所以,现在与apache结合的google desktop已经不需要客户端做任何设置了,有一个浏览器就足够了,而文件浏览器足够充当这个角色了。
    如果这个apache没有其他用途,如前文,给服务器分配第2个ip专门用来处置这个google desktop代理,简单的配置文件如下:

[Copy to clipboard] [ - ]
CODE:
NameVirtualHost 192.168.1.120:80
<VirtualHost 192.168.1.120:80>
        ServerAdmin webmaster@localhost
       ServerName 192.168.1.120

       ProxyPass // [url]http://127.0.0.1:30043/[/url]
#注:这里的30043端口每个linux用户是不同的,需要提前在桌面上记录google desktop的起始页面。
        ProxyPassReverse // [url]http://127.0.0.1:30043/[/url]
       <Proxy [url]http://127.0.0.1:30043[/url]>
               Allow from all

       </Proxy>

        <Directory />
                Options FollowSymLinks
                AllowOverride None
                Allow from all   
         
        </Directory>               

        <Location /redir>
                Deny from all
        </Location>

        <Location /openfolder>
                Deny from all
        </Location>

</VirtualHost>

在重启apache前还需要修改apache的运行用户为google desktop的运行用户,这是因为google desktop的索引文件都是针对单个linux用户可读的,其他用户都不可读,所以用其他用户启动的apache是不能读取google desktop的数据的,也就无法代理了。
    修订好这一切,apache重启后,通过http://192.168.1.120/XXXXXXXX(后面省略的是google desktop的起始地址,每个linux桌面用户的都不同)就可以访问On LAN上的google desktop。
    下一步,我试作将这个On LAN的google desktop集成进文件服务器,毕竟去记住那串后缀地址还是很困难的,所以有必要把这个首页文件存放在文件服务器上,通过文件服务器访问到这个文件后就可以点击首页文件打开搜索代理服务器了。
    这里需要注意的是,简单的将首页保存下来的文件中由于相对地址的原因,通过文件服务器启动的首页文件不能进行搜索,所以我做了这样的改动:

[Copy to clipboard] [ - ]
CODE:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"

"http://www.w3.org/TR/html4/loose.dtd">

<html>

<head>

<meta http-equiv="content-type" content="text/html; charset=utf-8">

<meta http-equiv="cache-control" content="no-cache">

<meta http-equiv="pragma" content="no-cache">

<meta http-equiv="expires" content="-1">

<title>Google 桌面</title>

<style>

body,p,td{font-family:arial,sans-serif;color:#000}body{background-color:#fff;margin:4px}img{border:0}table,td{border:0;margin:0;padding:0}.nowrap{white-space:nowrap}.none{display:none}.inline{display:inline}.float_left{float:left}.logo3{margin-top:9px;padding-bottom:10px}a:visited{color:#551a8b}a:link{color:#00c}a:active{color:#00c}a:hover{color:#00c}

.q{color:#00c;padding:4px 0 4px 4px;margin:0;white-space:nowrap}.q a:visited{color:#00c}.q a:link{color:#00c}.q a:hover{color:#00c}.q a:active{color:#00c}span{margin:0px}div{border:0;margin:0;padding:0}div#basic{margin:7px}div#advanced{margin:7px}div#search_box{padding-top:30px;padding-bottom:30px}div#line{background-color:#39c;height:1px}div#bottomquery{background-color:#e8f4f7}

div#querybuttons{padding-top:20px;padding-bottom:20px;text-align:center}div#bottom_links{text-align:center;font-size:small;padding-bottom:80px;white-space:nowrap}p#copyright{padding-top:3px;font-size:x-small;white-space:nowrap}div#home_bottom span#homelink{display:none}div#pref_bottom div#bottom_links{padding-bottom:10px}h1{color:#335cec;font-size:large;font-weight:bold}

div.centerwarning{text-align:center}h4#fixmsg,h4#lowdisk{color:#f60}

input#q { margin-bottom:1px }

div#idxprogress {

text-align: center;

color: #f60;

}

h4#idxongoing {

padding-top: 5px;

padding-bottom: 5px;

}

</style>

<script>

<!--

function sf() {

document.f.q.focus();

}

function sw() {

window.location = "http://www.google.com/search?sourceid=GGXD&rlz=1L1GGXD&hl=zh-CN&oe=UTF-8&sa=N&tab=xw&q=" + encodeURIComponent(document.f.q.value);

}

-->

</script>

</head>

<body onLoad=sf()>

<center>

<br>

<img src="image/hp-logo.gif?hl=zh_CN"

width=276 height=110 alt="Google 桌面">

<br><br>

<form name=f action="http://192.168.1.120/search" method=get>

<input type="hidden" name="hl" value="zh_CN">

<input type="hidden" name="s" value="IKfIRNbuy8oqOJPMZBNzffceB6c">

<div class="q">

<style>TD.q {white-space: nowrap}</style><style>#lgpd{display:none}</style><script defer><!--
function qs(el){if(window.RegExp&&window.encodeURIComponent){var ue=el.href,qe=encodeURIComponent(document.f.q.value);if(ue.indexOf("q=")!=-1){el.href=ue.replace(new RegExp("q=[^&$]*"),"q="+qe);}else{el.href=ue+"&q="+qe;}}return 1;}
//-->
</script><table border=0 cellspacing=0 cellpadding=4><tr><td nowrap><font size=-1><a class=q href="http://www.google.com/webhp?source_id=GGXD&rlz=1L1GGXD&hl=zh-CN&oe=UTF-8&q=GOOOOG&tab=xw" onclick="return qs(this)">网页</a>&nbsp;&nbsp;&nbsp;&nbsp;<a class=q href="http://images.google.com/imghp?source_id=GGXD&rlz=1L1GGXD&hl=zh-CN&oe=UTF-8&q=GOOOOG&tab=xi" onclick="return qs(this)">图片</a>&nbsp;&nbsp;&nbsp;&nbsp;<a class=q href="http://groups.google.com/grphp?source_id=GGXD&rlz=1L1GGXD&hl=zh-CN&oe=UTF-8&q=GOOOOG&tab=xg" onclick="return qs(this)">论坛</a>&nbsp;&nbsp;&nbsp;&nbsp;<a class=q href="http://news.google.com/nwshp?source_id=GGXD&rlz=1L1GGXD&hl=zh-CN&oe=UTF-8&q=GOOOOG&tab=xn" onclick="return qs(this)">资讯</a>&nbsp;&nbsp;&nbsp;&nbsp;<a class=q href="http://ditu.google.com/maps?source_id=GGXD&rlz=1L1GGXD&hl=zh-CN&oe=UTF-8&q=GOOOOG&tab=xl" onclick="return qs(this)">地图</a>&nbsp;&nbsp;&nbsp;&nbsp;<b>桌面</b>&nbsp;&nbsp;&nbsp;&nbsp;<!--ENTERPRISE--><b><a href="http://www.google.com/intl/zh-CN/options/" class=q>更多&nbsp;&raquo;</a></b></font></td></tr></table></div>

<table cellspacing=0 cellpadding=0><tr><td width=25%>&nbsp;</td>

<td align=center>

<input id="q" maxlength=512 size=55 name=q value="" title="Google 桌面"><br>

<input type=submit value="搜索桌面">

<input type=button value="搜索网络" onclick=sw()>

</td>

<td valign=top nowrap width=25%><font size=-2>

&nbsp;&nbsp;<a href="http://192.168.1.120/options?hl=zh_CN&s=KfaPHSQRTXpTQAiCn0SSk8QuG2U">桌面使用偏好</a><br>

&nbsp;&nbsp;<a href="http://192.168.1.120/adv?hl=zh_CN&s=RInjEvQpp_ec0xuHJfV2RMld_nM">高级搜索</a><br>

</font></td>

</tr></table>

</form>

<br>

<div class="centerwarning">

<br>



</div>



<br>

<div id="home_bottom">

<div id="bottom_links">

<span id="homelink">

Google 桌面主页

-&nbsp;

</span>

<a href="http://192.168.1.120/status?hl=zh_CN&s=1lSl07UkINdDG1B8XKs5981ICpM">索引状态</a>

-&nbsp;

<a href="http://192.168.1.120/privacy?hl=zh_CN&s=7MTauqWsrjfyih2nmolpmXA1mfc">隐私权</a>

-&nbsp;

<a href="http://192.168.1.120/about?hl=zh_CN&s=9H5yl44biBXVBAjJUC77vy7RfPY">关于</a>

<p id="copyright">&copy;2007 Google</p>

</div>

</div>

</center>

</body>

</html>

上面含有的“http://192.168.1.120”字串都是我改动添加上去的,如此启动的首页文件便可以触发搜索代理服务器。

    我的基本要求达到后,还是没有达到我的预期。因为我的服务器本身启用apache的原因是为了提供samba文件服务器的跨网段web访问,所以前面那个首页文件也可以被原来的apache访问到,但是却不能提供搜索服务(我的ip地址有限,不能够把内网地址全部映射出去的)。所以,接下来,对上面的设置适当加以改造,让它适合互联网应用。
    显然,不能代理成根目录了,因为根目录要用来当作文件服务器的首页,所以就把它代理到/googlesearch,所以代理部分的内容就变成了:

[Copy to clipboard] [ - ]
CODE:
       ProxyPass /googlesearch/ [url]http://127.0.0.1:30043/[/url]
        ProxyPassReverse /googlesearch/ [url]http://127.0.0.1:30043/[/url]
       <Proxy [url]http://127.0.0.1:30043[/url]>
               Allow from all
       </Proxy>

        <Location /googlesearch/redir>
                Deny from all
        </Location>

        <Location /googlesearch/openfolder>
                Deny from all
        </Location>

这样的代理可以打开主页但是根本不能展开搜索,原因是google desktop启动搜索的时候的url地址都是从根/search开始的,所以,需要进行URL重写,如下:

[Copy to clipboard] [ - ]
CODE:
       RewriteEngine On
        <Directory /Fileserver>
#  /Fileserver目录是DocumentRoot目录;
                Options Indexes FollowSymLinks MultiViews
                AllowOverride None
                Order allow,deny
                allow from all
              RedirectMatch ^/search /googlesearch/search
        </Directory>

终于,google desktop被集成进apache了。最后一步,修改主页文件,另存为/Fileserver/文件搜索/目录下的index.html,以保证apache访问到该目录时直接打开首页文件。
    首页文件的修改很简单,把上面的http://192.168.1.120全部替换成“/googlesearch”就可以了。

尾注:
    目前残留的问题就是将搜索出来的文件打开的问题,上面的处理都是简单的屏蔽,要实现如DNKA一般的效果需要采用输出重新,我这里简单把mod_sar的说明贴在这儿。

QUOTE:
NAME
mod_sar - apache2 module which works as output filter and it's
          purpose is to Search And Replace strings found in web
          content before it's sending to the client.


COMPILE
mod_sar can be compiled with apxs( or manually by hand.

1. Using apxs for compilation:
apxs -c mod_sar.c

If everything goes fine, you will find mod_sar.so under .libs in your
current directory.

2. Compiling mod_sar manually:
gcc -pthread -I/usr/include/httpd -c mod_sar.c
gcc -shared mod_sar.o -Wl,-soname -Wl,mod_sar.so -o mod_sar.so

If needed, modify path to your httpd include directory and if everything
goes fine, you will find mod_sar.so in your current directory.


INSTALL
mod_sar can be installed with apxs( or manually by hand.

1. Using apxs for instalation:
This command will compile and install your mod_sar module.
apxs -i -a -c mod_sar.c

Restart apache by first stopping it and then starting it:
apachectl stop
apachectl start

2. Installing mod_sar manually:
cp mod_sar.so /usr/lib/httpd/modules
chown root: /usr/lib/httpd/modules/mod_sar.so
chmod 755 /usr/lib/httpd/modules/mod_sar.so

If needed, modify path to your httpd modules directory.
Now, you have to modify your httpd.conf file. Find the bunch of
LoadModule directives and append your own line under them:
LoadModule sar_module modules/mod_sar.so

Restart apache by first stopping it and then starting it:
apachectl stop
apachectl start


DESCRIPTION
mod_sar ("sar" stands for Search And Replace) is apache2 module which
works as output filter. It's purpose is to search and replace strings
found in web content before it's sending to the client.
Search performed can be case sensitive or case insensitive, depending
on configuration.
Perfect example of common usage of this module is reverse proxy.

Reverse proxy is proxy in front of the local server, which can be
accessed from Internet only trough that proxy. In some cases such
configuration can be used effectively to prevent worms and other
unwanted guests but most commonly it just present a false layer of
security for those who do not understand server - client communication.

Whatever reason you have, for usable reverse proxy you will have to
solve two problems: modification of headers and modification of
content before it's sending to client.

1. Header modification
Header modification is not problem at all. It can be achieved two
ways.
You can use mod_proxy_http:
    <IfModule mod_proxy.c>
        <roxy *>
            Order deny,allow
            Allow from all
        </Proxy>
        ProxyRequests On
        ProxyPass / http://some-domain.local/
        ProxyPassReverse / http://some-domain.local/
        ProxyErrorOverride On
    </IfModule>
Or, you can use mod_rewrite:
    <IfModule mod_rewrite.c>
        RewriteEngine on
        RewriteRule ^/(.*) http://some-domain.local/$1 [P]
        RewriteOptions inherit
    </IfModule>

2. Content modification
Header modification will make all relative links look like they are
coming from external domain some-domain.com instead of real, local
domain some-domain.local. But if server behind reverse proxy the
serves pages with absolute links, we will have to modify content of
that pages on the fly, using apache2 output filter mechanism.

There are three choices: mod_proxy_html, mod_ext_filter and mod_sar.
The first uses a libxml2 and because of that, it is not good for
purpose such as reverse proxy. For example, libxml2 will seriously
corrupt HTML code in case of a minor errors in HTML such as missing
quote. mod_proxy_html inherits that nasty habit from libxml2 but
if you want to try it your own, you can find that module at
http://apache.webthing.com/mod_proxy_html/
The second one is not a third party module, it comes with apache2
and it can suite needs for reverse proxy but it is not good for heavy
loaded sites because external command is executed for every request.
Here is example of mod_ext_filter usage:
    <IfModule mod_ext_filter.c>
        ExtFilterDefine fixtext mode=output intype=text/html \
            cmd="/bin/sed s/some-domain\.local/some-domain\.com/g"
        <Location />
            SetOutputFilter fixtext
        </Location>
    </IfModule>
And the third one is the one you are just looking at: mod_sar.
See the DIRECTIVES and EXAMPLES sections for usage information.
mod_sar will do one simple thing. It will replace one string
with another, depending on configuration. It can perform case
insensitive search if needed. It has been tested under heavy load
without performance impact.


DIRECTIVES
SarStrings <search_string> <replace_string>
       This directive requires two parameters, search string and
       replace string enclosed with double quotes.
       It can be used in server config and virtual host context.

SarCaseInsensitive <On|Off>
       If set to On, case insensitive search will be performed instead
       of exact string match.
       Default is Off.
       It can be used in server config and virtual host context.

SarVerbose <On|Off>
       If set to On, every time mod_sar is used as filter, message is
       printed into apache error logs.
       Default is Off.
       It can be used in server config and virtual host context.


EXAMPLES
       <IfModule mod_sar.c>
           AddOutputFilterByType sar_filter text/html
           SarStrings "http://some-domain.local" "http//some-domain.com"
           SarCaseInsensitive Off
           SarVerbose Off
       </IfModule>


REQUIREMENTS
Apache-2.0.


COMPATIBILITY
It has been tested on Linux but there is no obvious reason why it
would'n work on other unix platforms supported by apache2.
             OS:  Linux
       compiler:  gcc-2.9x, gcc-3.x
         apache:  apache-2.0.x


BUGS
Current version of mod_sar does not contain known bugs.


SEE ALSO
apxs(, http://www.apache.org/


AUTHOR
Josip Deanovic <djosip@linuxpages.org>

由于新版本的google desktop的输出url规则比较复杂,重写很困难,加上linux文件系统中太多的权限,许多目录都不会允许apache访问的,所以就懒得再折腾了,毕竟输出的信息中已经有文件位置的详细地址,通过文件服务器找寻下去也是很方便的。
    最后,希望看到这篇文章的达人能够帮助写出mod-sar的输出规则,帮我完善这个Google Desktop For Linux with Apache2 On LAN,谢谢。
自己消灭0回复,期待达人帮助解决mod_sar的规则。鄙人不会读源码,不知道这个模块的规则细节。