my baby boy ily
seen from Germany

seen from Sweden
seen from China
seen from China
seen from China
seen from United States

seen from United States

seen from China

seen from United States
seen from Japan
seen from United States
seen from Dominican Republic
seen from United States
seen from United Kingdom
seen from China
seen from United States
seen from United States

seen from Malaysia

seen from Russia

seen from Russia
my baby boy ily

Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.
Free to watch • No registration required • HD streaming
My #YaCy stopped crawling after an update of the #Docker image. Reason were corrupted owner and rights of
/var/lib/docker/volumes/yacy_search_server_data/_data/WORK/robots.bheap
. Without robots.txt table, all start URLs of a crawl returned #HTTP status 404. Meanwhile, the #RSS importer was fine.
#OpenSource #YaCy #SearchEngine Make Your Own SearchEngine
Via @yacy_search @FlossWeekly @TWiT #GitHub
Website YaCy
YaCy on GitHub
Community/Forum
YaCy Tutorials on YouTube
Support it on Patreon
TWiT Floss Weekly Podcast/Video YaCy
Organizing a data hoard with YaCy.
It should come as little surprise to anyone out there that I have a bit of a problem with hoarding data. Books, music, and of course files of all kinds that I download and read or use in a project for something. Legal briefs, research papers (arXiv is the bane of my existence), stuff people ask me to review, the odd Humble Bundle... So much so that a scant few years ago I rebuilt Leandra to better handle the volume of data in my library. However, it's taken me this long to both figure out and get around to making it easier to find anything in all that mess. If I can't find it, I can't do anything with it, or even figure out what I do or don't have. I also don't often have console access so it's not as if I can SSH in and grep for what I need. I use Nginx as a web server on Leandra so actually getting access to files when I need them is trivial.
So, I did some research, did some thinking, ran quite a few miles, and decided that it made more sense to just set up another instance of YaCy on Leandra with a slightly different configuration. For a while I'd considered running YaCy on a separate machine on my network as a dedicated search appliance but when I ran the numbers I realized that it would be pulling terabytes of data on a daily if not weekly basis across the network. I ran a couple of tests on a RasPi and wasn't able to keep it online due to its limited memory (I've since found a couple of tips for doing just this on the YaCy wiki now that it's back online). That, coupled with the network traffic issue, ruled out this strategy. I tried a couple of other information management packages but none of them really did what I wanted in the way I wanted. Eventually I opted to take the bull by the horns and go with the most straightforward solution: Clone the source code for YaCy onto Leandra, build it from the instructions, and fire it up. Due to how YaCy uses databases it didn't make sense to try to hammer one copy installed from an Arch Linux AUR package into doing double duty, so the most logical thing to do was deal with the fact that there were two running copies and two separate indexes and go do more interesting things.
I had to configure YaCy to listen on a different port because the default of 8090/tcp is occupied by the other instance that indexes all of the stuff I throw at it. When it was up and running I configured it for Intranet Indexing mode (YaCy Administration -> Use Case & Account -> check Intranet Indexing) and gave the search engine a name. Intranet mode basically tells YaCy "Figure out the local network you're on and only index things you find on that network because you're acting as an in-house search portal." I'm not certain, but I think this also implicitly prevents YaCy from interacting with the global YaCy peer-to-peer network, because it doesn't make any sense to scatter information about private data to the four winds. Next was kicking off an initial indexing run of my library (YaCy Administration -> Load Web Pages, Crawler -> Site: http://leandra/ -> click "Start New Crawl"). Due to how networking functions in Linux, YaCy's search traffic basically hit the network card, took a hairpin turn right back around to the webserver, and at no time did it impact the rest of my network. While the initial indexing run was going I set the same job to run on a daily basis (YaCy Administration -> Process Scheduler -> pick the "crawl start for http://leandra/" -> Scheduler column -> change "no repetition" to "activate scheduler" -> every 1 day -> click "Execute Selected Actions") so that the index would always be relatively up to date.
Now, here's something weird: I tried adding this second YaCy server to the Searx configs that one of my web_search_bot/s use but for some reason it didn't seem to work. I tried a couple of variations but wasn't able to get any search results out of stuff I know I have. Rather than waste more time fighting with it, I opted to take the path of least resistance and set up another Searx instance on Leandra on a different port. This copy of Searx is configured to only support the YaCy search engine so it is little more than an API provider (also meaning that I don't have to write a YaCy specific bot), made another copy of web_search_bot.conf called library_bot.conf on Leandra, and configured it to use the second Searx API to communicate with the second copy of YaCy. The URL to do this looks a little weird because it only makes calls to YaCy and nothing else (another weird thing I have to look into) and happens to be this: http://127.0.0.1:9999/?format=json&q=%21yy%20
I made another copy of the run.sh wrapper script for web_search_bot/ called librarybot.sh and edited it to load the library_bot.conf file, like this:
#!/usr/bin/env bash export VIRTUAL_ENV="$(pwd)/env" export PATH=$VIRTUAL_ENV/bin:$PATH export PS1="(virtualenv) $PS1" unset PYTHON_HOME eval exec "./web_search_bot.py --config library_bot.conf" exit 0
I added a new message queue for the bot to the config file for the XMPP bridge, restarted the XMPP bridge, and started up my Librarybot variant. Lo and behold, I can now seach my files on Leandra from an XMPP client.
YaCy wound up being the right tool for the job in the long run for the simple reason that it can parse and index practically every type of file I have, from standard, boring ASCII text files to tabular data and CAD drawings. Also, after trying a bunch of different search packages, from GNOME Tracker to Open Semantic Search to Perkeep to Ambar, YaCy just does what I need with a minimum of screwing around and writing shims to trick software into doing what I wanted. There's no shortage of search libraries and frameworks out there, and in theory I could have written my own document search engine (and I might do something like that one day). But ultimately it came down to a fairly simple decision: Do I want to spent time writing a search engine and deal with the pain in the ass of not being able to find stuff for much longer, or do I want to go with a solution that I know works, will do what I want, and can get me on the road?
Sometimes the old ways may be best.
A couple of weeks back, I found myself in a discussion with a couple of friends about searching on the Internet and how easy it is to get caught up in a filter bubble and not realize it. To put not too fine a point on it, because the big search engines (Google, Bing, and so forth) profile users individually and tailor search results to analyses of their search histories (and other personal data they have access to), it's very easy to forget that there are other things out there that you don't know about for the simple reason that they don't show stuff outside of that profile they've built up. If you're a hardcore code hacker you might find it very difficult to find poetry or the name of a television show you saw once unless you take fairly drastic action. The up-side of this profiling is that, inside of your statistical profile search results are great. You can find what you need, when you need it. But outside of that? Good luck.
The point of the discussion was that there were ways that we could escape this filter bubble through application of self-hosted software and a little cooperation.
Ironically, searching through my conversation history I can't seem to find the thread in question so I'm relying entirely upon on-board storage (as it were). So, go ahead and laugh while I geek out. First, a little bit of Internet history.
Way, way, way back when, before Google was a thing and people still maintained personal homepages, it was not uncommon for people to post and curate lists of links to things they liked. This was pretty much how the web as we know it came to be, people linking to stuff that linked to other stuff, and so forth. As time passed some folks began collaborating on what net.history refers to as web directories, which were basically lists of hyperlinks (usually on the same topic) maintained by small groups of people. Curated web directories got more and more popular until the grand-daddy of them all, DMOZ (directory.mozilla.org) came to be in 1998. It doesn't exist anymore because it was bought out by AOL (which was later bought out by Verizon, under their Oath brand) and shut down a few years ago, but an open-source copy still exists online at dmoz-odp.org. For a long while it was probably the largest index of curated online resources out there. It was for this reason that it was a popular starting point for many search engines back in the day, from Lycos and Yahoo all the way to the very earliest iteration of Google and college projects pertaining to web crawling, indexing, and search.
This was the direction the discussion I mentioned earlier went in. All of us curate our own lists of useful (and, most importantly, still active) bookmarks online and we all have an interest in search and indexing of data. This is what we came up with:
Take a personal bookmarking system like Shaarli, which stores personal databases of useful links and occasionally notes. They can be private (you can't see the contents unless you're logged in), they can be public (the contents are all out there for anyone to look at), or they can be a combination of the two (some links kept private, some links public). Shaarli offers extremely flexible RSS and ATOM feeds of new content added to its database. It's nowhere near the size of DMOZ but your average Shaarli install is a good place to start. You could certainly do worse.
Take a federated, open source data search and indexing system like YaCy. Configure it so that you have a personal search engine. Assuming for the sake of this discussion that you have enough bandwidth to participate in the global YaCy network, do not configure it for Robinson Mode (not participating in the global YaCy network), just leave it as-is. Now, pull up the RSS feed for your Shaarli instance by adding ?do=rss to the end of the URL; for example, here's mine. If you have any private links in there don't worry, they won't appear in the RSS or ATOM feed. Tell YaCy to load your Shaarli RSS feed:
YaCy Administration Page
Index Export/Import
RSS Feed Importer
Paste your Shaarli RSS feed's URL here
Click "Show RSS items" to make sure it can load the feed
Change the indexing selector so that it says "Scheduled"
Change the schedule to "Every 1 day"
Click the "Add All Items to Index (full content of url)" button to tell YaCy to pull that RSS feed every week and index every item it finds.
What this does it tell YaCy to pull the RSS feed of your bookmark collection daily and index whatever new entries it finds. This will slowly grow a search index of new websites as you bookmark them. If you've configured YaCy to participate in the global YaCy peer-to-peer network, you are also helping the global index of the Internet it's building to grow by adding carefully curated links to things you find personally useful and/or interesting.
The $64kus question is, will this replace Google? An honest answer is, no, probably not. Google is a megacorp that throws billions of dollars at software development every year, with armies of software developers working around the clock. The YaCy network is run by a bunch of computer enthusiasts. What this does accomplish, however, is help construct a search engine that does not have a filter bubble (or at least, less of a filter bubble) because there are no profiling and ranking algorithms deciding what to show or not show a given user at a given time. The contents of the YaCy search index will be more carefully curated because they were selected by people for specific reasons with specific outcomes in mind. You will probably not get the perfectly tailored search results of Google, but you will get search results that are definitely relevant to your interests because they were added by someone with similar interests and needs to your own. Could it be gamed? Probably. It would be difficult because a rogue YaCy instance would need to be hacked to index specific data in specific, bad actor defined ways. It would be difficult but not impossible, and probably not worth the time and effort which could be "better" spent gaming Google, Bing, et al with easier and better known tactics.

Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.
Free to watch • No registration required • HD streaming
For all my twitter Follower a YaCy Search Engine to Get you Status msg links out of. http://smokingwheels.ddns.net search action yourusername copy the link to use else where. Thanks...
A Note for a history lesson for the family