• 0 Posts
  • 61 Comments
Joined 1 year ago
cake
Cake day: October 4th, 2023

help-circle
  • wordfreq is not just concerned with formal printed words. It collected more conversational language usage from two sources in particular: Twitter and Reddit.

    Now Twitter is gone anyway, its public APIs have shut down,

    Reddit also stopped providing public data archives, and now they sell their archives at a price that only OpenAI will pay.

    There’s still the Fediverse.

    I mean, that doesn’t solve the LLM pollution problem, but…



  • Internet Archive creates digital copies of print books and posts those copies on its website where users may access them in full, for free, in a service it calls the “Free Digital Library.” Other than a period in 2020, Internet Archive has maintained a one-to-one owned-to-loaned ratio for its digital books: Initially, it allowed only as many concurrent “checkouts” of a digital book as it has physical copies in its possession. Subsequently, Internet Archive expanded its Free Digital Library to include other libraries, thereby counting the number of physical copies of a book possessed by those libraries toward the total number of digital copies it makes available at any given time.

    This appeal presents the following question: Is it “fair use” for a nonprofit organization to scan copyright-protected print books in their entirety, and distribute those digital copies online, in full, for free, subject to a one-to-one owned-to-loaned ratio between its print copies and the digital copies it makes available at any given time, all without authorization from the copyright-holding publishers or authors? Applying the relevant provisions of the Copyright Act as well as binding Supreme Court and Second Circuit precedent, we conclude the answer is no. We therefore AFFIRM.

    Basically, there isn’t an intrinsic right under US fair use doctrine to take a print book, scan it, and then lend digital copies of the print book.

    My impression, from what little I’ve read in the past on this, is that this was probably going to be the expected outcome.

    And while I haven’t closely-monitored the case, and there are probably precedent issues that are interesting for various parties, my gut reaction is that I kind of wish that archive.org weren’t doing these fights. The problem I have is that they’re basically an indispensible, one-of-a-kind resource for recording the state of webpages at some point in time via their Wayback Machine service. They are pretty widely used as the way to cite a page on the Web.

    What I worry about is that they’re going to get into some huge fight over copyright on some not-directly-related issue, like print books or something, and then someone is going to sue them and get a ton of damages and it’s going to wipe out that other, critical aspect of their operations…like, some random publisher will get ownership of archive.org and all of their data and logs and services and whatnot.








  • Not having mandatory security is a legit issue, but there isn’t a drop-in replacement that does, not in 2024. You’re gonna need widespread support, support for file transfer, federated operation, resistance to abuse, client software on many platforms, etc.

    And email security is way down the list of things that I’d be concerned about. At least with email, you’ve got PGP-based security. If you’re worried about other people’s mail providers attacking mail you send them, that’s getting into “do you trust certificate authorities to grant certificates” territory, because most secure protocols are dependent upon trusting that.

    Like, XMPP with OTR is maybe a real option for messaging, but that’s not email.

    EDIT: Not to mention that XMPP doesn’t mandate security either.


  • No PGP support

    Why would the mail provider need to support it? I mean, if they provide some sort of webmail client, maybe it doesn’t do PGP, but I sure wouldn’t be giving them my PGP keys anyway.

    I haven’t used any of them, but I don’t think that you can go too far wrong here, since you have your own domain. Pick one, try it for non-critical stuff for a month or two, and if you don’t like it, switch. As long as you own the domain, you’re not locked in. If you do like it, then just start migrating.

    The main differentiating factors I can think of are (a) service reliability, (b) risk that someone breaks in and dumps client mail, but it’s hard for me to evaluate the risk of that at a given place. And © how likely it is that other parties spam-block mail from them.

    I’d look for TLS support for SMTP and IMAP; that may be the norm these days. The TLS situation for mail is a little unusual compared to most protocols, where on a new connection, some servers initially use the non-encrypted version and then upgrade via STARTTLS.

    If you intend to leave your mail on their server rather than just using it as a temporary holding point until you fetch it, you might look into what their storage provided is.

    I’d also see what the maximum size of any individual email that they permit is.




  • The reason that robots.txt generally worked was because nobody was trying to really leverage it against bot operators. I’m not sure that this might not just kill robots.txt. Historically, search engines wanted to index stuff and websites wanted to be indexed. Their interests were aligned, so the convention worked. This no longer holds if things like the Google-Reddit partnership become common.

    Reddit can also try to detect and block crawlers; robots.txt isn’t the only tool in their toolbox.

    Microsoft, unlike most companies, does actually have a technical counter that Reddit probably cannot stop, if it comes to that and Microsoft wants to do a “hostile index” of Reddit.

    Microsoft’s browser, Edge, is used by a bunch of people, and Microsoft can probably rig it up to send content of Reddit pages requested by their browser’s users sufficient to build their index. Reddit can’t stop that without blocking Edge users. I expect that that’d probably be exploring a lot of unexplored legal territory under the laws of many countries. It also wouldn’t be as good as Google’s (I assume real-time) access to the comments, but they’d get to them.

    Browsers do report the host-referrer, which would permit Reddit to detect that a given user has arrived from Bing and block them:

    https://en.wikipedia.org/wiki/HTTP_referer

    In HTTP, “Referer” (a misspelling of “Referrer”[1]) is an optional HTTP header field that identifies the address of the web page (i.e., the URI or IRI), from which the resource has been requested. By checking the referrer, the server providing the new web page can see where the request originated.

    In the most common situation, this means that when a user clicks a hyperlink in a web browser, causing the browser to send a request to the server holding the destination web page, the request may include the Referer field, which indicates the last page the user was on (the one where they clicked the link).

    Web sites and web servers log the content of the received Referer field to identify the web page from which the user followed a link, for promotional or statistical purposes.[2] This entails a loss of privacy for the user and may introduce a security risk.[3] To mitigate security risks, browsers have been steadily reducing the amount of information sent in Referer. As of March 2021, by default Chrome,[4] Chromium-based Edge, Firefox,[5] Safari[6] default to sending only the origin in cross-origin requests, stripping out everything but the domain name.

    Reddit could block browsers with a host-referrer off bing.com, killing the ability of Bing to link to them. I don’t know if there’s a way for a linking site to ask a browser to not give or forge the host-referrer. For Edge users – not all Bing users – Microsoft could modify the browser to do so, forcing Reddit to decide whether to block all Edge users or not.


  • I guessed in a previous comment that given their new partnership, Reddit is probably feeding their comment database to Google directly, which reduces load for both of them and permits Google to have real-time updates of the whole kit-and-kaboodle rather than polling individual pages. Both Google and Reddit are better-off doing that, and for Google it’d make sense for any site that’s large-enough and valuable enough to warrant putting forth any effort special-case to that site.

    I know that Reddit built functionality for that before, used it for pushshift.io and I believe bots.

    I doubt that Google is actually using Googlebot on Reddit at all today.

    I would bet against either Google violating robots.txt or Reddit serving different robots.txt files to different clients (why? It’s just unnecessary complication).


  • I haven’t hit that, but one thing that might help if you don’t like that – you might be able to set it up such that they only operate in your environment when chorded – like, when you hit multiple buttons at the same time. Like, only have “left click plus back” send “back” and “left click plus forward” send “forward”, or something akin to that.

    These days, I use sway on Linux, which provides for a tiled desktop environment – the computer sets the size of windows, which are mostly fullscreen, and I don’t drag windows. But when I did, and before mice had the convention of using “back” and “forward” on Button 4 and Button 5, I really liked having the single-button-to-drag-anywhere functionality, though I never really found a use for the fifth button. If I were still using a non-tiled environment, I’d probably look into doing chording or something so that I could still do my “drag anywhere on the window” thing.


  • I don’t personally go down the wireless mouse route – in fact, in general, I’d rather not use wireless and especially Bluetooth devices, due to reliability, latency, security, needing-to-worry-about-battery-charge, and privacy (due to broadcasting a unique ID that any nearby cell phone will relay the position of to Google, Apple, or similar). But I’d say that aside from that, most of those are advantageous, and a lot of people out there don’t care (or don’t know about) wireless drawbacks, so for them, even those are a win.

    The main complexity item I can think of is the buttons. Maybe back in the day, few set up Mouse Button #5 to be “drag window” in their window manager, as I did, so I could drag windows anywhere rather than on their titlebar. However, the browser “back” and “forward” functionality that I believe is the default in all desktop environments these days seems pretty easily-approachable.


  • I’m not planning to throw that watch away ever. So why would I be throwing my mouse or my keyboard away if it’s a fantastic-quality, well-designed, software-enabled mouse?

    Because watch technology is mature and isn’t changing. Nobody’s making a better watch every few years.

    That generally isn’t true of computer hardware.

    In the 1980s, you had maybe a one or two button mouse with mechanical optical encoder rings turned by a ball that gummed up and would stick.

    After that:

    • A third mouse button showed up

    • A scrollwheel showed up

    • Optical sensors showed up.

    • Better optical sensors showed up, with the ability to function on arbitrary surfaces and dejittering.

    • Polling rate improved

    • Mice got the ability to go to sleep if not being used.

    • More buttons showed up, with mice often having five or more buttons.

    • Tilt scrollwheels showed up

    • Wireless mice showed up

    • Better wireless protocols showed up

    • Optical sensor resolutions drastically increased

    • Weight decreased

    • Foot pads used less-friction-inducing material.

    • Several updates happened to track changing ports (on PC, serial, PS/2, USB-A, and probably soon USB-C).

    • The transparent mouse bodies that were initially-used on many optical mice (to show off the LED and that they were optical) went away as companies figured out that people did not want to have flashing red mice. (I was particularly annoyed by this, modded a trackball that used a translucent ball to use a near-infrared LED back in the day).

    If wristwatches had improved like that over the past 40 years, you likely wouldn’t be keeping an older one either.

    If you think that there isn’t going to be any more change in mice, okay, maybe you can try selling people on the same mouse for a long time. I’m skeptical.


  • Well, they give the rationale there too – that most webpages out there are, well, useless.

    I think that the heuristic is mis-firing in this case. But…okay, let’s consider context.

    I think that the first substantial forum system I used was probably Usenet. I used that at a period of time where there was considerably less stuff around on the Internet, and I had a fair amount of free time. Usenet was one of several canonical systems that “intro to the Internet” material would familiarize you with. You had, oh, let’s see. Gopher and Veronica. FTP and Archie. Finger. Telnet. VAX/VMS’s Phone, an interative chat program that could span VMS hosts (probably was some kind of Unix implementation too, dunno). IRC. Usenet. The Web (which was damned rudimentary at that point in time). I’d had prior familiarity with BBSes, so I knew that forums were a thing from that. There are maybe a few proprietary protocols in there too – I used Hotline, which was a Mac over-the-Internet forum-and-file-hosting system.

    But there just weren’t all that many systems around back then. Usenet was one of the big ones, and it was very normal for people to learn how to use it, because it was one of a limited set of options.

    So the reason I initially looked at and became accustomed to a forum system was because it was one of a very limited number of available systems back in the day.

    Okay, what about today? When I go see a new forum system, I immediately say to myself “Ah hah! This is a forum system!” I immediately know what it is, roughly how it probably works, what one might do with it, its limitations and strengths and how to use it. Even though I have maybe never used that forum website before a second in my life, I have a ton of experience that provides me with a lot of context.

    Let’s say that you don’t have a history of forum use. Never before in your life have you used an electronic forum. Someone says “you should check out this Reddit thing”. You look at it. To you, this thing doesn’t immediately “say” anything. You’ve got no context. It says “it’s the front page of the Internet”. What…does it do? What would one use it for? There’s no executive summary that you see. You don’t have a history of reading useful information on forums, so it’s not immediately obvious that this might have useful information.

    Now, I’m not saying that you can’t still assess the thing as useful and figure it out. Lots of people have. But I’m saying that having it fail that initial test becomes a whole lot more-reasonable when you consider that missing context of what an electronic forum is coupled with the extremely short period of time that people give to a webpage and why. You’d figure that there would be some significant number of people who would glance at it, say “whatever”, and move on.

    Facebook was really successful in growing its userbase (though I’ve never had an account and don’t want to). Why? Because, I think, it is immediately clear to someone why they’d use it. It’s because they have family and friends on the thing, and staying in touch with them is something that they want to do. The application is immediately clear. With Reddit or similar, it’s a bunch of pseudononymous users. People don’t use Reddit to keep in touch with family and friends, but to discuss interests. But if you’ve never had the experience of using a system that does that, it’s not immediately obvious what the problems are that the system solves for you.

    I was talking with some French guy on here, few months back. He was talking about how American food is bad. He offered as an example how he went to an American section of a grocery store in France and got a box of Pop-tarts after hearing about how good they were. He and his girlfriend got a box and tried them. They were horrible, he said. He said that he threw them in the garbage, said “they should be banned”. I asked him whether he’d toasted them before eating them.

    Now, is the guy stupid? No. I’m sure that he functions just fine in life. If you look at a box of Pop-tarts, it doesn’t tell you anywhere on the thing to toast them. The only clue that you might have that you should do so is in the bottom left corner, the thing says “toaster pastries”, but God only knows what that means, if you even read it. Maybe it means that they toasted them at the factory. We don’t have that problem, because we have cultural context going in – we had our parents toast them for us as a kid, and so the idea that someone wouldn’t know to toast one is very strange to us. The company doesn’t bother to put instructions on the box, because it’s so widespread in American culture that someone know how to prepare one. My point is just that a lot of times, there’s context required to understand something and if someone has that context available to them, it can be really easy to forget that someone else might not and for the same thing to not make sense to them.


  • I had a family member remark that they had tried to use Reddit, and it was “too busy-looking” and hard to understand, and they are in their 40s.

    So, I remember reading something on website UI back when, where someone said that some high percentage of users basically will only allocate a relatively-low number of seconds to understanding a website, and if it doesn’t make sense to them in that period of time, they won’t use it. It’s a big reason why you want to make the bar to initial use as low as possible.

    kagis

    This isn’t what I was thinking of, but same idea:

    https://www.nngroup.com/articles/how-long-do-users-stay-on-web-pages/

    It’s clear from the chart that the first 10 seconds of the page visit are critical for users’ decision to stay or leave. The probability of leaving is very high during these first few seconds because users are extremely skeptical, having suffered countless poorly designed web pages in the past. People know that most web pages are useless, and they behave accordingly to avoid wasting more time than absolutely necessary on bad pages.

    If the web page survives this first — extremely harsh — 10-second judgment, users will look around a bit. However, they’re still highly likely to leave during the subsequent 20 seconds of their visit. Only after people have stayed on a page for about 30 seconds does the curve become relatively flat. People continue to leave every second, but at a much slower rate than during the first 30 seconds.

    So, if you can convince users to stay on your page for half a minute, there’s a fair chance that they’ll stay much longer — often 2 minutes or more, which is an eternity on the web.

    So, roughly speaking, there are two cases here:

    • bad pages, which get the chop in a few seconds; and
    • good pages, which might be allocated a few minutes.

    I’ve also seen both Lemmy and Mastodon criticized for the “select an initial home instance” decision, because the point is that that significantly increases that bar to use. Maybe it’d be better to at least provide some kind of sane default, like randomly-select among the non-special-interest top-N home instances geographically near the user.

    Reddit (at least historically, don’t know if it’s different now) was somewhat-unusual in that they didn’t require someone to plonk in an email address to start using the thing. That’d presumably be part of the “get to bar to initial use low” bit.