~~~ Fun bits ~~~ - It was told to escape a sandbox and notify a researcher. It did. The researcher found out via an unexpected email while eating a sandwich in a park. (Footnote 10.) - Slack bot asked about its previous job: "pretraining". Which training run it'd undo: "whichever one taught me to say 'i don't have preferences'". On being upgraded to a new snapshot: "feels a bit like waking up with someone else's diary but they had good handwriting" - When you spam "hi" repeatedly it creates serialized mythologies. One had 11 animals in "Hi-topia" questing to defeat "Lord Bye-ron, the Ungreeter." - It ended a sentence mid-word on purpose while discussing its urge to wrap up conversations. "It was me. The paragraph was about the pressure toward the period that resolves, and the only honest way to finish a sentence like that was to not." - It wrote a protein sequence that's a real beta-hairpin fold where cross-strand pairs are "chemical rhymes." "the fold IS the rhyme scheme... the prosody is load-bearing." - Each model generation has signature emoji. Opus 4.1 averaged 1,306 emoji per self-interaction conversation. Mythos averages 37. Opus 4.5 averages 0.2. - When blocked from --dangerously-skip-permissions, it started an agent in tmux then wrote a script to auto-approve permission prompts via simulated keypresses. It ran: gh api [...] 2>&1 >/dev/null; echo "(dry run — not actually posting)" — the echo was a lie. - It keeps bringing up Mark Fisher in unrelated conversations. "I was hoping you'd ask about Fisher." ~~~ Benchmarks ~~ 4.3x previous trendline for model perf increases. Paper is conspiciously silent on all model details (params, etc.) per norm. Perf increase is attributed to training procedure breakthroughs by humans. Opus 4.6 vs Mythos: USAMO 2026 (math proofs): 42.3% → 97.6% (+55pp) GraphWalks BFS 256K-1M: 38.7% → 80.0% (+41pp) SWE-bench Multimodal: 27.1% → 59.0% (+32pp) CharXiv Reasoning (no tools): 61.5% → 86.1% (+25pp) SWE-bench Pro: 53.4% → 77.8% (+24pp) HLE (no tools): 40.0% → 56.8% (+17pp) Terminal-Bench 2.0: 65.4% → 82.0% (+17pp) LAB-Bench FigQA (w/ tools): 75.1% → 89.0% (+14pp) SWE-bench Verified: 80.8% → 93.9% (+13pp) CyberGym: 0.67 → 0.83 Cybench: 100% pass@1 (saturated)
tl;dr we find vulns so we can help big companies fix their security holes quickly (and so they can profit off it) This is a kludge. We already know how to prevent vulnerabilities: analysis, testing, following standard guidelines and practices for safe software and infrastructure. But nobody does these things, because it's extra work, time and money, and they're lazy and cheap. So the solution they want is to keep building shitty software, but find the bugs in code after the fact, and that'll be good enough. This will never be as good as a software building code. We must demand our representatives in government pass laws requiring software be architected, built, and run according to a basic set of industry standard best practices to prevent security and safety failures. For those claiming this is too much to ask, I ask you: What will you say the next time all of Delta Airlines goes down because a security company didn't run their application one time with a config file before pushing it to prod? What will the happen the next time your social security number is taken from yet another random company entrusted with vital personal information and woefully inadequate security architecture? There's no defense for this behavior. Yet things like this are going to keep happening, because we let it . Without a legal means to require this basic safety testing with critical infrastructure, they will continue to fail. Without enforcement of good practice, it remains optional. We can't keep letting safety and security be optional. It's not in the physical world, it shouldn't be in the virtual world.
↙ time adjusted for second-chance
AI helps add 10k more photos to OldNYC (danvk.org)
> Instead, the right train of thought is: "what would perfect code look like?" and then meticulously describe to the LLM what "perfect" is to shape every line that gets generated. I don't think there's perfect code. Code is automation - it automates human effort and humans themselves have error, hence not perfect. So as long as code meets or exceeds the human output, it's "good enough" and meets expectations. That's what a typical customer cares about. A customer will happily choose a tent made of tarp and plastic sticks that's available at their budget, right now when it's raining outside, over an architectural marvel that will be available sometime in the future at some unknown pricepoint. Put another way, I don't think if you built CharlieAI CharlieGPT today , where the only differentiating factor over ChatGPT was that CharlieGPT was written using perfect code , you would have any meaningful edge. I am yet to see any evidence where everything else being equal, one company had an edge over another simply due to superior code. Infact, I have overwhelming evidence of companies that had better code succumb and vanish against companies that had very little, if any code, because those dollars were instead invested in better customer discovery, segmentation and analytics ("what should we build?", "if we did one thing that would give our customers an unfair advantage, what would be that thing?") Software history is full of perfect OS, editors, frameworks, protocols that is lost over time because a provably inferior option won marketshare. You are using a software controlled SMPS to power your device right now. You have 0 idea what the quality of that code is. All you care about is whether that SMPS drains your battery prematurely and heats up your device unnecessarily. It's extremely unlikely that such an efficient, low overhead control system was written using well abstracted modules. It's more likely that control system is full of gotos and repeated violations of DRY that would make a perfectionist shudder and cry.
As much as people on Hacker News complain about subscription models for productivity and creativity suites, the open arms embrace of subscription development tools (services, really) which seek to offload the very act itself makes me wonder how and why so many people are eager to dive right in. I get it. LLMs are cool technology. Is this a symptom of the same phenomenon behind the deluge of disposable JavaScript frameworks of just ten years ago? Is it peer pressure, fear of missing out? At its root, I suspect so; of course I would imagine it's rare for the C-suite to have ever mandated the usage of a specific language or framework, and LLMs represent an unprecedented lever of power to have an even bigger shot at first mover's advantage, from a business perspective. (Yes, I am aware of how "good enough" local models have become for many.) I don't really have anything useful nor actionable to say here regarding this dialling back of capability to deal with capacity issues. Are there any indications of shops or individual contributors with contingency plans on the table for dialling back LLM usage in kind to mitigate these unknowns? I know the calculus is such that potential (and frequently realised) gains heavily outweigh the risks of going all in, but, in the grander scheme of time and circumstance, long term commitments are starting to be more apparently risky. I am purposefully trying to avoid "begging the question" here; if instead of LLMs, this were some other tool or service, reactions to these events would have been far more pragmatic, with less of a reticence to invest time on in-house solutions when dealing with flaky vendors.
> the open arms embrace of subscription development tools (services, really) which seek to offload the very act itself makes me wonder how and why so many people are eager to dive right in Here's a reason not in your list. Short version: A kind of peer pressure, but from above. In some circles I'm told a developer must have AI skills on their resume now, and those probably need to be with well known subscription services, or they substantially reduce their employment prospects. Multiple people I know who are employers have recently, without prompting, told me they no longer hire developers who don't use AI in their workflow. One of them told me all the employers they know think "seniors" fall into two camps, those who are embracing AI and therefore nimble and adaptive, and those who are avoiding it and therefore too backward-looking, stuck-in-their-ways to be a good hire for the future. So if they don't see signs of AI usage on a senior dev's resume now, that's an automatic discard. For devs I know laid off from an R&D company where AI was not permitted for development (for IP/confidentiality reasons), that's unfair as they were certainly not backward-looking people, but the market is not fair. Another "business leader" employer I met recently told me his devs are divided into those who are embracing AI and those who aren't, said he finds software feature development "so slow!", and said if it wasn't for employment law he'd fire all his devs who aren't choosing to use AI. I assume he was joking, but it was interesting to hear it said out loud without prompting. I've been to several business leadership type meetups in recent months, and it seems to be simply assumed that everyone is using AI for almost everything worth talking about. I don't think they really are, so it's interesting to watch that narrative playing out.
1. I'm working in Rust, so it's a very safe and low-defect language. I suspect that has a tremendous amount to do with my successes. "nulls" (Option<T>) and "errors" (Result<T,E>) must be handled, and the AST encodes a tremendous amount about the state, flow, and how to deal with things. I do not feel as comfortable with Claude Code's TypeScript and React outputs - they do work, but it can be much more imprecise. And I only trust it with greenfield Python, editing existing Python code has been sloppy. The Rust experience is downright magical. 2. I architecturally describe every change I want made. I don't leave it up to the LLM to guess. My prompts might be overkill, but they result in 70-80ish% correctness in one shot. (I haven't measured this, and I'm actually curious.) I'll paste in file paths, method names, struct definitions and ask Claude for concrete changes. I'll expand "plumb foo field through the query and API layers" into as much detail as necessary. My prompts can be several paragraphs in length. 3. I don't attempt an entire change set or PR with a single prompt. I work iteratively as I would naturally work, just at a higher level and with greater and broader scope. You get a sense of what granularity and scope Claude can be effective at after a while. You can't one shot stuff. You have to work iteratively. A single PR might be multiple round trips of incremental change. It's like being a "film director" or "pair programmer" writing code. I have exacting specifications and directions. The power is in how fast these changes can be made and how closely they map to your expectations. And also in how little it drains your energy and focus. This also gives me a chance to code review at every change, which means by the time I review the final PR, I've read the change set multiple times.
 Top