I've become a fan of unique, relatively short and "human-readable" IDs, such at ...

vanderZwan · on June 5, 2024

Nice!

For the record: the valid chars string is 62 characters, so naively using a modulo on a random byte will technically introduce a bias (since dividing 256 values by 62 leaves a remainder). I don't expect it to really matter here, but since you're putting in the effort of using crypto.randomBytes I figured you might appreciate the nitpick ;).

Melissa E. O'Neill has a nice article explaining what the problem is, and includes a large number of ways to remove the bias as well:

https://www.pcg-random.org/posts/bounded-rands.html

(in this case adding two more characters to the validChars string would be the easiest and most efficient fix, but I'm not sure if that is a possibility here)

naasking · on June 5, 2024

> For the record: the valid chars string is 62 characters, so naively using a modulo on a random byte will technically introduce a bias

Indeed, there's no reason you couldn't just add "_" and "-" or "." as well to complete the set. Your identifier will still be URL-safe. I've been using this type of encoding for years [1] for these kinds of ids to use in URLs, and encoding/decoding is super-fast with some bit shifts. And unlike Base64, you don't need padding characters.

[1] https://sourceforge.net/p/sasa/code/ci/default/tree/Sasa.Web...

notfed · on June 5, 2024

- and _ tend to break text selection.

naasking · on June 5, 2024

I'm not sure what you mean by "break". If you mean that touching or double-clicking on a block of text only extends up to the nearest symbols, that's true. But if your text selection UX is not terrible then it should be simple to extend that further.

That said, iOS and Android text selection have gotten worse recently, IMO.

Too · on June 6, 2024

Usually one is not in control over every place where text can be selected.

As a developer I will be exposed to ids being displayed in code, terminals, browsers of various sorts, database editors, json dumps, text editors, api responses, chat messages, you name it.

naasking · on June 6, 2024

Sure, I agree. I'm not sure what point you're trying to make though.

praestigiare · on June 6, 2024

Because - and _ break text selection in existing systems you do not have control over, if you use those characters your ids will become harder to select.

naasking · on June 6, 2024

Someone already made this point, and once again, "break" is completely undefined. It is not at all impossible or even difficult to select text with - or _, so what's "broken" exactly? At worst it takes one extra step to extend the default selection. These are such weird objections.

poops · on June 7, 2024

I work across multiple machines with different pointing devices (regular mouse, vertical mouse, touchpad), and have no issues double clicking to select a word. Dragging from the start of a word to the end can sometimes take multiple tries. I may miss the first letter. I may drag too far. The vertical mouse isn't great at holding a selection. It's not a huge deal, but it's an annoyance that I don't run into working with Stripe IDs.

helpfulContrib · on June 7, 2024

The entire point of making this form of ID is to make it a friendly user experience, to create an ID that can easily be communicated by a person over the phone to another phone, to make an ID that is easy to click, double-click, tap (with fingers on a touch surface), double-tap, hold-and-select and tap to 'copy', and so on.

The _ and - symbols make this difficult in the edge cases of all of the above. Do you call it dash, or underscore? Line? Hyphen? Binder-strich?

&etc.

I would have gone with adding the @ and ~ symbols, those are at least parsable in human form as well .. ".. email symbol and squiggly thing .. "

cogman10 · on June 5, 2024

A problem with this approach is it's not monotonical.

Especially if you want to use this thing as an index in a database, you'll run into problems where you try doing middle insertions frequently, which causes fragmentation.

The solution to this problem is making the higher order characters time sorted [1]. You don't need to go all out like uuid, you can have a pretty low resolution. It's more important that new insertions tend to be on the same page. If you have a low frequency insertions then minute resolution is probably good enough. (Minutes since 2000 is an easy calculation).

To implement that here, I'd suggest looking at how base64 or 85 encoders work and use that instead of repeated mods. You can then dedicate the upper bits to a time component and the lower bits can remain random. [2]

[1] https://vladmihalcea.com/uuid-database-primary-key/

[2] https://github.com/mklemm/base-n-codec-java/blob/master/src/...

reissbaker · on June 5, 2024

IMO it's nice to have two keys:

1. An auto-incremented 64-bit (unless you have a good reason, in which case 32-bit is fine) primary key, used internally for foreign key relations. This will generally result in less index bloat on associated tables, and fast initial inserts.

2. A public-facing random string ID. Don't use this internally (other than in an index on the table it's defined for), since it's large. But this should be the only key you expose to end-users, to prevent leaking data via the German Tank Problem: https://en.wikipedia.org/wiki/German_tank_problem

Only create the second key if this is data you're exposing to users, of course — for data that's only used internally, just use the 64-bit auto-incremented PK and skip the added index bloat entirely.

cogman10 · on June 5, 2024

For the number 2, I think one issue is that you are going to be semi-frequently whacking the db to do a mapping of that random string id back to the real id. OK for smaller entities but might be a pain if there's a lot of those ids to wrangle. You can throw a secondary index on it, but that will still have some minor fragmentation issues.

One benefit of a random id is if you are working with more complex data models it can make creating those easier/faster. Instead of having a centralized location to get new ids from (the DB) you can create ids on the fly from the application which can turn the write into a single action from the application rather than a dance of inserting the main table, getting the new id, then inserting to the normalized tables.

reissbaker · on June 6, 2024

Assuming you've set up an index on the public ID string, that should be a super fast query even on a table with billions of rows. Beyond billions of rows you're looking at something a lot more custom than a standard SQL DB, but assuming you shard the data, it should still be a very fast query regardless of dataset size.

Worst case you can cache, as others have mentioned, but TBH I think you don't need to for a query that simple and should save the cache space for something more complicated. Most SQL DBs are excellent for ready-heavy workloads; it's writes that tend to make them fall over.

caeril · on June 5, 2024

That's why the Good Lord invented caching. In most applications, 90% of your workload will be over ids less than a week old, so your hit rate is likely to be pretty high for this sort of mapping.

cogman10 · on June 5, 2024

First hit can be a beast. It's workload/entity determinant if caching is enough for this.

Not great if you are spending 1 minute on the first lookup just to do the mapping.

reissbaker · on June 6, 2024

Although I think it'll be fine even without a cache for most workloads (it's super easily indexed since the cardinality of random data is extremely high), I'm curious what cache setup you've got that it would take a minute to do a KV lookup — Memcached or Redis for example should be single digit milliseconds or better.

senderista · on June 5, 2024

I don't understand why you need to maintain two separate keys: instead of generating a random key, why not just encrypt the auto-increment key using a secret key? This is the approach used by e.g. cloud providers that use auto-increment keys internally but don't want them to be guessable.

cogman10 · on June 5, 2024

I think the biggest problem with this approach is it effectively pins you to a encryption key and algorithm (unless you embed some information in the key that lets you version the key, gotta think of that upfront).

Imagine, for example, that you picked DES and "kangaroo" as the secret several years back. You are now pinned to an algorithm and key with known security problems and a weak key.

Too · on June 6, 2024

Especially since you can try the encryption algorithm.

Create 1000 users very quickly in a row and you now have 1000 samples of enc(i) all the way up to enc(i+1000) that can help you break the algorithm.

Not saying it will be easy but it surely lowers the guarantees given by the encryption. Someone better at crypto can probably quantify this risk better than me.

reissbaker · on June 6, 2024

Sure, that works too. The upside of just storing a random string is you don't ever need to deal with secret management, key rotation, ciphers getting broken, etc — the random string truly has nothing to do with the real underlying ID, so there's just less work involved in hiding it. The downside is some extra bytes and index space.

WorldMaker · on June 5, 2024

A different approach to solve both 1 and 2 is timestamp-oriented IDs. You can get useful cache locality/less "index bloat"/fast initial inserts if your keys can be easily ordered in time. Sorted by timestamp means very similar behavior to B-Tree appends of a monotonic integer, even sometimes in the worst cases where "same moment" IDs aren't monotonic and rely more on random entropy.

I got some great DB cache/index performance from ULIDs with a bit of work to order the ULID timestamp bits in the way the DB's 128-bit column "uuid" sort best supported.

Now that UUIDv7 is standardized we should hopefully see good out-of-the-box collation for UUIDv7 in databases sooner rather than later.

reissbaker · on June 6, 2024

Yep, that works too as long as the timestamp data isn't sensitive.

rav · on June 5, 2024

Instead of a random string ID, you can devise a fixed secret key and expose the auto-incremented ID xor the fixed secret key as the public-facing ID. This saves you the separate index but still avoids the German tank problem. But it gives you a new problem, namely a secret that's hard or impossible to rotate.

cogman10 · on June 5, 2024

This is insecure. Assuming the user can get a few key examples (which, we assume they would be able to if the german tank problem is a problem) then the secret can easily be revealed. [1]

[1] https://dev.to/wrongbyte/cryptography-basics-breaking-repeat...

senderista · on June 5, 2024

XOR isn't secure enough, but you're on the right track. Instead, use an actual block cipher.

groestl · on June 5, 2024

> A problem with this approach is it's not monotonical

Whether or not that's bad fully depends on your platform and the number of writes you do. If you're using a massively distributed database like Datastore, Spanner etc, you want random keys as to avoid hot spots for writes. They produce contention.

cogman10 · on June 5, 2024

Well, you'd still likely want psuedo-random keys. You'd rather not have the underlying database doing extra work to shuffle around records as the pages get jumbled.

One solution to that is having more complex keys. For example, in one of our more contentious tables the index includes an account id (32bit int) and then the id of the entity being inserted. This causes inserts for a given account to still be contiguous (resulting in less fragmentation) while not creating a writing hotspot since those writes are distributed across various clients.

groestl · on June 5, 2024

Not disagreeing. Point is, you need to know your domain, your technology, your write patterns, your downstream systems, etc to decide if a specific key scheme works to your advantage or not. All the more reason not to use natural keys, as they lock you in in that regard.

cogman10 · on June 5, 2024

Absolutely agree.

I don't know how you can successfully maintain or develop software without developing an understanding of the underlying domain. I've seen devs try that route and the quality of their work has never been high.

hot_gril · on June 5, 2024

Depends heavily on what kind of database/index. Going with anything other than a random uuid4 adds complexity; for one, do you want to expose time info? I'd rather default to uuid4 as the client-exposed ID* and only change if there's a solid measured reason to.

* not the same as your internal DB row primary keys, which in Postgres should usually be bigserial

cogman10 · on June 5, 2024

Yes but also no.

Yes, in that there are DB technologies not built in a fashion where records are stored in a sorted order of some fashion. No, in that they are very much not common technologies. Most databases, relational, non-relational, etc have some form of a B-Tree at their core somewhere.

hot_gril · on June 5, 2024

I can see this. Spanner is an example where you don't want this, idk if that's considered common enough. Postgres and MySQL both support hash indexes that are unordered, but the default in both is btree, and Postgres hash indexes used to have some caveats that made them unsuitable (idk about now) so I've gotten in the habit of just using the default.

MySQL docs claim that a hash index is much faster if you only need kv lookups, so it seems like uuid4 with hash index would be suitable. Never tried it though, and can't say whether it's faster than using a btree with uuid7. Seems like in theory it would be.

cogman10 · on June 5, 2024

As always, depends on the implementation.

Hash maps should generally always have faster lookups than Btree based structures. However, they'll have slower writes especially when contested. A key issue hash tables have to deal with is what happens when a remapping needs to happen. For example, when 2 keys have the same hash. In that case, locking becomes a lot more messy.

For a btree this is simpler. It's built to be able to handle reshuffling and rebalancing in a way that's semi easy to have fine grained locks around.

hot_gril · on June 5, 2024

Hm yeah, now I'm wondering how MySQL implements it. Consistent hashing is one way to limit the scope of a remap.

hot_gril · on June 7, 2024

(But it seems like an ordered index is fundamentally easier for write performance)

echelon · on June 5, 2024

This is a great technical modification that can be made to work with "Stripe"-alike IDs or tokens.

Another hack for advanced active-active situations where you may need to route events before replication completes: encoding the author shard / region in the lower order bytes.

There are lots of interesting primary key hacks for dealing with physical or algorithmic complications.

jordanthoms · on June 5, 2024

This is dependent on the database you are using - if it's a key-sharded distributed database, you want to have insertions evenly spread across the key space in order to avoid having all the inserts go into a single shard (which could overload it)

cogman10 · on June 5, 2024

This is the great thing about using random bits for the lower bits. Because you are unlikely to use more than say 2^64 database nodes, any sharding algorithm will have to figure out how to spread a key with 64 bits (or however many bits are in your key) across n nodes.

Because of the random portion of the key, that means you'll get good distribution so long as the distribution algorithm isn't something stupid like relying solely on the highest order bits.

cimnine · on June 5, 2024

Yes, and I like to combine two established concepts instead of rolling my own: URI and UUIDv7. So my IDs become `uri:customer_shortname:product_or_project_name:entity_type:uuid`. An example ID could be `uri:cust:super_duper_erp:invoice:018fe87b-b1fc-7b6f-a09c-74b9ef7f4196`. It's even possible to cascade such IDs, for example: `uri:cust:super_duper_erp:invoice:018fe87b-b1fc-7b6f-a09c-74b9ef7f4196:line_item:018fe882-43b2-77bb-8050-a1139303bb65`.

It's immediately clear, when I see an ID in a log somewhere or when a customer sends me an ID to debug something, to which customer, system and entity such an ID belongs.

UUIDv7 is monotonic, so it's nice for the database. Those IDs are not as 'human-readable' for the average Joe, but for me as an engineer it's a bliss.

Often I also encode ID's I retrieve from external systems this way: `uri:3rd_party_vendor:system_name:entity_type:external_id` (e.g. `uri:ycombinator:hackernews:item:40580549:comment:40582365` might refer to this comment).

michaelt · on June 5, 2024

> It's even possible to cascade such IDs, for example: `uri:cust:super_duper_erp:invoice:018fe87b-b1fc-7b6f-a09c-74b9ef7f4196:line_item:018fe882-43b2-77bb-8050-a1139303bb65`.

Let me guess - you're a Java developer, right?

cimnine · on June 6, 2024

Nope.

whyever · on June 5, 2024

Please don't use % to generate integers from a range, it's not uniform, which can be disastrous if you rely on your numbers not being predictable. You can use crypto.randomInt instead.

CodesInChaos · on June 5, 2024

That is definitely an improvement, but I find that concern a bit exaggerated.

If all you need is unpredictability, a minor bias is sub-optimal but not disastrous. The 256 % 62 bias used here should reduce the min-entropy per character by 5%. And you can easily minimize the bias by using larger integer than a single 8-bit byte. There few algorithms where minor biases cause a disaster, like DSA.

strnisa · on June 5, 2024

Thanks for all the improvement suggestions! Taking them into account, the `makeSlug` function becomes:

    function makeSlug(length: number): string {
        const alphabet = "0123456789abcdefghjkmnpqrstvwxyz";
        let result = "";
        for (let i = 0; i < length; i++) {
            result += alphabet[crypto.randomInt(alphabet.length)];
        }
        return result;
    }

inopinatus · on June 5, 2024

Having a zero in your alphabet can be problematic, because leading zeros are often stripped (e.g. Excel notoriously mangles phone numbers thinking they are integers).

Multiple calls to a randomness generator can be expensive, and a waste of entropy; production-scale random string generators should still respect this and ask for a block of bytes, then encode them, but with bias correction. You're off the hook in this case, I think node's implementation of randomInt is doing exactly that for you and conserving remaining entropy in a cache.

biztos · on June 5, 2024

I think that 0 and 1 are likely to cause problems when customers end up reading their "user ID" back to your employees in Customer Support Country over the phone.

"It's one-three-oh-dee-ee-el. Yes, I'm sure, EL as in elephant."

kstenerud · on June 5, 2024

I developed safe32 for this reason.

https://github.com/kstenerud/safe-encoding/blob/master/safe3...

Notably, confusable characters are interchangeable when being ingested (although a machine encoder MUST always produce canonical output). https://github.com/kstenerud/safe-encoding/blob/master/safe3...

So a user can confuse 1 for l, 0 for o, I for l, u for v, uppercase, lowercase etc, or the agent can say any of those over the phone, and it won't matter.

vanderZwan · on June 5, 2024

Oh that looks well-thought out and is probably the sanest way to solve this particular problem!

It's obvious why the safe64/safe80/safe85 cannot do this, but is there a reason why the safe16 version doesn't have the same features?

kstenerud · on June 5, 2024

Oh whoops that's an oversight! I'll fix that up tonight.

ss64 · on June 5, 2024

Another approach is using base31 with all vowels removed https://ss64.com/ps/syntax-base31.html

kstenerud · on June 5, 2024

You still have the problem of 1 vs l. And also it doesn't support user-error (reading 0 as O, or reading V as U).

what · on June 5, 2024

Isn’t this just crockford encoding?

kstenerud · on June 5, 2024

Every base-32 style encoding is Crockford at the core. The difference is in the alphabet, and also whether it requires padding or not (safe32 does not).

Crockford also incorporates error correction, which is unnecessary in modern systems since the underlying protocols do that already.

inopinatus · on June 5, 2024

My favourite ambiguous readback is to say "M for Movember".

I have a note from a few years ago that 367CDFGHJKMNPRTWX may be a sufficiently unambiguous alphabet. Drop the one you like the least (probably N) to obtain a faux hex encoding.

gvx · on June 5, 2024

Anything that needs to be read over the phone should probably be written out using something like the NATO phonetic alphabet, split into smaller chunks if needed: "The code? It's kilo eight niner; one three mike; delta echo lima."

8372049 · on June 5, 2024

Having come from a military background where using that is second nature, I'm constantly surprised how rarely I meet civilians who understand it effortlessly. When picking up a package I say "the code is Oscar Foxtrot three-fife" and you see the person processing for a long time to extract the first letter of the word. I've started saying "OF, that's Oscar Foxtrot, 3-5" to help them out.

In other words, asking a customer/consumer to be able to recite something in phonetics is not realistic in most cases.

Fortunately the code already takes this into consideration and removes ambiguous characters.

mannykannot · on June 5, 2024

My experience in the USA is that if I don't include the phrase "as in" (as in "X as in Xray") most people still will not realize what I am doing (the alternative "for" can be confused with the digit.)

I also ask them to check my readback of key information they have given me and vice-versa; usually that works well.

hprotagonist · on June 5, 2024

digikey phone personnel all speak NATO. it’s wonderful.

hobs · on June 5, 2024

First thing I drilled into Apple phone support folks.

gbacon · on June 7, 2024

*tree :-)

jagged-chisel · on June 5, 2024

One can convert those to the correct characters. Since “oh” and “el” are excluded from the alphabetic range, they become “zero” and “one” - deciding whether that is done in software for the help desk, or in the brain of the help desk staff is left as an exercise to management.

smaudet · on June 5, 2024

I wonder if Unicode could be used to alter the characters such that these mistakes would be less possible, e.g. using ⓪.

WorldMaker · on June 5, 2024

If you are trying for URL safe, Unicode is problematic because of Punycode conversions and differing browser behavior with Unicode URLs. (Some browsers always show Unicode as Unicode in URLs. Some browsers always show Unicode as Punycode in URLs. Some browsers switch between the two based on a huge number of variables such as gTLD, user preference, phase of the moon, etc.)

fhars · on June 5, 2024

That is "AT", isn't it?

(No, it isn't, if you look closely enough.)

Piskvorrr · on June 5, 2024

It's obviously a �. Or perhaps a □ . Maybe an Â¾, on odd Wednesdays?

(Unicode has its strengths. Making up replacement characters isn't one.)

codr7 · on June 5, 2024

I happen to know that the biggest ski resort reservation system in Scandinavia contains a function called MaybeOnATuesday(), but to my knowledge it's never called.

smaudet · on June 5, 2024

Yeah...what I'd really like to do would be to give a character a "natural" background color, e.g.

Then its simple for support to say "red is one, green is ell". But you can't just add a color to a character, because copy paste/rich formatting don't work everywhere, or even transfer well...

Alternatively, if you use ⓪ and ⒈ it matters less if the user says "at" or "zero", and more that they didn't say "oh" or "one".

jgalt212 · on June 5, 2024

Or just drop the L and O, like CUSIP does.

8372049 · on June 5, 2024

i, L and o are left out of the alphabet in the snippet, so there's not really any ambiguity.

kodisha · on June 5, 2024

Or, just use cuid2 [1] + prefix.

[1] https://github.com/paralleldrive/cuid2

fnordsensei · on June 5, 2024

One thing I’ve been looking for in an ID generator is a way to supply a blocklist. There are a number of character combinations I’d like to avoid in IDs, because they might be offensive or get stuck in filters when copy-pasted (e.g. in a URI).

This can be solved in user space by regenerating if the character sequences are detected, but this a) skews the distribution, and b) potentially takes time, especially when the ID generator is made to not be “too fast”. I want to generate a single ID that passes the blocklist in a timeframe that is not too fast, if that makes sense.

Is there an ID generator that takes this into consideration?

ceejayoz · on June 5, 2024

Just take out the vowels and numbers that can look like vowels. Nixing 0 means no b00bs IDs, and avoids 0/O; I usually take out 1/I as well.

fnordsensei · on June 5, 2024

That goes some of the way, but I can think of a few problematic sequences that are only consonants and/or numbers.

ceejayoz · on June 5, 2024

I like to nix vowels and things that look like them, i.e. 0, to avoid random b00bs sort of tokens.

madcaptenor · on June 5, 2024

Sure, but I can see people still getting offended if "fck" showed up.

Tangurena2 · on June 5, 2024

This is why an open source project is now "CK Editor". It was the author's initials, but too many people saw an extra vowel in the name of the project.

https://en.wikipedia.org/wiki/CKEditor

travisgriggs · on June 5, 2024

Let’s not get sexist here! It also avoids random d1ck and c0ck sorts of tokens.

garblegarble · on June 5, 2024

This approach will randomly generate profanity. If the ID is visible to users it can cause some to get upset (and best-case simply looks unprofessional). On a purely technical level if visible in URLs it can cause links to be blocked/altered by e-mail filters / filtering proxies.

It's generally a good idea to drop vowels for this reason.

Knufferlbert · on June 5, 2024

We once had a rather angry Irish customer calling our support complaining that we called him a pikey (slur for gypsy). After some back and forth it turns out we just gave him an apikey.

We never had a similar issue with our random numbers/letters/reset passwords or anything like that which don't have any kind of "dont return profanity" protections. Though I agree, someone getting a randomly generated customer portal url or something containing fuck or similar would look bad. Our cloudfront or something (or was it main public facing s3 bucket? can't remember) starts with "gay" and was never picked up on.

biztos · on June 5, 2024

Totally off topic, but "gypsy" is itself a slur for Romani people in much of the world.

https://en.wikipedia.org/wiki/Romani_people

isaacremuant · on June 5, 2024

I'd say those articles show Wikipedia's political bias and a tendency to overly politically correct instead of portraying reality.

Many in real life do refer to themselves as gitano or Gypsy and would ask for others to refer to them as such.

Of course it's very easy to find an article saying otherwise and then using that as the end of discussion for Wikipedia editors.

coooolbear · on June 5, 2024

By ‘find an article’ you mean find ~10 real citations including the resolution of an authority a long time ago and to tell the reader it is not clear or definitive?

Better to be careful and let any individuals or communities tell you what they want. I have Roma connections in my family and at one point the word we’d use is ‘gypsy’. But, because I’m not Roma myself, if I came across some other group I wouldn’t assume I’m just allowed to say it to them.

vsuperpower2020 · on June 5, 2024

I don't care what they want. A lot of people are tired of playing these language games.

bigstrat2003 · on June 5, 2024

Also, "do what people want" is fine for your interactions with an individual. But it's not a viable general rule for language, where we need one single approach. I think saying gypsy unless someone personally tells you they would rather you don't call them a gypsy is perfectly reasonable.

coooolbear · on June 6, 2024

Everybody, in fact, takes innumerable social parameters into consideration when you say anything, especially with strangers.

For the sake of mass communication where you can’t really know your receiver, you have to do your best to just communicate whatever you need to (i.e. ‘a single approach’). Choosing to use a word that is ambiguous as to whether it is a slur is a bit unwise. I think it is probably unwise to do the same in personal interactions.

coooolbear · on June 6, 2024

Then “a lot of people” (lol) shouldn’t complain when they piss someone off when someone already warned them

isaacremuant · on June 6, 2024

What you're saying very well seems like a threat. A threat of violence for speech.

That attitude defaults to better at violence in a particular context gets to impose their will. Or whoever has the security forces to back them out of an inferiority situation.

You miss the part when I can arbitrarily warn you about a lot of things myself and then use any interpretation of rule breaking on your part to attack you.

I know it may sound harsh but this is where many end up going so let's make it explicit.

isaacremuant · on June 6, 2024

Your first paragraph missed the point. Your second is how you deal with it. I've just told you I had a different experience. Your experience doesn't supersede mine. Your "be careful" (or else) doesn't sit well with people who don't like to be threatened.

mst · on June 5, 2024

It's in the annoying category where it can be used as a slur but also gets used as not-a-slur, including but not limited to by the people it describes.

Locally (north west england) people generally use "traveller" as a description ... but there are definitely people who use that as a slur.

Language be like that sometimes.

fregonics · on June 5, 2024

True! For instance the example itself. "cus" is profanity in Portuguese. If you were to localize the application, this would be a factor.

account42 · on June 5, 2024

fvck cvnt b1tch

ThePowerOfFuet · on June 5, 2024

I hear Scunthorpe is lovely this time of year.

brazzy · on June 5, 2024

I hear someone got buttbuttinated there recently.

bigstrat2003 · on June 5, 2024

At first I thought that this was the result of my cloud-to-butt extension, and I was trying to figure out what "cloudcloudinated" meant.

Akronymus · on June 6, 2024

Why would you need a butt-to-butt extension?

mst · on June 5, 2024

A problem we've been having since the medireview age.

theophaniel · on June 5, 2024

I feel like https://github.com/jetify-com/typeid is the solution to this

vincentdm · on June 5, 2024

I don't like the prefix idea: besides the duplication of information, it also becomes a liability if you ever rename things.

Imagine you prefix all customer IDs with `cus_`, but at some point decide to rename Customer to Organization in your codebase (e.g. because it turns out some of the entities you are storing are not actually customers). Now you have some legacy prefix that cannot be changed, is permanently out of sync with the code and will confuse every new developer.

codeulike · on June 5, 2024

I wouldn't worry about that - I still think its worth it. I've had systems which during development we thought the Contact page was going to be called 'Contact' in the UI but at the end it got re-labelled to 'Individual' but in all the code it was still called Contact and the IDs all started with a C - but you know what? It was still useful to look at an ID, see the C and know that it was an Individual.

shawabawa3 · on June 5, 2024

> Now you have some legacy prefix that cannot be changed

Yes you can.

You can support the old cus_<ID> prefix as well as the new org_<ID> prefix, but always return org_<ID> from now on

wongarsu · on June 5, 2024

Reddit prefixes their IDs with t1_ for comments, t2_ for accounts, etc. That sidesteps the renaming issue.

Though I believe they mostly do it because their IDs are sequential, so without prefix you wouldn't easily notice if you use the wrong kind of id. They also only apply prefixes at the api boundary when they base36 encode the IDs, the database stores integers

chuckadams · on June 5, 2024

I have that exact issue with a couple different identifiers, and it's not a big deal. Usually it goes along with some data model change you already have to write compatibility code for, the new and old names tend to be related, and the old name tends to stick around other parts of the code anyway. Opaque IDs don't reduce the confusion there, documentation in appropriate places does.

Merad · on June 6, 2024

I'd recommend using Crockford base 32 [0] to encode the bytes. It makes the text more human friendly by eliminating case sensitivity, removing similar/ambiguous letters, and preventing accidental profanity.

And in most cases I think you're also better off just using a uuid and encoding its bytes as base 32, in which case you're basically doing type id's [1]. If the "slug" portion of the id actually encodes a uuid, then it gives you the option to store it in your database using an appropriate uuid type. This will make your database much happier than using long string PK's.

0: https://www.crockford.com/base32.html

1: https://github.com/jetify-com/typeid

giancarlostoro · on June 5, 2024

If I'm going to do that I think I'd use Bitcoins BASE58 which avoids letters that could be confused for each other. The number of times I see an O and 0 and wonder which is which, because the font does not make it clear really annoys me.

Edit:

Other honorable mentions: ObjectID's as used by MongoDB which contain the creation timestamp. Also Discord's snowflakes (inspired by Twitter's iirc), which also contain the creation timestamp.

kibwen · on June 5, 2024

If you want random IDs to be human-readable (and human-communicatable), I'd just recommend base 32 or even base16. You don't actually save that many bytes from base58 or base64 when it comes to short IDs.

Case in point, the parent poster's base64 ID is 14 characters long. When encoded as base32 that's still only 17 characters (or 19 in base16), and now you have completely gotten rid of all notion of casing, which is annoying to communicate verbally.

giancarlostoro · on June 5, 2024

I think the more I think of it, the more I favor the Discord snowflake ID's because they're just integers, and they can be generated on the fly. I think new messages generated them on the client if I'm not mistaken.

psychoslave · on June 5, 2024

To my mind, it always felt so saddening that adoption of a truly straightforwardly readable notation for numbers never took of. I mean it’s so easy to do. You can start for example with a single syllable per digit, and for example only target CV syllables.

From this there is many possibilities, but for example, let’s consider only a base ten. Starting with vowels order o, i, e, a, u with mnemonic o, i graphically close to 0, 1 and then cyclically continue the reverse order in alphabet (<-a, <-e, <-i, <-o*, |-u). We now only need two consonants for the two series of 5 cardinals in our base ten, let’s say k and n.

So in a quick and dirty ruby implementation that could be something like:

    $digits = %w{k n}.product(%w{o i e a u}).map{it.join('')}
    def euphonize(number) = number.to_s.split('').map{$digits[it.to_i]}.join('-')
    euphonize(1234567890) # => "ki-ke-ka-ku-no-ni-ne-na-nu-ko"

That’s just one simple example of course, there are plenty of other options in the same vein. It’s easy to create "syllabo-digit" sets for larger bases just adding more consonants, go with some CVC or even up to C₀C₁VC₀C₁ if sets for C₀ and C₁ are carefully picked.

jmcphers · on June 5, 2024

There is a old system for making numbers pronounceable as words using a mapping from each number to a consonant value. It's typically used to help memorize numbers:

https://en.wikipedia.org/wiki/Mnemonic_major_system

samatman · on June 5, 2024

This is how the Urbit address scheme works, btw: https://urbit.org/blog/the-urbit-address-space

psychoslave · on June 5, 2024

And of course a naive implementation of the reverse is also trivial:

   def numerize(euphonism) = euphonism.split(?-).map{$digits.find_index(it)}.map{it.to_s}.join.to_i

phkahler · on June 5, 2024

Something like that should have a built in check "digit" if people are going to see it and possibly type it in.

For numeric values, making them all a multiple of 11 is a simple way to catch all single digit errors or single transpositions.

chuckadams · on June 5, 2024

That's why VINs make a decent natural key, because they do have a check digit. Plus they're not opaque: if you look up the VIN and the make/model/year is completely different than the car in front of you, you know you either have the wrong VIN or the wrong car.

bambax · on June 5, 2024

Nice, but better to remove ambiguous / hard to read letters: ijlo, IJLO, and 01 (and maybe 7 as well?)

metafunctor · on June 11, 2024

Type safe and unguessable IDs are what I've been using in my projects for the past, oh, 10 years maybe? Inspired by Stripe!

In my databases, I often prefer integer primary keys for performance reasons. On the other hand, I don't want to expose my primary keys because they are easy to guess.

Recently I've been playing with Rust, and ended up publishing a library to encrypt IDs in they way I like:

https://crates.io/crates/cryptid-rs

scrollaway · on June 5, 2024

I don’t recommend using random bytes for this. What I have done in my previous projects is take a uuid6, reserve a couple of bytes inside of it to replace with an object ID, and I convert that to obj_XXXXXXX.

This means you can store them in the db not as a string but as a uuid, which is a lot more performant. You also get time stamping for free.

inopinatus · on June 5, 2024

I'm not convinced that all Stripe IDs are wholly random strings. I just decoded the base62 part of four genuine "acct_" objects, which at 16 characters are just shy of representing a 12 byte value (log2 62^16 =~ 95.3), and they all have a leading byte of "00000011" and two of them even have a leading 32 bits that is very suspiciously close to an epoch timestamp of a couple of years ago.

There is a similarly suspicious pattern in some of their longer identifiers, invoices for example at 24 characters (ca.143 bits) all seem to have a first byte of "00000010".

Even in the article you've linked to, look closely at the IDs:

      pi_3LKQhvGUcADgqoEM3bh6pslE
      pm_1LaXpKGUcADgqoEMl0Cx0Ygg
     cus_1KrJdMGUcADgqoEM
    card_1LaRQ7GUcADgqoEMV11wEUxU

Notice the consistently leading low-integer values (a 3, then three 1s)? and how it's almost always followed by a K or an L? That isn't random. In typical base62 encoding of an octet string, that means the first five or six bits are zero and the next few bits have integer adjacency as well. It also looks like part of the customer ID (substring here, "GUcADgqoEM", which is close to a 64-bit value) is embedded inside all of the other IDs and then followed by 8 base62 characters, which might correspond to 48 bits of actual randomness (this is still plenty, of course).

Based on these values it seems there's a metadata preamble in the upper bits of the supposedly "random" value, and it's quite possible that some have an embedded timestamp, possibly timeshifted, and a customer reference, as well as a random part, and who knows maybe there's a check digit as well.

It's possible - albeit this is not analytical but more of a guess - that the customer ID includes an epoch-ish timestamp followed by randomness or (worst case, left field) is actually a sequence ID that's been encrypted with a 64-bit block cipher and 32 bits of timestamp as the salt, or something similar (pro tip: don't try that at home).

My view is that either Stripe's engineering blog is being disingenuous with the truth of their ID format, or they're using a really broken random value generator. If the latter, I hope it's only in scope of their test/example data.

wslwsl · on June 5, 2024

In the comments it's mentioned, that the IDs contain a shard key for faster lookups.

https://dev.to/stripe/designing-apis-for-humans-object-ids-3...

dividuum · on June 5, 2024

I took a look at a bunch of stripe customer ids I have stored and at least mine look very random on first glance. I assume their blog post uses demo keys or something similar.

codeulike · on June 5, 2024

Except the Stripe ones are case sensitive which can be annoying with some databases

datavirtue · on June 5, 2024

You know what's annoying? People storing GUIDs in a case-insensitive database.

codeulike · on June 5, 2024

Classic RFC9562 GUIDs were often represented as hex like {550e8400-e29b-41d4-a716-446655440000} and so were case insensitive

ssahoo · on June 6, 2024

UUIDv7's premise to solve this, isn't it? it's 32 chars without dashes.

mettamage · on June 5, 2024

Why is such a thing called a slug?

inopinatus · on June 5, 2024

It's an old typesetting term that found its way into content management systems. https://archive.nytimes.com/www.nytimes.com/times-insider/20...

corobo · on June 5, 2024

Ooh I had this wonder a while back and jotted it down just in case anyone else ever wondered about it:

Comes from the ye olde paper-based blogs they call newspapers. When an article is being put together it’s given a short name, sort of like a project name. This name would remain the same throughout the article’s life - from reporter through to editor - it left its trail through the process. Like a slug.

meindnoch · on June 5, 2024

That's not what Wikipedia says about its etymology though.

>The origin of the term slug derives from the days of hot-metal printing, when printers set type by hand in a small form called a stick. Later huge Linotype machines turned molten lead into casts of letters, lines, sentences and paragraphs. A line of lead in both eras was known as a slug.

corobo · on June 5, 2024

Ooh interesting. It looks like I have either misinterpreted or found a source that misinterpreted (it was a few years back, unsure if I came to the conclusion or found it). I'll have to update my notes, cheers!

Apologies for the wetbrain hallucination, HN!

8372049 · on June 5, 2024

I'm not sure you're hallucinating. The dictionary I checked lists the printing and journalism terms separately. It's quite possible they have diverging etymologies, meaning both can be correct:

5. Print. a. a thick strip of type metal less than type-high. b. such a strip containing a type-high number or other character for temporary use. c. a line of type in one piece, as produced by a Linotype. 8. Journalism. a. a short phrase or title used to indicate the story content of a piece of copy. b. the line of type carrying this information.

hprotagonist · on June 5, 2024

or the journalism term itself diverged from the typographic one.

corobo · on June 5, 2024

Aye this is what it seems to be having double checked the reply's claim

Got to the Wikipedia page https://en.wikipedia.org/wiki/Slug_(publishing) which could possibly support the slimy conclusion of "it's a trail through the process" but that article has an etymology section that refers to the metal slug

I guess it could mean both depending on whether you're looking for the meaning of the word or the meaning of the concept but I didn't find any other slimy grub references (via an admittedly limited double check)

hprotagonist · on June 5, 2024

another fun etymological rabbit hole for you: stereotype and cliché both probably originated as typographer jargon.

longerd2 · on June 7, 2024

How is this readable?