johor day 2

  • hotel beds feel so nice and comfy that it's so hard to leave them
  • reread a bunch of Ava posts from yesterday and took some notes on the car
  • she's like a big sister that's imparting a ton of life advice and I feel lucky to have access to all these

on getting things done

  • stick to a schedule. your schedule needs to be realistic, it's not what ideal you would do, but what you actually think you're be able to do
  • touch everything once. run for 5 minutes. write one sentence. read one page. the most important thing is just touching it.
  • resistance is the problem to getting most things done, not time. do every task as soon as you think of it. procrastination can ruin your life. spend less time thinking about doing things and more time doing them.
  • have more constraints, it makes you value limited time more.
  • slice tasks into chunks of time
  • manage your stress. don't let it affect your sleep. work out. take walks. don't overthink about worst case unless there's something you can do about it.
  • no notifications, they're poison for your mind
  • take notes constantly: use iPhone notes to jot down whenever you remember something or have a good thought
  • multitasking is okay sometimes. there's two modes everyone operates in: 1) total focus and 2) getting things done while also doing other things. you can do #2 on certain things to save time.

the right conversations

  • in literature, there's an abyss between two people, how no matter how close you are to someone, they remain a wilderness to you
  • "how well is it possible to understand someone else?"
  • we all long for moments of pure recognition, the sense that someone else gets us
  • we seek for commonality on an emotional and intellectual level, and it's hard to find someone that fills both gaps
  • in a relationship that isn't working, it's like a boat dragging dead weight – you're the only one providing momentum and direction
  • or they other way around – they're the one driving and you're just along for the ride, the other person's choices don't reflect your sensibilities
  • a right conversation feels like you both are side by side, yelling back and forth about which direction to turn. a mix of intuition and negotiation. both sides has influence on each other
  • there is no comfort greater than being on the same page, talking about the same things
  • the combination of understanding and agreement is the rarest thing to find in a relationship
  • too many differences will leave you going in circles, it'll wear you down
  • disagreement between two people are inevitable, but how you navigate them marks the difference between a passage and impasse
  • when you meet people who are different, in what they say and how they behave, don't give up. notice that difference and get better at sharing, try harder at explaining, evangelize for your view of the world, convert them into a believer

on making and keeping friends

  • finding really good friends is more important than almost anything else you can be doing
  • friendship takes a lot of work. you have to be ready for friction and willing to work to resolve it
  • having good friends is enabled by a set of learnable skills – how to find people you like, how to put yourself out there, how to listen, how to make space, how to propose fun things to do
  • most successful friendships stem from shared preoccupations (technology, hobbies) and shared context (core life experience like same university)
  • good friends can see your potential before you can, nihilism results from a lack of people who really see you
  • things that work for her
    • carve out time for friends, whether calls or in-person hangouts
    • live in the city where the largest number of your good friends are
    • find 3-5 people you can be close friends with for many years, and hang out with their friends, and try to align your life choices around them
    • go to a lot of things, parties, dinners. maximize serendipity
    • meet people through hobbies, get really into running or something
    • ride the coattails of your one really social friend
    • be friends with people you can genuinely praise to everyone, you should be super proud of your friendships
    • meet people through Twitter, tweet like your life depends on it and enjoy the returns
    • write in public to meet people who like how you think, think in public
    • develop an internal friend clock that tells you when to reach out to a friend "it's been 3 weeks since you saw X, you should text them"
    • proactively ask about your friends' lives. many people have trouble talking about stuff that's troubling them, it's on you to ask
    • be willing to take feedback, be willing to apologize
  • you can't "win" an argument, you can only lose the relationship
  • it's okay to let go of a friendship that isn't working anymore
  • your mistakes can and will damage the relationship; you cannot start over, but your mistakes can also be forgiven

begins to regret it as I feel intensely carsick, shorter notes now

  • closeness
    • how to quantify what makes it possible for you to feel close to someone?
    • closeness has a few elements: proximity, both physical and emotional, as well as mutual empathy, shared context and experience
    • Joan Didion never had a thought without saying it aloud to her husband is aspirational
    • a desire to share consciousness
    • the appeal of extreme closeness: "I find it both thrilling and exhausting to be in my own mind and sometimes I want to share the burden of it. I want you to know everything, to share the inexhaustible accumulation of experience with me"
    • the ability to really understand someone is often the product of innate compatibility and similarity
  • knowing what i like
    • many people think they know what they want, but they're pursuing the wrong things.
    • when you get something that you've desired for a long time and it feels hollow, it means you've been chasing the wrong thing.
    • there are things in life that feel good to get
    • when what you get is what you want, it means you have taste, and it takes a lot of work to get there
  • loving imperfection
    • if someone is externally perfect, they usually have all sorts of terrible wounds.
    • if you love someone you are happy to pay the price, people assume love is about celebrating someone's amazing qualities, but true love is about accepting someone's flaws
    • everyone is imperfect, and we just have to choose the imperfections we can love
    • the goal is to find a tradeoff you more than tolerate – someone who makes you think, "you're so imperfect, and I'm so lucky, I can't believe I get to spend my life with you"
    • we have to look for the trauma that slots into our trauma, the imperfection that moves us
  • love in the time of hyperfixation
    • we don't get to choose which fixation lasts and which won't
    • the things we stick with are just the things we repeatedly fail to give up
    • the entirety of what I know about love: "in the beginning it felt like I couldn't control myself. And then it felt like I could control myself, but I still wanted to continue"
  • fewer, better thoughts: making the right decision involves paying careful attention to the state of the world around you, and the state of your own internal landscape. watch people and watch yourself. what makes you feel good, what frightens you, what aches and why.
  • 17 hot takes about dating
    • getting into a relationship is like buying a car; being in the relationship is like driving it. Don't spend two years trying to buy the car if you don't even know if you'll like driving it
    • a great relationship is one in which both people are like "Wow I got so lucky". an okay one is one in which only one person feels that way.
  • two years: consistency creates inspiration (if you know you have to write everyday, you'll think of something to write about)
  • what makes me feel grounded: spending time with the people you love amplifies your "youness"
  • practical magic: on rejection and secure attachment
    • ask for things at the exact edge of rejection. if you're never rejected, you're too risk-averse
    • be comfortable with rejection and with failure
    • the willingness to be pushy and ask for things that aren't directly offered to you
    • always go to the source. It’s important to know whose thoughts you’re referencing, and why. And whose thoughts they’re referencing.
    • double down on your talents
    • get good at building a strong feedback loop between "I perceive X" and "I've confirmed that X is in fact true". you'll start to trust yourself more. accurate self trust is more important than anything else

all my relatives keep asking me to take care of health and that's what matters the most

and that now that I'm relatively healthier there's less worry, and that I've gone through it before and I'll be fine.

I do think i'll enjoy life there a lot more than staying here.

The only thing stressing me out the most, the thing that really strikes fear deep into my core, is the pressure to find a good job that pays well (where I'll enjoy the work) to make my parents proud and to pay them back for the dent that I'll be making in my dad's bank account. the money that was hard-earned from going to the office everyday, taking multiple calls every hour.

But I'm sure I'll make it, either way. I have to trust in God more. Stay grounded, act according to my core values, surround myself with the right people, and take more risks. I just have to keep doing what I'm doing. And actually implement the advice that I've been consuming the past year or so in these blogs.

It feels like the week before the war, and you're on the frontline. you know you're subjugating yourself to a ton of challenges and discomfort and obstacles, and you constantly feel under-prepared and anxious of what's to come. but you still have to march forward. it's like you're trapped somewhere in sinking ship, the water is rising rapidly around you, you're floating and weightless, and air is running out, and you're about to take a deep breath before you dive under and look for an escape. you have to hold your breath, but you're not sure for how long. You just have to keep holding on.



johor day 1

  • woke up feeling anxious and sad that I'll be alone in a room and my parents and sister won't be calling me to get out of bed
  • saw primary school students on their field trip, I feel so old
  • yong peng fishball noodles
  • she spends her days and maybe entire life farming, and going to restaurants to sell her produce, and she's satisfied with life. she mentions her friend's durian is for sale. her face lights up with a big smile, her skin tanned yet healthy. her body strong and fit. isn't she's the lucky one? what more do she seek? her life is simple yet fulfilling
  • most people don't get to choose. they grow up in a small town and work as a waitress in a fishball noodle restaurant, and might never get the chance to even leave the town, country, to see and experience what I have experienced. what is life to these people? i'm worrying about not getting a high paying job and not passing FAANG interviews while they probably have to worry about whether they can save enough to sustain themselves, their parents and for their own future. perspective is important. step outside of the bubble more
  • a few notes from having meals with relatives in their 60s
    • they repeat things, at their age they talk about the same things a few times, almost like they forgot that they already mentioned it earlier
    • they like to talk about travelling experience, where they went and where we went, at this age they love travelling because time is limited
    • they talk about their own kids (marriage, life, career)
    • they talk about living cost in singapore is high
    • they will say you look like <INSERT RELATIVE NAME> a few times
    • they love asking questions about your life, they're very curious about young people's thoughts on things too
    • the uncles in IT will ask you about AI and ask you to share
    • they surprisingly know how much other relative's kids earn per month, that's what parents flex to other parents I suppose
    • they'll mention how times have changed, how young people today is different, ex: living together but not married, LGBTQ
    • they'll mention a few times that health is what matters the most
    • they'll give advice to you, the uncle mentioned travel and reading books are the two important things for career
  • i think I need to talk to old people more,
  • a lot of singaporeans came to Johor because it's a holiday, and everything is 3.48x cheaper
  • overheard a group of young singaporeans (probably my age) in a restaurant talk about career and it made me worry about my own
  • reminded of Steve Job's quote on career, "The enemy of most dreams and intuition, and one of the most dangerous and stifling concepts ever invented by humans, is the 'Career'"
  • but realizing that as an international student, I might not have the luxury to think about what I love doing until I get a stable job in the US, so the idea of doing what you love is a privilege not accessible to everyone
  • read a total of 13 posts by Ava while shopping with mom, it feels like I'm having conversations with her. it would be so cool to meet her in SF.
  • I always use to wait outside while mom shopped for things, but actually shopping with her is a pleasant experience, I found a nice jacket at brands outlet
  • i've been buying clothes at uniqlo and starting to feel guilty to buy nice clothes using my dad's money while he wears the same old clothes he's owned for years
  • growing up and getting old is realizing your parents are people with their own issues and problems and insecurities and flaws and quirks, and not expecting them to know everything about the world and about you, and noticing they're getting slower and more fragile and more accepting of everything.
  • i remember watching a video of a son bringing his parents to japan, booking all michelin restaurants months prior, growing up is also treating your parents to nice food and experiences once you're capable, paying back for all the sacrifices they've made for you.

Writing Tips For Work

Ryan shares three writing tactics that help you be more effective at work

  1. Make the first line interesting
    • show them why they should care immediately.
    • you have to know your audience well to do this.
    • examples
      • (Sharing launch results): share the metric movements that the audience cares about.
      • (Asking for collaboration): highlight the expected outcomes from the work that are most important to the audience.
      • (Raising awareness for a problem): explain the severity of the problem in a way the audience understands.
  2. make your ask clear
    • either share it first thing
    • or at the end after providing the relevant context for them to understand your ask
    • Before:

      Service X is broken. I think it's because of the push. Could you help me revert a diff? This change went out at Y time and I can see a clear drop in the success rate just after that. I'm not sure though still and want a second opinion

    • After:

      Prod is down, can you help me revert a diff?

      Service X is broken. I think it's because of the push. This change went out at Y time and I can see a clear drop in the success rate just after that. I'm not sure though still and want a second opinion

  3. write simply
    • the more simply the writing, the more your idea will get through to the audience
    • if you can remove a word and preserve the meaning, do it
    • if you can replace a complex word with a simple one, do it

these were for short writing, he shared more tactics for longer writing, i.e. planning docs, announcements, design docs, and more.

The main ideas are to include a tl;dr and format the writing to optimize for skimming (people skim in an F-shaped pattern)

  • add a tl;dr: put what your audience must know in a few lines, the most important takeaways
  • add a table of contents: for longer posts, let people jump to what they care about
  • add section headings: gives people a sense of how relevant a section is
  • use bullets and lists: much easier to skim than paragraphs, the spacing separates ideas clearly, delivering the key points to readers
  • bold the main points: draw attention to what matters most, use your judgement and use appropriately
  • break up long paragraphs: add line breaks where it makes sense, don't let paragraphs get too long

more tips in this article by Nielsen Norman group 5 Formatting Techniques for Long-Form Content

Lastly, you get better at writing by doing and asking for feedback by from people who write well.

Write, reflect on feedback, iterate.



Naval's advice to his younger self

if Naval want to say something to his younger self, this is what he would say:

chill out

don't stress so much

live in the moment

everything will be fine

be more of yourself

don't try to do what you think society wants or needs

don't try to live up to other people's expectations

self actualize

say no to more things

protect your time because it's very precious.

on your dying day you will give everything you have for another day.

the discount rate / marginal value of that extra day goes higher as you get older.

less fear, more love

love people more

everyone wants to be loved, everyone needs to be deeply loved

it's not something you can buy, no amount of money can give you true unconditional love

you can give love, it's free to give

you don't necessarily get it, but if you just get in the mindset of "I'm just going to give it", eventually, on a long enough time scale, you get what you deserve

just work on yourself and be ready, and then good things will happen

"when the student is ready, the master appears"



chatting with two DS managers

I talked to two DS managers the past two days, one at C3.AI, an enterprise AI company and the other at BlockFi, a crypto exchange company.

I've been asking questions like:

  • what kinds of projects you're working on?
  • what is the culture like?
  • interview process?
  • what made you hire your last team member?
  • what specific skills you look for?
  • what would impress you about a new-grad candidate?
  • what kinds of projects would impress you?
  • what do you enjoy about company X?

Here were a few things took away from the chats (from my goldfish memory)

  • it's difficult to hire gen AI talent, people's experience are not directly relevant since it's a new field, mostly toy projects and nothing substantial
  • a diverse team should have research people and builders
  • technical experience is the baseline, if you can't pass the python, sql, stats and prob, and ML basics, don't even think about getting into FAANG (side note: paraphrasing here, but this made me slightly stressed about interviews even thought i haven't started my program yet and I'm still in Malaysia)
  • in your internship, if you're not doing what you like, create your own projects if you can, focus on delivering results that you can show for, for your next job (i.e. if you're doing analytics, but you want to do ML)
  • first job should matter, it's the foundation for career, but don't be afraid to experiment and explore if you don't know what your niche is
  • two step process for DS job / interview
    • step 1: find a fit, there's ML core, product, full stack DS, etc., customize your resume for the role, further customize for company if you really like the company (highlight experiences and projects)
    • step 2: skills are python (up to lc medium), sql (do all LC Qs), get a cheat sheet for stats and prob, they like asking about hypothesis testing and p-value, ML algorithms like bagging and boosting, and specific domains (NLP, CV) study the algorithms (transformers, CNNs, etc.)
  • highlight your contribution with your work experience, and if it lacks technical (i.e. you worked on analytics but you want to pivot to ML), use projects and focus on technical aspects
  • there's offense vs defense analytics, offense is about growing the product, defense is fraud and risk
  • lots of candidates can't pass python medium questions
  • for projects like forecasting, accuracy doesn't have to the main goal, instead of using more advanced methods, what's more important is using a simple solution and being able to explain how you obtained that number; how well you're able to tell a story and having evidence and data to back it up.
  • junior DS struggle initially with red-tape involved in working at a company, lack of documentation and having to ask around for help, and spending at least 6 months understanding the data
  • for resume projects, focus on telling a story, talk about why you chose this project, what you learned about the data, why you chose that methodology, have a short executive summary, include charts and graphs. less on the technical, more on the result



Developing Technical Taste

Sam Schillace is the founder of six startups, including Writely, which was acquired by Google and became Google Docs. He served as the VP of engineering for Google Maps. He is currently the the CVP and Deputy CTO at Microsoft.

I love the way Sam talks, his words flow smoothly and you can easily capture what he's saying. Hearing him talk about the creation of Google Docs, his thoughts on AI, and his advice for future software engineers was so compelling, I had to capture some of his ideas down.

Below are notes from his chat on First Round Capital, Developing technical taste: A guide for next-gen engineers

lesson on market timing

  • the best time to act on an idea is when it feels uncomfortable and challenging due to the early times
  • the things that are worth doing are really uncomfortable
  • JS was buggy and cloud infra was nonexistent when sam was developing Gdocs
  • don't wait till things get easier, by then you lose the first mover advantage and there's more competition
  • engage with difficult problems early, criticisms on AI today are engineering problems to be solved

What is technical taste?

  • a "bag of heuristics" you've collected over time to make accurate technical judgements on new things
  • how? look at your past experiences and develop a sense of what works and what doesn't
  • it's a clearly articulated set of principles that's iterative, i.e. for space X, they know what they're optimizing for, the cost to get a pound into orbit, chipping away at all the things.
  • it's good judgement about how they approach technical problems and decide what to optimize for and what to ignore, and how the tech is going to work, and not distracted by all kinds of extraneous stuff.
  • use your technical taste and fail early rather than late, this way you can course correct and iterate
  • don't be a pessimist, be optimistic and make a lot of mistakes

ask "What If" questions

  • have a "what if" mentality, exploring possibilities rather than being defensive, and focus on "why not", why this or that wouldn't work
  • a healthy team embraces new ideas and experiments, engages in constructive arguments on potential solutions, and openly collaborate
  • ask "what if we tried this?" more

advice for future engineers

  • cultivate playfulness and curiosity
  • you get the most value from messing around, playing with things, experimenting and following your nose and doing stuff that you don't know why you're doing it, because it's just interesting to you
  • focus on breaking complex problems into simpler parts, solve the minimal thing
  • don't be attached to technical skills, emphasize on problem-solving skills and
  • learn quickly and adapt to change

building a winning product

  • prioritize user convenience and speed over feature completeness
  • continuously refine the product by listening and talking to customers
  • avoid feature wars, focus on your unique value prop rather than matching competitor features



How To Make Millions Before Grandma Dies

watched this movie today with my mom at Wangsa Walk, which brings back memories because I used to come here with my mom back in high school.

  • pre-movie ads were a lot of thai ghost movies, who watches these??
  • qingming with relatives is something I have never experienced, or I just don't remember it.
  • the thai was bothering and confusing me a little bit at first, I'm not usually exposed to thai at all, it sounds so alien on normal human faces, I wished they spoke chinese or english lol
  • such a waste of food, flowers I can understand
  • Did M drop out during high school or college? I'm guessing he's high school age but not sure
  • gaming on smartphones is such an ick for me, it reminds me of my past life wasting time on mobile games
  • lots of practices of rituals and prayers to goddess throughout the movie made me glad I'm christian, I find it sad people stick to these practices because their parents, and their parents, observed them, it's passed down knowledge and tradition that's hard to break off due to reverence and obedience
  • the cousin looks like one of the new jeans member
  • never thought much about chasing after inheritance, seems like a lazy and dishonorable thing to do, but if you're poor and wasn't born into a good family that's well off, it's understandable, my privilege is showing.
  • wonder what happened to M's dad
  • I like that montage of M taking pictures as he travelled to amah's house, I'm guessing he's doing that cause it's his first time going at that age? or maybe it was all for the house listing for areas and things around.
  • I like the green of amah's house
  • wish my grandma was just a train ride away
  • this grandma is so sassy in a chinese way
  • pray to guanyin = no beef
  • you have to spend enough time around them to not even notice the smell
  • the only thing old people want is time with them
  • amah's life isn't that bad, she sells congee early in the morning, she sees an old friend on the way back
  • the street where they walk, with a train along the path has japan vibes
  • the film locations reminds me a lot of Malaysia, specifically Penang or Ipoh or Johor?
  • a rich son that trades stock, a daughter that works at a supermarket, and a son that owes money (from gambling?), seems like there were issues with agong, money related, which resulted in children like this.
  • M is really good at responding to everything in a lighthearted and fun way, I got what my sister meant when she said he has high EQ
  • the way she says I'm used to being alone, that's what life is like, but the hardest day is the day after chinese new year, when the fridge is full of leftovers and she's the only one to finish them is SO heart-wrenching.
  • what's the solution to loneliness for the aging? i read so much about maintaining and building friendships, but what about how to keep parents close to us (distance-wise and emotionally), for those who have to immigrate for a better life?
  • this movie unearthed my deep-rooted fear of getting old
  • stairs are the bane of existence for old people
  • amah crying about pain in bed and asking mom and dad to take her with them is so heartbreaking
  • the way the rich son prioritizes his own family more than amah made me think about how an uncle at church said "θ€ε…¬ζ˜―ε±žδΊŽθ€ε©†ηš„"
  • he steals money and disappears, but amah still gives the house to him, a mother has the most heart to forgive all transgressions
  • old folks home scene made me think about how many old people spent their last moments alone with only nurses and caretakers around, and not their loved ones.
  • sons get the assets while daughters get the cancer is such a dark joke
  • asking a mom which child / grandchild they love the most is kind of an impossible question to answer
  • there is no (earthly) love greater than a mom's love
  • do all old chinese people like watching chinese opera, wait the agong watched victoria secrets lol
  • M's lullaby to amah in her final moments :,)
  • I like how he knocks on the coffin to talk to amah
  • flashbacks where the main character discovers something that makes them realize all the things they never realized until its too late (how amah has been depositing money for him since he was young just because he asked) is so emotionally satisfying and healing, it gives me goosebumps
  • I'm so reactive to people crying that I cry a little bit too, specifically that scene when the children and M was on the back of the truck with the coffin.
  • when M said "we're reaching the house I bought for you", that really got me

overall, I'm not close to my grandmas and I used to see them only once a year for CNY, so this movie just made me think of my mom the entire time and who will be take care of her when she's old and sick, and if I'm in the US (fingers crossed I make it) or somewhere else, and my parents are thousands of miles away, I'm worried about their well-being, and not being able to visit them often. Although it's still 20 years out (they're early 60s now), but it's something to think about. My dream is to immigrate them to somewhere nice, like Canada or Europe but that's a pipe dream for now. right now my focus is to graduate and get a good job. it's the best thing I can do for now for them.


10 Rules of Reliable Data Science

Since I'm starting an MSDS next month, I have to pick up Data Science again.

I found this paper with a ton of insights into best practices for data science.

below are some notes:


  • good: data science provides a powerful toolbox for increasing efficiency, driving growth, automating expensive processes, and making better decisions.
  • bad: Too many real-world projects are chaotic stew of hand-tweaked algorithms, cherry picked examples, and brittle, undocumented, untested research code.
  • the main bottleneck in data science are no longer compute power or sophisticated algorithms, but craftsmanship, communication, and process
  • the 3 basic premise
    • data science is a kind of software work,
    • correctness and reliability of software depends on development practices
    • data science quality depends on SWE quality
  • the aim: work that is accurate and correct, but also can be understood, work that others can collaborate on, and that can be improved and built upon in the future even if original contributors have left

Rule 1: Start organized, stay organized

  • good analysis is often the result of scattershot and serendipitous explorations; tentative experiments and trying out approaches that might work are all part of the process
  • but it should start with clean and logical structure
  • the cookiecutter DS template is an effective structure where:
    • data is always in /data, raw is in /data/raw, and final for analysis is in /data/processed
    • notebooks are in /notebooks with a numbering scheme
    • project-wide code is in /src
  • this sensible and self-documenting structure allows others to understand, extend, and reproduce your analysis
  • other people will thank you: newcomers can understand without suffering through documentation, every data scientist in a team can open any historical project, and immediately know where to find the inputs/outputs, the explorations, the final models, and any reports generated
  • you will thank yourself: if you try to reproduce your analysis, you're not asking questions about which .py file to run to get things done

Below is the full structure of their template

β”œβ”€β”€ LICENSE <- Open-source license if one is chosen β”œβ”€β”€ Makefile <- Makefile with convenience commands like `make data` or `make train` β”œβ”€β”€ <- The top-level README for developers using this project. β”œβ”€β”€ data β”‚ β”œβ”€β”€ external <- Data from third party sources. β”‚ β”œβ”€β”€ interim <- Intermediate data that has been transformed. β”‚ β”œβ”€β”€ processed <- The final, canonical data sets for modeling. β”‚ └── raw <- The original, immutable data dump. β”‚ β”œβ”€β”€ docs <- A default mkdocs project; see for details β”‚ β”œβ”€β”€ models <- Trained and serialized models, model predictions, or model summaries β”‚ β”œβ”€β”€ notebooks <- Jupyter notebooks. Naming convention is a number (for ordering), β”‚ the creator's initials, and a short `-` delimited description, e.g. β”‚ `1.0-jqp-initial-data-exploration`. β”‚ β”œβ”€β”€ pyproject.toml <- Project configuration file with package metadata for β”‚ {{ cookiecutter.module_name }} and configuration for tools like black β”‚ β”œβ”€β”€ references <- Data dictionaries, manuals, and all other explanatory materials. β”‚ β”œβ”€β”€ reports <- Generated analysis as HTML, PDF, LaTeX, etc. β”‚ └── figures <- Generated graphics and figures to be used in reporting β”‚ β”œβ”€β”€ requirements.txt <- The requirements file for reproducing the analysis environment, e.g. β”‚ generated with `pip freeze > requirements.txt` β”‚ β”œβ”€β”€ setup.cfg <- Configuration file for flake8 β”‚ └── {{ cookiecutter.module_name }} <- Source code for use in this project. β”‚ β”œβ”€β”€ **init**.py <- Makes {{ cookiecutter.module_name }} a Python module β”‚ β”œβ”€β”€ <- Store useful variables and configuration β”‚ β”œβ”€β”€ <- Scripts to download or generate data β”‚ β”œβ”€β”€ <- Code to create features for modeling β”‚ β”œβ”€β”€ modeling β”‚ β”œβ”€β”€ **init**.py β”‚ β”œβ”€β”€ <- Code to run model inference with trained models β”‚ └── <- Code to train models β”‚ └── <- Code to create visualizations

Rule 2: Everything comes from somewhere and the raw data is immutable

Every piece of knowledge must have a single, unambiguous, authoritative representation within a system. – Andy Hunt and Dave Thomas

  • the fundamental theorem of reproducibility: every conclusion drawn in an analysis must come from somewhere
  • DAG: Every piece of data or work product in an analysis tree should be the result of a dependency graph that can be traced backwards to examine what combination of code and data it came from or run forwards to recreate any artifact of the analysis
  • if you trace any product far enough upstream, you will end up at one or more scripts that extract raw data, or the raw dataset itself
  • do not have misc data files in the project, at least write own how the data was acquired in the README

Rule 3: Version control is basic professionalism

  • data should (mostly not be in VCS)
    • not usually a good idea to store intermediate or cleaned data products, whole idea of reproducible data pipelines is everything should be obtainable with clear provenance from original, raw dataset
    • impractical to store data of a certain size, will clutter commit history and inflate repository size on disk, tracking code changes alongside data change will be unbearable
    • putting data in code tracker is conceptual mismatch: we expect raw data to change often over time (daily report requires new data everyday)
  • data stored in db or warehouse has its own versioning (timestamp, IDs)
  • raw data can be archived and shared using basic storage (hard drive, S3, etc.)
  • check out data versioning tools
  • why source control?
    • look at "diff" to spot unintended changes and extra debugging code
    • DS accretes small decisions and assumptions, helps track individually and get feedback on mathematical validity, statistical relevance, and implementation of the decisions
    • promote sharing knowledge

Rule 4: Notebooks are for exploration, source files are for repetition

  • notebooks are tools for rapid, iterative, and serendipitous explorations that give a tight feedback loop showing the immediate results of change
  • also invaluable as artifacts for communicating and explaining analyses
  • why notebooks bad for reproducability
    • manually opening and running cells is not automation
    • hard to test and introduce complications that go against the grain of common system conventions (logging, where output is directed STDERR vs STDOUT, return of error codes, what process fails on uncaught exceptions, working directories, etc.)
    • challenging for source control, merge conflicts
  • naming convention that shows "owner" and gives sense of order of analysis
    • <step>-<initials>-<description>.ipynb
    • ex: 3.1-BN-visualize-distributions.ipynb where step is a loose idea of where in the end-to-end workflow this notebook falls
  • the solution: .ipynb -> .py
    • continuously extracting common building blocks out of the notebook into source files that can be centrally tracked and used from any other notebook
  • benefits:
    • separate multiple concerns into logical units, so data layer (ETL) is not mixed with modeling and experimentation (where notebooks shine), and are not entangled with output layer (final result of workflow, where predictions, reports sent to a database)
    • prevents duplication of code (how errors creep in), rather than being stuck in one notebook, changes to commonly used code propagate across all notebooks
    • allows pieces of pipeline to run without running the entire pipeline (ex: tweak hyperparameter and refit model doesn't require sitting through long-running data extraction task)
    • enables test to verify functionality and correctness
    • shows code changes across commits in VCS
  • embrace refactoring
    • write utility code in commonly accessible modules, write a couple of quick tests, import it into notebooks
  • when to move out of notebook?
    • matter of judgement, taste, and pragmatism
    • notebooks > code, for demonstration and experimentation
    • if a piece of code is the focus of the work, or solely designed for the analysis, leave it in
    • if it's a necessary building block but not worth showing, move it to a separate file that can be tracked in VCS and tested

Rule 5: Tests and sanity checks prevent catastrophes

Code without tests is bad code. It doesn’t matter how well written it is; it doesn’t matter how pretty or object-oriented or well-encapsulated it is. With tests, we can change the behavior of our code quickly and verifiably. Without them, we really don’t know if our code is getting better or worse. – Michael Feathers

  • testing data science code is hard
    • testing on large datasets = long-running test
    • we expect data to change over time, downstream values fluctuate
    • often use randomness on purpose (to sample or fit model), hard to assert whether changes are meaningful or due to expected variation
    • visualizations are challenging to test
    • notebook env not the same level of tooling for test discovery and running
  • what code to test?
    • sanity check and smoke test
    • helpful when examples have possible edge cases such as null values, zeros, shape mismatch, etc.
  • what kind?
    • test that operate on "toy examples" with tiny amounts of data or extremely small arrays make it clear what is being tested and what values are expected
    • unit test: the right tool for verifying processing or math code, tests focused on a specific operation being correct in isolation
      • how? take a sensible number of expected input-output pairs (taken from reference, calculated with an alternative package, or worked out by hand) and demonstrate that the new code produces the expected output
  • recommendation
    • write test for any code refactored out of notebooks
    • write tests with sample data to confirm logic works as expected

Rule 6: Fail loudly, fail quickly

This is a problem that occurs more for machine learning systems than for other kinds of systems. Suppose that a particular table that is being joined is no longer being updated. The machine learning system will adjust, and behavior will continue to be reasonably good, decaying gradually. Sometimes tables are found that were months out of date, and a simple refresh improved performance more than any other launch that quarter! β€” Martin Zinkevich, β€œRules of Machine Learning

  • ml is dangerous models can often appear to give reasonable predictions despite serious programming errors or data quality issues
  • principles of defensive programming
    • Conspicuous: worst failure is a silent one
    • fast: if a function is going to fail eventually, it might as well fail now
    • informative: provide messages
  • ex: a model fitting pipeline, one of the feature is mean(col_from_raw)
    • assume raw data is not generally expected to contain missing values
    • mean() function in many packages ignores nulls without complains
    • a damaged sensor or buggy code change causes column to silently start having non-random null values
    • this new mean value could be strongly biased and lead to bad predictions, and will make costly business decisions
  • solution?
    • make assumptions concrete and enforce them at runtime
    • bail out immediately and loudly if assumptions are violated
    • from example above:
      • check input column before applying transformation
      • immediately log that the non-null assumption was violated
      • halt processing script with a failure exit status
    • point and call
      • your pipeline should not be more permissive than necessary
      • set a norm thinking explicitly about where to be permissive and strict
      • ex: add assertions that values can be floating point or null is an intentional way to show sources of possible error

Rule 7: Project runs are fully automated from raw data to final outputs

  • it should be obvious for anyone to initiate the process for raw data -> finished product
  • instructions in README are not enough, you can miss crucial steps, or it's too vague, or not updated.
  • when data pipelines are not automated, they are not reproducible
  • use a build tool
    • keep pipeline working and run it everytime you make a non-trivial change
    • CICD: mainline branch is always in a runnable state and has its tests run every time a change is committed
    • for a data project, to reproduce the results of a project, anyone should be able to run a "default" pipeline without typing out or understanding the various knobs and settings
    • it can be as intuitive as running a single command from proj dir
  • make env reproducible
    • same interpreter and same libraries with identical versions to ensure matching results
    • requirements.txt with specific versions of packages
    • for more complex requirements, use docker
  • make randomness reproducible
    • set a random seed in a central location
      • note that random, numpy.random, torch.rand each use their own seeding mechanism
      • use an explicit random number generator object where you set seed in one location and track its use
    • reproducing an analysis using probabilistic methods means pseudorandom number generators must be initialized with a known state to induce the same values each time

Rule 8: Important parameters are extracted and centralized

  • magic numbers is a bad practice of sprinkling critically important values which affect the program's behaviour throughout the codebase
  • data projects often end up with an excessive number of ways to parameterize scripts and functions
  • consider a model training pipeline that cleans raw data, fits several models, and outputs predictions
    • all of the model parameters multiply possibilities with the "meta" choices like train/test ratio, cross-val params, ensemble voting params, to make a combinatorially overwhelming universe of possible inputs for a single run
  • AVOID: one notebook for each model and then copy pasting setup code, or reuse notebooks by changing parameters by hand, or have parameters in multiple code files.
  • remember the goal is not just making changes, it's to see how our changes affect outcomes
  • ideal: running another experiment = changing settings in a central config file, getting a cup of coffee, compare most recent log to past experiments
    • it also documents the changes with the output for yourself and for colleagues
  • imagine a declarative settings file config.yml file like below
n_threads: 4 random_seed: 42 train_ratio: 0.5 log_level: debug features: use_log_scale: true n_principal_components: 4 models: xgboost: max_depth: [2,5,10] n_estimators: [50,100,150,200] random_forest: criterion: ["gini","entropy"] ensembling: voting: ["soft","hard"]

Rule 9: Project runs are verbose by default and result in tangible artifacts

  • capturing useful output during data pipeline runs make it easy to figure out where results came from, making it easy to look back and pick up from where it was left off
  • example of a run of experiment codified in a config file
2022-11-01 14:10:02 INFO starting run on branch master (HEAD @ 37c02abοΌ‰ 2022-11-01 14:10:02 DEBUG set random seed of 42 2022-11-01 14:10:02 DEBUG reading config file 2022-11-01 14:10:02 DEBUG ... settingsοΌšο½›"n_threads": 4, <..snip.... .>} 2022-11-01 14:10:03 INFO reading in and merging data files 2022-11-01 14:10:39 INFO finished loading data: 2,835,824 rows 2022-11-01 14:10:40 WARNING ... dropped 126, 664 rows where ID was duplicated (4.47%οΌ‰ 2022-11-01 14:10:41 WARNING ... dropped 2,706 rows where column 'total was null (0.09%οΌ‰ 2022-11-01 14:10:41 INFO creating train/test split 2022-11-01 14:10:41 INFO ... train: 0.5 οΌ»1,353,227 rowsοΌ½ 2022-11-01 14:10:41 INFD ... test : 0.5 οΌ»1,353,227 rowsοΌ½ 2022-11-01 14:10:42 INFO starting grid search cross validation .. 2022-11-01 14:21:01 DEBUG ... 30/120 2022-11-01 14:31:46 DEBUG ... 60/120 2022-11-01 14:42:31 DEBUG ... 90/120 2022-11-01 14:53:03 DEBUG ... 120/120 2022-11-01 14:53:03 INFO finished cross validation, writing best parameters 2022-11-01 14:53:03 DEBUG ... runs/2022-11-01_14-10-02/parameters/xgboost.yml 2022-11-01 14:53:03 DEBUG ... runs/2022-11-01. _14-10-02/parameters/random_forest.yml 2022-11-01 14:53:03 DEBUG ... runs/2022-11-01_14-10-02/parameters/adaboost.yml 2022-11-01 14:53:03 INFO training voting classifier on ensemble of 3 best models.. 2022-11-01 14:53:19 INFO making predictions 2022-11-01 14:53:22 INFO writing results to runs/2022-11-01_14-10-02/results.yml 2022-11-01 14:53:22 DEBUG ... results: precision=0.9624 recall=0.9388 f1=0.9514 2022-11-01 14:53:23 INFO writing predictions to runs/2022-11-01_14-10-02/predictions.csv
  • above we are storing important history like how many rows were dropped, what the train/test split was, how long it took to run, the performance, the best parameters, etc.
  • imagine if you introduce a new feature, you want to see what impact that has on overall performance as well as processing time, instead of eyeballing, you can simply look at the result data files
  • this allows rapid iteration through experimental settings, you worry less about missing settings or performance backsliding from misc changes, and focus more on actual modeling subject matter and business case
  • this ability to audit how results came about -> eases detecting and fixing any downstream problem

Rule 10: Start with the simplest possible end-to-end pipeline

a complex system that works is invariably found to have evolved from a simple system that worked. A complex system designed from scratch never works and cannot be patched up to make it work. You have to start over with a working simple system"

– Brian Kernighan and John Gail, Systemantics

  1. start with proper form, fill in proper function
    • work from raw data all the way to finish product before going back to improve all of the pieces
    • all projects have a time constraint, so it's better to get the entire pipeline glued together, even if some parts are basic or even faked
      • even if you use a naive model that has poor accuracy, very few features are cleaned from raw data, if outputs are rough,
    • aim for an automated pipeline, then work on it piece by piece
  2. start with basic tools and models
    • for numerical predictions from tabular dataset, the potential accuracy buff with deep learning does not justify the added complexity and decreased interpretability, compared to plain old regression or decision trees
    • the decision to add additional complexity should not be made lightly, but delayed until the last responsible moment when its clear the tradeoff is going to pay dividends
    • ex: there are powerful libraries for orchestrating from raw -> clean data to trained models to new predictions, but consider whether the old and boring Make tool can do the job
    • ex: instead of perfectly handling every quirk in a dataset, purposefully decline to handle edge cases by failing fast on bad inputs (rule 6) or loudly filtering unhandled cases (rule 9)
      • a column "12.10 USD", instead of doing something clever like a custom struct with enumerated data type representing the currency is in use, if you know it'll only be in "USD", just do a split on space " " and assert the second value is in "USD" (rule 5)
  3. you don't have to use all of the data all the time
    • exploratory analysis and scaling ana analysis are two different chores, they can be tackled separately as long as the bigger picture is in mind
    • keep exploratory feedback loop fast by working with representative samples
    • sometimes a naive sample is enough for the task at hand, i.e. getting all data cleaning code in order or making sure a preprocessing step results in useful features
    • save the parallelization for later, once you feel confident about the modeling assumptions and data pipeline, they can be translated for parallelization

software lessons

  1. use version control
  2. keep it simple, stupid (KISS): simple solutions are easier to reason, easier to debug, and easier for new collaborators to understand. only add complexity if it serves the ultimate goal of the project, after careful consideration
  3. separation of concerns: aka, loose coupling , minimize the amount each component of a program needs to "know" about other pieces
  4. separate configuration from code: dispersing settings throughout codebase makes it difficult for anyone trying to understand the logical flow, centralize and codify important settings to make the effect they have on outputs obvious.
  5. You aren't gonna need it (YAGNI): abstraction is powerful and costly, start with a concrete case and only generalize when needed.
  6. premature optimization is the root of all evil: rule of thumb is to "make it work, make it right, make it fast", focus on getting the job done first before trying to speed up or cover all edge cases. in data work, instead of dealing with massive datasets and cluster, understand the problem well
  7. don't repeat yourself (DRY): when functions get copied and pasted all over, it's a problem. for DS, that means refactoring notebook code -> "real" modules
  8. composability: unix philosophy, "write programs that can communicate easily with other programs" so that developers break down projects into small simple programs rather than overly complex monolithic programs. instead of packing everything into one function, separate functions that read data, assemble data, and graph the result
  9. test the critical bits: simple basic tests and sanity checks are important for data projects, they prevent the most common errors and increase confidence about correctness
  10. fail fast and loudly: adopt a defensive way of thinking about error cases and focus on bailing out early if unanticipated errors happen



How to be popular?

Bal du moulin de la Galette, Pierre-Auguste Renoir, 1876

Bal du moulin de la Galette, Pierre-Auguste Renoir, 1876

if you were ever wondering why you're not popular.

here's Dr. Ana to answer that burning question.

The main trait of popular kids is they're fun

But what does it mean to be fun?

There are a few characteristics.

first is humor. We like to laugh, and we like people who can make us laugh. laughing induces pleasant emotions. humor is a defense mechanism to keep people from experiencing the hardships of life as painfully as they would. That's we like comedians. so if you're innately funny, let that shine in social interactions.

second is charisma. Charisma is comprised of three components: presence, warmth and power. if you can make people feel like the most important person in the room (presence), and they have positive feelings towards you (warmth) and you are somebody worth being around (power), then you're fun to be around.

third is a good conversationalist. what makes conversations free flow and fun? it's not the deep questions, some people don't like to go too deep and intimate into their traumas or personal lives, especially if they don't know you that well yet. Instead, have surface level conversations that can involve humor, then go deeper

example, you're talking to a therapist

  • you ask: "what is it like to be a therapist, do you feel drained at the end of the day?", they share their experience, talk about their day, and you just nod, and try to think of the next question and there's no flow
  • you say: "I always wanted to be a therapist in undergrad but I decided against" they asked why, you say "I don't think I had enough empathy for it", you both laugh, they ask "what did you go into instead?", and conversation flows from there

fourth is liveliness. they exude energy, they dance enthusiastically, they converse enthusiastically, they are very uninhibited, they laugh wholeheartedly. they're the ones pulling people on to the dance floor. naturally you'll be drawn to high energy people because they will pull us up, whereas low energy people are going to be a little bit more static. Extroverts have the advantage here, whereas introverts are the people who retreat from loud environments. Introverts might be grounding for people, creating a calm and pleasant environment, but that isn't fun. So, if you're an introvert, you have to put in more effort to be lively and recharge your social battery.

last is risk taker. fun people are not as reactive when something scary happens, their bodies are not as affected in dangerous situations, the seek the adrenaline rush. They're the ones that have fun with life and don't mind things getting a little dangerous. They don't think, they act. They're the ones that convinces everyone to go out when everyone's already in bed, the ones that takes the first dive at a cliff, the ones that tell you to ask out that person at the bar. learn to live a little, you do only live once. but don't be too reckless.

one thing to take mind is, greater popularity does not relate to being a good person or a good friend, it's just about being fun.

if you want to be popular, ask yourself in every single interaction, how can I make this more fun for the other person?



Biggest mistakes in your 20s

Luxe, Calme et VoluptΓ© ("Luxury, Calm and Pleasure"), Henri Matisse, 1904

Luxe, Calme et VoluptΓ© ("Luxury, Calm and Pleasure"), Henri Matisse, 1904

Every twenty-something year old should know these four facts

  1. Your energy is a limited resource that you're consciously or unconsciously investing every day.
  2. How you invest that energy determines the outcome of your life.
  3. Most people will convince you to invest your energy into things that benefit them, not you.
  4. You will never have as much disposable energy as you do in your 20s ever again

So the the biggest mistake 20 year olds can make is chasing pleasure and comfort, and not investing energy into building a strong foundation for their lives.

If you become disciplined and sow the right seeds early, you'll reap an abundance of health, wealth, intelligence, love and strength in the future.

But how do you become disciplined?

You need to know these two concepts: compound interest and purchasing power

Compound interest is what turns small investments into huge payouts over time, it's the only sure way to generate wealth in any domain.

Here's an example:

Jill invests $5,000 a year in an asset that increases by 7% in value each year, while Jack invests in an asset that decreases by 7% in value each year. After ten years of investing, Jill has about $69,000 in her account, but Jack only has around $37,000

Applying this to habits, the same logic applies.

If you invest in good habits early, they will compound and accelerate your growth through the power of compound interest

As James Clear once said: "Habits are the compound interest of self-improvement"

The second piece is purchasing power.

Back to the financial metaphor:

Imagine now that instead of having $5,000 to invest each year into an asset, Jack and Jill lose $200 worth of purchasing power each year

The amount of energy available you have to invest in yourself declines as you get older.

As you get older, you take on more responsibilities that place a demand on your energy. You have to manage your time and energy to balance work, family, and self-care.

The easiest time to form good habits and break bad ones is while you're younger.

To make things easy, you can focus on only 5 assets to invest in.

Five best assets to invest in

  1. productive capital
    • what: your expertise that makes you valuable to society
    • how: stay curious, consistently learn and improve, do hard things, put in the work
  2. spiritual capital
    • what: how strong your conscience is in making wise decisions
    • how: self-reflect, meditate, be truthful and honest, volunteer and mentor, support local causes
  3. intellectual capital
    • what: ability to turn negative emotions into insights and actions
    • how: practice labeling emotions, use creative outlets to process emotions
  4. social capital
    • what: depth and quality of relationships
    • how: invest in quality time with friends and family, seek for mentors, network and step out of comfort zone
  5. physical capital
    • what: health and physical well-being
    • how: consistent exercise routine, balanced nutrition, staying hydrated, quality sleep and rest

Look at your daily/weekly/monthly schedule, try to squeeze in time to grow and develop each of the five forms of capital.

This way you're investing time and energy that will have ROI in the long run.


View the archives